Language Model Guided Knowledge Graph Embeddings

Knowledge graph embedding models have become a popular approach for knowledge graph completion through predicting the plausibility of (potential) triples. This is performed by transforming the entities and relations of the knowledge graph into an embedding space. However, knowledge graphs often include further textual information stored in literal, which is ignored by such embedding models. As a consequence, the learning process stays limited to the structure and the connections between the entities, which has the potential to negatively inﬂuence the performance. We bridge this gap by leveraging the capabilities of pre-trained language models to include textual knowledge in the learning process of embedding models. This is achieved by introducing a new loss function that guides embedding models in measuring the likelihood of triples by taking such complementary knowledge into consideration. The proposed solution is a model-independent loss function that can be plugged into any knowledge graph embedding model. In this paper, Sentence-BERT and fastText are used as pre-trained language models from which the embeddings of the textual knowledge are obtained and injected into the loss function. The loss function contains a trainable slack variable that determines the degree to which the language models inﬂuence the plausibility of triples. Our experimental evaluation on six benchmarks, namely Nations, UMLS, WordNet, and three versions of CodEx conﬁrms the advantage of using pre-trained language models for boosting the accuracy of knowledge graph embedding models. We showcase this by performing evaluations on top of the ﬁve well-known knowledge graph embedding models such as TransE, RotatE, ComplEx, DistMult, and QuatE. The results show an improvement in accuracy up to 9% on UMLS dataset for the Distmult model and 4.2% on the Nations dataset for the ComplEx model when they are guided by pre-trained language models. We additionally studied the effect of multiple factors such as the structure of the knowledge graphs and training steps and presented them as ablation studies.

triples, they remain highly incomplete and usually do not capture all relevant knowledge for a domain of interest. Various link prediction approaches have been proposed to tackle the incompleteness of KGs, among which link prediction using knowledge graph embeddings (KGE) has become popular for KG completion tasks.
KGEs typically receive a knowledge graph as a set of correct triples and transfer entities and relations from their symbolic representation to an embedding space (often vectors). In the learning process, the existing triples in the KG are used as positive samples. Additionally, negative samples are generated from positive ones, usually by random corruption techniques. The learning process is performed by employing a score function to compute the plausibility of potentially correct triples. Furthermore, a loss function is used to adjust the randomly initialized embeddings in a way that positive samples get higher scores than negative ones.
Although the above-described procedure works well when aiming to preserve the structural aspects of a KG, the literals in a knowledge graph that contain complementary textual knowledge remain unused. The left side of Figure 1 illustrates a KG where entities and relations contain textual complementary knowledge. The right side of the figure compares to the way typical input is taken by standard KGE models, where only symbolic representation (structure) is considered. However, the performance of KGEs models is influenced when there is a lack of enough structural knowledge in a KG [3]. In such cases, the complementary knowledge from textual descriptions can mitigate the problem in order to enhance the performance of the link prediction task. As an example, there is a lack of structure around the triple (Louis Armstrong, occupation, Singer) shown in Figure 1. As can be seen, further contextual information about the subject Louis Armstrong is available (American Jazz Trumpeter, Composer, and Singer). Although the entity Composer is structurally in the KG, the link is missing to Louis Armstrong. However, the connection can be found from the textual information of entity Louis Armstrong if the KGE model is able to perceive it. In the ceiling performances reported by state-of-the-art KGE-based approaches [4], [5], this textual information is not considered by almost any of such models [6]. In order to bridge this gap, we incorporate complementary knowledge into the learning process of KGE models with a unique use of language models. Several methods providing embedding of textual data can facilitate this process, among which are the recent Transformer-based [7] pre-trained language models (PLMs) such as BERT [8], RoBERTa [9], and GPT-2 [10].
In this work, we propose a novel combination of Knowledge Graph Embedding models and pre-trained language models through a unified loss function. This is done for the purpose of utilizing the embedding of textual knowledge in the learning process of KGEs. Through a systematic analysis, we selected fastText [11] to obtain the embeddings in the sub-word or character level and Sentence-BERT [12] to encode the sentence level descriptions. In our approach, the primary score of each triple is calculated from the KGE models. In addition, an auxiliary term is assigned as the plausibility of the same triple obtained from PLMs considering the available textual information. Later, these two criteria are transformed into a likelihood estimation. In this way, we enforce the upper bound for the score of positive samples and add a margin between positive and negative samples. In addition, a confidence function that represents the plausibility of triples is taken into consideration from the language model. This changes the boundary based on structural and textual information and affects the triple prediction performance of the underlying model. In order to optimize the embeddings based on KG and PLM, a log-likelihood loss is computed and maximized for the guidance of the baseline KGE model. The results emphasize the effectiveness of our proposed approach as well as the improved performance of KGE models in the link prediction task.
In summary, our main contributions are: • We addressed the problem of knowledge graph embedding models ignoring complementary knowledge.
• The gap between knowledge graph embedding models and language models is bridged by a unique approach which performs the inclusion of PLMs in a loss function.
• A novel model-free loss function is proposed that considers embedding of complementary textual knowledge.
• The standard benchmark datasets are adapted to be used for Knowledge Graph Embeddings in the presence of Language Models at once, which is also usable for other similar works.
• An extensive evaluation is performed to study the effect of considering textual knowledge by embedding models on six benchmark datasets and five known knowledge graph embedding models.
• Evaluations are extended on the impact of using PLMs by conducting several studies, including: a) inclusion of both structural and textual knowledge by considering vectors obtained from the PLMs and KGE models into the proposed loss function; b) inclusion of only textual knowledge by considering the vectors obtained by PLMs in the score function of KGE; c) comparison of similarities and differences of the vectors corresponding to the structural and textual knowledge of the KGs. The rest of the paper is organized as follows. In Section II, we describe notations and required information for understanding the proposed methodology. In Section III, we review the literature on current embedding models which utilize the PLMs in their learning process. Section IV provides the details of the proposed method and the learning process. In section V, we provide the details about the experimental setup and analysis of the obtained results. The current section also contains the ablation studies to support the analysis of the results in subsection V-D. Finally, in Section VI, we summarise the main conclusions and outline future directions.

II. PRELIMINARIES
In this section, we provide the preliminaries which are required to understand the methodology of this work. Here, we introduced the core concepts of Knowledge Graph Embedding Models, such as embedding vectors, score function, and loss function. Similarly, the concepts related to the pre-trained language model are discussed as well. In both of the cases, we have introduced the notations which will be used throughout the whole paper.

A. KNOWLEDGE GRAPH
For a set of entities E, and relations R; a Knowledge Graph (K) is a formalism for the triple-based representation of facts shown as K = {(s, p, o)|s, o ∈ E, p ∈ R} where (s), (p), and (o) refer to the subject, prediction, and object respectively.

B. KNOWLEDGE GRAPH EMBEDDING
In this part, we introduce the core components of KGEs, namely embedding vectors, score, and loss functions.

1) EMBEDDING VECTORS
A KGE model is a mapping function as φ KG : E/R − → E/R that transfers the symbolic representation of entities in E and relations R of a KG into a d dimensional latent feature space.

2) SCORE FUNCTION
Each KGE model defines a score function f (s, p, o) in which the input is a triple (s, p, o) and the output is a value showing the degree to which the triple is plausible. Usually, a higher value for the score shows a triple to be more plausible.

3) LOSS FUNCTION
The learning process in a KGE starts with random initialization of the embedding vectors. Therefore, the scores of the triples for positive and negative samples are also random. Optimization of a loss function L (often Stochastic Gradient Descent) leads to a better adjustment of the embeddings such that positive samples get higher scores than the negative ones. As the prediction of the triples gets better with each iteration, the loss decreases.

C. PRE-TRAINED LANGUAGE MODEL
Language models are trained on large corpora to generate word tokens. The probability of generating a sentence S is computed by p(S) = |S| i=1 p(s i |s <i ), where s i is the token generated at time step i. The trained checkpoint (PLM) contains all the learned weights that we leverage to obtain the embedding representation of the text. Let the PLM function be PLM, which takes a text as an input and returns a learned vector representation as the output. The corresponding text of subject and object entities as well as the predicate can be the input of the function, and as an output PLM returns the 1 × d LM dimensional vector for each s, p, o. PLM : s text − → s LM ∈ R 1×d LM (this example represents for subject only, same applies for the object and relation). We denote the embedding of textual descriptions corresponding to the entities and relations of a KG obtained from a pre-trained language model by s LM , p LM , o LM . Similar to KGEs, this is also a mapping function that transfers text into a d LM dimensional latent space which we denote by φ LM : E/R − → E LM /R LM . The difference between the functions φ KG and φ LM is that the embeddings of the PLMs are not learned. Instead, they are considered as a guide in the learning process of KGE models.

III. RELATED WORK
In this paper, we aim to utilize KG and text embedded by language models to enhance the performance of KGEs in link prediction tasks. Therefore, in this section, we review three categories of embedding models: standard knowledge graph embedding, text-enhanced knowledge graph embedding, and pre-trained language models. The selection of the models to be covered in the related work as well as the evaluations follows the best practices of the embedding models [1], [4], [13]. Same for the choice of pre-trained language models, we selected a handful list of models that are aligned with our work in terms of their methodology in capturing textual patterns [14].

A. KNOWLEDGE GRAPH EMBEDDING MODELS
Generally, KGE models can be classified into two groups [1] based on the design of their score function: translational distance and semantic matching-based model. Here we name several state-of-the-art KGEs that we also use in our evaluation. TransE [15] and the family of its follow-up models (e.g., TransH [16] and TransR [17]) are examples of translationaldistance groups. The score function of these models is designed in a way that they encode entities as vectors and relations between them as translation vectors. DistMult [18] is a semantic matching model that uses a predicate-specific matrix in order to capture the pairwise interaction between the subjects and objects [1]. ComplEx [19] is another model in this category that assesses the plausibility of triples by considering the similarity of their latent representations through matrix multiplications. The RotatE model [20] provides a rotation-based score function where the subject is rotated towards the object via the predicate. Both DistMult and ComplEx influence this model. QuatE [21] is designed in the quaternion space and, similar to RotatE represents a predicate as a rotation. However, a rotation in quaternion space is different and more expressive than a rotation in complex space as the dot product. The subjects, predicates, and objects are modeled in quaternion space in which three imaginary components exist in the latent representation.
Here, we use the score function of several state-of-the-art models mentioned in Table 1.

B. ASSISTING KGEs WITH TEXTUAL DESCRIPTION
There is a thread of works that utilizes text embeddings to enhance the performance of KGEs. For an entity classification task, DKRL [23] uses the description of entities employing Word2Vec [24] in the score function of the TransE [15] model. This approach is not bridging PLMs and KGEs, and cannot be generalized to other types of KGEs as it is designed explicitly for TransE. Similar to LiteralE [25], it also ignores the textual information of the relations inside a KG while using a computationally expensive approach to achieve entity descriptions. Additionally, in [26], a method is proposed to use the textual description of entities and relations in a language model. The output of the language model is modeled as a score function to learn the knowledge graph embedding, which is different from our work as we propose a loss function. There were also a few recent works to transfer structural knowledge into PLMs using various knowledge encoding techniques [26]- [29], which solved the problem from the side of language models, not KGEs as considered in our work. In [30], a method is proposed to train language models and KGEs jointly. It utilizes the structural knowledge from KG in order to achieve a better inference capability for PLMs. However, this approach suffers from high computational and memory costs as the LMs are trained through the learning process, which is not the case for our proposed solution. In addition, there is a risk of information loss due to the use of two objective functions separately for PLM and KGE, as well as the shared embedding space (using an encoder). Whereas, we propose one unified loss function and use pre-trained models that lower the computation costs to a large extent. Our approach can be independently plugged into any KGE model because of utilizing the PLM information in the proposed loss function.

C. PRE-TRAINED LANGUAGE MODELS
In recent years, Transformer architecture-based pre-trained language models [7] have revolutionized the domain of natural language processing. PLMs are widely used to solve various downstream tasks such as question answering [31], document retrieval [32], and language evaluation [33], [34]. PLMs are trained on huge corpora with the objective of understanding textual patterns. Architecturally, PLMs such as BERT [8], Sentence-BERT [12], fastText [11], GPT-2 [10], and T5 [35] contain a high number of parameters (e.g., BERT-base with 110 million parameters), which facilitate them to understand a large set of vocabularies and capture a wide range of patterns. However, language models alone do not consider other forms of knowledge representation, such as graph-based structures. Within the scope of this work, we focus on two types of PLMs, namely a Transformer-based model [7] that understands the sentence-level context namely Sentence-BERT and a Skip-gram-based statistical language model [36] namely fastText. fastText was previously trained on the Wikipedia corpus utilizing the Skip-gram model. Both Sentence-BERT and fastText [11] models can capture language patterns on a sub-word level and handle out of vocabulary words, which other models such as Glove [37] and Word2vec [24] remain short. In this work, we leveraged pre-trained language models to obtain the contextualized embedding of the entities to guide the learning process of the knowledge graph embedding models. VOLUME

IV. PROPOSED APPROACH
The proposed approach is a language model-guided loss function that allows the KGE models to employ the pre-trained embedding into the learning process. The loss function gets two different types of embeddings as input, along with the slack variable λ and margin γ . At the beginning of the learning process, the embeddings of all entities E and relations R are randomly initialized. These embedding vectors are learned during the whole training process by using the KGE score function and optimization over a loss function. On the other hand, the second set of embeddings comes from PLMs. These are denoted as E LM for the set of entities E and R LM for the set of relations R. A confidence value C(s, p, o) is computed that represents the plausibility for each triple (s, p, o) only from the textual point of view. Here, C s,p,o = C(s, p, o) = s LM , p LM , o LM as the product of three vectors (subject, predicate and object of a triple). We use a product to compute the confidence from text because if a triple is correct, in the text of subject entity, object and relation might appear (same for the object). Therefore, their embeddings from the language model are close, and the product between the vectors gives a high confidence value. Such confidence value can guide the learning process of KGE models. However, this is not the case for all triples, as the reliability of the confidence value depends on the quality of the gathered text of subject, predicate, and object. Therefore, for all the triples of the underlying KG, a common trainable slack variable λ is defined to regulate the effect of textual information in the final score values. The slack variable conveys the confidence based on the contextual information of the triple to guide the score obtained from KGEs. Finally, the likelihood estimation is computed using a Sigmoid function. Later the likelihood estimation gives the final maximum log-likelihood loss. All the steps of our approach are described in more details in Algorithm 4.
The construction of the loss function is presented in a step-wise manner by providing an example of the distance-based class of embedding models. For each positive triple, the distance should be a small value, upper-bounded by a term containing the confidence obtained from the language model. However, for each negative triple, the distance should be a big value, lower-bounded by a term containing the confidence obtained from a language model. The effect of the boundary after including PLMs confidence is depicted in equation 1.
where λ is a learnable parameter, and γ is a hyper-parameter. Note that the parameter λ is used to adjust the scale of the confidence value coming from a language model. Therefore, it balances the negative affect that might be caused by irrelevant text present for entities or relations. Rearranging the components of equation 1, leads to equation 2: Note that the equation 2 and 3 are equivalent due to using the Sigmoid function where x − → ∞ then σ (x) − → 1. In the following equation, we use a maximum likelihood estimation (max(L)) to satisfy the equation 3:  s, p, o), respectively. Each negative sample is obtained based on uniform sampling [15], [22].
In order to relax the problem, we use a log-likelihood log L.   d(s, p, o) − λC s,p,o ). It has been analysed for each of them together with C s,p,o as ablation study in section V-D. It is noteworthy that our proposed loss function (equation 5) is a generalization of the loss function proposed in [20] with additional capability of injecting additional textual knowledge.

V. EXPERIMENTS
In this section, we provide the results of an extensive set of experiments that were done to evaluate the effect of the proposed loss function. The following describes the considered KGs and the experimental setup.

A. BENCHMARK DATASETS
We evaluate our approach on several publicly available datasets, namely: Nations [38], UMLS [39], WN9 (constructed in this work), and CoDEx [40]. We conduct experiments on three different splits of the CoDEx dataset (small, medium, and large): CoDEx-S, CoDEx-M, and CoDEx-L. Here, we provide a detailed description of the datasets, and comprehensive statistics are reported in Table 2.
• Nations dataset includes a set of relationships between nations and their features. The dataset consists of binary and unary relations.
• UMLS dataset (standing for Unified Medical Language System ) is a high-level ontology for organizing a large number of terminologies used in the biomedical domain into a unified vocabulary that allows for uniform access to disparate medical resources.
• WN9 is a subset of WN18 [41] dataset with 9 relations. The textual information is constructed in this work from the associated resources that come with WordNet glosstag files. 1 The XML resource contains <synset id> and <terms> tags which refer to the entity ID and name, respectively. In the case multiple <term> information under <terms> tag, we consider first <term> as the entity name.
• CoDEx [40] provides three comprehensive knowledge graph datasets that include positive and hard negative triples, entity types, entity, and relation descriptions. The knowledge graph in CoDEx is constructed from Wikidata [42] and Wikipedia 2 datasets.

B. EXPERIMENTAL SETUP
The experimental setup includes the introduction of baseline models, evaluation metrics as well as the hyperparameters search setting.

1) BASELINE MODELS
The evaluations of the proposed loss function have been conducted on the following known knowledge graph embedding models: TransE [15], RotatE [22], ComplEx [19], DistMult [18], and QuatE [21]. The respective score functions of these KGEs are reported in Table 1. The selection of these models has been made through a systematic analysis considering these criteria: a) variety in the model type (translation-based, semantic matching) based on the design of their score function; b) diversity of geometric space (Euclidean, Complex, and Quaternion) c) outperforming in their group or beyond.

2) EVALUATION METHODOLOGY & METRICS
We measure the performance of the models in the link prediction task using the following standard metrics as used in [4], [22]. Basically, the evaluation of the Knowledge embedding models aims to solve the link prediction task. Given a set of triples K test ∈ E × R × E in the evaluation which are not seen during the training process, for each triple (h, r, t) ∈ K test , we predict either the head or tail. If the model aims to predict the head, then it is called a head prediction where the query is (?, r, t). On the other hand, the tail query is considered as (h, r, ?).

3) HYPER-PARAMETER SETTINGS
To make a fair comparison, we obtained results on each dataset with the same hyper-parameters for all the settings. We train the models on CoDEx until 20,000 steps (S), WN9 until 30,000 steps (S), and Nations and UMLS until 3,000 steps (S). Note that each step is one pass of the optimization per batch. The training objective is optimized using Adam [43]    dimension d of 500 is used. 50 negative sampling (N ) and 1.0 adversarial temperature (T ) have been chosen for running experiments on Nations, UMLS, and CoDEx, while for WN9, we use 1024 negative samples (N ) and 0.5 adversarial temperature (T ). All the experiments have been conducted using a Tesla V100 machine with 16GB memory. Table 3 and Table 4 report the evaluation results on the benchmark datasets. The approaches with ST and FT are our methods on incorporating Sentence-Transformer(ST) (also known as Sentence-BERT) and fastText(FT) embedding, respectively. In both Table 3 and Table 4, the deltas ( 1 and 2 ), denote the performance difference between standard knowledge graph embedding models and our proposed ST and FT-based methods, respectively. We show the results for a standard KGE model, Sentence-BERT guided KGE model (KGE model+ST), and fastText guided KGE model (KGE model+FT). Our outperforming results in these tables are highlighted in blue and bold; for other models, we underline them. The δ 1 and δ 2 represent the differences between the baseline models and our models, ST and FT, respectively. The improvement is highlighted in green otherwise we highlight them in pink. The results in Table 3 show that the inclusion of text embedding in the learning process enables the knowledge graph embedding models to achieve better results. Some of the significant improvements can be seen in Nations and UMLS for RotatE.

C. EVALUATIONS AND RESULTS
In nations, RotatE model has an improvement of Hit@3 from 0.435 to 0.455 in TransE+FT. In the UMLS dataset, the RotatE+FT model provides an improvement in the Hit@1 from 0.874 to 0.886, and Hit@3 also increased from 0.952 to 0.961. By using the proposed approach, we found the computation time is almost similar. For this reason, we have collected the runtime of RotatE model for 5 different runs for UMLS dataset. Both with or without ST are considered to be checked for the runtime comparison. In both cases with 3000 stepsize (S), dimension 100 (d) the difference between the average runtime is 0.487 seconds, which means our approach does not increase the runtime drastically. An overall improvement in Hits@1 and Hits@3 scores is evident across most datasets, where comparable results are noticeable in other metrics such as Hits@10 and MRR. Obtaining improved results in Hits@1 is generally challenging. However, the results in Table 3 exhibit a considerable improvement in Hits@1 on the test set across several benchmarks. Table 4 shows the results of the baseline model on three splits of the CoDEx dataset. For different splits of CoDEx, a remarkable improvement in the performance of all the metrics is noticeable across the baseline models. Overall, we observe improved results across several datasets. Only in RotatE and DistMult on CoDEx-M dataset, our proposed approach achieves slightly low yet comparable results. One of the possible reasons for not achieving the desired result in some cases can be the disagreement between the connection of the graph and the PLM vectors which, is studied below.

1) TRAINED EMBEDDING VS PLM EMBEDDING
In Figure 3, through a systematic analysis, we sampled entities from CodEx-S, for which we computed cosine similarity. It can be seen that there are often dis-similarities between the trained embeddings for capturing structural information and PLM. In many cases, the similarity between the entities is high if they are trained from the KGE model. In the case of embedding coming from PLMs, the lower score is due to disagreement with the structural information. For example, ''Canadian musician'' and ''American rapper'' have high similarity in the trained embedding from KGE, but in the case of PLM vectors, they are not very similar. Our observation confirms that this can be caused by the textual difference between ''Canadian'' and ''American''. This situation can lead to performance degradation due to the anomaly in the conveyed information. For this reason, only using PLM vectors for the link prediction task does not perform as expected, which can be clearly seen from Table 5. As mentioned, having a very small value for λ mitigates the negative effect of disagreements between structural and textual information. However, this does not solve the problem in all scenarios due to the fact that λ is shared between all triples during the training phase. Overall, this becomes problematic when triples in the test set contain less informative textual information compared to the overall textual information in the training triples.

D. ANALYSIS OF THE ABLATION STUDIES
We further analyze several things: Firstly, we analyze the effect of using only PLMs. The effect of the embedding dimension and the influence of the training step were analyzed in order to understand further dynamics such as overfitting.

1) Effect OF USING ONLY PLMs AS SCORING FUNCTION
As a first step of the ablation study, we only explore the effect of pre-trained language models. In this case, we evaluated the score (C s LM ,p LM ,o LM ) which is obtained from the language models only. The evaluations are performed on UMLS, Nations, CoDEx-S and are shown in Table 5. We observe a significant decrease in the results when considering only PLMs (highlighted in red). More specifically, in the case of the UMLS dataset, Hits@10 dropped from 0.996 (RotatE+ST) to 0.160(Only ST); and  for the CoDEx-S dataset, Hits@10 dropped from 0.696 (RotatE+ST) to 0.012 (Only PLM ST). Here ST and FT correspond to sentence transformer and fastText, respectively. Since PLMs lack the structural information of the KG, it is not sufficient to use only PLMs for link prediction. However, the evaluations on the Nation dataset show satisfactory results, which are highly affected by the structure of the KG. This shows that contextual meaning corresponding to the entities and relations has a high correlation with the structural connectivity of KG, and the number of entities and relations. Structural analysis of the Nation dataset also approved that the connection of the graph is generally well reflected in the contextual meaning of the entities and relations. In order to perform this evaluation, we fixed the dimension d to 100, the learning rate α to 0.01, and the number of negative samples N to 10. The batch size of these evaluations was set to B to 256, adversarial temperature T to 0.5, margin γ to 10, and step size S to 5000. One possibility is that since this dataset  only has 13 entities, the evaluation with a focus on Hits@10 might result in high values for this metric.

2) EFFECT OF EMBEDDING DIMENSION
As an ablation study, we evaluated the effect of dimension on the performance of the models with/without using PLMs. Here, we demonstrate the results of the RotatE model in the respective Figures 4, 5, and 6.
The comparison of the performance on Hits@1 for both PLMs + RotatE, in contrast to baseline RotatE, illustrates most for a large spectrum of selected dimensions {10, 50, 100, 150, 200, 500, 1000} in Figure 4 is improved after the inclusion of PLMs. It is visible that, in most of the cases, either FT or ST with RotatE achieves better performance. The same effect can be seen in Figure 5. The improvement in Hits@1 is higher compared to Hits@3 considering the dimensions (can be seen between Figures 4  and 5). Except for a few cases, almost in every dimension, the inclusion of PLM illustrates improvement.   Figure 6 also demonstrates a significant improvement in the selected dimensions considering Hits@10. Generally, it is observed that from dimensions 10 to 200, the scale of improvement is larger than the rest of the dimensions. By increasing the dimension, the models can learn more structural information with higher complexity. Therefore, the structural information becomes more influential in overall performance. In order to perform these analyses, the same hyper-parameters are used as in 5 other than the dimension d, which is varied in order to show its effect.

3) INFLUENCE OF TRAINING STEP
Since the λ parameter is changing in each step size, we conducted an ablation on whether increasing the number of training steps improves performance. In order to do so, the step size is divided into three regions (low, medium, and high). A low step region includes the step S low = {100, 500, 1000}.
Furthermore, the training step S mid = {3000, 5000} remains medium and S high = {10000, 20000, 50000} in the VOLUME 10, 2022  high region. Figures 7, 8, and 9 demonstrate the effect of training steps. These evaluations considering Hits@1 can be seen in Figure 7. It is observed that in S low and S mid regions (i.e., until the step size 3000) the improvement for PLMs+KGE is noticeable. The baseline KGE starts to improve compared to our approach in the S high region (step size 500 onward). Figure 8 also shows noticeable improvement, specifically in S low and S mid regions-until training step 10000. Afterward, the results for the used baseline KGE model (ComplEx) started to improve. The same behavior as Hits@1 and Hits@3 is observable for Hits@10 which is shown in Figure 9. In a nutshell, the improvement can be insignificant or worse depending on whether the training step is considered too large.
The hyper-parameters we have chosen in this regards include negative sample size N = 50, embedding dimension d = 100, batch size B = 256, adversarial temperature: T = 1.0, learning rate α = 0.01 and the KGE model as ComplEx.

4) LIMITATIONS
The proposed approach highly depends on the quality of textual information. There are knowledge graphs in which such complementary knowledge is completely missing. In order to make use of language models, the textual information needs to be collected through a complex procedure. Additionally, the knowledge graph with textual information needs to go through a complex data quality check. Our model falls short when the dissimilarity between the structural and textual information is huge.

VI. CONCLUSION AND FUTURE WORK
This work proposes a novel approach for leveraging pre-trained language models in order to guide the knowledge graph embedding models. The main contribution of this work is the design and development of a model free loss function that utilizes the additional textual information through PLMs, which can be plugged into any KGE model. The empirical evaluations demonstrate that by using additional embeddings corresponding to the textual knowledge of entities and relations of a knowledge graph through pre-trained language models in a maximum log-likelihood loss, the performance of baseline embedding models improves substantially. The compassion of resultant embeddings from PLMs alone does not lead to high performance. In addition, we further addressed observations regarding the effect of datasets and their structural information as well as the evaluation setting in terms of training steps, and dimensions.

VII. FUTURE WORK
We plan to investigate the proposed loss function by using embedding generated by contemporary large-scale language models such as GATO [44] and OPT [45]. While this work was using sentence level textual information, the next step is to leverage pre-trained embedding on the word, and document levels. We strongly believe that our findings will motivate researchers to investigate the use of PLMs in enhancing representation learning further. This is expected to influence the downstream AI tasks that are using KG embedding models in practice for recommendation, prediction, or question answering systems.
MIRZA MOHTASHIM ALAM received the bachelor's degree from the Department of Computer Science, BRAC University, Bangladesh, and the master's degree from the University of Bonn, Germany. He is currently pursuing the Ph.D. degree with the Smart Data Analytics (SDA) Research Group. He is a Senior Research Scientist with the Nature-Inspired Machine Intelligence Research Group, Institute of Applied Informatics (InfAI). His research interests include knowledge graph analysis, machine learning, and pattern recognition. He was awarded the Vice-Chancellors Gold Medal for his outstanding performance during his bachelor's degree.
MD RASHAD AL HASAN RONY received the bachelor's degree from BRAC University, Bangladesh, and the master's degree from the University of Bonn, Germany, where he is currently pursuing the Ph.D. degree with the Smart Data Analysis Group. He is a Research Scientist with Fraunhofer IAIS, Dresden. He has worked on several research and industry projects related to dialogue systems, question answering, and machine reading comprehension. His primary research interests include knowledge graph-based dialogue systems, question answering systems, evaluation of generative systems, machine reading comprehension, and language models. JENS LEHMANN (Member, IEEE) received the master's degree in computer science from the Technical University of Dresden and the University of Bristol and the Ph.D. degree (summa cum laude) from the University of Leipzig. He is currently the Head of the Smart Data Analysis Research Group, a full-time Professor at the University of Bonn, and a Lead Scientist at Fraunhofer IAIS. He has contributed to various opensource projects, such as DL-Learner, SANSA, LinkedGeoData, and DBpedia. He has authored more than 100 publications, which were cited more than 18000 times. He has won 12 international awards. His research interests include semantic web technologies, question answering, machine learning, and knowledge graph analysis.