Utilizing Textual Information in Knowledge Graph Embedding: A Survey of Methods and Applications

,


I. INTRODUCTION
Recent years, KG has experienced rapid development. Some typical achievement have been constructed and published, e.g., YAGO [1], Freebase [2] and DBpedia [3]. KG provides a structure form to store the human knowledge. It is a structured representation of relational facts, composed of entities, relations and descriptions. Entities represent concrete objects and abstract concepts, and relations represent the relationships between the entities and descriptions define or describe the entities. The knowledge, called fact as well, is particularly stored as a triple (head entity, relation, tail entity) under the scheme resource description framework (RDF). Given a fact that moon is the satellite of earth, it is stored as (Moon, satellite of, Earth). KGs make the knowledge available to The associate editor coordinating the review of this manuscript and approving it for publication was Chang Choi . the intelligent systems and is applied to many artificial intelligence task, such as recommendation systems [4], question answering [5], [6], semantic parsing [7], named entity disambiguation [8] and information extraction [9].
Knowledge representation learning or KG embedding is one of the advanced researches, projecting the elements in KG into the continuous vector space. Such technique is widely applied to KG-related tasks, such as KG completion [10], relation extraction [11], entity classification [12]. Much work has been proposed, e.g., TransE and its extensions [14]- [19] and semantic matching models [20]- [23]. They directly represent entities and relations in the real-valued point-wise or complex vector space. Representations are made compatible to the ture facts existing in KG, relying heavily on the connectivities in KG. Yet, most KGs are built through semi-automatical methods, which makes them incomplete. The performance of the KG embedding in the downstream tasks could suffer from the spareness [24]. Considering that entities in KGs usually have precise and concise description. Examples of the entity description are illustrated in FIGURE 1. In addition, rich external context information is also available, e.g., textual mentions and text corpus. The textual information could be used to discover new relationships and offer precise expression. More importantly, it is slightly affected by the spare connectivity in KG. As a result, methods that utilize the textual information in KG embedding start to get attention [25]- [29]. We category these works according to if the representation is built from the textual informaiton: (i) Text-based KG embedding: The textual information are encoded to represent the entities and relations. Then the text-based representation is used to extend the existing KG embedding techniques or do the embedding task alone. The encoding models are learned by maximizing the overall plausibility of the facts in KG. (ii) Text-improved KG embedding: Unlike the former, no encoding models are made to be compatible with the facts and are used to build the text-based representation. The textual information is incorporated in different phases of the existing techniques, i.e., initialization, augmenting the representation and joint embedding, aiming to achieve better performance. Previous literature has made a survey on relational learning for KG [30], KG refinement [31], KG embedding [32] and knowledge representation learning [33]. Ji et al. [34] make a comprehensive review on KG representation, acquisition and application. Wang et al. [32] categorized the KG embedding based on facts alone with the scoring function and discussed the different types of auxiliary information utilized in KG embedding. Lin et al. [33] focus on the quantitive analysis of the KG embedding techniques. These work just briefly review few methods which incorporated the textual information. However, in this paper, we propose a comprehensive review on techniques that utilize the textual information in KG embedding, including state-of-the-art and latest trends. They are classified into two catogories based on whether the work builds the representation form the texts. Additionally, we further introduce the details in the training procedure and how the embedding with textual information is applied to the downstream tasks.
The remaining parts are organized as follows. Section II provides the premise, including the notation and definition, KG embedding with facts and various textual information. Then we introduce the text-based KG embedding in Section III and describe the encoding models and scoring function. Section IV reviews the text-improved KG embedding techniques which utilize the texts to achieve better performance. Section V discusses the training procedure of the mentioned methods. Some specific applications of KG embedding with textual information are further explored in section VI. At last, the conclusion and future directions are presented in section VII.

II. PREMISE A. NOTATION AND DEFINITION
In this section, we first introduce KG's definition and components. Following the definition in [32], KG is defined as G = {E, R, F}, where E denotes the set of entities, R refers to the set of relations and F represents the set of facts. Inparticular, a fact is defined as a triple (h, r, t), where h, r, and t denote the head, relation, and tail, respectively. Other notations and corresponding descriptions are listed in TABLE 1.

B. KG EMBEDDING WITH FACTS
Then some representative embedding techniques which only depend on facts are introduced, much previous work has been studied on such techniques. They proved to be state-of-theart and are widely used as the basis of embedding task with the textual information. These work could catch the structure information of the KG and provide the structure-based embeddings. Entities and relations are directly represented as the real-valued vector, matrix or complex-valued vector. Scoring functions are defined to assess the validity of facts. We divide them into two groups according to the scoring function.

1) TRANSLATION-BASED MODEL
TransE [10] projects the entity and relation into the same vector space. Inspired by [35], TransE follows the translation of the entity embeddings and takes relations as operations. Particularly, given a fact (h, r, t), it is represented as r s ≈ h s − t s . Intuitionally, to measure the validity to facts, let the score of the fact equal to the distance between h s + r s and t s , i.e., (1) VOLUME 8, 2020 If (h, r, t) is observed in KG, f r (h, t) should obtain a high score. TransE is proved to be concise and effective. However, it assigns a unique embedding to each entity no matter which different triples it constitutes. It is indicated that TransE is poor at dealing with the multi-relation problems. To handle the 1-to-N, N-to-1 and N-to-N relations, TransH [13] introduces a hyperplane to represent different relations between entities. h and t are projected onto the relation-specific hyperplane with the normal vector w r . The scoring function of TransH follows TransE defined as Each structure-based relation in TransR has an associated space. Entity embeddings are first projected in the relation-specific space with the projection matrix M r ∈ R d×d . Then the translation is performed in the relation space as 2) SEMANTIC MATCHING MODEL RESCAL [21] models the interaction between the entity pair through representing the relation as a weighted matrix M r . The scoring function is based on the bilinear formulation, i.e., To simplify the model, M r is restricted to be diagonal in DISTMULT [37]. ComplEx [38] introduces complex vector space and represents the entities and relations with the complex-value embeddings, i.e., h s , r s , t s ∈ C d . The scoring function is the extension of DISTMULT, i.e., where Re(·) keeps the real part in the complex. t s is the conjugating operation. Neural Tensor Network (NTN) [36] introduces the neural network architecture. Vectors of pair entity are given as the input. Next in the hidden layer, a bilinear tensor M [1:d] r ∈ R d×d×d is used to combine the entities h, t ∈ R d across different relations. The score is given by the linear output, which reflects the possibility of the relation. The model can be represented as: (6) where b r ∈ R d is the bia and W r ∈ R d×d denotes relation-specific weight matrices.

C. TEXTUAL INFORMATION
Rich textual information is available for KG embedding. In this section, different kinds of textual information are introduced. Especially the entity description and textual mentions are relatively precise and labeled texts. However, in most cases, the textual information is noisy and has weak relevance with the KG. It is hard to build the representation directly from the textual information. Some preprocessing methods are described as well based on the type of the information.

1) RAW TEXTS
Raw texts or unlabeled texts could contain much semantic information and they are easy to collect, e.g., a news corpus [39], Wikipedia articles. And they are expected to have a high coverage of entities. However, these texts don't have direct and close associations with the KG. Entities appear at random positions in document with an unknown title. Relations are implicit and entities are always buried in the noise. The wanted information are hard to derived from the raw texts, which makes it tough to represent the entities and relations accurately. Wang et al. [26] traverse the corpus and assume each distinct pair words in the window context and the implicit relation between them is a candidate fact. Entity name and description are used to align the candidate fact with the facts in KG. Some KG-related tools are also available for preprocessing these texts, i.e., entity linking. An et al. [40] link the Wikipedia inner-link and a corresponding entity in Freebase if they share the titles and link a word in corpus as a WordNet entity if the word belongs to its synsets. Wang et al. [24] also annotate the corpus by labeling the entities in KG as the FIGURE 2 illustrated. Wu et al. [42] extract all the sentence containing the entity name in the corpus as the reference sentences of the corresponding entity. General tools [41] are used but they are not able to completely remove the noises in the raw texts.

2) TEXTUAL MENTIONS
Textual mentions refer to the individual sentence containing the entity pairs which are derived from ClueWeb. The sentence is processed with the dependency parser and represented as lexicalized dependency paths in [43] and [44]. Then the path, which contains words and dependency arcs, is defined as the textual relation between the entity pair. An instance of textual relation is illustrated in FIGURE 3. However, the textual mentions are built to express relationship between entity pairs and inevitably introduces the noisy information. For example, the sentence ''Miami Dolphins in 1966 and the Cincinnati Bengals in 1968'' does not express any relationship. An et al. [40] aims to provided precise textual mentions for the following relation representation learning. An extractor is used to collect precise textual mentions for each fact (h, r, t). All the sentences possessing h and t are collected as the candidate textual mentions at the begining. The sentence is kept as the accurate textual mentions only if it meets one of the conditions: (i) containing the hyponym/synonyms word of r, (ii) containing the similar words with relation names.

3) ENTITY DESCRIPTION
In most KGs, e.g., Freebase and Wikidata, there are precise descriptions or definitions belong to entities. Entity description is considered to have strong association with the KG. As a result, entity descriptions are promising texts for entity representation and are popular in the most of the text-based representing models [25], [27], [40]. Classical textual features like TF-IDF could be used to extract keywords from the entity description. The flaw of entity description is that not every entity has the associated description in KG. Other textual information like entity name is also considered though they have little semantic information.

III. TEXT-BASED KG EMBEDDING
Most text-based KG embedding techniques utilize the textual information to extend the existing techniques with facts alone and represent the entity or relation with text-based and structure-based embedding. In recent years, reseaches that represent the entity and relation with textual information alone [46], [47] start to appear owing to the expressive encoders. We make a review on these text-based methods and find out that they typically cover the following three key elements: (i) building the text-based representations of entites and relations, (ii) defining a scoring function containing the text-based represenations, and (iii) training the encoding models and making it compatible to the facts. In this section we introduce the methods from two views: encoding models and scoring functions. The details of training procedure are presented in section V.

A. ENCODING MODELS
Particularly, the representation is constructed based on the word embeddings via the encoding models. Different encoders and mechanism are proposed to learn the expressive representation from the textual information. We further category them in linear models, convolutional neural networks, recurrent neural networks, topic models and transformer.

1) LINEAR MODELS
Simple linear models can be used to generate the text-based representation. Description-embodied knowledge representation learning (DKRL) [25] and Joint(BOW) [48] proposed continuous bag-of-words (CBOW) to encode the keywords extracted for the entity description. The vectors of keywords are summed up as the text-based entity representation. The limitation of CBOW is that it treats all the keywords equally and neglects the word order in description.
Veira et al. [49] make use of the Wikipedia entry for the entity name as the entity description and generate the relation-specific weighted word vectors (WWV) for the entity. WWV endows different importance to the words in description. The importance is decided depending on the frequency and the relationship with the relation of the word. Matrix A is used to record the number of occurrences in the description for each word and matrix B for the relevance. Given an entity e i with relation r j and entity description, the text-based representation is defined as where W is a n w × d word vectors matrix. Moreover, WWV introduces an auxiliary matrix P to decrease the number of parameters to be estimated, refers to parameter-efficient weighted word vectors (PE-WWV). P is composed of the word feature of the relations and the relevance between r i and word w k is defined as P i W k . So e i with r j can be expressed as WWV and PE-WWV are much more expressive than the CBOW, but they still failed to exploit the word order.

2) CONVOLUTIONAL NEURAL NETWORKS
The deep neural networks could have impressive performance on encoding the corresponding entity description and textual mentions. Convolutional neural network (CNN) is an efficient and effective model for the researches in computer vision and natural language processing. It can be used to learn deep expressive features in the textual information. DKRL(CNN) [25] assumed that the word orders in entity description may involve the implicit relations between entities that the KG omits and can be learned through the neural network models [50]. It constructs a five-layer CNN to fully discover the implicit relation in the word order. Except stop words, the vector embedding of the words in the description are regarded as input. Max-pooling operation is used for reducing the scale of the parameter space of CNN and filtering noises. After that, mean-pooling operation preserves the textual information for the entity embedding. The non-linear output layer builds the text-based entity representation. The architecture of DKRL(CNN) is illustrated in FIGURE 4. He et al. [51] assume that the descriptions of relations could also supply much semantic information and proposed Relation Text-embodied Knowledge Representation Learning (RTKRL). It adapts the similar CNN architecture and takes the position feature into consideration to represent the relation by encoding the fine-grained descriptions. Lexicalized dependency paths represent the textual relation between the entities. Synonymous textual relations share the similar paths consisting of similar patterns, words and dependency arcs. To use the statistical strength and learn the Compositional Representations of Textual Relations (CONV), Toutanova et al. [44] construct a one-hidden-layer CNN to encode their internal structure. Words and dependency arcs in the paths are projected to the input layer as word embedding v ∈ R d with an embedding matrix V . The text-based relation representation is as follows are the position-specific maps and b bias the vector.
However, the textual relation probably exists in multiple textual mentions containing the same entity pair and it is essential to figure out which better expresses the relation. Tang et al. [43] propose Multi-source Knowledge Representation Learning (MKRL), introducing the position embedding and attention mechanism [52] in CNN to encode the lexicalized dependency paths extracted from the textual mentions. For the i th word in the path, position embedding x ih and x it is defined as the relative distance to the head entity and the tail entity from the word. Then the position embedding and the word embedding constitute the concatenated embedding (x i , x ih , x it ) as the input of the encoder. To build the final representation from the related textual mentions {s 1 , . . . , s n }, sentence-level attention mechanism is used to filter noise and preserve information. For each output representation of single sentence s i after the max-pooling operation, the correponding structure-based relation embedding r is defined as the attention Then the text-based relation representation is as follow 3) RECURRENT NEURAL NETWORKS Recurrent neural networks are utilized to capture long-term relational dependency. Wu et al. [42] propose the Sequential Text-embodied Knowledge Representation Learning (STKRL). STKRL first extracts the reference sentences for each entity from corpus and regards the entity representation as the multi-instance learning problem. Given an entity e, position-based RNN/LSTM encoder is used to get the set of sentence-level representations {s 1 , . . . , s m }. Position features of words constitute the entity name are marked as 0.
The words around the name are marked based on the relative distance and the left side has negative values and the right side has the positive. The extracted sentence expresses the entity differently. So the corresponding structure-based embedding e s is used to assign different importance for s i by calculating the cosine similarity att(s i , e s ) between them. The text-based representation of e is as follow: Wang et al. [29] propose Entity Descriptions-Guided Embedding (EDGE) to encode the entity description. BiLSTM is introduced in EDGE to well handle the context and word sequence in the entity description. The pre-trained embeddings of words in description released by word2vec are given as the input of BiLSTM. However, the word embeddings are not directly input into the encoder. Considering entity description could enhance the structure-based embedding, vice versa. EDGE aims to refine the word embedding with structure-based embedding iteratively. For the word/ phrase w i in description, if exists an entity e i with the same name. w i is updated as where α i , β ij are adjustable values, e j ∈ r(e i ) denotes the neighbors of e i in KG with different relation edge. Although different measures are taken with the RNN/ LSTM to enhance the text-based representation, they fails to handle the multi-relation problem and represent the same entities or relations in different triples with a unique text-based representation. Xu et al. [48] take the both directions of word sequence into consideration to fully discover the semantic information in the word orders. They utilize BiLSTM to represent the entity with its description, referred to as Joint(BiLSTM). The output vectors are concated at each step. Particularly, given the output at the i th position z i ∈ R d of BiLSTM. The text-based entity representation is defined as Considering that words in the description make different contribution to the entity representation given different relations. Some words are the keywords for one relation, but not for another. For example, the ''parentOf relation will emphasize on social relations and gender attributes of the person. Based on Joint(LSTM), word-level attention mechanism is applied in BiLSTM to do the relation-specific encoding of the entity description in different contexts, referred to as Joint (A-LSTM). Given a structure-based representation of relation r s and entity description D with size n, the attention of r is defined as where W s ∈ R d×d , v a ∈ R d are parameters. Such that, for each entity e related to r, the text-based representation is defined as An et al. [40] employ BiLSTM to encode the entity descriptions and triple-specific relation mentions to learn accurate text-based representations, referred to as ATE, and introduce mutual attention between the entities and relations, referred to as AATE. Particularly, the mutual attention has two phrases. Given the relation mention {w 1 , w 2 , . . . , w n }, at the first stage, the representation of the textual relation r ∈ R d generated by averaging the hidden vectors of BiL-STM is used as attention to infer the entity representation e d . , where h i ∈ R h is the hidden representation of w i . Then the text-based entity representation is as follow: At the second stage, h d + t d is used to infer the attention representation r d , following the same formulation. It is proved that, with the mutual attention, AATE could better handle the multi-relation problem than ATE in the experiment.

4) TOPIC MODELS
The aforementioned methods focus on catching the word-level or sentence-level information in the texts but neglect the latent topics of entities. Ouyang et al. [53] propose entity topic based representation learning (ETRL) and Xiao et al. [54] propose the Semantic Space Project (SSP). They utilize the latent topics in entity description to represent the entities and relations. Particularly, entity descriptions are treated as documents and topic model NMF is used to learn the text-based representations where M i,j denotes the frequency of word w j existing in the description of entity e i , v e i and s w j are their topic representation respectively. The topic relation representation is defined as the average of the entities in ETRL.  5. KG-BERT discards the structure-based representation and the scoring function is as follow, W ∈ R 2×h is the classification layer weights.

5) TRANSFORMER
All the mentioned encoding models and related textual information are listed in TABLE 2.

B. CONSTRUCT THE SCORING FUNCTION
Apart from KG-BERT [46], other methods extend the existing embedding techniques. To measure the plausibility of the facts from two different prospects and learn the representations simutaneously, the scoring function is made up by the structure-based and text-based representation. VOLUME 8, 2020 Considering text-based representation in single method can be applied to multiple models, we concentrate on the combination of the both representations in scoring function, rather than the formulation.

1) REPLACE THE ASSOCIATED REPRESENTATIONS
Replacement is a simple and effective way to construct the extended scoring function. Reference [49] replaced the structure-based entity representation in TransE with the corresponding text-based one, the scoring function is as follow Not limited to this, the text-based representation is applied to other techniques, such as TransR, RESCAL as well. Contrary to directly subsituting the representation in single formulation, [25] and [42] built multiple formulas for different replacements. Given the fact (h, r, t), the scoring function is the amount of the formulas: DKRL extends TransE, so that Structure-based and text-based representation share the same relation embedding, resulting the mutual promotion between the two types of representation. Instead of TransE, the text-based embedding is mapped on the relation-specific hyperplane and relation-specific space before being injected to scoring function of TransH and TransR in RDRL [55].

2) UNIFY THE REPRESENTATION
Some combination mechanisms are proposed to integrate the both representations into a single one. Given an entity e, gate mechanism is used to combine the e s and e d . Joint(BiLSTM) applies a linear interposition between e s , e d and use a real-value vector g to control how much the joint representation depends on structural or textual information. Particularly, where g e is the gate to balance e s and e d and its elements are in [0, 1], is an entry-wise multiplication. Unified representaions are applied for TransE as The construction of Joint(BiLSTM) is illustrated in FIGURE 6. The gate mechanism only considers the entity representation projected in the real-valued vector space. An et al. [40] make a more comprehensive combination mechanism, a weight factor α ∈ [0, 1] is used to integrate the embeddings.  where Re(·) denotes the real part vector of the representation. If the structure-based relation representation is matrix, it is treated as a vector with each element the same as the element in diagonal matrix. Simple linear transformations can also be used to unify the embedding. In this way, [53] defines the joint representation as

3) OTHERS
SSP provides an idea that executes the embedding progress in a semantic subspace by modeling the association between the KG facts and textual information. The architecture is similar to TransH. But SSP measures the plausibility of the facts by projecting h + r − t onto the semantic-specific hyperplane, rather than a relation-specific hyperplane. In this way, the facts and the corresponding texts are interacted, so semantic relationships and textual contexts are able to be used to contribute the knowledge graph embedding. Let d ≈ h + r − t and s denotes the loss vector and normal vector respectively, such that the scoring function in SSP is defined as The embeddings, extended models are listed in TABLE 3, and they are categorized by the combination mechanism when constructing the scoring function.

IV. TEXT-IMPROVED KG EMBEDDING
The difference between text-improved and the text-based KG embedding techniques is that the former does not cover the three key elements and focus on enhancing the structure-based KG embedding with the textual information.
We roughly category the methods based on the usage of the texts.

A. INITIALIZE THE ENTITY EMBEDDING
The entity embeddings are usually initialized randomly in the existing embedding approaches. Some methods improve KG embedding by initializing the embeddings with textual information. NTN [36] and DISTMULT [37] learn the distribute embeddings of the words which constitute the entity name. Then the vectors are averaged and used to intialize the entity embedding. However, the entity name contains little semantic information. Long et al. [58] follow NTN but replace the entity name with the entity description. Particularly, for each entity e, given an associated fragment of textual information D, e d is defined as where w i denotes the embedding of the word in texts. NTN learns the word embedding in an unlabeled news corpus, DISTMULT makes use of the pre-trained embedding released by Word2Vec. Word2vec and GloVe in [58] are trained on a large corpus which contains many entities from FreeBase. DISTMULT introduced another method which treats the entity as the word/phrase in corpus and learns the distributed embedding of the entity for initializing directly.
The experiments of all these methods prove that initialization with the textual embedding is useful in the knowledge graph embedding. However, not all the initialization with the word embedding leads to improvement on performance in the downstream application. In DISTMULT, the initialization with the average word vectors in entity name is observed performance drops on the link prediction task. For more than 73% entities in datasets are non-compositional phrases like person names, locations and films which are not suitable to represent the entity.

B. AUGMENT THE STRUCTURE-BASED KG EMBEDDING
To incorporate the textual information, some work augment the entity and relation embedding in the existing models, specifically, they represent the entity or relation as the linear transformation of the embedding of textual information. To capture the implicit relationship between entities and attributes, FeatureSum [49] utilizes the unstructured corpus and uses word2vec model to embed entity name that in the same context with close word vectors, generating the associated word vectors e d . A transformation is designed in FeatureSumê where M is used to map to the vector space of e, which is regardless of the relation type. Sun et al. [56] concatenate the pre-trained vectors of entity description released by the Doc2Vec models [59], i.e., DM and DBOW, as the description embedding. The entity representation is defined aŝ where M r is a relation-specific weight matrix for description embeddings.
The producted word and description embedding are unique for the entity in various facts made up of different relations, which is a negative impacting on handling the mutli-relation problem. Wang et al. [24] presented a text-enhanced KG embedding method (TEKE). For better coping with the 1-to-N, N-to-1 and N-to-N problems, entity representations are augmented with the corresponding textual context embedding. Additionally, TEKE defines the joint contexts of the pair entity as the textual context embedding of each relation between them. Such that, relations between different entities or the same entity with different contexts will have distant representations. TEKE transforms the Wikipedia text corpus to a co-occurrence network. The network is constructed by the words and labeled entites. Each entity is treated as a node and the words in the textual context are defined as the neighbor nodes n(e) of it. Co-occurrence frequency y is the connection edge between them in the context and threshold is used to remove the noise. The textual context embedding of the entity is defined as follow Given two nodes h, t, the textual context of relation is defined as the intersection of n(h) and n(t). Associated embedding is defined as the weighted average as well. The augmented representations of the existing embedding techniques, such as TransE, are as folloŵ where W e , W r are weight matrices, and b h , b t , b r bia vectors. The extension on entity representation is available for the existing models except ComplEx. The extension on relation can apply for TransH and TransR.

C. JOINT EMBEDDING OF THE TEXTS AND FACTS
Joint embedding aims to project the textual information and structural knowledge into the same continous vector space for improving the structure-based embedding. Specifically, these methods represent and score the facts with the existing models. Moreover, they model the textual information and make it interactive with the entities and relations.
Wang et al. [26] first present the method of jointly learning (Jointly) the KG embedding and word embedding by aligning the facts and the words in raw texts. Knowledge model, text model and alignment model are the components of Jointly. The knowledge model follows TransE to score the facts and designs a conditional likelihood loss L K to learn the general KG embedding. The text model defines the scoring function to measure the plausibility of the two words w and v co-occurring in the context.
Based on the function, L A is designed to measuring the overall fitness of the word pairs. Alignment models are designed to guarantee the entity embedding and the word embedding projected in the same vector space. Various alignment texts are used: entity names, Wikipedia anchors and entity description. Particularly, given the fact (h, r, t) and word pair (w, v), alignment model generates new triples and pairs like (h, r, w), (h, v) depending on whether the words are in the alignment texts. The total plausibility is measured by the loss L A . Finally, [26], [27] learn the embeddings by maximizing The alignment by entity description has been proved better than the other two alignment mechanisms in the experiment. RLKB [28] proposes a method of jointly embedding the entities, relations and words in entity descriptions in the same vector space. Following Jointly, RLKB designs the L K based on TransE to measure the fitness of facts. Then entity description is made interactive with the entities. Given the set of keywords {w 1 , w 2 , . . . , w n } extracted from the descritpion of entity e, RKLB forces the entity embedding close to the where L D is defined to measure the total distance. Embedding is learned by maximizing the loss Jointly and RLKB perform the linguistic analysis on textual information, which is not powerful enough to catch the important features. The words, entities and relations are directly embeded into the continous vectors, failing to make use of the excellent feature in texts and facts. Han et al. [39] introduce mutual attention mechanism between the knowledge model and text model to filter the noise in sentences and obtain more discriminative KG embeddings, referred to as JointE+SATT. Given the set of sentences π r s = {s 1 , s 2 , . . . , s m } containing the associated entites (h, t) and textual relation r s , a position-based CNN is used to encode each sentence in text model. To represent r s , latent relation r ht = h − t is defined as the attention over the output embedding. Scoring function o = Mr s is defined to measure the fitness. Given the entity pairs φ r = {(h 1 , t 1 ), . . . , (h n , t n )} with the common relation r, the knowledge model utilizes M r s as the attention over the latent relations. The framework of the mutual attention is illustrated in FIGURE 7. Related facts are scored with the global relation r k in the formulation of TransE or TransD. Embeddings are learned via the amounted plausibility of the relations, facts and parameters θ.
where λ is the harmonic factor. Different mechanisms and textual information are listed in  All the aforementioned techniques are trained under the open world assumption. A fact triple that doesn't exist in the KG is defined as unknown whether it is true or false [60]. OWA is much more fit the status of knowledge graph than CWA, for the existing KGs are far from complete and huge amounts of true facts are still missing.

A. LOSS FUNCTION
Some favored loss functions are introduced for the model optimization. To make true facts have higher scores, most methods select the margin-based pairwise ranking loss to learn the text-based and structure-based KG embedding. (38) where F + denotes the set of positive samples and F − the negatives. (h, r, t) is the positive sample in F + and (h , r , t ) is the negative one in F − . Higher score is assigned for the true fact than the false one, and through maxizing the margin between (h, r, t) and (h , r , t ), embeddings are learned. The joint embedding techniques favor to use the likelihood-based loss function, aiming to measure the plausibility of every element in the triple. where P(h|r, t) is the conditional probability of head h when given r and t.
The same formulation is defined for P(r|h, t) and P(t|h, r).

B. NEGATIVE SAMPLING
Negative samples can be generated through corrupting the true facts. A simple method is to replace the head entity or tail entity of a positive fact (h, r, t) ∈ F + with an entity sampled randomly from the entity set E [10]. Samples follow the uniform distribution. However, this may lead to too many positive samples in the training. More effective sampling strategies are proposed in the following researches like [13], [61]. Bernoulli sampling performs the replacement with the possbility tph tph+htp or hpt tph+htp , where tph and hpt denote the average number of tail entities per head entity and the average number of head entities per tail entity respectively. For the textual information learning, entity descriptions are replaced accordingly during the training. Some of the introduced models treat the entities and relations as the word/phrase in the textual information to learn the contexts and expressions of them. The training procedure needs to be modified according to their own definitions or assumptions. In Literature [26], when word w and context word v are considered a positive fact (w, v) with implict relation. Distinct wordw is sampled uniformly from the textual information to form negative samples (w, v) which are never concurrent in the textual information.
Jointly [26] and RKLB [28] use negative sampling to simplify the loss function. The likelihood-based loss is tough to do the normalization for the millions of normalizers and the large number of candidate-entities influence the efficiency of training phrase. So negative sampling strategy [35] is used to transform the loss objective and reduce the number of candidate entities. A sampled subset of negative candidates from the corresponding negative distributions P neg and the probability of a positive triple is defined as P(1|h, r, t) and false as P(0|h , r, t). log P(h|r, t) in (41) is guaranteed to approximate the following equation and replaced with it. where The same transformation is applied to log P(r|h, t) and log P(t|h, r).

C. DATASET
FB15K [10] and WN18 [10] are the most widely used datasets for the knowledge graph model training. But for models required entity descriptions, they still need to be refined. Entities which have too few descriptive words or even no words and the triples those entities comprise are removed from FB15K. The same manipulation is conducted in WN18. New Dataset is also created for the model training and testing, i.e., DBpedia500k [47]. Some downstream tasks usually need task-specific datasets. FB20K [25] is generated for the open-world KG completion in which one of the entities in test triples are unseen in the KG. FB15k-237-OWE is generated based on the FB15k-237 [62] in which redundant inverse relations have been removed and the entity descriptions are restricted in 5 words on average.  [43]. The former does not share the details of model and the latter introduces entity type as auxiliary information besides texts. n e , n r , and n w denote the size of entity set, relation set and the vocabulary respectively. p is the dimensionality of the position embedding and k is the size of node in hidden layer. All the models are trained under the open world assumption. Prerequisites are the techniques used to do the preprocessing or pre-training, in which we do not consider the parameters. Parameters of different models are calculated under the conditions that the structure-based embedding is provided by TransE, except Literature [36], [37] and [44]. We could make several conclusions on the parameters of the different models. First, initialization is apparently the most parameter efficient way to enhancement the KG embedding and the improvement is based on the suitable textual information. Second, most text-improved techniques have less parameters than the text-based. For they rarely apply complicated mechanism to cope with the textual information and incoporate it with the existing models. Comparing with TABLE 4, more expressive encoding models have more parameters in text-based embedding. On the other hand, models based on raw texts would have more parameters than those based on labeled texts. DNN models can't be applied to cope with the raw texts directly. Preprocessing methods, i.e, entity linking and distant supervision are demanded to generate textual mentions or annotate the raw texts. However, these work may label noisy data and force the DNNs to introduce additional mechanisms, e.g. attention mechanism which also results in more parameters. The models work on the entity description usually has the least parameters, for the entity descriptions are precise and concise in KGs. We next discuss the performance of these methods and restrict the scene in link prediction task with two datasets, i.e., WordNet and Freebase data. Intuitionally, models with more mechanisms have more parameters and lead to better performance. However, based on the same textual information, the models with more parameters do not necessarily perform better. For example, Joint(A-LSTM) is slightly worse than the Joint(LSTM) in WordNet. The reason is that the number of relations is too small for Joint(A-LSTM) to take relation as attention. In [53], the CNN architecture in DKRL is reimplemented and the extended model is replaced with TransR. Better performance is observed in the experiment, which indicates more expressive composition can be helpful to performance. In most cases, the combination between structure-based and text-embedding embedding should have better performance than only using one of them. However, the Hits@10 of PE-WWV has different results in WordNet and Freebase. In WordNet, structure-based embedding provided by TransE and TransR performs better than PE-WWV while in Freebase the PE-WWV performs better [49]. The reason is that there is no common words between the definition of the pair of entities. Similar case happens to TEKE as well for the models have more parameters and needs more rounds to converge. TABLE 3 compares structure-based and text-based embedding and extended models of text-based KG embedding. The ''Extended Models'' column lists the existing embedding techniques used in the experiment section. Apparently, TransE is the most popular model used to catch the structural information. Most models choose to combine the representations into a joint one or consititue a hybrid scoring function. It is a conundrum to determine which way is best to construct the scoring function, for there are no related comparisons are found to our best knowledge.

VI. APPLICATIONS IN KG-RELATED TASKS
After reviewing the current approaches that utilize textual information in KG embedding. We aim to show some applications that the learned entitiy and relation embeddings via such techniques can be applied to. Moreover, we simply review some embedding techniques with texts aiming at these specific applications. The applications are restricted to be relevant to the knowledge graphs, including KG completion task, i.e., link prediction, triple classification and entity classification as well as out-of-KG tasks, i.e., entity alignment, relation extraction and recommender system.

A. LINK PREDICTION
Link prediction is to search an entity which probably constructs a new fact with another given entity and specific relation. For KGs are always imperfect, link prediction aims to discover and add missing knowledge into it. With the existing relation and entity, candidate entities are selected to form a new fact. The task has been experimented in much previous studies [10], [13]. Given (?, r, t) is to predict h or given (h, r, ?) to predict t is a common form where h ∈ E, r ∈ R, t ∈ E, and (h, r, t) / ∈ G. For instance, (?, satellite of, Earth) is to predict what goes around the Earth and (Moon, satellite of, ?) to predict what object the Moon moves around. It's easy for the learned entity and relation embedding to do such prediction, so long as treating it as a ranking procedure. Through putting the embedding candidate entities into the scoring function f r (h, t), the answer of the prediction could be obtained by ranking the score of the candidate facts. As a result, it can be handled by all the KG embedding method. The evaluation criterions, such as mean rank, Hits@n (rank proportion smaller than n + 1), and AUC-PR, can be designed based on the ranks.

1) ZERO-SHOT SETTING
In zero-shot scenario, restrictions on entity are relaxed. The setting introduces the set of entities E i that are out of KG. So the task is formulated as given (?, r, t), r ∈ R, t ∈ E ∪ E i to predict h ∈ E ∪ E i or given (h, r, ?), r ∈ R, h ∈ E ∪ E i to predict t ∈ E ∪ E i and (h, r, t) / ∈ G. The out-of-KG entities do not carry any instructure information, so KG embedding on basis of facts alone fails to cope with the task, for they rely heavily on the structure information and connectivity in KG. However, the text-based embedding is still available if they have asscociated textual information [25], [26]. The link prediction in zero-shot scenario is also called open-world KG completion in [47].
Recently, [64] introduces an Open-World Extension (OWE) to make traditional KG embedding models to perform link prediction in zero-shot scenario. OWE utilizes entity description to supply the missing structure information. Particularly, the text-based entity embedding is defined as the mean of the vectors and mapped to the vector space of the structure-based embedding. A linear function, an affine function and a 4-layer Multi-Layer Perceptron are introduced. The transformations are trained through the loss function: where e s is released by the pre-trained KG embedding model, denotes the parameters in the transformation models. Suppose that the extended model is TransE and head entity h is out-of KG, the task is performed based on: ConMask [47] performs the embedding-based entity prediction task solely based on the entity description, name and relation name. Given a fact (h, r, t), ConMask built its text-based represention with relationship-dependent content masking, semantic averaging and target fusion. The masking highlights the relevant words by computing the similarity between each word in description and words constitute the relationship name. The weight of the i th word in the entity description w i is the largest score among the (i − k : i)th word embeddings. Semantic averaging gets the mean vector of names. Target fusion then extracts the entity embedding with a convolutional neural network (FCN). Normalized text-based embedding is used to measure the fitness of the fact.
The overall framework of ConMask is illustrated in FIGURE 8. Additionally, the link prediction in zero-shot scenario could handle the poor scalability of KG [63], because the KG can be extended with the out-of-KG entities.

B. ENTITY CLASSIFICATION
Entity classification is a multi-label classification task and targets to predict the types of the entity. For each entity e, it has multiple types in KGs. The embeddings of entity are learned by the models and input into the classifier, i.e., Logistic Regression as features. In the zero-shot scenario, the unseen entity is represented by its text-based embedding [25]. For evaluation, mean averaged precision(MAE) is used.

C. TRIPLE CLASSIFICATION
Triple classification is to verify whether unseen facts are true or not in testing data, which is typically regarded as a binary classification problem. Scoring function can be used to assign a score for each triple and the decision rule is a specific threshold. Aforementioned embedding methods could be applied for triple classification. Ranking metrics and the micro-and macro-averaged accuracy can be used to evaluate the task.

D. ENTITY ALIGNMENT
Entity alignment (EA) aims to identify pair entities with an equvalent relation in heterogeneous KGs. Given two different KGs G 1 and G 2 , E 1 and E 2 respectively denote corresponding set of entities. EA aligns the entities, such that e 1 ∈ E 1 and e 2 ∈ E 2 into an entity pair (e 1 , e 2 ) where e 1 is equivalent to e 2 . In practice, a small set of alignment entities (i.e., Equal entities in different knowledge graphs) is given to start the alignment process. Embedding-based alignment calculates the similarity between embeddings of a pair of entities [65]. Recently, the embedding models are leveraged to alignment the entities in multilingual KGs [66]. The application is novel but challenging. For the crosslingual is far from complete and is lack of the supervision, which will influence the performance of cross-lingual inferences.
KDCoE [67] first conducts co-training on structure-based KG embedding and text-based embedding to perform the alignment between entities of multilingual KGs. The method is composed of multilingual KG embedding Model (KGEM) and multilingual entity Description Embedding Model (DEM). KGEM is modified from the previous embedding model MTransE-LT [66]. TransE and a linear transformation alignment model are introduced to jointly learn the cross-lingual inferences. In DEM, a gated recurrent unit (GRU) encoder incorporated with the self-attention mechanism in the layer is applied to learn the semantic feature in multilingual entity descriptions. Pre-trained word embeddings released by cross-lingual Bilbowa [68] is given as the input of the encoder. The text-based embedding is obtained through mapping the averaged output vectors where M denotes the affine matrix, d is the description sequence and v i is the i th output vector of encoder. The scoring function of the description embedding is as follow where (e, e ) is the entity pair in unordered language pair I (L, L ). Finally, KGEM and DEM are iteratively co-trained to propose new ILLs in turn iteratively.

E. RELATION EXTRACTION
Relation extraction [69] aims to extract the latent relationship that connects the entities in the raw texts to help build the KGs automatically. Although relation extraction is extensively studied in NLP tasks, due to the lack of labeled relational data, much researches utilize KG in a distant supervision way [70], [71] [72], also referred as weak supervision or self supervision, use heuristic matching to create training data by assuming that sentences containing the same entity mentions may express the same relation under the supervision of a relational database. Riedel [11] jointly embed the textual information and KG into the same matrix to do the relation extraction. Entity pairs are represented as the row of the matrix, and the textual mentions or relations in facts are represented as the column. The value of entry depends on the condition where entity pairs appear in the textual mention or form a fact with a observed relation in KG. If meeting the condition, the value is set to 1, otherwise it is set to 0. In the training procedure, entity pairs are along with textual mention and KG relationship at the same time, the latter is distant supervision. Relation extraction is performed through predicting the missing relations given the textual mentions and entity pair. Fan et al. [73] use the text features to replace the textual mentions at the first columns in a similar matrix.

F. RECOMMENDER SYSTEM
Recommender systems provide users with suggestions about what they may want to buy or inspect. Collaborative filtering technology is widely used in various recommendation systems and achieved great success. It models the interaction between users and projects (as the product of potential representation). However, because the interaction between users projects may be very sparse, such technologies do not always work well.
Zhang et al. [4] improve the performance of collaborative filtering by introducing a hybrid recommendation framework which utilizes the facts, textual information and visual information in KG. Given a item, TransR is used to obtain the structure-based entity representation, stacked de-noising auto-encoder is used to generate the text-based entity representation and stacked convolutional auto-encoders are applied to catch visual-based representation. For each item j, η j is defined as the original latent vector for j. After incorporating the representations from KG, the latent vector is redefined as where s j , t j and v j are the associated structure-based representation, text-based representation and visual-based representation respectively. Given the latent vector u i of user i, the preference between them is represented as u i e j . The representations are learned and proved to be effective in the recommender system.

VII. CONCLUSION AND FUTURE DIRECTIONS
Much work has been done to handle the sparseness of KG and enhance the performance of embedding with the textual information. The text-based KG embedding methods introduce entity and relation representations built from the entity description and textual mention. Low frequency or poor connectivity of entities and relations slightly influence text-based representation. Various powerful encoding models, i.e., linear models, deep neural networks and topic models are proposed for the representation. Attention mechanism is used to deal with the noise in text and multi-relation problem. Scoring functions are constructed by the structure-based and text-based representations or the unified representations. The text-improved KG embedding help fact-based embedding techniques with different mechanisms. Pre-processed word vectors are used to initialize or augment the KG embedding. Embedding models and text models are made interactive and trained simultaneously, so that the KG embedding can be improved by the texts. Both the text-based and text-improved methods are proved effective. In addition, utilizing textual information in KG embedding can handle zero-shot scenario KG completion and many other tasks. Future directions of utilizing textual information in KG embedding. (i) Incorporation with the latest embedding models: The reviewed techniques mainly incorporate the textual information in TransE, TransR, and TransH. Recently, a lot of achievements have emerged in the field of KG embedding with facts, e.g., ConvKB [74], RotatE [75], and CrossE [76]. Latest encoders, i.e., capsule networks and Graph Neural Networks and new characteristics, i.e., relation patterns and interaction matrix are introduced to enhance VOLUME 8, 2020 the performance of KG embedding. Nevertheless, such techniques still suffer from the imperfection of the KG. To extend the newly proposed models with the textual information is an interesting research direction. (ii) Scalability of KG: Scalability is an important attribute of large-scale KGs containing millions of entities and relations. To deal with the noise in the texts and multi-relation problem, existing methods simply improve the expressiveness of the model but ignore the computational efficiency, which could have a negative impact on scalability. So that a balance between the model expressiveness and computation cost should be studied. (iii) Open-world KG Completion: Novel entities and relations are increasing and added into the KGs every day. Embedding techniques based on the observed facts in KG alone failed to learn the representations of the newcomers. As mentioned before, the problem can be tackled by introducing the text-based representation. However, except DKRL, ConMask, and OWE, few works try to solve the task. For the structural information is no longer available for the new entities, expressive encoding models used to generate the precise embedding from texts are necessary. In addition, the above three achievements only consider entity prediction but fail to predict the out-of-KG relations.
Challenges on utilizing the textual information. (i) Coverage on the entities: Currently, text-based KG embedding rely heavily on entity description which do not have a high coverage on entities like raw texts. The experiments are usually conducted with processed datasets in which each entity owns a description. In reality, the condition is not perfect and the encoding models might be useless. Even the text corpus used in text-improved methods can not be guaranteed that it contains all the entities to be embedded. (ii) High-frequency entities: Contrary to the sparseness, high-frequency entities own relatively rich structural information. On the basis of the connectivity, the embeddings can be well-trained by the techniques with facts alone. The text-related mechanisms that are helpful to low-frequency entities might damage the well-learned embeddings. Consequently, the overall performance of the model is influenced. (iii) Latent relations in texts: Relations are explicit in facts but implicit in texts. Even in the textual mentions, to definitely represent the relation between the entity pair is formidable. Most text-based and text-improved methods choose to only target the entities. The rest either treat the representations of the pair entity as replacement or introduce complicated process procedure. The former is not precise enough and the latter is inefficient.
As a result, there are still unsolved problems and valuable researches worth exploring. We expect the survey can facilitate future work on the KG embedding with textual information.