Knowledge Graph Embedding via Graph Attenuated Attention Networks

Knowledge graphs contain a wealth of real-world knowledge that can provide strong support for artificial intelligence applications. Much progress has been made in knowledge graph completion, state-of-the-art models are based on graph convolutional neural networks. These models automatically extract features, in combination with the features of the graph model, to generate feature embeddings with a strong expressive ability. However, these methods assign the same weights on the relation path in the knowledge graph and ignore the rich information presented in neighbor nodes, which result in incomplete mining of triple features. To this end, we propose Graph Attenuated Attention networks(GAATs), a novel representation method, which integrates an attenuated attention mechanism to assign different weight in different relation path and acquire the information from the neighborhoods. As a result, entities and relations can be learned in any neighbors. Our empirical research provides insight into the effectiveness of the attenuated attention-based models, and we show significant improvement compared to the state-of-the-art methods on two benchmark datasets WN18RR and FB15k-237.


I. INTRODUCTION
The knowledge graph (KG) is a graph-based data structure composed of ''node-edge-node'' that represents a semantic network. The node represents a ''concept'' or ''entity'', and the edge represents a relation between two entities. For example, in Figure 1, a triple (Joe Russo, born_in, Cleveland) is represented as two entities: Joe Russo and Cleveland with a relation born_in linking them. Knowledge graphs are used to describe concepts, entities, and the rich relations between them in the real world. At present, knowledge graphs have been widely used in finance [1], medical [2], semantic search [3], and other fields.
However, the use of knowledge graphs is often limited due to the deficiency in relations arising from incomplete processing data. It is difficult to complete the missing knowledge by means of extraction or fusion because of the sparseness of the data. The completion of true relations, namely knowledge graph completion, remains an active research area. For example, in Figure 1, given triples (Anthony Russo, brother_of, Joe The associate editor coordinating the review of this manuscript and approving it for publication was Huiling Chen . Russo), (Joe Russo, born_in, Cleveland), (Cleveland, city_of, Ohio) and (Ohio, state_of, USA), the inference (Anthony Russo, nationality, USA) should stands. State-of-the-art methods aim to match entities and relations to low-dimensional continuous vector spaces to characterize their latent semantic features.
The most advanced knowledge embedding approach is knowledge graph representation learning, which mainly categories into tensor decomposition, translational model, and neural networks model. While statistical models need to capture the joint distribution triple features between multiple atomic, which will cause the exponential growth of feature space, the representation learning maps features to the distributed space so that the complex relations are decoupled and dimensionality disaster problem is alleviated. In addition, the data sparsity is prominent in knowledge graph, while the representation learning fills the sparse matrix by numerical calculation, which solves the data sparsity problem to some extent. Finally, representation learning allows symbol data to directly participate in the computation without using statistics such as counts, distributions, and so on.
However, there are still some problems in representation learning. First of all, state-of-the-art methods of representation learning only consider the triples and their hidden features, while the rich information contained in the neighboring triples is not taken into account.
Secondly, the relation embedding is too simple.The existing methods are mainly based on entity embedding, ignoring the influence of the diversity of the relations on the representation of the triple. At last, the existing method adopts a strategy of equally assigning weights to multiple relations on the same path, so that the importance of the relations is treated equally, resulting in error in link prediction.
We use event knowledge graph to make prediction of events, so the accuracy of relations is critically significant. Once the relation reasoning on the critical path goes wrong, it may lead to the prediction of completely oppsite information, misleading the decision analysis. Therefore, improving the accuracy of relation prediction is an urgent problem we need to solve. Inspired by the previous research, we propose the attenuated attention-based graph embedding for knowledge graph completion. Graph attention networks(GATs) with n-hop neighbors [14] have achieved improvements in relation prediction. Our model introduces the attenuated attention mechanism while considering the n-hop neighbors, i.e., the closer the entity is to a given entity, the higher the weight of attention gained. We add it to the GATs and learn the new embedding of entities and relations. Then we use our model to train the relation and entity embeddings separately. More details will be described in Section III.
Our contributions are as follows. First, in order to learn the more expressive embedding, we introduce the graph attention network. Secondly, in view of its existing limitations, we propose the attenuated attention mechanism, and then propose a novel entity and relation embedding method based on graph attenuated attention networks. Third, we use the information of the n-hop neighbor nodes to extend the representation of the entities and relations. Fourth, we introduce the encoderdecoder model. The graph attenuated attention network is used as the encoder, and the Capsule Networks Embedding(CapsE) [18] is used as the decoder. This model supports the exploration of the triple features on a deeper level. Finally, we evaluate the proposed approach with the experiment that uses two benchmark datasets. The experimental results show that our model accuracy performs better than the state-ofthe-art methods on the FB15k-237, and most indicators in WN18RR are better than the state-of-the-art methods.
The rest of the paper is structured as follows. We first review the related work in Section II and then present our detailed approach in Section III. The experimental dataset descriptions, results, and analysis are reported in Section IV and followed by our conclusion and future work in Section V.

II. RELATED WORK
Recently, several different representation learning methods have been proposed for relation prediction. These methods can be broadly classified into (i) tensor decomposition, (ii) knowledge embedding models.
The basic idea of tensor decomposition is to replace the original relation matrix with multiple low-dimensional matrices or tensor products, thus replacing sparse and large amounts of raw data with a small number of parameters. The RESCAL [6] takes the inherent structure of dyadic relational data into account by employing the tensor factorization. The Neural Tensor Networks (NTN) [4] model expresses the relations as a matrix to characterize the correlation of potential features. The bilinear matching between the entities (head, tail) and the relations are used to judge the possibility of the relation being established. However, both of these models require a large number of matrix multiplication operations, which greatly increases the time cost.
The knowledge embedding models are further classified as translational models and neural network models. The translational models are based on the energy function, which demostrate the fact that golden triples have low energy while corrupted triples have high energy [8]. The judgment of the established tirple is golden or not is determined by the calculation of the energy function. Translating Embedding(TransE) [5] is a simple model with fewer parameters, but there are certain problems in dealing with oneto-many and many-to-one relations. In order to solve the problem, Translating embedding on Hyperplanes(TransH) [7] and Translating Embedding in Relation spaces (TransR) [8] models are proposed. TransH and TransR calculate a triple score with the same entity using alternative representations in different relation spaces, effectively avoiding the convergence problem. Translating Embedding via Dynamic Mapping Matrix(TransD) [9] solves the diversity of entities and relations by applying the transitional matrix determined by the corresponding entities and relations. Since the initial introduction of TransE in 2013, a variety of methods have been proposed under this framework. The community has also proposed additional methods (e.g., Translating Embedding via relational Mapping properties(TransM) [12], Translating Embedding via Adaptive Approach(TransA) [10]). A summary is shown in Table 1. The scoring functions of the methods are introduced, and the number of parameters represents the model complexity. Scoring function and complexity of proposed translational embedding models. n e and n r represent the number of entities and relations, respectively. d is the dimension of entity and relation embedding. h, r, t represent the head entity, relation, and tail entity embedding, respectively, M r represents the relation transitional matrices, I denotes the identity matrix.
Different from the translational models, in the field of neural networks, the representation of the knowledge graph is how to express entities and relations through neural network models so that entities and relations can be calculated by symbols. The neural networks models, which focus more on mining latent semantic information of triples, mainly include Neural Tensor Networks (NTN) [4], Convolutional Knowledge Base Embedding(ConvKB) [16], Convolutional Network Embedding(ConvE) [15], Relational Graph Convolutional Networks (R-GCN) [21], Graph Convolutional Networks(GCNs) [11], and Capsule Networks Embedding (CapsE) [18]. ConvKB uses the internal structure of the textual relation as input to the convolutional neural network. NTN learns a tensor network for each relation in the knowledge graph. ConvE uses convolutional neural networks (CNNs) to optimize the input vector to score the triples. The R-GCN uses a graph convolutional network to obtain an embedding of the triples, then applies DistMult [19] to compute a score for the embeddings. In the European space represented by the image, the number of neighbors of the node is fixed. However, in the non-European space represented by the knowledge graph, neighbors are not fixed. The convolution operation in the European space is a feature of the pixel extracted by a fixed-size learnable convolution kernel. Nevertheless, the traditional convolution kernel cannot be directly used to extract the features of the nodes on the graph due to the unfixed number of neighbors. Therefore, GCNs was introduced to find a learnable convolution kernel suitable for graphs. CapsE represents each triple as a 3-column matrix and feeds into capsule neural networks to generalize the expressive embedding. A key limitation of tensor decomposition and neural network based approaches is the high computational cost. To address this limitation, the holographic embedding model(HOLE) [17] has been proposed to construct a more efficient embedding representation by applying the cyclic correlation of entity embedding. Besides, taking path learning into consideration, other than [14], reinforcement learning(RL) with multi-hop knowledge embedding is proposed for the use of query answering based on knowledge graph [28]. Relation path embedding(RPE) [31] adds multihop relation paths to the translation model and simultaneously embeds each entity into two types of latent spaces.
Path-based Attribute-aware Representation Learning model (PARL) [32] is proposed to perform path denoising and path representation learning for the relation prediction task. Discriminative path-based embedding model (DPTransE) [33] builds interactions from the jointly learned latent features and graph features and uses the graph features as the crucial prior to offer precise and discriminative embedding. Metabased multi-hop knowledge graph reasoning (Meta-KGR) [34] adopts meta-learning to learn effective meta parameters from high-frequency relations that could quickly adapt to few-shot relations.
In addition to the models above, the incorporation between attention mechanisms and neural networks, which aim to improve the robustness of models, is widely studied. Attention Graph Convolution Network (AGCN) [35] consists of an attention mechanism layer and Graph Convolution Networks(GCNs) to perform superpixel-wise segmentation in big SAR imagery data. Graph Attention Model (GAM) [36] focuses on small but informative parts of the graph, avoiding noise in the rest of the graph. Attention mechanism is applied to reason evidence from the representation of multiple paths to predict whether the entities should be connected by the candidate relation [41]. Besides, attention mechanism is also applied to the classic compositional method to find reasoning paths between entities [40]. Open-domain conversation generation model with graph attention model [37] retrieves relevant knowledge graphs from a knowledge base and then encodes the graphs with a static graph attention mechanism. Learning node embeddings via graph attention utilizes attention parameters exclusively on the data itself instead of inference. Knowledge aware Path Recurrent Network (KPRN) [38] generates path representations by composing the semantics of both entities and relations and allows effective reasoning on paths to infer the underlying rationale of a user-item interaction for recommendation. Knowledge Graph Attention Network (KGAT) [39] recursively propagates the embeddings from a node neighbors to refine the node embedding, and employs an attention mechanism to discriminate the importance of the neighbors for recommendation as well.
In summary, although the neural network-based methods suffer the problems of high complexity and computational cost, the triples based on them have a stronger expressive ability. But state-of-the-art models only regard the relation embedding as an auxiliary feature of the entity embedding, and does not deeply consider how to embed the relations. Besides, when there are indirect connections (not 1-hop relation) between the two entities, the former research assigns the same weight to the relations of each hop, resulting in the loss of information about the partial neighbor triples. At this time, a model that integrates the neural network model and translational model, leveraging the advantages of both models, is not yet available in the literature. Here, we propose a combined model to further improve the task of knowledge graph completion in terms of prediction accuracy and computational cost.

III. APPROACH
Here, we first formalize the notation adopted in this work.
. . , e n } and R = {r 1 , . . . , r i , . . . , r m } denote the set of entities and relations, respectively. The triples are represented as (e i , r k , e j ) where e i denotes a head entity, e j denotes a tail entity, and r k denotes a relation linking e i and e j . Their embeddings are denoted by (h i , r k , t j ). The embedding representation model attempts to learn efficient representations of the entities, relations and scoring functions f , such that for a given triple a = (e i , r k , e j ), f (a) gives the probability of a being a golden triple.

A. GRAPH ATTENTION NETWORKS
As we introduced in Section II, GCNs make a great progress in graph features extraction. But there remains a problem that GCNs only focus on the node itself without considering its neighbors, which contain rich valuable information. The graph attention network (GATs) [13] is a further improvement of GCNs. The GATs emphasize the assignment of varying importance values to different node neighbors, rather than allocating an equal weight for all neighbor nodes, as is done in GCNs.
The input feature of the node-set in the graph attention network layer is e = {e 1 , e 2 , . . . , e N }, the output features of the nodes are e = {e 1 , e 2 , . . . , e N }, where e i indicates the input embedding of the i-th node. e i indicates the output embedding of the i-th node. N indicates the number of nodes. The attention value of a node can be formalized as where W represents a parametric linear transformation matrix that maps input features to high-dimensional output feature space. The attention value e ij indicates the edge between node e i and node e j and f a represents an attention function. The attention value reflects the importance of the edge (e i , e j ), which can be used to measure the importance of the head node e i . The attention value can be obtained as it shown in Figure 2. We learn a weight of attention for each edge and then gather information from neighbors based on those  weights as where, η ij represents relative attention, which computed by a softmax function over all the values in the neighbors. W represents a linear mapping matrix. After this operation, the neighbors representation of the node e i is output. The η ij can be calculated as follows where, i denotes the neighbors set of entity e i , in denotes the relations set which connects between e i and e n .
To prevent over-fitting of the model, we use multiple independent different attentions for the attention calculations [22]. The multi-head attention structure is shown in Figure 3.

VOLUME 8, 2020
The head node (HN), edge (E), and tail node(TN) first enter a linear transformation and then are input to the expansion point product attention for M times, namely the multi-head attention. For each of the M times, the parameters between the headers are not shared, and the parameters for the linear transformation of HN, E, and TN are different. Then, the M -th order expansion results are concatenated, and the values obtained by the linear transformation are used as the result of the multi-head attention. The multi-head attention process of concatenating M attention heads is shown as follows where, || represents a concatenate operation, σ represents a nonlinear activation function, η m ij represents the weight obtained by the m-th attention mechanism, and W m represents a linear mapping matrix of the m-th attention mechanism. Then, we use the averaging method to obtain the final node representation as follows

B. ATTENUATED ATTENTION MECHANISM
The attention mechanism involved in deep learning essentially plays a similar role as the human selective visual attention mechanism [28]. The core is to select critical features for the current mission objectives from sufficient information. However, from the perspective of human experience, the scope of influence of attention weight is not equivalent. For example, people always pay more attention to objects that are close to themselves, attention to objects that are further away decrease. As shown in Figure 3, Angelina Jolie's 4-hop neighbor Canada should have a much smaller impact than the 1-hop neighbor Brad Pitt. Therefore, in a knowledge graph, the closer an entity is to a given entity, the higher its attention value obtains. Besides, Figure 4 also demonstrates that the lighter the color of the relation in the graph, the lower the weight and the smaller the impact on a given entity. Based on this, we propose the attenuated attention mechanism to assign different weights to the n-hop neighbors attention values of a given entity.
Following the principle that the closer to a given entity, the higher weight of attention is gained, we define the attenuated attention coefficient θ id , which denotes the attenuation of the d-th hop neighbor for a given node e i . Suppose the distance between a given node and their d-hop neighbor is x d , so the attenuated attention coefficient can be formalized as follows where, θ 0 denotes the initial attenuated attention coefficient, and x 0 represents the path step length, default is 1.
We incorporate the attenuated attention mechanism into GATs to form our proposed Graph Attenuated Attention Networks (GAATs) model. The model gathers features from the neighbors of nodes by assigning different weights to different relations to reinforce the entities and relations embedding.

C. RELATION EMBEDDING WITH GAATs
State-of-the-art methods are focused on entity embedding, while relation embedding only using TransE-trained initialization vectors. However, as the most important part of the knowledge graph, relation plays a decisive role in the quality of knowledge reasoning. In addition, specifical relation embedding can make representation embedding containing more valuable features. Therefore, we propose to refine the type of relations and use our GAATs model to learn the embedding of relations.
We define the initial relations embeddings matrix R ∈ R N r ×T , where N r represents the number of relations, T represents the initial feature dimension embedded in each relation, and the i-th row represents the embedding of the i-th relation. The corresponding output matrix after the GAATs layer processing is R ∈ R N t ×T , where N t represents the number of triples, and T represents the output feature dimension. In other words, we learn relations embeddings independently for each triple.
We use the GAATs model to learn the relation embedding. In the graph structure of the relations, we first calculate the attention value of each relation separately, and obtain the output relation embedding matrix as R = R · W R , where W R ∈ R T ×T denotes the weight matrix. Specifically, extracting some of the subgraphs in Figure 1, we divide the relations into three categories as it shown in Figure 5.
(1) If there is a direct relation between the two entities, as shown in Figure 5(a). Then, the relation is expressed as 5216 VOLUME 8, 2020 where, R ij indicates the relation vector between the entities e i and e j , η ij denotes attention value between the entities e i and e j , and R(cor) represents the corresponding row of the initial relation matrix.
(2) If there is an indirect connection between the two entities, but they are accessible through the neighbor nodes, as shown in Figure 5(b). Then, the relation is expressed as where, i denotes the neighbors set of entity e i , θ is denotes the attenuated attention coefficient between the entities e i and e s . We use the attenuated attention mechanism to calculate the neighbor weighted attention values and then use the product between the relation embeddings of the neighbors and the attention values as the representation of the relation.
(3) If there is an indirect connection between the two entities, and a direct connection exists, as shown in Figure 5(c). We use the attenuated attention mechanism to calculate the neighbor attention values and then use the relation of the existing edge for its representation. Then, the relation is expressed as follows However, while learning the new relation embedding, the relations lose their initialized information. To address this issue, we assign two matrices in order to add the original relation embedding matrix into the final representation as where, W N ∈ R N t ×N r and W T ∈ R T ×T denote linear transformation matrices, and R represents the final relation embedding matrix.

D. ENTITY EMBEDDING WITH GAATs
Entities play an important role in knowledge graphs, and the correspondence between entities and relations is especially important in knowledge graph reasoning. For example, the entity Anthoy Russo plays different role in different triples, i.e. (brother_of, Joe Russo), (college, Robert Downey Jr.). Entity Embedding with GAATs has the following advantages: (i) Our model can solve the existence of insufficient flexibility of the translational models. (ii) Our model can embed the same entity under different relations. (iii) Our model calculates the attention weight of an entity using the attenuated attention mechanism, which allows entity carry the distance information of the relation path in the knowledge graph, to concrete entity embedding and easier to handle complex relations such as 1-N, N-1 and N-M.
We define the initial entity embedding matrix E ∈ R N e ×D , where N e represents the number of entities, D represents the initial feature dimension of the entity embedding, and the j-th row represents the embedding of the j-th entity. The corresponding output matrix after the GAATs layer processing is E ∈ R N e ×D , where D represents the output feature dimension.
The attention value reflects the importance of the edge feature [20], which can be used to measure the importance of the head entity. We first learn the embedding of the triple in KGs. We concatenate the initialized entity embedding with the relation vector of Section III.C to obtain the initial representation t ijk of the triple, as follows where, W ∈ R 1×(2D+T ) denotes linear transitional matrix, t ijk indicates the vector of triple (e i , r k , e j ), and h i , t j , r k represent the embedding of e i , r k , e j , respectively. Then, we use the ReLU activation function to learn a triple embedding to ensure that the triple attention is non-negative.
where, c ijk means vector after transformation and V ∈ R 1×(2D+T ) denotes the linear transformation parameter matrix. The softmax function is applied to obtain the attention values of the entity's n-hop neighbors, thus calculating the attention weight η ijk of each triple. Meanwhile, as we learn the weight of each triple attention, we add our attenuated attention coefficient to ensure that the closer we get to the given triple, the higher the weight gains. η ijk can be calculated as follows where, i denotes the neighbors set of entity e i , in denotes the relations set which connects between e i and e n , and θ ijk is the attenuated attention coefficient of the triple (e i , r k , e j ).
The new embedding of the entity e i is updated to the weighted VOLUME 8, 2020 sum of the product between the triple attention weight and the triple embedding, as follows In order to improve the stability of the learning process and learn more neighbor information, we add a multi-head attention mechanism similar to GATs. Finally, the neighbors of each entity are represented as follows Similar to relation embedding, while learning the new entity embedding, the entities lose their initialized information. To address this shortcoming, we assign a linear transitional matrix W E ∈ R T ×T in order to add the original relation embedding matrix into the final representation as follows This is the graph attenuated attention layer shown in Figure 6. In our structure, the entity gathers the neighbors' information layer by layer. For example, in Figure 4, the entity Brad Pitt gathers information from the direct neighbor in the first layer. Then, it gathers information from the indirect entity USA and the direct entity Angelina Joile in the second layer. In general, an n-hop-neighbor entity can be accumulated by an n-layer model.

E. TRAINING OBJECTIVE
Our model draws on the scoring function of the translational model, in which the basic idea is the condition h i + r k ≈ t j holds for a golden triple (e i , r k , e j ). Specifically, we try to learn the entities embeddings and relations embeddings by minimizing the L1-norm dissimilarity measure given by We train our model with the margin-based ranking loss as the objective for training as follows where, [x] + max{x, 0}γ > 0 is a hyper-parameter, T denotes the set of golden triples, and T denotes the set of corrupted triples, given formally as In order to extract the latent features inside the triples and analyze the global embedding properties of a triple across each dimension, we apply CapsE [18] as a decoder.
Similar to ConvKB, CapsE first learns a three-column matrix for the triples. Each column represents the vector of the head entity, relation, and tail entity. Then, the convolution operation is performed on the matrix using multiple convolution kernels to extract the deep relations within the triples. The difference with ConvKB is that the capsule networks are added behind the feature vector. After the feature vector transformation, it becomes the first layer of capsule network, and each vector is a capsule. After completing the routing process and the activation function, the information is passed to the capsule network of the second layer. There is only one capsule in the second layer, and the modulus of the vector corresponding to the capsule is the probability that the (e i , r k , e j ) triple exists in the knowledge graph. The score function can be written formally as f (e i , r k , e j ) = ||capsnet(ReLU ([e i , r k , e j ] * ))|| (19) where, capsnet represents the capsule network operation, represents the convolutional layer filter set of the shared parameters, and * represents the convolution operation. The model is trained using the Adam optimizer and the soft margin loss as follows (20) where,

IV. EXPERIMENTS RESULTS AND ANALYSIS A. DATASETS
To evaluate our model, we used two benchmark datasets: WN18RR [15] and FB15k-237 [15]. The knowledge graph completion task mainly consists of link prediction and triple classification. In addition, in order to testify the performance of our model, our apply a task named relation prediction to make sure our model can handle the inference of relations. The research shows that when the link prediction task is performed on the datasets WN18 and FB15k, it is affected by the inverse relation, which is a simple inverse-rule-based model can be used to obtain optimal results. Therefore, the corresponding sub-datasets WN18RR and FB15k-237 are used to solve the problem of the reversible relation between the datasets [27]. We use the dataset partitioning method to divide the training set and the test set. Table 2 provides all the information we use for the datasets.

B. EVALUATION PROTOCOL
The goal of the link prediction task is to predict a golden triple when the head or tail entity is missing, i.e., given (r k , e j ) to predict e i or given (e i , r k ) to predict e j . In this task, we remove the head or the tail entity and replace it with all other entities in the corpus. We first calculate the scores of these corrupted triples and rank them in descending order. The ranking of the correct entity is finally recorded. The link prediction emphasizes the final ranking of the correct entity rather than just finding the best entity. Similar to TransE, we also use two methods as our evaluation strategy: MeanRank (MR) and  the proportion of the top N correct entities (Hit@N), N = 1, 3, 10. In addition, we also use the Mean Reciprocal Rank (MRR) commonly used in information retrieval as another strategy for evaluation. The lower MR values and the higher MRR or Hit@N values demonstrate better accuracy performance of the model [23]. We call this evaluation method ''Raw''. Note that there may also be corrupted triples in the KG, and these corrupted triples will be considered as the correct triples. Therefore, before ranking, we should delete the corrupted triples contained in the training set, the validation set, and the test set. This evaluation method is called ''Filter''.
In this article, we summarize the evaluation results of using the two methods, ''Raw'' and ''Filter''.

C. TRAINING PROTOCOL
We create two sets of corrupted triples using the method of replacing the head or tail entity of a golden triple. One of them is a collection that replaces only the header entities, and the other is a collection that replaces only the tail entities [26].
To ensure the robustness of the link prediction, we use the classical Bernoulli sampling method, and the two sets extract an equal number of corrupted triples as training samples. We use the TransE method to initialize the entity embedding matrix and the relation embedding matrix. First, we use the GAATs model to train the embedding of entities and relations in the datasets., and then we train a decoder CapsE to complete our link prediction task. The attenuated attention mechanism is introduced while joining the n-hop neighbor. We use the Adam optimizer to determine the values for all of the parameters. Among them, the initial learning rate is set to λ = {0.001, 0.01, 0.1}, the embedding dimension of the entity and relation is set to k = {50, 100, 150, 200}, and the margin is set to γ = {1, 2, 10}. The optimal parameters are configured as: λ = 0.01, k = 100, γ = 2 on the WN18RR dataset; λ = 0.01, k = 100, γ = 1 on the FB15k-237 dataset.

D. RESULT AND ANALYSIS
We have selected eight current, state-of-the-art methods: Dis-Mult, ConvE, CapsE, TransE, TransH, ConvKB, R-GCN, and n-hop-GATs. Table 3 and Table 4 compares the results of our model experiments with the state-of-the-art results published previously using the same evaluation protocol. The experimental results show that our model performs better than the state-of-the-art method in Hit@N on WN18RR; the performance is better than the state-of-the-art method on FB15k-237. In particular, our Hit@N has increased by 3% overall on the FB15k-237. The results also show that all of the indicators of ''Filter'' are better than those for the ''Raw'' method.
From Table 3 and Table 4, we know that: (1) Compared to the matrix decomposition model DisMult, the proposed GAATs model uses the neural network to mine  the deep features of the triples to extract more feature information in the link prediction task. Therefore, all indicators for the proposed method are higher than the matrix decomposition method; (2) Compared to the translational model (TransE, TransH), since the core of the translational model is h + r ≈ t for golden triples, which ignores the features of the triple itself, the proposed GAATs model emphasizes h + r ≈ t with adding attention mechanisms. The graph model has been introduced under the premise of the neural network, and the nhop neighbors of the triples have been found, which enriches the meaning of the triple. Consequently, the experimental results for the proposed model are better than the translation models.
(3) Compared to the neural network models (ConvE, Con-vKB, R-GCN, CapsE), the proposed GAATs model applies the encoder-decoder model, which uses two different neural network structures to describe the features of the triples. In particular, CapsE acts as a decoder to discover the hidden features of the triples so that the proposed model performs better than the neural network models.
(4) Finally, compared to the n-hop-GATs, the proposed GAATs model introduces an attenuated attention mechanism, which emphasizes the closer the relation to a given entity is, the higher the attention weight obtains. In addition, we replace the ConvKB with CapsE, so that the proposed model performance is also better.
The results also show that the performance on the WN18RR dataset is lower than on the FB15k-237 dataset. The main reason is that there are fewer types of relations on WN18RR and there are fewer intermediate paths in the relation, which can not take advantage of our model.
For each relation r on FB15k-237, we calculate the average number n h of head entities per tail entity and the average number n t of tail entities per head entity [24]. If n h < 1.5& &n t < 1.5, r is classified as one-to-one (1-1). If n h < 1.5& &n t ≥ 1.5, r is classified as one-to-many (1-M). If n h ≥ 1.5& &n t < 1.5, r is classified as many-to-one (M-1). If n h ≥ 1.5& &n t ≥ 1.5, r is classified as many-to-many (M-M). We provide an embedding representation for each triple, making the representation of the triple more precise, and the discrimination between the triples is higher. so compared to other models, our model has better performance; (2) The proposed attenuated attention mechanism makes the relation learned by each triple differently, so when making predictions, whether (r k , e j ) or (e i , r k ), they can still learn different representations,  so the results are more accurate when ranking. (3) The nhop neighbors of a given triple in the encoder of our model replaces the embedding representation of the triple itself so that the triple carries more features and the decoder applies the CapsE instead of max-pooling ConvKB to ensures that the triple features are not lost. Based on this, the generalization ability of the model is further improved. Figure 8 shows the results for Hits@10 and MRR on WN18RR. also_see, similar_to, derivationally_related_form, verb_group can all be regarded as M-M relations, and our model performs better than others.
Meanwhile, we study the value of n-hop neighbors and the effect of the number of iterations on the final result. Table 5 shows the value of the validation set Hit@10 of WN18RR when the number of iterations and the value of n change. When n = 3 and epochs = 1000, the model works best. We can see that as the training progresses, our model collects more information from the neighbors, which focuses more on the direct neighbor and gets secondary information from the far neighbor. Once the model converges, it learns to collect multi-hop and cluster relation information from the node n-hop neighbors. However, when n > 3, the effect of the model will decrease. This also implies that the farther away from the relation, the lower the attention value obtained. When n = 4 or more, the neighbor nodes are no longer important.

E. RELATION PREDICTION
Relation prediction requires two steps: the discovery of a pair of entities with potential relation, and the reasoning of potential relations.Our experiment assumes that an entity pair with a potential relation has been recognized. i.e., given (e i , e j ) to predict r k .
We drop the relations of the triples in the test data, reasoning the relations based on the embeddings of the trained entities and relations, and comparing it with the standard answer to calculate the accuracy. Besides, we apply Precision, Recall, and F-Measure in machine learning algorithms For each pair of entities, we traverse all the relations into a triple. For each triple, we calculate the distance d betweent the embedding vector from the head entity and the relation to the tail entity. The smaller the distance d(h + r,t), the greater the possibility that the triple is established.We record the relations of top 10 triples for each pair of entities in ascending order of the distance d(h + r,t). We record the Hit@10 and Hit@1 of the correct relations as the accuracy rate, and record the recall rate and calculate the F-Measure. The experimental results are shown in Table 6 and Table 7.
From Table 6 and Table 7, we know that: (1) On the WN18RR and FB15k-237 datasets, our model has achieved good result in relation prediction. Because there are only 11 kinds relations in WN18RR, and FB15k-237 contains 237 kinds of relations, the model has better effect on simple dataset WN18RR than the complex one FB15k-237.
(2) Since our model has independently embedded of relations and incorporates the distance information of the relations in the entity embedding, our model is better than the state-of-the-art models in most cases. Among them, the translational models(TransE/TransH) perform the worst. Because they only models the relations by optimizing h + r ≈ t, and obviously ignores the semantic information of the relations. Our model adds an attenuated attention mechanism based on n-hop-GATs, so that each relation can obtain different weights according to the distance of the path, and solve the probem that n-hop-GATs does not separately embed the relations. Our model uses CapsE as a decoder and uses GAATs as an encoder to further mine the semantic information of the relations to prove the performance of relation prediction.
(3) The experimental results also show the importance of separate embedding of relations in relation prediction. In order to ensure the accuracy of the relation prediction, the independent effect of the relationship embedding is far better than the auxiliary information embedded only as the entity.

F. TRIPLE CLASSIFICATION
The task of triple classification is to determine whether a given triple (e i , r k , e j ) is a golden triple or not, which is a binary classification task. We use the WN18RR and FB15k-237 datasets for these experiments as well. We create two  Experimental results on the MN18RR test set. The best score is in bold and second-best score is underlined.

TABLE 7.
Experimental results on the FB15k-237 test set. The best score is in bold and second-best score is underlined.
corrupted triple sets as presented in Section IV.B for the triple classification experiment. For the triple classification task, we set a threshold δ r for each relation, which is obtained by maximizing the classification accuracy of the validation set. Given a triple (e i , r k , e j ), if its score is greater than δ r , it is classified as a golden triple, otherwise it is classified as corrupted [25].
We choose TransE, TransH, TransR, TransD, ConvKB, and n-hop-GATs as our baseline methods. In this experiment, we select ADADELTA SGD as our optimization goal. We For the two datasets, we testify the task by using 1000 iterations. The experimental results are presented in Table 8. Table 8 shows that in WN18RR and FB15k-237, the accuracy of the triple classification of the proposed GAATs model is higher than the other models. The main reasons are: (1) The translational models (TransE, TransH, TransR, TransD) first randomly initialize the representation of entities and relations, and then represent the embedding of entities and relations based on h + r ≈ t. The initial embedding is further optimized using a neural network, so our model representation is better than the translational models; (2) ConvKB method uses a CNNs to extract the features of the triple, as a decoder in the n-hop-GATs model. Although the hidden features of the triples can be found, it is easy to lose features using the pooling method. (3) Compared with the n-hop-GATs model, the proposed GAATs model first introduces the attenuated attention mechanism, and then the decoder adopts the more advanced CapsE model, so the accuracy is further improved.

V. CONCLUSION
We propose the GAATs model, which applys an attenuated attention mechanism into the representation of relations and entities to obtain new embeddings on both entities and relations. In addition, we use the CapsE as our decoder, avoiding the loss of features exhibited by the ConvKB. Therefore, our model takes the diversity of triples into account. An extensive experiment shows that our method is more effective than the state-of-the-art methods in link prediction, triple classification and relation prediction. However, our model has a high computational complexity due to the graph feature extraction, and time cost arising from representing each triple separately. How to improve the efficiency of the graph feature extraction is an urgent problem to be solved. In addition, the temporal and spatial attributes for dynamic KGs also need to be studied as the next research goal.