SR-HGAT: Symmetric Relations Based Heterogeneous Graph Attention Network

Graph neural network, as a deep learning based graph representation technology, can capture the structural information encapsulated in graphs well and generate more effective feature embedding. We have recently witnessed an emerging research interests on it. However, existing models are primarily focused on handling homogeneous graphs. When designing graph neural networks for heterogeneous graphs, heterogeneity and rich semantic information bring great challenges. In this paper, we extend graph neural network to heterogeneous graph scenes, and propose a novel high-order Symmetric Relation based Heterogeneous Graph Attention Network, denoted as SR-HGAT, which takes into account the features of nodes and high-order relations simultaneously, and exploits the two-layer attention mechanism based aggregator to efficiently capture essential semantics in an end-to-end manner. The proposed SR-HGAT first identifies the latent semantics underneath the observed explicit symmetric relations guided by different meta-paths and meta-graphs in a heterogeneous graph. The nested propagation mechanism for aggregating semantic and structural features that different links contain is then designed to calculate the interaction strength of each symmetric relation. As the core of the proposed model, to comprehensively capture both the structural and semantic feature information, a two-layer attention mechanism is applied to learn the importance of different neighborhood information as well as the weights of different symmetric relations. These latent semantics are then automatically fused to obtain unified embeddings for specific mining tasks. Extensive experimental results offer insights into the efficacy of the proposed model and have demonstrated that it significantly outperforms state-of-the-art baselines across three benchmark datasets on various downstream tasks.


I. INTRODUCTION
Graphs are one of the most expressive data structures which have been used to model a variety of problems. The early processing of graph data is mainly done through network embedding ([1]- [3]), and the basic idea behind various embedding approaches is to use dimensionality reduction techniques to distill the high dimensional topology structure information and relationship between nodes into a dense representation space. With the rapid development of deep learning technology, many scholars have begun to leverage the powerful feature extraction ability of deep neural network to process graph data. As a powerful deep learning based representation The associate editor coordinating the review of this manuscript and approving it for publication was Mu-Yen Chen . technology, Graph Neural Network (GNN) employs deep neural network to aggregate feature information of neighboring nodes, which makes the aggregated embedding more powerful and arouses considerable research interest. In addition, the GNN based models (e.g. GCN [4], GraphSAGE [5], and GAT [6]) can be naturally applied to inductive tasks involving nodes that are not present in the training period. However, the advances and applications of GNNs are largely concentrated on homogeneous graphs.
As a matter of fact, the real-world online platforms (e.g Facebook, Amazon, DBLP and IMDB) usually comes together with the heterogeneous graph structure, which contains abundant information with various relations among multi-typed nodes as well as unstructured content associated with each node, also widely known as heterogeneous VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ information network(HIN). For convenience, we uniformly call it heterogeneous graph in this paper. Current state-ofthe-art GNNs lack the aggregation mechanism to encapsulate rich semantics and complex relationships in heterogeneous graphs, and directly applying these to embed heterogeneous graphs will inevitably lead to unsatisfactory embedding results and reduced performance in downstream tasks. Based on the above analysis, when designing graph neural network architecture to deal with heterogeneous graphs, we need to address the following challenges.
• Capturing distinctive semantic characteristics of different relations Existing GNNs usually learn heterogeneous graph embedding by solely focusing on node features without taking into account the features of links, making them insufficient to capture the properties of heterogeneous graphs. Moreover, heterogeneous graphs have been commonly used for abstracting and modeling complex systems, in which various relations with rich latent semantics are involved in. One must effectively discover the corresponding latent semantic causing a specific semantic relationship, and embed these rich semantics into feature vector space.
• Integrating the attention mechanism into graph neural network for heterogeneous graph In a heterogeneous graph, any node pair can be connected with various types of relationships. For any symmetric relationship, each node has a large number of neighborhood nodes and links based on this relation. How to distinguish the subtle differences of neighbors and then automatically infer different attention values to them is worth looking forward to. Moreover, according to the form of relationships, meaningful and complex semantic information can be contained in heterogeneous graphs. It is impractical to treat these different semantic relationships equally, which will weaken the semantic information of more useful relationships. How to choose the best semantic relationship information and integrate them according to specific mining tasks is still an open problem. In light of previous limitations and challenges, we propose a high-order Symmetric Relation based Heterogeneous Graph Attention Network framework(SR-HGAT). The SR-HGAT mainly consists of three modules to correspondingly address the challenges in graph neural network modeling for heterogeneous scenes. First, in the semantics capturing module, the concept of symmetry relations guided by different meta-paths and meta-graphs is proposed, which can facilitate better identification and capture of rich and complex latent semantic information of any node's neighborhood. Moreover, a specifically designed module is proposed, namely nested propagation of multi-type link features. For each kind of symmetric relation, both semantic and structural feature information of links are nestedly aggregated to capture the distinct characteristics of high-order relations. Hence, the feature of both relations and neighbor nodes can be integrated to capture more accurate neighborhood information.
As the core of our proposed model, in the two-layer attention based aggregator module, the first layer is adopted to learn the attention values between the nodes and their symmetric relation based neighbors, and the second layer is able to automatically determine the weights of these latent semantics represented by symmetric relations for different downstream tasks. Based on the learned attention values in terms of the two-layers mechanism, our proposed model can obtain the optimal combination of neighbors and multiple symmetric relations in a hierarchical manner.
It is worthwhile to highlight our main contributions as follows: • We formalize the problem of heterogeneous graph representation learning which involves both node features and link features of multiple types simultaneously.
• To explicitly capture the distinctive semantic characteristics of diverse high-order symmetric relations, we design a nested propagation strategy of multi-typed link features on these relations.
• We propose a two-layer attention mechanism, which can automatically capture both the importance of neighbor information of any node as well as the weights of different symmetric relations in a hierarchical manner. Extensive experiments conducted on three benchmark datasets validate the superiority of our proposed model via comparison with other state-of-the-art baselines for numerous graph mining tasks.
The rest of this paper is organized as follows: A review of related work is provided in Section II. Preliminary and several related definitions are introduced in Section III. The proposed model is elaborated in Section IV. Experimental results and dataset descriptions are reported in Section V, conclusions and future research directions are presented in Section VI.

II. RELATED WORK A. GRAPH REPRESENTATION LEARNING
Graph representation learning is proposed to embed a network into a low-dimensional space while preserving the structure and property so that the learned embedding can be applied to the downstream network tasks. For example, the random walk based methods [1], [2], the deep neural network based methods [7], the matrix factorization based methods [8], [9], and others, e.g., LINE [10]. However, all these algorithms are proposed for the homogeneous graphs. Some elaborate reviews can be found in [11], [12]. While in many real-world applications, data can be naturally represented as heterogeneous graphs. Over the past decade, several attempts [13] have been made on heterogeneous graph embedding and have achieved promising performance in various tasks. Metapath2vec [3] designs a meta-path based random walk and utilizes skip-gram to perform heterogeneous graph embedding. However, metapath2vec can only utilize one meta-path and may ignore other useful information. To resolve this issue, Esim [14] attempts to capture various semantic relations of nodes through multiple metapaths, which heavily depends on the fact that a group of meta-paths and the weight of each meta-path should be assigned by user in advance. Moreover, Zhang [15] comes up with Metagraph2vec, in an attempt to use meta-graphs as guidelines for random walk. Richer structural details and more complete semantic information are successfully extracted. HIN2Vec [16] carries out multiple prediction training tasks that learn the latent vectors of nodes into the same relation space and conducts heterogeneous link prediction. In addition, there are also several other heterogeneous graph embedding methods designed for some particular tasks, such as identifying authors [17] and recommendation tasks [18].

B. GRAPH NEURAL NETWORK
Graph neural network, introduced in [19], [20], aims to develop deep neural network to handle network structural data. It has attracted considerable attention lately, because of its remarkable success in various graph analysis tasks. The early attempts to derive a graph convolutional layer are based on graph spectral theory [21]- [23]. However, spectral theory based graph convolutional networks still have difficulty in capturing local graph information. Recently, non-spectral approaches [24]- [26] are proposed to perform a convolution process to aggregate the neighbor nodes' information directly so that spatially close neighbors information can be captured directly. One of the most prominent progress is GCN [4], which averages the neighbor of each node in the graph, followed by a linear projection and non-linear activation operations. William et al. [5] proposed GraphSAGE, an inductive framework that leverages node sampling and feature aggregation techniques to efficiently generate node embeddings for unseen data, which breaks the limitation of applying GCN in transductive settings. Graph attention network (GAT) [6] incorporates attention mechanism into GCN. By calculating attention coefficients among nodes, GAT allows each node to focus on the most relevant neighbors to make decisions. The aforementioned models exhibit state-of-the-art performance in various graph representation learning applications. However, above graph neural network cannot be directly applied to heterogeneous graphs with multi-relational edges and multi-typed nodes. Recently, several attempts have been made to adopt GNNs to learn with heterogeneous graphs. Schlichtkrull et al. [27] propose the relational graph convolutional networks (R-GCN) to model knowledge graphs. R-GCN keeps a distinct linear projection weight for each edge type. Zhang et al. [28] present the heterogeneous graph neural networks (HetGNN) that adopts different RNNs for different node types to integrate multi-modal features. Although these methods have been shown to be empirically better than traditional models, they have not fully utilized the heterogeneous graphs' properties.

III. PRELIMINARIES
We begin this section by introducing the notations used in the rest of the paper, followed by some related definitions about Heterogeneous graphs and a brief background on Graph Attention network (GAT). Finally, we will introduce our proposed symmetric relations guided by meta-path and meta-graph.

A. HETEROGENEOUS GRAPH
A heterogeneous graph is a directed graph, denoted as G = (V , E, A, R), which consists of the set of nodes V , the set of links E, the sets of node types A and the edge types R respectively. In usual cases, a heterogeneous graph is associated with a node type mapping function : V → A which represents that each vertex v ∈ V can be mapped into a node type a ∈ A, and a link type mapping function is expressed as : E → R, meaning that each edge e ∈ E can be mapped into an edge type r ∈ R . A network can be deemed as a heterogeneous graph when the number of node types |A| and edge types |R| conform to |A| + |R| > 2. Otherwise, it is a homogeneous graph.

B. SEMANTIC STRUCTURE OF HETEROGENEOUS GRAPH
Complexity of the heterogeneous graph forces us to propose a structure, which can describe meta-level semantic information of a heterogeneous graph, so that node types and relations between nodes in the network can be better comprehended. Given a heterogeneous graph, a graph schema can be abstracted as G = (A, R), where G is a directed graph that contains all allowable node and link types, these types are combined together to conduct meta-level semantic description of G.  To be more specific, in Figure 1, DBLP schema consists of four node types (A for authors, P for papers, V for venue and T for topic), and three link types (an author-paper link signifying papers published by the author. A paper-venue link signifying a venue on which a paper is published. A papertopic link signifying keywords of a paper). By contrast, in Figure 2, MovieLens schema comprises 6 node types of U (users), M (movies), A (actors), D (directors), T (tags) VOLUME 8, 2020 and G (genres) and five link types including users watching and reviewing movies, actors in movies, directors directing movies, tags of movies and cinematic genres.
A heterogeneous graph contains abundant semantic information that can be expressed by graph schemas. Moreover, both meta-path and meta-graph are semantic structures generated from heterogeneous graph schema.
Definition 1 (Meta-Path): As an abstract sequence of node types connected by link types, the meta-path is formed by transforms of a graph schema and able to capture rich semantic information preserved in heterogeneous graphs. Specifically, given a HIN schema denoted as G = (A, R), a meta-path can be expressed in the form of P = a 1 where a i ∈ A(i = 1, . . . , l) indicates node types and r i,j ∈ R represents link types between a i and a j , 1 ≤ i, j ≤ l . For example in Figure 3, the meta-path A-P-V-P-A in the academic cooperation network DBLP indicate that papers written by two authors are published on the same venue. In addition, A-P-T-P-A and A-P-A-P-A respectively mean that papers published by two authors share the same subject and papers published by two authors have the same co-author. Moreover, for example in Figure 4, the meta-path U-M-G-M-U in the movie review network MovieLens indicates that movies rated by two users contain the same genres. In addition, U-M-A-M-U and U-M-D-M-U respectively mean that movies rated by two users are acted by the same actor and have the common director. Clearly, different metapaths represent different semantics.
Definition 2 (Meta-Graph): In terms of topological structures, a meta-graph is more complicated than a meta-path and thus has the potential to represent richer and more complicated semantic contents. Similar to meta-path, given a heterogeneous graph schema G = (A, R), the corresponding meta-graph can be represented as a directed acyclic graph M G = (N , M , n s , n t ), which contains a source node n s (in-degree: 0) and a target node n t (out-degree: 0). It depicts a complex semantic relation formed between n s and n t by stitching. Moreover, N and M respectively represent subsets of node types and edge types. And they meet N ⊆ A and M ⊆ R respectively.
We continue to take DBLP and MovieLens as examples in Figure 5 and 6 respectively. Meta-graph (A, P, (V , T ), P, A) implies that not only are papers written by two authors published on the same venue, but these papers contain identical terms. In a similar way, Meta-graph (U , M , (A, D), M , U ) in MovieLens dataset represents that movies rated by two users are acted by the same actor and have the common director.

C. SYMMETRIC RELATIONS
In heterogeneous graphs, there are various relationships between two same-typed nodes guided by meta-paths and meta-graphs. Therefore, we propose to leverage both metagraph and its embedded meta-paths for describing various symmetric relations between node pairs. Related definitions will be summarized as follows:

Definition 3 (Symmetric Relation Guided by Meta-Path):
Given a meta-path denoted as P = A 1 where A i denoted as an node type, E j represents a link type.
And we can say that P = A 1 Figure 7, where A i and A l+2−i (i = 1, 2, . . . , l + 1) has the same type, E j and E l+1−j (j = 1, 2, . . . , l) represents the same semantic. Then SR P can be defined as a symmetric relation guided by this symmetric meta-path P.
Definition 4 (Symmetric Relation Guided by Meta-Graph): In Figure 8, (A i,1 , A i,2 , . . . , A i,n a (i) ) denoted as the i-th node type group and n a (i) represents the number of node types, similarly, (E j,1 , E j,2 , . . . , E j,n e (j) ) represents the j-th link type group and n e (j) represents the number of link types, and we say M is symmetric if M is equal to M −1 . SR M can be denoted as a symmetric relation guided by meta-graph M .

Definition 5 (Symmetric Relation Based Neighbors):
We notice that, each node connects with other nodes with the same type according to symmetric relations guided by metapaths or meta-graphs. Given a node-relation triple (v s , sr, v t ) in a heterogeneous graph, where sr denotes as a symmetric relation between node v s and v t . Then the relation based neighbors N sr v s of node v s are defined as the set of nodes which connect with current node v s via symmetric relation sr. Due to different strength of semantic relation, different symmetric relations based neighbors of each node show different importance. Notice that we can actually analyze all possible symmetric relations based on meta-paths and metagraphs. However, not all relations have meaningful semantics and positive effects on embeddings. Hence, we only choose important and meaningful symmetric relations in this paper.

IV. SR-HGAT: SYMMETRIC RELATIONS BASED HETEROGENEOUS GRAPH ATTENTION NETWORK
In this paper, we extend graph attention network to heterogeneous graph scenes, and propose a semi-supervised Symmetric Relation Based Heterogeneous Graph Attention Network, denoted as SR-HGAT, to take into account the features of nodes and high-order symmetric relations simultaneously, and adopt a two-layer hierarchical attention mechanism to embed heterogeneous graph in an end-to-end fashion. Figure 9 shows the overall framework of our proposed SR-HGAT, and we will provide a detailed description of it in the following subsections.

A. MODEL INPUTS
In our model, the feature vectors of nodes and high-order relations are taken simultaneously as inputs. For any node v i , we define the triple (v i , sr k , v j ) at first, which means neighbor node v j connects with node v i via symmetric relation sr k , where the neighbor node v j ∈ N sr k v i (N sr k v i represents the neighbor nodes set of node v i via symmetric relation sr k ). Moreover, h i , h j ∈ H represent feature vectors of v i ,v j respectively and r k represents feature vector of sr k . Specifically, features of nodes can be expressed as matrices H ∈ R N a ×T a , where the i-th line h i ∈ H represents the vector of node v i , N a represents the total number of nodes and T a is the feature dimension of each node representation.

B. NESTED PROPAGATION OF MULTI-TYPED LINK FEATURES
The GAT designed for homogeneous graphs ignores the semantic and structural features of different types of links, as well as the complex and rich semantic relationship between nodes, which is the extremely important part of heterogeneous graphs. To fit the general framework to heterogeneous graph, in addition to the information from neighbor nodes, the features of symmetric relations connected to them are also collected in the proposed model. Each symmetric relation guided by a meta-path or a meta-graph is composed of multiple types of links, and the hidden state is updated according to the nested aggregation of features of the composed links. Subsequently, the detailed process will be described as follows: Definition 6 (Structural and Semantic Feature of Link): Considering that multi-typed links in a heterogeneous graph have significantly different characteristics, for any link e i ∈ sr k which belongs to sr k , we first explore how to calculate the structural and semantic features of it respectively.
In structural feature view: Considering that the degree of nodes can well reflect the topological characteristics of networks, a degree based measure is defined to explore the structural features of various links. We assume that d i represents the degree of source node in a directed link e i , and then the structural strength of link e i , denoted asd i , can be defined as:d Moreover, link with smaller value ofd i imply stronger structural strength.
In semantic feature view: It is assumed that the semantic feature a i represents the semantic features of link e i , then the absolute semantic strength of link e i , denoted asā i , can be defined as below:ā where M i is the maximum strength of semantic which link e i represents (belongs to). VOLUME 8, 2020 For instance, in MovieLens dataset, the semantic of link User rate → Movie represents that user has watched a movie, and the semantic strength of this link represents user's attitudes to films. The larger the value is, the more importance of the film contributes to the user. Therefore, User 5 → Movie should be attached more importance than User 1 → Movie. Similarly, in DBLP dataset, the semantic of link Author write → Paper represents that author has written a paper, and the semantic strength of this link is the author's serial number.

Author
1st−author → Paper should be attached more importance than Author 4th−author → Paper. Therefore, according to different heterogeneous graphs, we should setup the semantic strength of each link in advance.
Definition 7 (Aggregation of Symmetric Link Pairs on Meta-Path): Considering that the symmetry relationship is generated by nested aggregating of symmetric link pairs, we thus continue to introduce how to aggregate features of symmetric link pairs that belong to any symmetric relation guided by a meta-path or meta-graph as follows.
For a meta-path instance p = a 1 , it is assumed that link pair e i and e l+1−i are a symmetric link pair and they represent the same semantic. A measure rs(i, l+1−i) is then defined to explore the distinction of this link pair from both structural and semantic view in a unified manner.
Specifically, the difference of semantic strength between a symmetric link pair, denoted as ra(i, l + 1 − i), can be calculated as: where D i is the maximum semantic strength of specific link type that e i or e l+1−i belongs to.
Then the difference of structural strength between any symmetric link pair, denoted as rd(i, l + 1 − i), can be calculated as: in Equ.4, the outer degrees of two source nodes connected with the links e i and e l+1−i are compared by dividing the larger one by the smaller one. A large value indicates the quite inequivalent structural strength of the symmetric link pair. Let's see a toy example, symmetric link pair (A-P, P-A) represent the same semantic that ''authors write papers''. A famous author a 1 has published 20 papers, and author a 2 may be a student and has publishes only one paper, then symmetric link pair connected with two authors as respective source nodes are structural inequivalent, and the value of rd is only 0.05, which is consistent with common sense.
Combining ra(i, l +1−i) and rd(i, l +1−i), the distinction of symmetric link pair can then be calculated as follows: a large value of rs(i, l + 1 − i) indicates quite equivalent semantic and structural strength of symmetric link pair on the relations sr. In other words, any symmetric link pairs with a large value of rs(i, l + 1 − i) show much stronger affiliation relationships and much more similar properties.
Definition 8 (Aggregation of Symmetric Link Pairs on Meta-Graph): In a similar manner, for a meta-graph instance, we may face the following two situations: For each symmetric link pair: the distinction measure is the same as it in meta-path above.
For each symmetric link group pair: (e i,1 , e i,2 , . . . , e i,n ) and (e l+1−i,1 , e l+1−i,2 , . . . , e l+1−i,n ), the difference of structural and semantic strength between any symmetric link pair e i,j ∈ (e i,1 , e i,2 , . . . , e i,n ) and e l+1−i,j ∈ (e l+1−i,1 , e l+1−i,2 , . . . , e l+1−i,n ), denoted as rd(i, l + 1 − i, j) and ra(i, l + 1 − i, j), can be calculated in a similar manner respectively: where D i,j is the difference between the maximum and minimum semantic strength of the j-th link that belongs to i-th link group. Therefore, the distinction of any symmetric link group pair (e i,1 , e i,2 , . . . , e i,n ) and (e l+1−i,1 , e l+1−i,2 , . . . , e l+1−i,n ) which both belong to the meta-graph and represent the same semantic can be defined as :

Definition 9 (Nested Propagation of Symmetric Relations):
The idea of ''nested propagation'' is employed to recursively propagate the distinction of all link pairs. After the nested aggregation process, the distinction strength of any symmetric relation guided by meta-path or meta-graph can be learned as follows: symmetric relation with smaller value of RS implicates much stronger interaction relationships between nodes. Definition 10 (Neighbor Representations): The features of symmetric relation learned from above nested propagation process, together with representations of the node that this relation connected with, is utilized to obtain the ''neighbor representation'', which enables the neighborhood feature information to be depicted in more detail and can be defined as follows: where concat represents a linear transformation operation over the combination of symmetric relation sr k 's features and neighbor node v j 's features h j ∈ R d , ∀v ∈ V for the source node v i .

C. TWO-LAYER ATTENTION MECHANISM
Before aggregating the information from symmetric highorder relation based neighbors for each node, it should be noted that different relation based neighbors of each node play a different role and show different importance in learning node embedding for the specific task. As discussed above, distinct from traditional neighborhood aggregators (e.g.,GCN and GAT) which only takes node representations as inputs, we adopt a two-layer hierarchical attention mechanism as for the specific forms of aggregation function which can holistically encode semantically different high-order relations in the neighborhood of any given node during propagation on a heterogeneous graph.

1) INTRA-RELATION ATTENTION
The intra-relation attention is applied first to infer the weight coefficients of neighborhood that are actually connected with source node due to a specific symmetric relation. The weight parameters e k ij , which means how important neighbor node v j via symmetric relation sr k will be for node v i , can be formulated as follows: where att node denotes the deep neural network that performs the intra-relation attention, and H k,j i the ''neighbor vector'', which means the combination of symmetric relation sr k ' features and neighbor node v j 's features for the source node v i . Note also that e k ij is asymmetric which maintains the different contributions of the neighbor vector H k,j i to current node v i . After obtaining the importance between symmetric relation based node pairs, they are normalized to obtain the weight coefficient α k ij via a softmax function: where σ denotes the activation function, || denotes the concatenate operation, N k i denotes the neighborhood of entity v i via symmetric relation sr k .
Then, the embedding of node i can be updated by the symmetric relation based neighbor features with the corresponding coefficients as follows: where z k i is the updated embedding of node v i for the symmetric relation. The new embedding of the entity is the sum of each triple representation weighted by their attention values as shown in Eq.14. Since the attention weight z k i is generated for one single symmetric relation, it is semantic specific and able to capture one kind of semantic information.
Generally speaking, each node pair in a heterogeneous graph contains multiple types of semantic information and specific semantic based embedding can only reflect node feature from one aspect. To learn a more comprehensive node embedding, we need to further fuse multiple semantics which can be revealed by symmetric relations guided by different meta-paths and meta-graphs.

2) RELATION-LEVEL ATTENTION
To address the challenge of symmetric relation selection and semantic fusion in a heterogeneous graph, a novel combiner with relation-level attention mechanism is proposed to automatically learn the importance of different symmetric relations and fuse them for the specific tasks to obtain the updated node representations.
Given the symmetric relations set {sr 1 , sr 2 , . . . , sr K }, after feeding neighbor features into intra-relation level attention, we can obtain k ∈ K groups of semantic specific node embeddings, denoted as (Z sr 1 , Z sr 2 , . . . , Z sr K ). Taking these K groups of semantic specific node embeddings as inputs, the learned weight of each aggregated symmetric relation {β sr 1 , β sr 2 , . . . , β sr K } can be expressed as follows: {ω sr 1 , ω sr 2 , . . . , ω sr K }=att sem (Z sr 1 , Z sr 2 , . . . , Z sr K ), (14) where att sem denotes the deep neural network which performs the symmetric relation level attention. This shows that the semantic-level attention can capture various types of semantic information behind a heterogeneous graph.
After obtaining the importance of each symmetric relation, they are normalized via a softmax function. Specifically, the weight of the k-th symmetric relation, denoted as β sr k , can be obtained by normalizing the importance of all above relations: which can be interpreted as the contribution of the symmetric relations for the specific task, obviously, the higher β sr k , the more important relation sr k is. Note that for different tasks, different symmetric relations may have different weights. With the learned weights as coefficients, these semantic specific representations can be fused to obtain the final representation Z as follows: With the learned importance of neighbors and symmetric relations, the proposed model can pay more attention to meaningful neighbors or relations for the specific task and give a more comprehensive description of a heterogeneous graph.

D. MODEL OPTIMIZATION
Compared with the general GNN method or traditional heterogeneous graph representation learning model, our proposed model optimizes the representation learning and specific task together in an end-to-end manner. In the previous models, once the feature vectors of nodes are learned, they will be solid down and the supervised information generated in the downstream task cannot effectively guide the updating of graph representation results. On the contrary, in our model, the representation learning of nodes and the learning of downstream tasks are put together for end-to-end learning. The supervised signals of the entire model guide the relevant parameters updating of specific task layer (such as classification layer) and graph attention layer at the same time.

1) FOR SEMI-SUPERVISED NODE CLASSIFICATION
We can minimize the cross entropy over all labeled node between the ground truth and the prediction: where ω represents parameters of the classifier, o L is the set of node indices that have labels, and o i stands for labels of z i . With the guide of labeled data, we can optimize the proposed model via back propagation and learn the embeddings of nodes.

2) FOR RELATION PREDICTION TASK (RECOMMENDATION TASK)
The labeled instances are a collection of practical links between nodes that belong to different two types, such as viewing relationships between user nodes and movie nodes in MovieLens dataset, and purchasing relationships between user nodes and item nodes in Amazon dataset. For any node pair v A i and v B i which respectively belongs to type A and B, after a series of temporal encoding from different views, we can obtain the aggregated latent representations of them, denoted as (z A i , z B j ). The probability of the interaction between (z A i , z B j ) can then be calculated as follows: 165638 VOLUME 8, 2020 where the sigmoid(.) is the sigmoid layer, andŷ ij is the probability in the range of [0, 1]. Then, the loss function of the proposed model is a pointwise loss function in the following equation: (19) where y ij is the ground truth of the label instance, Y and Y − are the positive and negative instances sets respectively.

E. MODEL TRAINING
By combining the stochastic gradient descent (SGD) and Adaptive Moment Estimation (Adam), relevant parameters including structure loss function and interaction loss function can be continuously optimized. We first make a forward propagation to calculate the loss and then back propagate based on the minimizing the loss function, then relevant model parameters and weights that correspond to different views can be automatically and continuously updated in each iteration. Here, only a few label data corresponding to a specific mining task is needed to train the attention mechanism and fine-tune the GNN encoder.

V. EXPERIMENTAL EVALUATION
The proposed method was evaluated on three real-world datasets to answer the following research questions: Q1: How does the SR-HGAT perform compared with other state-of-the-art methods for various graph mining tasks, such as node classification, node clustering, and relation prediction?
Q2: How do different components (e.g., nested propagation of multi-typed links, attention mechanism, and aggregator selection) affect the representation capability of the proposed SR-HGAT? Q3: How do different hyper-parameter settings affect model performance?

A. EXPERIMENTAL SETTING 1) DATASETS
We study all models using data collected from three wellknown online applications. The detailed description statistics for all datasets are shown in Table 1.

a: DBLP
The DBLP dataset is an academic network dataset in the field of computer science. Here we extract a subset, denoted as DBLP-4area, which contains relevant literature information of four research areas: databases, information retrieval, data mining and machine learning. In addition, each author's research area was labeled according to the venues in which they mainly published. For each paper, we use a pre-trained XLNet [29], [30] to get the representation of each word in its title, then we average them weighted by each word's attention to get the title representation for each paper. The initial feature of each author is then simply an average of his/her published papers' represerntations.

b: MovieLens
The MovieLens dataset comprises movie viewing records of a massive number of users and other details related to movies. Here we extract a movie sub-set which consists of five genres including action, adventure, science and education and crime. Each movie falls into at least one of these genres. Movie features correspond to elements of a bag-of-words represented of plots.

c: AMAZON
The Amazon dataset records user ratings on businesses and contains social relations and attribute data of businesses. In the experiment, we select the items of Electronics categories for evaluation, and each item falls into corresponding category. Item features correspond to elements of a bag-ofwords represented of product description. The initial feature of each user is then simply an average of his/her purchased items' represerntations.

2) BASELINES FOR COMPARATIVE EVALUATION
We compare with some state-of-the-art baselines, including the network embedding methods and graph neural network based methods, to verify the effectiveness of the proposed SR-HGAT.
The first class of baselines includes several state-of-the-art network embedding approaches: VOLUME 8, 2020 a: DEEPWALK [2] a homogeneous graph embedding method, which performs a random walk on networks and then learns low-dimensional node representations via the skip-gram model. b: Node2vec [1] a homogeneous graph embedding method. Node2vec defines two parameters of p and q, so that random walk strikes a balance between BFS and DFS to preserve local and global information of nodes. c: Metapath2vec [3] a heterogeneous graph embedding method, which leverages meta-path based random walk and utilizes skip gram to perform node embedding. Considering that heterogeneous graph may have diverse meta-paths, we select the most efficient meta-path for experiment here to guide random walk. d: HIN2Vec [16] a heterogeneous network embedding approach. The core of the HIN2Vec framework is a neural network model, which is designed to capture the rich semantics embedded in HINs by exploiting different types of relationships among nodes.
The second class includes several GNN variants that designed for homogeneous and heterogeneous graphs respectively as baselines: e: GCN [4] a GNN variant designed for homogeneous graphs, which is a semi-supervised graph convolutional network and simply averages the neighbor's embedding followed by linear projection. Here we test all symmetric relations for GCN and report the best performance. f: GAT [6] a GNN variant designed for homogeneous graphs, which adopts multi-head additive attention on neighbors. Similar with GCN, we also test all symmetric relations for GAT and report the best performance. g: R-GCN [27] a dedicated heterogeneous GNN, which keeps a different weight for each relationship, i.e., a relation triplet.
The third class includes several our SR-HGAT variants as baselines:

h: SR-HGATnl
It is a variant of SR-HGAT, which removes the nested propagation of multi-typed link features mechanism and only considers the node features.

i: SR-HGATnar
It is a variant of SR-HGAT, which removes the relation-level attention and assigns the same weight to each symmetric relation.

j: SR-HGATnan
It is a variant of SR-HGAT, which removes the intra-relation attention and assigns the same weight to each neighbor vector.

k: SR-HGAT
Our proposed semi-supervised graph neural network which simultaneously employs two-layer hierarchical attention and nested propagation of multi-typed link features.

3) IMPLEMENTATION DETAILS
For the proposed SR-HGAT, in the training stage, we randomly initialize parameters and optimize them via Adam optimizer. The batch size and learning rate are searched in {64, 128, 256, 512} and {0.0005, 0.001, 0.002, 0.005}, respectively. Meanwhile, the dropout is applied to our model, and the dropout rate is test in {0.1, 0.4, 0.5, 0.6}. Then these parameters are carefully fine-tuned to achieve optimal setting. Early stopping with a patience of 100 is applied to avoid overfitting, i.e., we stop training if the validation loss does not decrease for 100 consecutive epochs. For all the baseline methods, we follow the instructions described in their corresponding literatures. We perform a grid search on the values of hyper-parameters and choose an optimal combination of them on each dataset. For the sake of fair comparison, we setup the same embedding dimension of all above algorithms. Moreover, for semi-supervised graph neural network, including GCN, GAT, R-GCN and our SR-HGAT, we split exactly the same training set and testing set to ensure fairness. For random walk based methods include Deepwalk and metapath2vec, we set window size to 5, walk length to 100, walks per node to 40, the number of negative samples to 5. Specifically, in the parameter sensitivity analysis subsection, we vary the number of symmetric relations K in range {1, 2, 3, 4, 5} and the embedding dimension d in range {32, 64, 128, 256, 512, 1024}, and their impact on the performance of the proposed model under different datasets are further investigated. All experiments are conducted on a Ubuntu 14.04 LTS system with two CPUs(Intel Xeon E5*2) and two GPUs(NVIDIA GTX-1080ti*2).

B. NODE CLASSIFICATION
We start by evaluating the quantitative results through the node classification task. Node representations, which were learned on DBLP, Amazon and MovieLens from various baselines, were treated as input features in the node classification task. Relevant classification results were assessed to estimate whether network embedding results are good or not. A logistic regression classifier was chosen as the classifier algorithm. In the course of experiments, the ratio of training data is randomly sampled to 90% from 10%. Node representations in the training set are used to train the logistic regression classifier which is then used on the testing set. The experiment was repeated for 10 times and the average experimental results were reported. Micro-F1 and Macro-F1   were chosen as evaluation metrics of this task.The results are presented in Table 2, 3 and 4. As demonstrated by experimental results, node representations that are learned from the proposed SR-HGAT model perform best on the classification task. From the score point of view, it is clear that performance of SR-HGAT achieve significant improvements in terms of the Micro-F1 and Macro-F1 over other baselines on all three real-world datasets. We also noted that in Amazon and MovieLens datasets, SR-HGAT improves classification results more significantly than that in DBLP. Because single symmetric relation guided by meta-path A-P-V-P-A is much more important than the rest symmetric relations in DBLP datasets. Compare to heterogeneous graph approaches, homogeneous graph embedding methods, such as Deepwalk, fail to perform well. In addition, in terms of traditional heterogeneous graph embedding methods, HIN2Vec, which can leverage multiple meta-paths performs better than metpath2vec. Generally, GNN-based models that combine the structure and feature information, e.g., GCN and GAT, usually perform better. Moreover, compared to simply averaging over node neighbors of GCN, GAT can weigh the information properly and improve the performance of the learned embedding. The proposed SR-GHAT, which is designed for heterogeneous graph, can encode more complex neighborhood information into latent embedding space and shows its superiority. Furthermore, without nested propagation or two-layer attention mechanism, the performance becomes worse than that of the complete SR-HGAT. Specifically, SR-HGATnl performs worse than other two variants, including SR-HGATnan and SR-HGATnar.

C. NODE CLUSTERING
Node clustering was further carried out to evaluate embedding results learned from above embedding algorithms. In DBLP dataset, we cluster author using the set of author labels. Regarding MovieLens and Amazon datasets, movie clustering and item clustering were conducted, and corresponding label sets selected according to genres of movies and categories of items respectively. Here, the K-Means algorithm was utilized to perform node clustering and the number of clusters K is set to the number of classes. Since the VOLUME 8, 2020 performance of K-means is affected by initial centroids, all clustering experiments were repeated 10 times and the average results in the end reported. Clustering performances were evaluated by normalized mutual information (NMI) metrics. The higher the NMI value, the better the clustering performance will be.
Clustering results of three datasets are presented in the table 5. It can be seen that the proposed model performs consistently much better than other baseline methods. Specifically, it can be observed at least 1.5%, 3.9% and 1.8% improvements in NMI values were generated by SR-HGAT compared with other baseline models on clustering author nodes in DBLP, movie nodes in MovieLens and item nodes in Amazon respectively. The SR-HGAT model is able to produce high-quality representations that contain accurate network information specific for features of multi-typed nodes and links in a uniform way. As proved by above experimental results, traditional network embedding method performs as poorly as on node classification task. Moreover, graph neural network based algorithms usually achieve better performance. Besides, without distinguishing the importance of neighbors, GCN cannot perform well. With the guide of multiple symmetric relations, SR-HGAT performs significantly better than GCN and GAT. However, without intrarelation attention or relation-level attention, the performance of SR-HGAT has shown various degrees of degeneration. This demonstrates that by assigning different importance to neighbors and semantics relations, the proposed SR-HGAT can learn a more meaningful and comprehensive representation. Based on the above analysis, it can be found that the proposed SR-HGAT can give a comprehensive description of complex structures and rich semantics behind heterogeneous graph, and achieves a significant improvement.

D. RELATION PREDICTION
A qualified network representation method can not only reconstruct the visible edge in training, but can also predict the edge that should appear but be lost in training data. Considering that the research object in this paper is heterogeneous graph, there is no direct connection between nodes of the same type, so we choose the relation prediction (recommendation) task, which is used to test the ability to predict the interaction links between two types of nodes. In recommendation task, for Amazon dataset, we predict the purchase relationship between user nodes and commodity nodes, and for Movielens dataset, we predict the rating relationship between user nodes and movie nodes. We apply the leave-one-out method for evaluation. For a fair comparison with the baselines, we use the same negative sample set for each (user, item) or (user, movie) pair in the testing set of Amazon and MovieLens respectively for all the methods. After that, we adopt the widely-used evaluation protocols: HR@K and NDCG@K metrics to measure the quality of recommendation. We set k=5, 10, 15, 20, and the average metrics are reported for all users in the test set. The results of recommendation task are reported in Table 6 and 7 with HR and NDCG score respectively.
We have the following observations: (1) Our model consistently outperforms all the baselines, suggesting the effectiveness of SR-HGAT for relation prediction task. In Amazon dataset, it is observed that at least 7.1% and 4.3% improvement in HR@K and NDCG@K values respectively generated by SR-HGAT when compared with other baselines. Moreover, in MovieLens dataset, the average relative improvements over the best performance of baselines are 1.7% and 2.2% respectively. The results also show that our model achieves the best among all baselines regardless whether the datasets are sparse or dense. (2) Meanwhile among these baselines, it is observed that GCN and GAT generally outperform traditional embedding approaches, such as metapath2vec and HIN2Vec, suggesting the power of graph neural network models in representation learning for graph data. It is worth noting that conventional embedding methods do not perform well, since they cannot simultaneously preserve features of links and nodes well. (3) Upon comparison with SR-HGAT, it is found that the performances of SR-HGATnar and SR-HGATnan have various degrees of degeneration. These results are consistent with our assumptions that not all neighbors contribute equally and not all relations have the same importance in learning the final embeddings. This phenomenon also demonstrates that hierarchical two-layer attention is effective for generating meaningful node representations. Moreover, SR-HGATnl exhibits worse performance than other variants in most cases, indicating that our proposed nested propagation of link features has superior capability in capturing in ''deep'' content of link features.

E. HYPER-PARAMETER SENSITIVITY ANALYSIS
The hyper-parameters play important roles in SR-HGAT, as they determine how the node embeddings will be generated. We conduct experiments to analyze the impacts of several key parameters, involving (1) the dimension of node embeddings, (2) the selection of different symmetric relations and (3) the number of symmetric relations. In these experiments, we investigate a specific patameter by changing its value and fixing the others.

1) DIMENSION OF NODE REPRESENTATIONS
How to determine the optimal number of embedding dimensions is still an open research problem. The dimension of representations (denoted by d) is also a key parameter that controls the complexity and capacity of SR-HGAT. Node clustering and relation prediction performance are analyzed when node representation dimension learned by ST-HGAT varies. We explore the experimental results with various dimensions.
The result is shown in Figure 12. Generally speaking, as embedding dimension is gradually increased, the performance is improved, since a larger dimension could enhance the representation capability. Nevertheless, when dimension is larger than the optimal values, performance starts to drop slowly. The reason is that SR-HGAT needs a suitable dimension to encode the semantics information and a larger dimension may introduce additional redundancies. It can be seen that SR-HGAT can achieve the best performance when the dimension d is set to 128 in DBLP and to 256 in MovieLens and Amazon. After that, the performance of SR-HGAT starts to degenerate which may be due to overfitting. Therefore, a proper length of embedding dimension was employed to balance the trade-off between performance and complexity.

2) IMPACT OF DIFFERENT SYMMETRIC RELATIONS
As mentioned before, the proposed SR-HGAT can learn the importance of relations for the specific task. To verify the ability of relation-level attention, taking DBLP and Movie-Lens as examples, we rerun SR-HGAT with individual relation and report the clustering results (NMI) of each symmetric relation guided by meta-path or meta-graph and corresponding attention values in Figure 13 and 14.
Obviously, a positive correlation exists between the performance of each symmetric relation and its attention value. For MovieLens dataset, SR-HGAT gives symmetric relation guided by meta-graph U-M-(A,D)-M-U and M-(A,D)-M the largest weight, which means that SR-HGAT considers them as the most critical relation in inferring the relation between user nodes and movie nodes. This makes sense because the users' preferences and stars (including famous actors and directors) they persuit are highly correlated. Moreover, from the results shown in figure 13, one can observe that the symmetric relation about purchasing (UIU-IUI) can achieve superior performance over other individual relations because is contains more important information which indicate the purchase history and plays a leading role in identifying the user's perference. It can be seen that relation-level attention can reveal the difference between these symmetric semantic relations and weights them properly.

3) NUMBER OF SYMMETRIC RELATIONS
To investigate whether SR-HGAT can benefit from multiple symmetric relations, we gradually add number of relations K in the range of [1,2,3,4,5] into our model and check the performance changes, while keeping the other parameters the same. Figure 15 shows the experimental results on different datasets. Note that the attention is removed when the number of symmetric relation considered is set to 1. Figure 15 shows the experimental results in real datasets. It can be observed that generally the performance improves with the incorporation of more symmetric relations. At this point, the power of multiple relations is more prominent. However, it does not always yield the improvement with more relations, and the performance slightly fluctuates. The reason is that some relations guided by meta-paths or meta-graphs may contain noisy or conflict information with existing ones. Moreover, the corresponding performance will remain stable or become slightly worse when relation number further increases. In addition, analysis is also made about how many relations we should consider simultaneously so as to achieve a balance between better performance and lower computational cost. In our experiment, we can see that the optimal number of relations varies with the specific dataset. Specifically, three relations are taken into account simultaneously on Amazon datasets, and four for Movielens, which are sufficient to meet the demands of most downstream applications. The experiment results also show that our proposed collaborative framework can indeed improve performance by facilitating alignment of different symmetric relations.

VI. CONCLUSION AND FUTURE WORK
In this paper, we propose an end-to-end GAT framework for heterogeneous graph, named SR-HGAT, which applies the attention mechanism to the aggregation of heterogeneous graphs by jointly considering characteristics for both nodes and edges. Our proposed model can capture rich semantics beneath the observed explicit symmetric relations guided by different meta-paths and meta-graphs in heterogeneous graphs. Meanwhile, our proposed SR-HGAT utilizes the nested aggregation mechanism for these link features to calculate the interaction strength of each symmetric relation. As the core of proposed model, a two-layer attention mechanism is adopted to learn the importance of different neighborhood information as well as the weights of different symmetric relations respectively, then these latent semantics are fused automatically to obtain unified embedding specific to graph mining tasks. Extensive experiments on various tasks demonstrate the effectiveness of our proposed model. In the future, we will explore whether SR-HGAT is able to handle heterogeneous graph dynamics, and whether we can conduct efficient and scalable training of our model on Web-scale graph data.

ACKNOWLEDGMENT
Any opinions, findings, and conclusions expressed here are those of the authors and do not necessarily reflect the views of the funding agencies.
ZHENGHAO ZHANG is currently pursuing the Ph.D. degree with the School of Computer Science and Technology, Xidian University, Xi'an. His research interests include graph data mining, machine learning, and heterogeneous information networks.
JIANBIN HUANG received the Ph.D. degree in pattern recognition and intelligent systems from Xidian University, Xi'an. In 2009, he held a postdoctoral research position with the University of Illinois at Urbana-Champaign. He is currently a Professor with the School of Computer Science and Technology, Xidian University. His research interests include data mining and knowledge discovery.
QINGLIN TAN is currently pursuing the Ph.D. degree with the School of Computer Science and Technology, Xidian University, Xi'an, China. His research interests include data mining, reinforcement learning, and traffic pattern analysis. VOLUME 8, 2020