Siamese Network Based Multiscale Self-Supervised Heterogeneous Graph Representation Learning

Owing to label-free modeling of complex heterogeneity, self-supervised heterogeneous graph representation learning (SS-HGRL) has been widely studied in recent years. The goal of SS-HGRL is to design an unsupervised learning framework to represent complicated heterogeneous graph structures. However, based on contrastive learning, most existing methods of SS-HGRL require a large number of negative samples, which signiﬁcantly increases the computation and memory costs. Furthermore, many methods cannot fully extract knowledge from a heterogeneous graph. To learn global and local information simultaneously at low time and space costs, we propose a novel S iamese N etwork based M ulti-scale bootstrapping contrastive learning approach for H eterogeneous graphs (SNMH). Speciﬁcally, we ﬁrst obtain views under the meta-path schema and the 1-hop relation type schema through dual-schema view generation. Then, we propose cross-schema and cross-view bootstrapping contrastive objectives to maximize the similarity of node representations between different schemas and views. By integrating and optimizing the above objectives, we can extract local and global information and eventually obtain the node representations for downstream tasks. To demonstrate the effectiveness of our model, we conduct experiments on several public datasets. Experimental results show that our model is superior to the state-of-the-art methods on the premise of lower time and space complexity. The source code and datasets are publicly available at https://github.com/lorisky1214/SNMH.

learn the high-order embeddings of nodes or graphs that pre-23 serve the information of node attributes and graph topological 24 structure, which can be used for a wide variety of downstream 25 The associate editor coordinating the review of this manuscript and approving it for publication was Mauro Tucci . tasks. Benefiting from the development of deep learning, 26 most successful GRL methods extend neural networks to 27 graph data, classified as graph neural networks (GNNs). They 28 have obtained significant results on many tasks, such as node 29 classification [3]-[5], recommendation system [6]- [8], and 30 link prediction [9]- [11]. 31 Despite the fruitful progress, GNNs are mostly applied in a 32 supervised manner [3], [4], [12], [13], which requires a large 33 number of labeled nodes for training. Moreover, the acquisi-34 tion of label information in the real world is very costly and 35 FIGURE 1. An example of a heterogeneous graph from the ACM dataset and relative illustrations of meta-path and relation based 1-hop neighbors. Nodes with red frames indicate that information is discarded during the encoding process.
However, real-world graphs often contain multiple node 59 types and relation types represented by edges, which are 60 called heterogeneous graphs with more comprehensive infor-61 mation and richer semantics. As a typical characteristic of 62 heterogeneous graphs, meta-path [24] can capture seman-63 tic information in a graph by representing the composite 64 relation between two nodes. Fig. 1 shows an example of a 65 heterogeneous graph from the ACM dataset, which contains 66 three types of nodes (Author, Paper and Subject). There are 67 Write and Belong-to relation types between them. Mean-68 while, the meta-paths between two papers can be divided 69 into two types, i.e., Paper-Author-Paper (PAP) and Paper- 70 Subject-Paper (PSP). PAP means that two papers belong to 71 the same author, and PSP means the same subject. DMGI [25] 72 and HDGI [26], two current self-supervised heterogeneous 73 GRL methods, generate node embeddings for each meta-path 74 type first and then integrate the embeddings with different 75 semantic information using a consistent regularization frame-76 work. HeCo [27] conducts a further step by proposing a 77 collaborative contrastive learning mechanism that encodes 78 nodes to handle heterogeneity from both network schema 79 and meta-path views. Although the aforementioned methods 80 have achieved significant success, they are all subject to one 81 of the following problems: 1) Ignoring local neighborhood 82 information. If we only focus on semantic information, the 83 representations will fail to extract useful information from 84 direct neighbors. With this end, it is necessary to design a 85 mechanism to simultaneously learn the rich local and global 86 information in a graph. 2) Dependance on a large number 87 of negative samples. In this case, it leads to high time and 88 space complexity. At the same time, it is difficult for graphs 89 to define negative samples in a principled way. 90 To solve the aforementioned problems, inspired by boot-91 strapping in the Siamese network [1], we propose SNMH, 92 a novel multi-scale self-supervised heterogeneous GRL 93 method to comprehensively extract rich information from 94 heterogeneous graphs at low time and space costs. Specifi-95 cally, distinct from current methods, we propose dual-schema 96 view generation to obtain the meta-path based views and 97 relation type based 1-hop views, which represent global 98 and local information, respectively. Furthermore, we con-99 struct cross-view and cross-schema variant Siamese architec-100 tures. By maximizing the similarity of node representations 101 between different views under the same schema and between 102 two schemas, we can obtain node representations containing 103 abundant node attributes and topological information without 104 negative samples. Experimental results on various datasets 105 demonstrate the excellent performance of our model.

106
Our contributions can be summarized as follows:

107
• SNMH is the first trial to apply bootstrapping in 108 the Siamese network to self-supervised heterogeneous 109 graph representation learning, which can reduce time 110 and space costs by avoiding negative samples. A heterogeneous graph is defined as G = {V, E}, where V is 177 a set of nodes, and E is a set of edges. It has a node-type 178 mapping function φ : V → T and an edge-type mapping 179 function ψ : E → R, where T and R represent the node-type 180 set and the edge-type set, respectively, and |T | + |R| > 2. 181 Fig. 1 (a) shows an example of a heterogeneous graph with 182 Author(A), Paper (P) and Subject (S) nodes. There are two 183 types of relations, i.e., Write and Belong-to, which mean 184 that the author writes the paper and the paper belongs to the 185 subject, respectively.

186
In this paper, we represent the attributes of nodes with 187 type φ i as the initial feature matrix |V φ i | is the number of nodes with type φ i , and F φ i is the initial 189 dimension. In this paper, we specify the set of target nodes 190 as V φ t for representation learning.

192
A meta-path is defined as v 1 ). It describes the composite relation 194 R = R 1 • R 2 • · · · • R l between nodes v 1 and v l+1 , where 195 • represents a combination operator on the relation.  paths can model rich semantic information in heterogeneous 197 graphs. As shown in Fig. 1 (b), two papers can be connected 198 by PAP and PSP meta-paths. PAP means that two papers 199 belong to the same author, and PSP means the same subject. 200 In this paper, we represent the set of meta-paths as 201 where P is the number of meta-path 202 types. The topology of the view based on the meta-path type 203 k can be expressed as  We represent the neighborhood information pertaining to 217 each view as a relation type based adjacency matrix A k ∈ 218 R |V φ t |×|V φ k | . If there is an edge of relation type k between 219 target nodes v i ∈ V φ t and v j ∈ V φ k , then A k ij = 1; otherwise,  To simultaneously learn the above information, we innova-241 tively propose a dual-schema view generation mechanism, 242 as shown in Fig. 3.

243
Assume that the original graph is G. To start with, accord-244 ing to the first schema, meta-path, we can generate a series of 245 views: Next, for the second schema, i.e., the 1-hop relation type, 251 we can still obtain a series of views: Since node types contained in heterogeneous graphs are not 260 singular with different feature spaces or feature dimensions, 261 we need to first project them into the same dimension space to 262 facilitate the processing of subsequent models. Specifically, 263 for the nodes of a certain type φ i , we construct a specific 264 type of linear transformation that transforms the initial fea-265 tures X φ i of the nodes into a unified dimension. The specific 266 formula is shown as follows:  The cross-schema learning procedure is shown in Fig. 3 contrastiveness by maximizing the cosine similarity between 290 the above two. We express the optimization objective as the 291 following formula:   Each GCN encoder encodes a node embedding under a 317 meta-path. Apparently, the effects of different meta-paths on 318 the quality of the resulting node embeddings are distinct. 319 Intuitively, if target nodes are mostly connected through a 320 certain type of meta-path, this meta-path type affects their 321 representations most. Based on this, we treat the encoder of 322 the view with the largest number of meta-paths contained 323 as the anchor encoder. As shown in Fig. 3 (a), we assume 324 that the top encoder is the anchor encoder, and the bottom 325 encoders are non-anchor encoders. The ellipses in the figure 326 indicate that the number of non-anchor encoders may be 1, 2, 327 3. . . , which depends on the number of meta-path types in a 328 heterogeneous graph (we assume a minimum number of 2). 329 During the learning process, only the parameters of the 330 anchor encoder are updated by gradient descent to reduce 331 the target loss, while the parameters of other non-anchor 332 encoders follow different targets. The intuition behind this is 333 that the slow-moving non-anchor encoders act as a stabilizer 334 to encode the meta-paths that are not the most influential. 335 This guides the anchor encoder to learn to explore richer 336 and better representations on the basis of the most influential 337 meta-paths without relying on additional negative samples to 338 avoid a collapse. The parameters of the non-anchor encoders 339 are updated as an EMA of the parameters of the anchor 340 encoder: where η and δ are the parameters of the anchor encoder 343 and non-anchor encoders respectively. τ is a decay rate that 344 controls the distance between η and δ, and its update can be 345 seen in formula (19).

346
After the above meta-path specific node representation 347 learning, we obtain a set of node embeddings {H i } P i=1 .

348
Now, we need to aggregate the node embeddings above. 349 Considering that the appropriate aggregation methods may 350 change for datasets with different distributions of the number 351 of meta-paths, we implement distinct aggregation methods 352 for different datasets, which are shown as follows: The first aggregation method is average pooling, which cal-355 culates the average of the set of embedding matrices: For the second method, we employ semantic-level atten-359 tion [31] to fuse the node embeddings into the final embed-360 ding H mp in the meta-path schema: where β i weighs the importance of the meta-path i , which 363 is calculated as follows: into an identically structured GNN:

390
Unlike the MP encoder, we set g i as a node-level atten-391 tion layer here. For node n in the i relation type view, its 392 representation in this layer can be calculated as follows:  can be calculated as follows:

401
where a i ∈ R 2F×1 is the node-level attention vector and 402 || indicates the concatenating operation.

403
For the selection of nodes in N i n , we do not simply delimit 404 all the nodes directly connected with node n through i .

405
Instead, we design a threshold i . When the number 406 of neighbors corresponding to i is greater than the 407 specified i , we will non-repeatedly choose i neighbors 408 at random to join N i n . Otherwise, the i neighbors can be 409 selected repeatedly. In this way, the threshold ensures that 410 each node under the same view aggregates the same amount 411 of neighborhood information, while the random selection 412 ensures the diversity of node embeddings in each epoch.

413
The specific learning process of each view is similar to the 414 meta-path. The only difference is that we choose the view 415 with the largest number of 1-hop neighbors as the anchor 416 encoder here.

417
After learning the 1-hop relation type specific node repre-418 sentations described above, we obtain a set of node embed-419 dings {H i } |R| i=1 . Next, we use type-level attention to fuse 420 them together to obtain the final embedding H rt in the 1-hop 421 relation type schema:  In addition to the bootstrapping contrastiveness between 431 the two schemas, we additionally consider the relationship 432 between the views within the meta-path schema, which 433 acts as a strong regularization and is highly informative for 434 improving the performance of our model. Details are shown 435 in Fig. 4. Since there are usually more than two meta-path types 437 in a heterogeneous graph, more than two encoders with the 438 VOLUME 10, 2022   tion objectives. We define the overall objective as follows:

477
To evaluate the performance of SNMH, we conduct experi-478 ments with three public datasets, and the basic information 479 statistics of the datasets are shown in Table 1. We implement five methods as baselines, and their specific 503 characteristics are shown in Table 2  is used as the optimizer. We set the initial learning rate as 518 γ 0 = 0.5 and the number of total epochs as n total = 10000.

519
To increase stability, we use batch normalization between 520 layers and learning rate with a cosine schedule [28], which 521 can be expressed by the following formula:  (7) is initialized as τ 0 = 0.99, which also follows a 536 cosine schedule to update: respectively. We run 10 times randomly and present the aver-545 age results with the standard deviation values.

546
In the comparative experiments, we mainly refer to 547 HeCo [27]. For HERec, we set the window size, the number 548 of walks per node and the walk length to 5, 40 and 100, 549 respectively. For both HERec and DGI, we test all meta-paths 550 and present their best results. Other parameters follow the 551 settings in the original paper.

553
To evaluate the trained graph encoder, we use the learned 554 node embeddings to fit a logistic regression classifier. During 555 the fitting process, the embeddings are frozen to prevent 556 any gradient flow back to the encoder. For the Freebase, 557 DBLP and ACM datasets, we randomly select 20, 40 and 558 60 labeled nodes in each class as the training set, respectively, 559 and 1000 nodes as the validation set and 1000 as the test 560 set. We present the test performance when the validation set 561 presents the optimal result. We compare SNMH with other 562 baselines by AUC, Micro-F1 and Macro-F1. The results are 563 shown in Table 3, where we mark the best performance in 564 bold. As shown in the table, SNMH outperforms all datasets 565 than other baselines. We attribute the results to the follow-566 ing two points: 1) We design a novel network based on 567 the Siamese network, which can extract the information in 568 historical representations without relying on negative sam-569 ples. 2) We construct a multi-scale mechanism to learn node 570 features as well as global and local structural information 571 in heterogeneous graphs from cross-view and cross-schema 572 perspectives. Even if the label information of nodes is used in 573 the training process of HAN, SNMH is still superior to HAN, 574 which also confirms the effectiveness of the self-supervised 575 learning of SNMH.

577
To intuitively evaluate our model, we visualize the node 578 embeddings of ACM obtained by HERec, HAN, DMGI and 579 SNMH using the t-SNE [38] algorithm, and the results are 580 shown in Fig. 5. We also compute the Silhouette scores for 581 different methods, which are 0.209, 0.323, 0.327 and 0.335. 582 We can see that HERec cannot effectively identify different 583 classes because of the lack of initial features. Even if HAN 584 takes node labels as input, SNMH still works better than 585 HAN. SNMH has a clearer boundary and denser clusters 586 VOLUME 10, 2022  than the others, as well as a higher Silhouette score, which 587 demonstrate the effectiveness of our method.  Table 4.

602
As shown in Table 4 advantage in terms of time and space when sampling more 621 1-hop neighbors of target nodes.

622
In addition, we perform experiments to see how the number 623 of meta-paths affects time and space. The results are shown 624 in Fig. 7, where the solid lines represent graphics memory 625 and the dashed lines represent time. It can be seen from 626 the figure that the curves representing different meta-paths 627 show almost the same changing trend for time, and they 628 are all much smaller than HeCo, indicating that the number 629 of meta-paths has minimal influence on the time required 630 for SNMH. When the number of MAM increases, the graph-631 ical memory required for SNMH increases slowly and is 632 always less than HeCo, but it remains constant when the 633 number of MDM and MWM increases. This finding suggests 634 that MAM, as the meta-path encoded by the online encoder 635 (anchor encoder), has a greater impact on the space required 636 by the model. These results also prove the superiority of 637 SNMH in time and space when more meta-paths are encoded.

657
In this section, we also investigate a key hyperparameter, 658 i.e., the equilibrium parameter λ in the final objective (17). 659 We perform node classification on ACM, DBLP and Free- between different meta-paths for datasets where the number 672 of different meta-paths varies substantially (i.e., imbalanced). 673 Since the majority of real-world datasets are imbalanced, the 674 above findings illustrate the significance of utilizing both the 675 cross-schema and cross-view parts for multi-scale learning. 676

677
In this paper, we propose a novel self-supervised heteroge-678 neous graph representation learning method called SNMH. 679 To capture rich self-supervised signals, we conduct dual-680 schema view generation to obtain meta-path based views and 681 relation type based 1-hop views, which represent the global 682 and local information in a heterogeneous graph, respec-683 tively. Based on the Siamese network, SNMH implements 684 a multi-scale bootstrapping contrastive learning mechanism 685 to learn node representations in heterogeneous graphs from 686 cross-schema and cross-view aspects. Our method does not 687 require any negative samples, thus reducing time and space 688 costs. Experimental results show that our method is supe-689 rior to other methods. In the future, we will continue to 690 research more efficient methods with lower time and space 691 complexity.

693
The authors are grateful to the anonymous referee, who made 694 valuable suggestions to help improve the article.