Multi-Grained Semantics-Aware Graph Neural Networks

Graph Neural Networks (GNNs) are powerful techniques in representation learning for graphs and have been increasingly deployed in a multitude of different applications that involve node- and graph-wise tasks. Most existing studies solve either the node-wise task or the graph-wise task independently while they are inherently correlated. This work proposes a unified model, AdamGNN, to interactively learn node and graph representations in a mutual-optimisation manner. Compared with existing GNN models and graph pooling methods, AdamGNN enhances the node representation with the learned multi-grained semantics and avoids losing node features and graph structure information during pooling. Specifically, a differentiable pooling operator is proposed to adaptively generate a multi-grained structure that involves meso- and macro-level semantic information in the graph. We also devise the unpooling operator and the flyback aggregator in AdamGNN to better leverage the multi-grained semantics to enhance node representations. The updated node representations can further adjust the graph representation in the next iteration. Experiments on 14 real-world graph datasets show that AdamGNN can significantly outperform 17 competing models on both node- and graph-wise tasks. The ablation studies confirm the effectiveness of AdamGNN's components, and the last empirical analysis further reveals the ingenious ability of AdamGNN in capturing long-range interactions.


INTRODUCTION
I N many real-world applications, such as social networks, recommendation systems, and biological networks, data can be naturally organised as graphs [10].Nevertheless, working with this powerful node and graph representations remains a challenge, since it requires integrating the rich inherent features and complex structural information.To address this challenge, Graph Neural Networks (GNNs), which generalise deep neural networks to graph-structured data, have drawn remarkable attention from academia and industry, and achieve state-of-the-art performances in a multitude of applications [32], [43].The current literature on GNNs can be used for tasks with two categories.One is to learn node representations to perform tasks such as link prediction [41] and node classification [15], [34].The other is to learn graph representations for tasks, such as graph classification [9], [37], [39].
On node-wise tasks, existing GNN models on learning node representations rely on a similar methodology that utilises a GNN layer to aggregate the sampled neighbouring nodes' features in a number of iterations, via non-linear transformation and aggregation functions.Its effectiveness has been widely proved, however, a major limitation of these GNN models is that they are inherently flat as they only propagate information across the observed edges in the original graph.Thus, they lack the capacity to encode features in the high-order neighbourhood in the graphs [1], [38].For example, in an academic collaboration network, flat GNN models could capture the micro semantic (e.g., co-authorships) between authors, but neglect their macro semantics (e.g., belonging to different research institutes).
On the other hand, graph classification is to predict the label associated with an entire graph by utilising the given graph structure and initial node features.Nevertheless, existing GNNs for graph classification are unable to learn graph representations in a multi-grained manner, which is crucial to encode better meso-and macro-level graph semantics hidden in the graphs for many practical applications such as drug design [27] and program analysis [18].To remedy this limitation, novel pooling approaches have been proposed, where sets of nodes are recursively aggregated to form super nodes in the pooled graph.DIFFPOOL [37] is a differentiable pooling operator but its assignment matrix is too dense [4] to apply on large graphs.G-U-NET [9], SAGPOOL [16], GXN [20] and ASAP [26] are four recently proposed methods that adopt the Top-k selection strategy to address the sparsity concerns of DIFFPOOL.They score nodes based on a learnable projection vector and select a fraction of high scoring nodes as super nodes.However, the pre-defined pooling ratio limits these models' adaptivity on graphs with different sizes, and the Top-k selection may easily lose important node features or graph structure by simply ignoring low scoring nodes.As shown in Figure 1 (Section 2), different numbers of k will significantly affect the number of covered nodes of super nodes in the pooled graph, which means the important nodes' features could get lost during the trivial pooling strategy.The hyper-parameter k is also crucial for the final performance [9], thus reduces their convenience in applications.
In the end, we argue that node-and graph-wise tasks are inherently correlated with one another.That said, node representations form graph representation, and graph representation can provide node representations with meso/macrolevel semantic information in the graph.Joint modelling with node-and graph-wise tasks will allow GNNs to overcome the limitation of flat propagation mode in capturing multi-grained semantics, and the enriched node representation could further ameliorate the graph representation.However, to the best of our knowledge, none of the existing work simultaneously exploit node-and graph-wise tasks, along with capturing multi-grained semantics hidden in the graph, to learn representations of nodes and the graph.
In this work, we propose a novel framework, Adaptive Multi-grained Graph Neural Networks (AdamGNN), which integrates graph convolution, adaptive pooling and unpooling operations into one framework to generate both node and graph level representations interactively.Unlike the above-mentioned GNN models, we treat node and graph representation learning tasks in a unified framework so that they can collectively optimise each other during training.In modelling multi-grained semantics, the adaptive pooling and unpooling operators preserve the important node features and hierarchical structural features.More concretely, as shown in Figure 2-(a), we employ (i) an adaptive graph pooling (AGP) operators to generate a multi-grained structure based on the derived primary node representations by a GNN layer, (ii) graph unpooling (GUP) operators to further distribute the explored meso-and macro-level semantics to the corresponding nodes of the original graph, and (iii) a flyback mechanism to integrate all received multi-grained semantic messages as the evolved node representations.Besides, the attention-enhanced flyback aggregator provides a reasonable explanation of the importance of messages from different grains.Experimental results reveal the effectiveness of AdamGNN, and the ablation and empirical studies confirm the effectiveness and flexibility of different components in AdamGNN.At last, through case studies, AdamGNN is shown to highlight variant-range node interactions in different graph datasets.
Our contributions can be summarised as follows.
(1) We propose a novel framework, AdamGNN 1 , that adaptively integrates multi-grained semantics into node representations and achieves mutual optimisation between node-wise and graph-wise tasks in one unified process.(2) An adaptive and efficient pooling operator is devised in AdamGNN to generate the multi-grained structure without introducing any hyper-parameters.(3) An attention-based flyback aggregation can provide model explainability on how different grains benefit the prediction tasks.(4) Extensive experiments on 14 real-world datasets demonstrate the promising performance of AdamGNN, along with providing insightful explanations with case studies.

RELATED WORK
Graph neural networks.The existing GNN models can be generally categorised into spectral and spatial approaches.The spectral approach utilises the Fourier transformation to define convolution operation in the graph domain [3].However, its incurred heavy computation cost hinders it from being applied to large-scale graphs.Later on, a series of spatial models drawn remarkable attention due to their effectiveness and efficiency in node-wise tasks [6], [7], [8], [10], [15], [22], [29], [31], [34], such as link prediction, node classification and node clustering.They mainly rely on the flat message-passing mechanism that defines convolution by iteratively aggregating messages from the neighbouring nodes.Recent studies have proved that the spatial approach is a special form of Laplacian smoothing and is limited to summarising each node's local information [5], [21].Besides, they are either unable to capture global information or incapable of aggregating messages in a multi-grained manner to support graph classification tasks.Graph pooling.Pooling operation overcomes GNN's weakness in generating graph-level representation by recursively merge sets of nodes to form super nodes in the pooled graph.DIFFPOOL [37] is a differentiable pooling operator, which learns a soft assign matrix that maps each node to a set of clusters, treated as super nodes.Since this assignment is rather dense that incurs high computation cost, it is not scalable for large graphs [4].Following this direction, a Top-k selection based pooling layer (G-U-NET) is proposed to select important nodes from the original graph to build a pooled graph [9].SAGPOOL [16] and ASAP [26] further use attention and self-attention for cluster assignment, GXN [20] uses mutual information estimation for super node selection.They address the problem of sparsity in DIFFPOOL, however, such a manual-defined hyperparameter k is quite sensitive to the final performance [9], thus limits the adaptivity of these models on graphs of different sizes.In addition, as shown in Figure 1, different k values will significantly affect the number of covered nodes in the graph, which means the important node features could get lost during the trivial pooling operation.Note that nodes covered by a super node refer to nodes involved in the super node's aggregation tree.

Preliminaries
A graph with n nodes can be formally represented as G = (V, E, X), where V = {v 1 , . . ., v n } is the node set, E ⊆ V × V denotes the set of edges, and X ∈ R n×π represents nodes' features (π is the dimensionality of node features).Besides, V and E can be summarised in adjacency matrix A ∈ {0, 1} n×n .For node-wise tasks, the goal is to learn a mapping function f n : G → Z, where Z ∈ R d , and each row z i ∈ Z corresponds to node v i 's representation.For graphwise tasks, similarly it aims to learn a mapping f g : D → Z, where D = {G 1 , G 2 , . . .} is a set of graphs, each row z i ∈ Z corresponds to the graph G i 's representation.The mapping function's effectiveness f n and f g is evaluated by applying Z to different downstream tasks.
Primary node representation.We use Graph Convolution Network (GCN) [15] as an example primary GNN encoder to obtain the node representation: where Â = A + I, D = j Âij and W ( ) ∈ R n×d is a trainable weight matrix for layer .H ( ) is the generated node representation of layer which is defined as the primary node representations H = H ( ) .Node representations are generated based on each target node's local neighbours, which are aggregated via learning based on the adjacency matrix A. GCN cannot capture meso/macro-level knowledge, even with stacking multiple layers.Hence we term such generated node representations as primary node representations.

Adaptive Graph Pooling for Multi-grained Structure Generation
The proposed model, AdamGNN, adaptively generates a multi-grained structure to realise the collective optimisation between the node and graph level tasks within one unified framework.The key idea is that applying an adaptive graph pooling operator to present the multi-grained semantics of G explicitly and improve the node representation generation with the derived meso/macro information.While AdamGNN is usually performed under multiple levels of granularity (T different grains), in this section, we present how level t's super graph is adaptively constructed based on graph of level t−1, i.e., G t−1 = (V t−1 , E t−1 , X t−1 ).
Ego-network formation.We initially consider the graph pooling as an ego-network selection problem, i.e., each ego node can determine whether to aggregate its local neighbours to form a super node, resolving the dense issue of DIFFPOOL by avoiding to use a dense assignment matrix.As shown in Figure 2-(b)-(i), each ego-network c λ contains the ego and its local neighbours N λ i within λ-hops., i.e., N λ i = {v j | d(v i , v j ) ≤ λ}, where d(v i , v j ) means the shortestpath length between v i and v j .Thus an ego-network can be formally presented as: can be generated from G. We will investigate the impact of the ego-network size in the ablation studies of Section 4.2 Super node determination.Given G with n nodes has n ego-networks, forming a super graph with all ego-networks will blur the useful multi-grained semantics and lead to a high computation cost.We select a fraction of ego-networks from C λ to organise the multi-grained semantics of G.We make the selection based on a closeness score φ i that evaluates the representativeness of the ego v i to its local neighbours v j ∈ c λ (v i ).We first create a function to calculate the closeness score φ ij between v i and v j : where − → a ∈ R 2π is the weight vector, is the concatenation operator and σ is an activation function (LeakReLU) to avoid the vanishing gradient problem [12] during the model training process.
)) calculates one component of φ ij considering the non-linear relationship between node v j 's and ego v i ' representations, and its output lies in (0, 1) as a valid probability for ego-network selection.Meanwhile, we further add another component ) to supercharge φ ij with the linearity between node v j and ego v i to capture more comprehensive information.Consequently, nodes have similar features and structure information to the ego will contribute to higher closeness scores.In the end, we produce the closeness score of c λ (v i ) as: where |N λ i | indicates the number of nodes in N λ i .After obtaining ego-networks' closeness scores, we propose an adaptive approach to select a fraction of egonetworks to form super nodes without pre-defined hyperparameters (cf. the Top-k selection strategy [9]).Our key intuition is that a high diameter ego-network could be composed of multiple low diameter ego-networks.Therefore, we intend first to find these low diameter ego-networks, then recursively aggregate them to form a super node that contains these ego-networks.Specifically, we form egonetworks by selecting a fraction of egos Np as: where N 1 i means the neighbour nodes of node v i within the first hop.Note that each node may belong to various ego-networks since they may play different roles in different groups.Therefore, we allow overlapping between different selected ego-networks and utilise N 1 i instead of N λ i .If we adopt N λ i here, the selected ego node v i cannot be involved in other ego-networks anymore.Following this, we can select a fraction of ego-networks to form super nodes at granularity level t.
Proposition 1.Let G be a connected graph with n nodes, and a total number of n ego-networks can be formed from the graph ) will be assigned with a closeness score φ i .Then, there exist at least one ego-network c λ (v i ) that satisfies φ i > φ j , ∀v j ∈ N 1 i .Proof of Proposition 1.For G = (V, E, X) with n nodes.n ego-networks can be generated by following the procedures, and each ego-network will be given a closeness score φ i as Equation 3. We assume that these cluster closeness scores are not all the same, thus there exists at least one maximum φ max .Hence, the clusters with closeness score φ max satisfy the requirements of egonetwork selection requirement that: where . So, for any connected G with n nodes, there exists at least one cluster satisfies the requirements of our ego-network selection approach.
Proposition 1 ensures that our super node determination method can find at least one ego-network to generate a super graph for any graph.It guarantees the generality of our strategy.
Meanwhile, we would also retain nodes that do not belong to any selected ego-networks, denoted as Nr , to maintain the graph structure: In this way, a super node formation matrix S t ∈ R n×(| Np|+| Nr|) can be formed, where (| Np | + | Nr |) is number of nodes of the generated super graph, rows of S t corresponds to the n nodes of G t−1 , and columns of S t corresponds to the selected ego-networks ( Np ) plus the remaining nodes ( Nr ).We have S t [i, j] = φ ij if node v j belongs to the selected ego-network c λ (v i ) and S t [i, j] = 1 if node v j is a remaining node corresponds to node v i in the super graph; otherwise S t [i, j] = 0.The weighted super node formation matrix S t can better maintain the relation between different super nodes in the pooled graph.
Maintaining super graph connectivity.After selecting the ego-networks and retaining nodes in level t−1, as shown in Figure 2-(b)-(iii-iv), we construct the new adjacent matrix A t for the super graph using Ât−1 and S t as follows: This formula makes any two super nodes connected if they share any common nodes or any two nodes are already neighbours in G t−1 .In addition, A t will retain the edge weights passed by S t−1 that involves the relation weights between super nodes.Eventually, we obtain a generated super graph G t at granularity level t.Super node feature initialisation.All nodes in the super graph G t need initial feature vectors to support the graph convolution operation.Recall that we have the closeness score as calculated in Eq. 2, between node v j to the ego v i .However, this is not equivalent to the contribution of node v j 's feature to the super node feature, since we need to compare the relationship strength between ego v i and v j with the relation between other v r ∈ c λ (v i ).Therefore, we further propose a super node feature initialisation method through a self-attention mechanism [31].Specifically, it can be described as: where H t−1 is the generated node representation by the (t − 1)-th primary GNN layer, i.e., at level t − 1 with a similar method as Equation 1, α ij describes the importance of node v j to the initial feature of c λ (v i ) at level t.And α ij can be learned as follows: is the weight vector.For the remaining nodes Nr that do not belong to any super nodes, we keep their representations of H t−1 as initial node features.

Graph Unpooling
Different from existing graph pooling models [4], [16], [26], [37], [39] which only coarsen graphs to generate graph representations, we aim to mutually utilise node-wise and graph-wise tasks to better encode multi-grained semantics into both node and graph representations under a unified framework.We design a mechanism to allow the learned multi-grained semantics to enrich the node representations of the original graph G as shown in Figure 2-(a).Vice versa, the updated node representation can further ameliorate the graph representation in the next training iteration.
A reasonable unpooling operation that passes macro-level information to original nodes has not been well studied in the literature.For instance, Gao et al. [9] directly relocate the super node back into the original graph and utilise other GNN layers to spread its message to other nodes.However, these additional aggregation operations cannot allow each node to receive meaningful information since some nodes may distant from super nodes, in such case these operations can exacerbate local-smoothing [21].
We implement the unpooling process by devising a topdown message-passing mechanism, which endows GNN models with meso/macro level knowledge.Specifically, since S t records how nodes of G t−1 form the super nodes of G t , so we utilise S t to restore the generated ego-network representation at level t to that at level t−1 until we arrive at the original graph G, i.e., t → 0, as follows: where Ĥt ∈ R n×d .At the end of each iteration, nodes in the original graph G will receive high-level semantic messages from the different levels, i.e., { Ĥ1 , . . ., ĤT }.As illustrated in Figure 2-(b)-(iv-i), the graph unpooling process can be treated as an inverse process of adaptive graph pooling process.

Flyback Aggregation
Since the super graphs at different granularity levels present multi-grained semantics and how each node utilises the received semantic information with its f lat representation is a challenging question.And nodes of the same graph may need different granularity levels' information.Therefore, we propose a novel attention mechanism to integrate the derived representations at different levels, given by: where the attention score β t estimates the importance of the message from level t, given by: where − → a 2 ∈ R 2π is the weight vector.We term this process as the flyback aggregation, which considers the attention scores of different levels and allows each node to decide whether/how to utilise semantic information from different granularity levels.We will verify the effectiveness of flyback aggregation in the ablation study of Section 4.2 and discuss the explainability in Section 4.3.

Training Strategy
Till now, there are still two challenges when training the model.The first is how to highlight the difference among nodes' representations from different ego-networks.Nodes belong to neighbouring ego-networks receive closely related messages from super nodes, since their super nodes are connected in the super graph, and local smoothing makes their representation vectors similar.Representations of proximal nodes could be further closer to each other in the representation latent space.To address this problem and enhance the discrimination capability between egonetworks, we exploit a self-optimisation strategy [33], which let nodes in different ego-networks distinguishable.Specifically, we use the Student's t-distribution (Q) as a kernel to measure the similarity between representation vectors of v j and ego v i : , where ), v i are other ego nodes, µ are the degrees of freedom of Student's t-distribution.Following this, q ij can be integrated the probability of assigning node v j to ego v i .In this paper, we set µ = 1 the same as [33].After, we propose to learn better node representations by matching Q to the auxiliary target distribution (P ), and we choose the proposition of [33] which first raises q i to the second power and then normalises by frequency per ego-network: , where g i = j q ij .Therefore, apart from the task-related loss function L task , we further define a KL divergence loss as: The second challenge is to avoid the over-smoothing problem that nodes of a graph tend to have indistinguishable representations.GNN is proved as a special form of Algorithm 1: Adaptive Multi-grained GNNs Generate the super-node formation matrix: S t ; Laplacian smoothing [5], [21] that naturally assimilates the nearby nodes' representations.AdamGNN will further exacerbate this problem because it distributes semantic information from one super node representation to all nodes of the ego network.Therefore, we introduce the reconstruction loss, which can alleviate the over-smoothing issue and drive the node representations to retain the structure information of G by differentiating non-connected nodes' representations.Specifically, the reconstruction loss is defined as: where A = Sigmoid(Z T Z).Therefore, the overall loss function consists of the training task L Task , the self-optimising task L KL , and the reconstruction task L R , given by: where L Task is a flexible task-specific loss function, and γ and δ are two hyper-parameters that we will discuss in Section 4.1.Note that for link prediction task we have L = L R + γL KL , since L Task equals to L R .Moreover, we will demonstrate the effectiveness of each component of loss function L in the ablation study of Section 4.2.

Algorithm
We have presented the idea of AdamGNN and the design details of each component in Section 3. Given a graph G, we first apply a primary GNN encoder to generate the primary node embedding (line 1).Then we construct a multi-grained structure with t-th level (line 3-13) with the proposed adaptive graph pooling operator.Meanwhile, we also propose a method to define the initial features of pooled super nodes (line [14][15][16][17][18][19].The graph connectivity of the pooled graph is maintained by line 20.We apply GNN encoder on the pooled graph to summarise the relationships between super nodes (line 21) to learn macro grained semantics of t-th granularity level.The learned multi-grained semantics will be further distributed to the original graph follows an unpooling operator (line 22).Last, the flyback aggregator generates the meso/macro level knowledge from different levels as the node representations of G (line 24), and additional READOUT operators [7] produce the node representations as to the graph representation (line 25).Model scalability.According to the design for AdamGNN framework, we can find that the primary node representation learning module of each level and the adaptive graph pooling and unpooling operators are categorised as a local network algorithm [28], which only involves local exploration of the network structure.Therefore, our design enables AdamGNN to scale to representation learning on large-scale networks and to be friendly to distributed computing settings [25].We present instances that utilise multi-GPU computing framework to accelerate the training process of AdamGNN in Section 4.3.

Experimental Setup
We evaluate our proposed model, AdamGNN, on 14 benchmark datasets, and compare with 16 competing methods over both node-and graph-wise tasks, including node classification, link prediction and graph classification.Datasets.To validate the effectiveness of our model on real-world applications, we adopt datasets come from different domains with different topics and relations.We use 8 datasets for node-wise tasks (data statistics are summarised in Table 2).Ogbn-arxiv [13], ACM [2], Cora [15], Citeseer [15] and Pubmed [15] are paper citation graph datasets.DBLP [2] is an author graph dataset from the DBLP dataset.Emails [17] is an email communication graph dataset.Wiki [35] is a webpage graph dataset.
For the graph classification task, we adopt 6 bioinformatics datasets [39] (data statistics are summarised in Table 3).D&D and PROTEINS are dataset containing proteins as graphs.NCI1 and NCI109 involve anticancer activity graphs.The MUTAG and Mutagenicity consist of chemical compounds divided into two classes according to their mutagenic effect on a bacterium.Note that all datasets can be downloaded with our published code automatically.
Evaluation settings.For the node-wise tasks, we follow the supervised node classification (Sup-NC) settings of PNA [7], i.e., using two sets of 10% labelled nodes as validation and test sets, with the remaining 80% labelled nodes used as the training set.Meanwhile, we follow the semi-supervised node classification (Semi-NC) settings of GCN [15], and the data split is shown in Table 2.That said, for Cora, Citeseer and Pubmed, we use the fixed splits and for other datasets, we randomly assign 20 labelled nodes for each class for training, and 500 and 1000 nodes for validation and testing, respectively.Note that since the Email does not have a sufficient number of nodes for classic Semi-NC setting, we choose 10 labelled nodes for each class for training, and the rest data is evenly separated as validation and test sets.Wiki is imbalanced, where some classes only have very few labelled nodes, e.g., class 12 has 9 labelled nodes and class 4 has 10 labelled nodes, which cannot support Semi-NC settings.Therefore, we follow Sup-NC settings to split Wiki for the Semi-NC experimental parts but use only 20% labelled nodes for training and the remaining nodes for validation and testing, respectively.Ogbn-arxiv follows the fixed split of OGB leaderboard [13].For the Link Prediction (LP) task, we follow the settings of [38], i.e., using two sets of 10% existing edges as validation and test sets, with the remaining 80% edges used as the training set.Note that, an equal number of non-existent links are randomly sampled and used for every set.We present the average performance of 10 times experiments with random seeds.The AUC score evaluates link prediction, and node classification tasks are evaluated by accuracy.We conduct the experiments with random parameter initialisation with 10 random seeds and report the average performance.
For the graph-wise task, i.e., graph classification (GC) task, we perform all experiments following the pooling pipeline of SAGPOOL [16].80% of the graphs are randomly selected as training, and the rest two 10% graphs are used for validation and testing, respectively.We conduct the experiments using 10-fold cross-validation and report the average classification accuracy on 10 random seeds.Model configuration.For all methods, we set the embedding dimension d = 64 and utilise the same learning rate = 0.01, Adam optimiser, number of training epochs = 1000 with early stop (100).In terms of the neural network layers, we report the one with better performance of GCNII with better performance among {8, 16, 32, 64, 128}; for other models, we report the one with better performance between 2 − 4; For all models with hierarchical structure (including AdamGNN), we use GCN as the GNN encoder for fair comparision.In terms of the number of levels that is required by hierarchical models, we present the one with better performance, between 2−5.On other hyper-parameter settings of competing methods, we employ the default values of each competing method as shown in the paper's official implementation.Particularly, for AdamGNN, by tuning the hyper-parameters based on the validation set, we have γ = 0.1 and δ = 0.01 for Eq. 10 for the experiments to let loss values lie in a reasonable range, i.e., (0, 10).We employ Pytorch and PyTorch Geometric to implement all models.Experiments were conducted with GPU (NVIDIA Tesla V100) machines.

Experimental Evaluation and Ablation Study
Performance on node-wise tasks.We compare AdamGNN with 7 GNN models and one pooling-based model, i.e., G-U-NET, since other pooling approaches do not provide an unpooling operator and thus cannot support nodewise tasks.Results on node classification (with supervised and semi-supervised settings) are summarised in Table 4.They show that AdamGNN can outperform most competing methods with up to 10.47% and 5.39% improvements on semi-supervised and supervised settings, respectively.AdamGNN brings the most significant improvement in Wiki data with semi-supervised settings, and the competing method that only adopts node features, i.e., MLP, achieve terrible accuracy, 17.46%.We argue that because the node features and node labels are weakly correlated in this dataset, multi-grained semantics provided by AdamGNN help to ameliorate the performance.Link prediction results in Table 5 show that AdamGNN can significantly outperform the 7 competing methods by up to 25.3% improvement in terms of AUC.It indicates the versatility of AdamGNN on different node-wise tasks and exhibits the usefulness of modelling multi-grained semantics into node representations.Similar to node classification task, AdamGNN again brings the most significant AUC improvement on the Wiki dataset, i.e., achieving 29.76% improvement compared with the flat GNN models.Performance on graph-wise task.Experimental results are summarised in Table 6.It is apparent that our AdamGNN achieves the best performance on 4 of the 6 datasets, and consistently outperforms most of competing pooling-based techniques by 1.76% improvement.For the datasets MU-TAG and PROTEINS, our results are still competitive since  Ablation study of different loss functions.The loss function of our AdamGNN consists of three parts, i.e., L Task , L R and L KL .We examine how each part contributes to the performance.Ablation study of number of granularity levels.As it has been proved that the existing GNN models will have worse performance when the neural network goes deeper [21], here we examine how AdamGNN can be benefited from more granularity levels.By varying the number of granularity levels, we report the performance of AdamGNN on link prediction, supervised node classification and graph classification, as summarised in Table 9.We can observe that increasing the number of granularity levels can improve both link prediction and node classifications.As for graph classification, 2 levels would be a proper choice.Ablation study of Ego-network size (λ).The size of an ego-network as defined in Section 3 is captured by λ.We present an ablation study to investigate the influence of λ on AdamGNN's performance, results are summarised in Figure 3.The figure indicates that λ has no significant influence on the model performance.We simply adopt λ = 1 throughout the paper.

More Model Analysis
Running time comparison.We present the average epoch training time of different node and graph classification models in Table 11 and Table 12, respectively.In terms of node classification task, AdamGNN requires more training time due to the computation cost of α ij and β t , similar to any attention-mechanism enhanced models [30].However, AdamGNN is designed as a local network algorithm, maintaining good scalability; hence it can be easily accelerated by mini-batch and multi-GPUs computing frameworks [14].It will significantly mitigate the computational issues.On the other hand, AdamGNN follows the sparse design similar to SAGPOOL, ASAP, striking a balance between performance improvement and maintaining proper time efficiency.DIFF-POOL and STRUCTPOOL employ a dense mechanism that is not easily scalable to larger graphs [4], and G-U-NET uses convolution operations, which bring additional computation cost, to distribute the received information to the graph.Visualisation of different granularity levels.To better understand the process of learning multi-grained semantics, we report the relative numbers of nodes (i.e., node ratio concerning the original graph) at different granularity levels generated by AdamGNN in 7 datasets, as shown in Figure 5.
In particular, we set the max number of granularity levels as 5, and level 0 indicates the original graph.We train AdamGNN with semi-supervised node classification and report the relative number of nodes and selected ego-nodes at each level.In the right, we can find out that the number of ego-nodes can stay stable after 1−2 times of our adaptive pooling which indicates that AdamGNN can effectively find a compact structure that contains multi-grained semantics.
In the left, we can see that the number of nodes of each level stabilises after 3 − 4 times of our adaptive pooling which illustrates AdamGNN will maintain the graph size at a proper level to avoid dense super graph.Visualisation of adjacency matrices at different granularity levels.One of the fundamental limitations of existing GNNs is the inability of capturing long-range node interactions in the graph [19].We find that AdamGNN can provide a possible solution to overcome this limitation.AdamGNN allows nodes to receive messages from far-away nodes with the support of the adaptive multi-grained structure.That said, the learned multi-grained structure can be regarded as a kind of short-cuts to let far-away nodes be aware of each other.We visualise these short-cut connections at different levels on the Wiki dataset in Figure 6.We can clearly see that the original graph of the Wiki dataset is very sparse, and AdamGNN adds short-cuts between nodes with the help of the learned multi-grained structure.
In this way, AdamGNN allows nodes to capture global information with few adaptively pooled graphs.Comparing aggregation mechanisms.To demonstrate the internal aggregation mechanism of AdamGNN and figure out the reason that it leads to performance improvements as shown in Section 4.2, we give a toy example of the aggregation schemas in vanilla GNNs and AdamGNN.As shown in Figure 8, 1-layer vanilla GNN can only capture limited information as presented in the left rooted tree.Nevertheless, thanks to the adaptive hierarchical structure learned by AdamGNN, target nodes can receive multigrained semantics as well as endowed with the ability to capture information from nodes from a long-range.For instance, node v a 's message cannot be obtained by target node with a few-layers vanilla GNN, but AdamGNN allows the target node to receive v a 's information with 1 granularity level.The level 1's graph allows the super node to receive a message from node v a and pass it to the target nodes by flyback aggregator.

CONCLUSION
To summarise, we proposed AdamGNN, a method that integrates multi-grained semantics into node representa-tions and realises collective optimisation between node-and graph-wise tasks in one unified process.We have designed an adaptive and efficient pooling operator with a novel ego-network selection approach to encode the multi-grained structural semantics, and a training strategy to overcome the over-smoothing problem.Extensive experiments conducted on 14 real-world datasets showed the promising effectiveness of AdamGNN on node-and graph-wise downstream tasks.One future direction is to appropriately apply the adaptive multi-grained structure on heterogeneous networks for node and graph level tasks.

Fig. 4 .Fig. 5 .
Fig. 4. Visualisation of attention weight for messages at different granularity levels.Dark colours indicate higher weights.i.e., AI and WC, receive messages from different levels with relatively indistinguishable weights, i.e., higher attention scores of nodes are distributed across levels.The DM topic in two datasets has different attention patterns: it receives messages from level 1 with the greatest attention in ACM but receives greatest attention messages from level 3 in DBLP.This is because DM is not closely related to the other two topics of ACM dataset, Scholars of DM-related papers are less possible collaborate with researchers from DB or WC.DM papers are close to each other in the network.Thus DM papers only need to receive level 1 granularity semantic information summarised from neighbouring nodes.In contrast, DBLP's other 3 topics are close to DM. DMrelated scholars may cite any other scholars' papers, and information related to DM is scattered over author nodes in DBLP network.Therefore, DM researchers tend to be characterised by level 3 semantics from a wide range.
Figure-level 0 plots the original adjacency matrix, Figure-level 1 exhibits the learned short-cuts by the first pooled super graph (in green).Similarly, figure-levels 2 and 3 present the derived shortcuts by further stacking the adjacency matrices of the second (in blue) and third (in yellow) pooled graphs, respectively.

Fig. 7 .Fig. 8 .
Fig. 7. Visualisation of network structure and adjacency matrix at different granularity levels of Karate-club dataset rest nodes for test.Network structure, experimental results is summarised in Table13and two adjacency matrices are shown in Figure7.Short-cuts derived by the first pooled super graph are depicted in green in the level 1 adjacency matrix.We find that AdamGNN outperforms GCN on 1shot NC task with up to 40.5% performance improvements.The two figures in the right part of Figure7clearly demonstrate that the short-cuts derived by AdamGNN make the example node aware of far-away nodes.

TABLE 2
Data statistics for node-wise tasks and the split for the semi-supervised node classification task.N.A. means a dataset does not contain node attributes or does not support semi-supervised settings.

TABLE 3
Data statistics for graph classification.

TABLE 5
Results in AUC for link prediction on seven datasets.

TABLE 7
Comparison of AdamGNN with different loss functions on three tasks.NC task follows the supervised settings.

Table 7
Task equals to L R .Thus, two comparison experiments are missing in link prediction.From the results, we can see that L R can significantly improve the performance over three tasks.This is because it can eliminate the over-smoothing problem caused by the received messages from different granularity levels.Meanwhile, L KL can slightly improve the results of node-wise tasks as well.
provides the results.For the link prediction task, we have L = L R +γL KL , since L

Ablation study of the flyback aggregation.
Experimental results of node-wise tasks confirm that capturing multigrained semantics in AdamGNN can help to learn more meaningful node representations.Here, we further study whether flyback aggregator can improve graph representations.Specifically, we aim to see how the flyback aggregator contributes to graph classification performance by removing and keeping it.The results are summarised in Table 8.It is clear that the node representations enhanced by the flyback aggregation can indeed improve the graph representation in the classification task.

TABLE 9
Comparison of AdamGNN with different number of granularity levels in terms of different tasks.NC task follows the supervised settings.

TABLE 10
Comparison of AdamGNN with different primary GNN encoder, follow the semi-supervised node classification settings.Fig.3. Ablation study of Ego-network size λ in terms of different tasks.NC task follow the supervised settings.
research areas: AI, DB, DM and computer vision.The node classification task on these two datasets is predicting the paper/scholar's research area.The attention scores of nodes that highlight the importance of different levels' messages are plotted in Figure4.We can find different distributions of attention weights over different granularity levels for various areas' classifications.The relatively general topics,

TABLE 13
Performance comparison for 1-shot-NC task on Karate-club dataset.
[40]oration of short-cuts of AdamGNN.To explore the short-cuts derived by AdamGNN, we perform another empirical analysis on one additional network, i.e., the Karate[40]club network.We choose 1-shot NC as target task, where we randomly select one sample from each class as training set, an equal number nodes as validation set and the Adjacency Matrix-Level 1 (with Short-cuts)