Graphs, Entities, and Step Mixture for Enriching Graph Representation

Graph neural networks have shown promising results on representing and analyzing diverse graph-structured data, such as social networks, traffic flow, drug discovery, and recommendation systems. Existing approaches for graph neural networks typically suffer from the oversmoothing issue that results in indistinguishable node representation, as recursive and simultaneous neighborhood aggregation deepens. Also, most methods focus on transductive scenarios that are limited to fixed graphs, which do not generalize properly to unseen graphs. To address these issues, we propose a novel graph neural network that considers both edge-based neighborhood relationships and node-based entity features with multiple steps, i.e. Graph Entities with Step Mixture via random walk (GESM). GESM employs a mixture of various steps through random walk to alleviate the oversmoothing problem, an attention mechanism to dynamically reflect interrelations depending on node information, and structure-based regularization to enhance embedding representation. With intensive experiments, we show that our proposed GESM achieves state-of-the-art or comparable performances on eight benchmark graph datasets in both transductive and inductive learning tasks. Furthermore, we empirically demonstrate the superiority of our method on the oversmoothing issue with rich graph representations. Our source code is available at https://github.com/ShinKyuY/GESM.


I. INTRODUCTION
Graphs are general data representations that exist in a wide variety of real-world problems, such as analyzing social networks [1], [2], forecasting traffic flow [3], [4], predicting side-effects or interactions for drug discovery [5]- [7], and recommending products based on personal preferences [8], [9]. Owing to breakthroughs in deep learning, recent graph neural networks (GNNs) [10] have achieved considerable success on diverse graph problems by collectively aggregating information from graph structures [11]- [13]. As a result, in recent years, much research has been focused on how to aggregate feature representations of neighboring nodes to effectively utilize the dependence of the graph.
The associate editor coordinating the review of this manuscript and approving it for publication was Sun-Yuan Hsieh .
The majority of studies have predominantly depended on edges to aggregate the neighboring nodes' information. These edge-based methods are premised on the concept of relational inductive bias within graphs [14], which implies that two connected nodes have similar properties and are more likely to share the same label [15]. Although this approach leverages the unique properties of graphs for structural relations, it appears less capable of generalizing to new or unseen graphs [16].
To improve the neighborhood aggregation scheme, some studies have incorporated node information; they fully utilize node information and reduce the effects of structural (edge) information. A recent approach, graph attention networks (GAT), employs the attention mechanism for adjusting aggregation of neighboring nodes depending on the node information [17]. This approach has yielded impressive FIGURE 1. Random walk aggregation and propagation procedure. From left to right are step-0, step-1, step-2, and step-infinite. The values in each node indicate the distribution of a random walk. In the leftmost picture, only the starting node has a value of 100, and all other nodes are initialized to zero. As the number of steps increases, values spread further throughout the graph, become smoothed (indistinguishable), and converge to some extent. performance and has shown potential for improving generalization to unseen graphs.
Regardless of neighborhood aggregation or propagation schemes, most methods, however, suffer from a common problem called oversmoothing. It means that as the aggregation step goes deeper, all node representations become similar and eventually indistinguishable. To avoid the oversmoothing issue, neighbor information is considered within a limited degree [18], weakening the rich graph representation. For example, graph convolutional networks (GCNs) [15] limit the aggregation to shallow depths and only operate on nodes that are closely connected, to keep distinguishable node information [19]. Consequently, information becomes localized and access to global information is restricted [12], leading to poor generalization performance on datasets with only a few labeled nodes [19].
In order to address the aforementioned issues, we propose a novel method, Graph Entities with Step Mixture via random walk (GESM), which considers information from all nodes in the graph and can be generalized to new graphs by incorporating mixture of random walks and attention. Mixture of random walks enriches the graph representation and alleviates the oversmoothing issue, allowing global information to be considered. Hence, our method can be effective particularly on the periphery or a sparsely labeled dataset. The attention mechanism also advances our model by adaptively aggregating neighbor node information. This enhances the generalizability of models to diverse graph structures.
Despite of the attention mechanism, it is likely that some neighbor nodes are still not clustered closely on the embedding space. We employ a triplet loss-based regularization technique [20], which enforces the neighbor nodes to be closer and the irrelevant nodes to be distant.
To validate our approach, we conducted experiments on eight standard benchmark datasets: Cora, Citeseer, and Pubmed, which are well-known citation networks datsets, Coauthor CS, Coauthor Physics, Amazon Computers and Amazon Photo, which are co-authorship and co-purchase datsets for transductive learning. We also conducted experiments on protein-protein interaction (PPI) dataset for inductive learning, in which test graphs remain unseen during training. In addition to these experiments, we verified whether our model uses information of remote nodes by reducing the labeled ratio. The experimental results demonstrate the consistently competitive performances of GESM on all the datasets including transductive and inductive scenarios. This manuscript is an expanded version of our prior workshop paper [21].
The key contributions of our approach are as follows: • We propose Graph Entities with Step Mixture via random walk (GESM), which incorporates Step Mixture with bilinear pooling based attention and novel regularization technique.
• We experimentally demonstrate that our proposed model is consistently competitive compared to other models on eight benchmark datasets and is applicable to both transductive and inductive learning tasks. In addition, we show its effectiveness in terms of global aggregation as the rate of labels decreases.
• We provide an in-depth analysis regarding the effects on performance and inference time as the aggregation step increases, and confirm our superiority on the oversmoothing issue.

A. RANDOM WALKS
Random walk, a widely used method in graph theory, mathematically models how node information aggregates and propagates throughout the graph. Random walk refers to randomly moving from the starting node to neighbor nodes in a graph in a succession of random steps. FIGURE 1 shows how values change for random walk aggregation and propagation procedure from zero to infinite steps. For a given graph, the transition matrix P, which describes the probabilities of transition, can be formulated as follows: where A denotes the adjacency matrix of the graph, and D the diagonal matrix with a degree of nodes. The probability of moving from one node to any of its neighbors is equal, and the sum of the probabilities of moving to a neighboring node adds up to one. Let u t be the distribution of the random walk at step t (u 0 represents the starting distribution). The t step random walk distribution is equal to multiplying P, the transition matrix, t times. In other words, The entries of the transition matrix are all positive numbers, and each column sums up to one, indicating that P is a matrix form of the Markov chain with steady-state. One of the eigenvalues is equal to 1, and its eigenvector is a steadystate [22]. Therefore, even if the transition matrix is infinitely multiplied, convergence is guaranteed.

B. ATTENTION
The attention mechanism was introduced in sequenceto-sequence modeling to solve long-term dependency problems that occur in machine translation [23]. The key idea of attention is allowing the model to learn and focus on what is important by examining features of the hidden layer. In the case of GNNs [10], attention mechanism has been used to give different importance to neighboring nodes depending on their informational relevance to a center node. During the propagation process, node features are given more emphasis than structural information (edges). Consequently, using attention is advantageous for training and testing graphs with different node features even in the same structures (edges). GAT [17] achieved state-of-the-art performance by using the attention mechanism, which is based on concatenation operation between node information.
Given the many benefits of attention, we incorporate the attention mechanism to our model, which is based on bilinear pooling to fully utilize node information and fine-grained relevance. Our attention mechanism enables different weights to neighboring nodes by considering interactions. Combining attention with mixture-step random walk allows our model to adaptively highlight features with salient information in a global scope.

C. NODE CLASSIFICATION TASK
Node classification is a task to classify the labels of masked nodes by learning from the other nodes of given graphs. Node classification can be classified into transductive learning and inductive learning tasks based on the usage of unseen graphs.
In transductive learning tasks, graph learning is performed within a single fixed graph, and focuses on learning the given structural relationships between nodes of the graph. GCN [15] and HGCN [24], which are state-of-the-art models of transductive learning, focus on how to effectively handle the graph's structural representation. However, these models learn from the spectral-domain or hyperbolic-space by limiting the graph structure to only one fixed graph [25], thus leading to poor generalization on unseen graphs.
In inductive learning tasks, graph learning is performed to deal with multiple graphs; it focuses on learning dynamic relationships between nodes with new graphs. Existing models such as GraphSAGE [25] and GAT [17] focused on the local range of aggregations by sampling neighborhood nodes and actively utilizing node embeddings. However, in the case of node sampling, there are issues of which node and how many nodes to sample. Other challenges pertaining to node sampling include the excessive computation time and loss of structural information.
In general, most existing approaches for graph representation learning commonly suffer from the oversmoothing issue. Despite the efforts of recent work including JK-Net [12] and APPNP [18], these methods cannot completely solve the oversmoothing issue. Since APPNP employs a simple sum of Neumann series [26], it cannot adjust to the global and local aggregation scheme. JK-Net similarly does not work effectively for global aggregation as shown in FIGURE 5.

III. METHODOLOGY
Our proposed method enriches graph representation by employing a mixture of various steps through random walk, an attention mechanism for dynamic aggregation, and structure-based embedding regularization. Based on the base component of step mixture, we can add the attention mechanism and regularization.
The notations used in this manuscript are as follows. Let G = (V , E) be a graph, where V and E denote the sets of nodes and edges, respectively. Nodes are represented as a feature matrix X ∈ R n×f , where n and f respectively denote the number of nodes and the input dimension per node. A label matrix is Y ∈ R n×c with the number of classes c, and a learnable weight matrix is denoted by W . The adjacency matrix of graph G is represented as A ∈ R n×n . The addition of self-loops to the adjacency matrix is A = A + I n , and the column normalized matrix of A isÂ = AD −1 withÂ 0 = I n .

A. STEP MIXTURE
The step mixture as our base component has a simple structure that is composed of three stages as shown in FIGURE 2. First, input X passes through a fully connected layer with a nonlinear activation and generates embedding node features Z = σ (XW ). Second, Z is multiplied by a normalized adjacency matrixÂ for each random walk step that is to be considered. As indicated in the first and second stages, the node embedding and propagation processes are separated. Finally, the step mixture is implemented by concatenating each step and passing f concat through the last prediction layer to produceŷ. The entire propagation process can be formulated as: where is the concatenation operation, s is the maximum number of steps considered for aggregation, andÂ k is the VOLUME 9, 2021 normalized adjacency matrixÂ multiplied k times. As shown in Equation 3, weights are shared across nodes. In our method, the adjacency matrixÂ is an asymmetric matrix, which is generated by random walks and is flexible to arbitrary graphs. On the other hand, prior methods such as JK-Net [12] and MixHop [27], use a symmetric Laplacian adjacency matrix, which limits graph structures to given fixed graphs.
For step mixture, localized sub-graphs are concatenated with global graphs, which allows the neural network to adaptively select global and local information through learning (see FIGURE 3). While traditional graph convolution methods consider aggregated information within three steps by A(A(AXW (0) )W (1) )W (2) , our method can take all previous aggregations into account by Most graph neural networks suffer from the oversmoothing issue. Although JK-Net [12] tried to handle oversmoothing by utilizing GCN blocks with mulitple propagation, it could not completely resolve the issue as shown in FIGURE 5. Unlike JK-Net, we explicitly separate the node embedding and the propagation process by employing a mixture of multiple random walk steps. This step mixture approach allows our model to alleviate the oversmoothing issue along with localized aggregation.

B. ATTENTION MECHANISM
For a more sophisticated design of node embedding Z , we adopt an attention mechanism that dynamically aggregates the information of neighboring nodes according to the relationships between the targets and their neighboring nodes. The degree of relationship, denoted as attention coefficient α, is based on bilinear pooling of encoding vectors between two nodes to reflect all element-level relationships. Bilinear pooling-based attention can consider element-level interactions different from the concatenation-based attention used in GAT.
The attention mechanism is implemented by simply replacing Z in Equation 4 with the attention feature denoted by H multi in Equation 5, As described in Equation 6 and 7, we employ multi-head attention, where H multi and α i denote the concatenation of m attention layers and the i-th attention coefficient respectively. Z i denotes the node embedding of each attention head. We only compute α for targets and their neighboring nodes, to maintain the structure of the graph.
The attention coefficients α is calculated by the sum of outer products between encoding vectors of target node t and its neighbor node j for all neighbor nodes: where ⊗ is an outer product, e is Hadamard product of node embedding Z and weight matrix W , e = Z W . By incorporating attention to our base model, we can avoid or ignore noisy parts of the graph, providing a guide for random walk [28]. Utilizing attention can also improve combinatorial generalization for inductive learning, where training and testing graphs are completely different. In particular, datasets with the same structure but different node information can benefit from our method because these datasets can only be distinguished by node information. Focusing on node features for aggregation can thus provide more reliable results in inductive learning.

C. EMBEDDING REGULARIZATION
Although our attention effectively filters out noisy nodes, there is still a possibility that the correct neighbor nodes may not be clustered in the process, as shown in FIGURE 8. Therefore, it is necessary for node embedding Z to have a regularization that helps cluster neighbor nodes together and move irrelevant nodes further away by reflecting the graph structure. Our push and pull-based triplet regularization R can be formulated as: where S ⊂ E and its cardinality |S| is the number of samples, and β denotes a weight for the distances between the center node embedding and its positive and negative node embeddings. A positive node p represents the neighbor node of a center node c, and a negative node n represents all the nodes except the positive and center nodes. The distance function, Dis(), uses a sigmoid of dot products, i.e., Dis(Z i , Z j ) = 1 − sigmoid(Z i · Z j ). The distance between positive nodes and the center are minimized and the distance between negative nodes and the center node are maximized. Finally, our objective function L is L = J + R, where J denotes a cross-entropy loss between the prediction labelŷ and the target label to reduce classification errors, and R denotes an embedding regularizer to further differentiate embedding representations.

IV. EXPERIMENTS A. DATASETS 1) TRANSDUCTIVE LEARNING
We utilize three benchmark datasets for node classification: Cora, Citeseer, and Pubmed [34]. These three datasets are citation networks, in which the nodes represent documents and the edges correspond to citation links. The edge configuration is undirected, and the feature of each node consists of word representations of a document. Detailed statistics of the datasets are described in Table 1.
For experiments on datasets with the public label rate, we follow the transductive experimental setup of [35]. Although all of the feature vectors of node are accessible, only 20 node labels per class are used for training. Accordingly, 5.1% for Cora, 3.6% for Citeseer, and 0.3% for Pubmed can be learned. In addition to the experiments with public label rate settings, we conducted the experiments using datasets where labels were randomly split into a smaller set for training. To check whether our model can propagate node information to the entire graph, we reduced the label rate of Cora to 3% and 1%, Citeseer to 1% and 0.5%, Pubmed to 0.1%, and followed the experimental settings of [19] for these datasets with low label rates. For all of the experiments, we report the results using 1,000 test nodes and 500 validation nodes.

2) INDUCTIVE LEARNING
We use the protein-protein interaction PPI dataset [36],which is preprocessed by [17]. As detailed in Table 1, the PPI dataset consists of 24 different graphs, where 20 graphs are used for training, 2 for validation, and 2 for testing. The test set remains completely unobserved during training. Each node is multi-labeled with 121 labels and 50 features regarding gene sets and immunological signatures.

3) ROBUSTNESS
For an in-depth verification of overfitting, we extended our experiments to four types of new node classification datasets. Coauthor CS and Coauthor Physics are co-authorship graphs from the KDD Cup 2016 challenge 2 , in which nodes are authors, features represent the article keyword for each author's paper, and class labels indicate each author's most active research areas. Amazon Computers and Amazon Photo are co-purchase graphs of Amazon, where nodes represent the items, and edges indicate that items have been purchased together. The node features are bag-of-words of product reviews, and class labels represent product categories. Detailed statistics of the datasets are described in Table 2.

B. EXPERIMENTAL SETUP
Regarding the hyperparameters of our transductive learning models, we set the dropout probability such that 0.7 and number of multi-head m = 8. The size of the hidden layer h ∈ {64, 512} and the maximum number of steps used for aggregation s ∈ {5, 15} were adjusted for each dataset. We trained for a maximum of 300 epochs with L2 regularization λ ∈ {0.003, 0.0005}, triplet regularization β ∈ {0.5, 1.0} and learning rate lr ∈ {0.003, 0.0008}. We report the average classification accuracy of 20 runs.
For inductive learning, we set the size of hidden layer h = 256, number of steps s = 3, multi-head attention m = 8, and β = 1.0 for GESM. L2 regularization and dropout were not used for inductive learning [17]. We trained our models for a maximum of 3,000 epochs with learning rate lr = 0.008. The evaluation metric is the micro-F1 score, and we report the averaged results of 10 runs.
Our models were initialized using Glorot initialization [37], and the nonlinearity function of the first fully connected layer was an exponential linear unit (ELU) [38]. Models were trained to minimize the cross-entropy loss using the Adam optimizer [39]. We employed an early stopping strategy based on the loss and accuracy of the validation sets, 20 epochs for Cora and 100 epochs for others. All the experiments were performed through NAVER Smart Machine Learning (NSML) platform [40], [41]. Table 3 summarizes the experimental results for transductive and inductive learning tasks which were based on the public label ratio. In general, not only are there a small number of methods that can perform on both transductive and inductive learning tasks, but the performance of such methods is not consistently high. Our methods, however, are ranked in the top-3 for every task, indicating that our method can be applied to any task with large predictive power.

A. NODE CLASSIFICATION 1) RESULTS ON BENCHMARK DATASETS
For transductive learning tasks, the experimental results of our methods are higher than or equivalent to those of other methods. As can be identified from the table, our model GESM (w/o reg), which considers node information in the aggregation process, outperforms many existing baseline models. These results indicate the significance of considering both global and local information with the attention mechanism. It can also be observed that GESM achieves more stable results than GESM (w/o reg), suggesting the importance of reconstructing node representations. In addition, we calculate confidence intervals via bootstrapping, denoted by GESM (bootstrap.). For the bootstrapping setup, we varied the train nodes with replacement and fixed valid/test nodes. This demonstrates a statistically rigorous performance of GESM.
For the inductive learning task, our proposed model GESM (w/o reg) and GESM outperform the results of GAT, despite the fact that GAT consists of more attention layers. These results for unseen graphs are consistent with those from the previous work [17], that improved generalization by enhancing the influence of node information. JK-Net shows comparable performance to GESM on PPI dataset. However, subsequent in-depth comparisons on the oversmoothing issue confirm the superiority of our method compared to JK-Net.

2) RESULTS ON DATASETS WITH LOW LABEL RATES
To demonstrate that our methods can consider global information, we experimented on transductive learning datasets with low label rates. As indicated in Table 4, our models show remarkable performance even on sparse datasets. In particular, we can observe that our methods that were trained on only 3% of Cora dataset outperformed some methods trained on 5.1% of the data. Since each run of the experiment uses a separate random split with very low label rates (e.g., 1% Cora is only 25 nodes for the train), the overall bootstrapped confidence interval is larger than its public counterpart.
It could be speculated from the results that using a mixture of random walks played a key role in the accuracy of our proposed model GESM; the improved results can be explained by our method adaptively selecting node information from local and global neighborhoods, and allowing even peripheral nodes to receive information.

3) EXPERIMENTS ON OTHER DATASETS FOR ROBUSTNESS
Existing works have suffered from experimental bias and overfitting from using a single train/valid/test split, and tuning hyperparameters to each dataset. GNN algorithms are very sensitive to both data splits and weight initialization (as shown by [18], [42]). Thus, [42] suggests carefully designed evaluation protocol which runs experiments on multiple random splits and initializations.
To check robustness, we extended our experiments to Coauthor CS, Coauthor Physics, Amazon Computers, and Amazon Photo datasets. We followed the experimental  setup of [42]. The experiment is performed by repeating 100 random train/valid/test splits with 20 random weight initializations and using the same hyperparameters (unified size: 64, step size: 15, multi-head for GAT and GESM: 8) without additional tuning. Table 5 proves that our proposed methods do not overfit to any particular dataset and consistently ranks in the top-3 methods. Moreover, in comparison to GAT, GAT shows a very high standard deviation on Amazon Computers and Amazon Photo, whereas the performance of GESM is more accurate and stable. For a more rigorous study of statistical validation, we report the uncertainties of models showing the 95% confidence level (see Table 6).

B. MODEL ANALYSIS 1) OVERSMOOTHING AND ACCURACY
To investigate the oversmoothing issue empirically, we conducted an experiment to check the classification accuracy as the number of aggregation or propagation steps increased. As shown in FIGURE 4, GCN [15], SGC [31], and GAT [17] suffer from oversmoothing. GCN and GAT show severe degradation in accuracy after the 8th step; The accuracy of SGC does not drop as much as GCN and GAT but nevertheless gradually decreases as the step size increases. Unlike the others, our proposed GESM, maintains its performance without any degradation due to no rank loss [43] and alleviating the oversmoothing by step mixture.    Intriguingly, JK-Net [12] also maintains the training accuracy regardless of the step size by using GCN blocks with multiple steps according to FIGURE 4. Like a similar approach on earlier, we further compared the test accuracy of GESM with JK-Net, by the step size. To inspect the adaptability to larger steps of GESM and JK-Net, we concatenated features after the 10-th step. As shown in FIGURE 5, GESM outperforms JK-Net, even though both methods use the concatenated features. These results demonstrated that JK-Net obtains global information similar to GCN or GAT. Consequently, the larger the step, the more difficult it is for JK-Net to maintain test accuracy. GESM, on the other hand, maintains a steady performance, which confirms that GESM does not collapse even in large steps.
We also observe the effect on accuracy as the number of steps increases under various label rates for GESM. As represented in FIGURE 6, it is evident that considering remote  nodes can contribute to the increase in accuracy. By taking into account more data within a larger neighborhood, our model can make reliable decisions. The figure also indicates that the accuracy converges faster for datasets with higher  label rates, presumably because a small number of walk steps can be used to explore the entire graph.

2) INFERENCE TIME
Computational complexity of all methods increases linearly as the step size increases, as shown in FIGURE 7. We can observe that the inference time of GCN [15] is slightly faster than that of GESM, especially with a constant margin. The inference time of GESM is much faster than GAT [17], even though both methods employ multi-head attention mechanisms. GESM provides higher accuracy in a reasonable inference time due to the sophisticated design comprising a mixture of random walk steps.

3) EFFECTS OF REGULARIZER
The last embedding representations of the proposed methods are visualized using the t-SNE [44], as represented in FIGURE 8. The figure illustrates the difference between GESM (w/o reg) and GESM (with reg). While the nodes are scattered in GESM (w/o reg), they are closely clustered in that of GESM due to the effects of regularization. According to the results in TABLE 3, more distinctly clustered GESM generally produce better results than loosely clustered GESM without regularization.

VI. CONCLUSION
Conventional graph neural networks suffer from the oversmoothing issue as the number of aggregation (or propagation) steps increase and from poor generalization on unseen graphs. To tackle these issues, we propose a simple but effective model that weights differently depending on node information in the aggregation process and adaptively considers global and local information by employing the mixture of multiple random walk steps. To further refine the graph representation, we have presented a new regularization term, which enforces the similar neighbor nodes to be closely clustered in the node embedding space. The results from extensive experiments show that our GESM successfully achieves state-of-the-art or competitive performance for both transductive and inductive learning tasks on eight benchmark graph datasets.
For future research, we will refine our method that utilizes node information to improve the computational efficiency for attention. In addition, we will extend GESM so that it can be applied to real-world large-scale graph data.