Learning Augmentation for GNNs With Consistency Regularization

Graph neural networks (GNNs) have demonstrated superior performance in various tasks on graphs. However, existing GNNs often suffer from weak-generalization due to sparsely labeled datasets. Here we propose a novel framework that learns to augment the input features using topological information and automatically controls the strength of augmentation. Our framework learns the augmentor to minimize GNNs’ loss on unseen labeled data while maximizing the consistency of GNNs’ predictions on unlabeled data. This can be formulated as a meta-learning problem and our framework alternately optimizes the augmentor and GNNs for a target task. Our extensive experiments demonstrate that the proposed framework is applicable to any GNNs and significantly improves the performance of graph neural networks on node classification. In particular, our method provides 5.78% improvement with Graph convolutional network (GCN) on average across five benchmark datasets.

One solution to the overfitting problem is data augmentation. It increases the amount and diversity of data and improves the generalization of neural networks that are trained on randomly augmented samples. Conventional data augmentation includes, for instance in image recognition, simple transformations such as random cropping, cutout, The associate editor coordinating the review of this manuscript and approving it for publication was Felix Albu . additive Gaussian noise, or blurring. Intuitively, data augmentation teaches the invariances to neural networks for a target task. The key to effective data augmentation is to find an optimal combination of the operations and augmentation strength. Since every task needs different invariances, data augmentation has been manually optimized with domain knowledge for each dataset. To reduce the manual efforts and scale up the data augmentation to new tasks and domains, learning augmentation has been recently proposed; this includes AutoAugment [19] that optimizes hyperparameters of a non-differentiable augmentor via reinforcement learning; PDB [20] generates augmentation policy schedules using population-based training; Fast AutoAugment [21] searches the optimal policy via Bayesian optimization; Faster AutoAugment [22] approximates the gradient information and optimizes the augmentation policies using straight-through estimator [23].
The learning augmentation paradigm, however, has less been studied in the graph domain. One technical challenge is that it is difficult to define simple operations to augment graph data without changing the meaning. The recent advances in VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ learning augmentation mainly focused on finding an optimal policy to combine existing operations and their strength rather than learning the operations. But only a few simple operations for graph data augmentation have been proposed in the literature such as DropEdge [18], DropNode [24], attribute masking [25]. In the last year, with these simple operations, [26], [27] studied frameworks to learn graph data augmentation but they are limited to perturbations of adjacency matrices. In this work, we propose a framework to learn graph data augmentation that transforms input features instead of adjacency matrices. Our framework learns topology-aware input feature transformations and adaptively controls the strength of augmentation based on a GNN's performance. It is formulated as a meta-learning problem and our framework explicitly maximizes the generalization power of graph neural networks that are trained on augmented data. We observe that scarcely labeled data may cause meta-overfitting, i.e., overfitting in meta-learning, and result in excessive and ineffective data augmentation. Inspired by the success of consistency regularization in representation learning [28]- [32], we regularize our augmentation model by maintaining the consistency of GNN's predictions leveraging unlabeled data. To our knowledge, this is the first work to study consistency regularization in the context of learning augmentation. In a conventional setting, consistency regularization was used for representation learning with a manually designed augmentation method. Our experiments show that consistency regularization is effective to reduce meta-overfitting as well.
Our contributions are summarized as follows: • Unlike existing graph data augmentations based on simple removing/adding edges, dropping nodes or masking attributes, we propose a novel framework that learns operation to augment input features and adaptively controls the strength of augmentation.
• We study consistency regularization in the context of learning augmentation. To our knowledge, this is the first work to show the effectiveness of the consistency regularizer in learning augmentation and reducing meta-overfitting.

•
We demonstrate the effectiveness of our framework to improve the generalization power of GNNs. Overall, our method provides 5.78%, 2.63% and 3.14% improvements on average across various datasets compared to the vanilla baselines GCN [4], GraphSAGE [2], and SGC [33], respectively.

II. RELATED WORK A. GRAPH NEURAL NETWORKS
Graph neural networks (GNNs) have been widely adopted in representation learning on graphs for various tasks such as node classification, link prediction, and graph classification. They can be categorized as spectral approaches [4], [34], [35] and non-spectral (spatial) approaches [2], [3], [36]. [4] presented GCN which developed spectral approaches into spatial approaches by localizing first-order approximation of graph convolutions. GraphSAGE [2] is one of non-spectral approaches and learns to generate embeddings by sampling and aggregating features from the neighborhood of nodes. The existing works in graph representation learning only leverage a small subset of nodes. Therefore, to fully utilize a large amount of unlabeled data, recent works for semisupervised learning [24], [37] have emerged.

B. GRAPH AUGMENTATION
Data augmentation is an effective technique to improve generalization by increasing the diversity of training data. It is becoming the de facto necessity for modern machine learning model training to employ simple data augmentation (e.g., image rotation, cropping, flipping, translation, and so on). Despite its effectiveness, few approaches have been explored to graph domain and most of which is based on the graph topology. DropEdge [18] is a simple graph data augmentation strategy to randomly remove a certain number of edges from the graph. Similarly, a method to propagate the perturbed node features by randomly dropping on a nodebased was proposed in [24]. [27] introduces AdaEdge which is to adaptively adjust the graph topology by removing the inter-class edges and adding the intra-class edges based on the prediction. The proposed method is to augment node features based on the topological structure of the graph by interpolating between original input features and augmented features.

C. META-LEARNING FOR GRAPHS
Meta-learning has shown success in diverse tasks [38] and there are some works applying meta-learning to data augmentation on images [39]- [41]. Recently, it is also applied for non-Euclidean domains. For instance, Meta-GNN [42] and Meta-Graph [43] use a gradient-based meta-learning method in a few-shot setting for node classification and link prediction, respectively. MetaR [44] is a meta-learning framework leveraging a knowledge graph for few-shot link prediction. [45] proposed a meta-learner that explicitly relates tasks describing the relations of predicted classes. [46] proposed G-META which can learn transferable knowledge faster via meta gradients by leveraging local subgraphs. Most works for graphs are specialized in a few-shot setting and are used to learn a learner. Recently, SELAR [47] can learn how to assist the primary task by optimizing a weight function not a classifier via meta-learning. In this paper, we propose a graph augmentation strategy in a meta-learning manner which learns how to augment input features to improve performance on unseen data.

III. METHOD
We present a novel framework (AugGCR) that learns data Augmentation for GNNs with the Consistency Regularization to enhance the generalization power of GNNs. Our framework includes 1) an augmentor that generates augmented input features taking into account graph topologies and 2) a novel learning strategy to alternately train a GNN and , θ, β) and is alternately optimized with a target model f (·; w ). Our augmentation consists of three steps: (a) Learnable mixed-order propagation randomly drops node features and propagates the node features to k-hop neighbors with weight α k (b) Node-wise transformation applies a two-layer MLP denoted by g θ (c) Mixup interpolates the augmented samples with original features. Our optimization involves three different data splits: meta-train data (m_tr ), meta-test data (m_te), entire data including unlabeled nodes. The training procedure of Augmentor can be divided into three steps. First, (1) update the dummy target model f * with meta-train data and get a cross-entropy loss, L xe , with (2) on meta-test data. Second, get a consistency regularization loss, L con , in (3) with S different augmentation X from entire data. Lastly, optimize with L aug in (4). (Best viewed in color).
the augmentor avoiding both overfitting and meta-overfitting to scarce label information.

A. TOPOLOGY-AWARE AUGMENTATION
Given an input graph G = (V , E) with |V | nodes, their input features X ∈ R |V |×d , and adjacency matrix A ∈ R |V |×|V | , our augmentor generates the augmented features X ∈ R |V |×d taking account into the graph topology as follows: where Aug(·) is the augmentor and is its model parameters.
Our augmentor consists of three components: a learnable mixed-order propagation, node-wise nonlinear transformation, and mix-up with original features.

Algorithm 1 Augmentor
Input: training feature X,Â; drop ratio δ, # hop K ; Output: augmented feature X 1: function Aug 2: To perturb features based on graph topologies, our augmentor first randomly drops node features as DropNode [24]. The features from DropNode with drop ratio δ can be formally written as . Then the randomly dropped node features are propagated by the power series of normalized adjacency matrix as in SGC [33]. For augmentation, [24] adopted the mixed-order propagation, i.e., X = K k=0 2 is a normalized adjacency matrix and D denotes the degree matrix of A + I. Note that,Â 0 is the identity matrix. We extend this feature perturbation with learnable weights {α k } K k=0 for the power series ofÂ, i.e., X = K k=0 α kÂ k X. Our preliminary experiment shows that the coefficients allow more flexible feature augmentation.

2) NODE-WISE TRANSFORMATION
In the second step, the augmented node features obtained by the learnable mixed-order propagation are further transformed by a simple two-layer perceptron, i.e.,X = g(X; θ). As mentioned in [24], the multi-layer perceptron (MLP) can be replaced with any neural networks.

3) MIXUP WITH ORIGINAL FEATURES
Lastly, our augmentor generates the final augmented node features by adapatively averaging the perturbed feature matrixX and the original node feature matrix X as: where β is a learnable parameter and σ is a sigmoid function. The mixup coefficient β can control the strength of augmentation by adjusting the influence of original feature matrix X on augmented features matrix X. Target models' sensitivity to β will be discussed in Section IV-D. Our augmentation process is summarized in Algorithm 1. VOLUME 9, 2021

B. STRATEGIES FOR LEARNING AUGMENTATION
We present our meta-learning formulation to learn the augmentor with consistency regularization to reduce metaoverfitting of the augmentor.

1) LEARNING TO AUGMENT VIA META-LEARNING
The objective of our augmentation model is to generate data to improve learner's generalization power. In other words, the objective is to maximize the accuracy of a learner GNN on unseen meta-test data D m_te . Given meta-train data where both of which belong to a training set, learning augmentation for a GNN parameterized by w is naturally formulated into a meta-learning problem as: where X( ) is the augmented samples from our augmentor parametrized by , and L cls (·) and L aug (·) are the crossentropy L xe (·) as a surrogate loss function for the accuracy on meta-train data and meta-test data respectively. Note that X( ) is used for both losses since in node classification the input features of neighbor nodes are utilized by GNNs. More concretely, L cls (·) in (3) is defined as where I tr is the indices of the meta-train data and Z = f ( X; w) is the predictions of the learner GNN. L aug (·) is similarly defined with I te , which is the indices of the meta-test data. Although the formulation in (3) is seemingly plausible, it is problematic when the meta data is small. It suffers from meta-overfitting. The augmentor maximizes the accuracy of the learner GNN on the small meta-test data rather than improving its generalization power. This often leads to the GNN's overfitting.

2) CONSISTENCY REGULARIZATION FOR AUGMENTATION
We impose consistency regularization to address metaoverfitting and leverage underutilized node features for learning augmentation. Now, the loss for the augmentor is defined as where L xe (·) and L con (·) are the cross-entropy and consistency loss. Specifically, consistency loss is defined as where Z denotes the predictions of the learner GNN on augmented samples, i.e., Z = f ( X( ); w) andẐ i means the end for 8: for s = 1 to S do 9: 10: end for 12: g r ← ∇ L con ( Z (s) ( n )) 13: 14: w n+1 ← w n − η f g w 17: end for expected probability of ith node. Given the average of outputs Z = 1 S S s=1 Z (s) on S augmented samples, we apply the sharpening trick [48] to get the expected probabilityẐ by reducing the entropy of the predictions. It is written aŝ for all i = 1 . . . |V | and c = 1 . . . C, where |V | is the number of nodes, C is the number of classes, and 0 < t ≤ 1 is a temperature to control the sharpness. In equation 7,Ẑ ic denotes the expected probability of cth class for ith node, i.e.,Ẑ i = [Ẑ i1 , . . . ,Ẑ iC ].
Since the consistency loss does not require labels, we apply the loss to all nodes including unlabeled data. The consistency loss, L con , is commonly used to constrain model predictions to be invariant to input noise. In this work, we apply it to regularize the augmentor not the learner model. This regularizes meta-overfitting and excessive augmentation that drastically changes the predictions of GNNs on augmented data leveraging a large amount of unlabeled data. More discussion on the effectiveness of consistency regularization is in Section IV-D.

3) OPTIMIZATION
Our estimation algorithm is outlined in Algorithm 2. To solve bi-level optimization in (3), we alternately optimize our augmentor and GNN. The lower-level optimization task is to train the GNN parameterized by w minimizing the supervised loss L cls on augmented train samples X( n ). We approximate the solution to the lower-level optimization with the updated parametersŵ using a gradient descent step as where η f is a learning rate for w. w * ( ) is a computational graph for updating and it is not explicitly evaluated. We do not update the GNN parameter w at this point. Using this first-order approximation, the upper-level optimization minimizes L aug (·) on entire data. The equation for updating the augmentation model parameters is given as We adopt the cross-validation to alleviate meta-overfitting and average C gradients from C different splits as Line 2-7 in Algorithm 2. The augmentor parameter is optimized with supervised loss L cls on meta-test data and consistency loss L con on total data including unlabeled data.
After updating the augmentation model parameters, we now update the GNN parameters w using X( n+1 ): The classifier parameter w is optimized with supervised loss L cls on meta-train and meta-test data that belong to a training set in a standard supervised learning. Note that we study our augmentor in the context of Test-time Augmentation (TTA) [49]- [52] so the augmentor is be optimized by minimizing the loss on augmented data X instead of the original data X .

IV. EXPERIMENT
We evaluate the effectiveness of our framework on five popular benchmark datasets. The results include the following.
(2) It also shows competitive performance compared to other state-of-the-art semi-supervised methods: GRAND and GraphMix.
(3) Ablation study shows that topology-aware augmentation and keeping consistency of GNNs predictions on unlabeled data improve the performance of GNNs.
Datasets. We evaluate our method on five benchmark datasets for node classification in various domains: movie dataset IMDB, citation network DBLP preprocessed by [53], air-traffic dataset AIRPORT [54], Wikipedia page-page network SQUIRREL and CHAMELEON [55] preprocessed by [56]. The statistics of datasets are summarized in Table 1.
Implementation Details. All experiments are implemented in Pytorch [57] with the geometric deep learning library TorchGeometric [58], and we use marco-F1 score as an evaluation metric. We set the dimensionality of hidden representations to 64 and the number of layers to 2 for IMDB, DBLP (and 3 for AIRPORT, SQUIRREL, and CHAMELEON) for all the methods. Models are optimized by the Adam [59] optimizer for 200 epochs except for AIR-PORT in which models are trained for 500 epochs. We did not use the batch normalization, dropout, and weight decay. More Details of the hyperparameters setting are in Table 2. Hyperparameter tuning for baseline methods is performed by a grid search. The drop ratio δ for baselines DropEdge and DropNode are tuned by a grid search in {0.1, 0.2, · · · , 0.9}. For the subroutine DropNode in our AugGCR, we did not perform a hyperparameter-tuning for the drop ratio. δ = 0.5 is used for AugGCR in all the experiments. The test accuracy of models with the best performance on the validation sets is reported. All experiments are repeated 10 times and average accuracy and standard deviation are reported. The standard deviation of Gaussian noise in AddNoise is searched in {0.001, 0.01, 0.1, 1}. AdaEdge has two hyperparameters the number of adding/removing edges and they are searched in {0, 0.2|E|, 0.4|E|, 0.8|E|}.
Baselines. We evaluate the effectiveness of the proposed method by integrating with three widely-used graph neural networks: GCN [4], SGC [33], and GraphSAGE [2]. We compare our method with four augmentation  strategies: AdaEdge [27] that adaptively adjusts the graph topology by removing/adding edges based on its predictions; DropEdge [18] that randomly removes a fixed number of edges from the graph; DropNode [24] that randomly drops the entire feature vector of a node with probability δ and scales up features by a factor of 1 (1−δ) , i.e., δ); AddNoise that adds random Gaussian noise ∼ N (0, σ ) to node features for augmentation. Table 3 shows that our method significantly improves node classification performance of most base models (denoted as Vanilla). AugGCR consistently enhances the F1-score for almost all the GNNs compared to other augmentation strategies. More specifically, we observe that AugGCR improves the performance by 6.0% compared to the vanilla Graph-SAGE model on IMDB. In addition, AugGCR provides an 8.5% absolute performance improvement on both GCN and SGC compared to Vanilla on CHAMELEON. Due to the saturated performance of GraphSAGE and SGC on DBLP, room for improvement is limited compared to Vanilla. In the GCN case, our method yields a high performance gain of 7.7% compared to the Vanilla. In heterophilic graph datasets such as SQUIRREL and CHAMELEON, the performance of GNNs considering the homophily assumption where neighbor nodes have similar labels, is generally inferior to MLP that does not utilize the graph structure. Nevertheless, our method successfully learns an augmentor that adjusts the input features leveraging the graph topology improves the performance of all Vanilla models in heterophilic graph datasets. In summary, AugGCR shows 5.8%, 2.6%, and 3.1% improvement on average on GCN, GraphSAGE, and SGC over all datasets. It is worth noting that no significant gain was observed in the baseline augmentations even if we employed TTA setting on those methods for fair comparison. We also evaluate the effectiveness of AugGCR on widely used citation network datasets; Citeseer, Cora, and Pubmed. We observe that AugGCR consistently enhances the performance (Accuracy) compared to the Vanilla model. More specifically, we observe that AugGCR improves the F1-score by 2.1% on Citeseer dataset, 1.3% on Cora dataset, and 1.4% on Pubmed dataset compared to the Vanilla model.

A. RESULTS ON NODE CLASSIFICATION
In addition, we examine the training time of AugGCR. Since the meta-learning optimization usually takes quite a long time, AugGCR needs more training time compared to Vanilla (GCN). See Table 5 for details.

B. COMPARISON WITH SEMI-SUPERVISED METHODS
We verify the efficacy of our AugGCR on semi-supervised node classification by comparing it with a strong baseline and two state-of-the-art methods. GCN+CR minimizes the cross-entropy on labeled data as well as the consistency loss on unlabeled data augmented by DropNode; GraphMix [37] is a regularization method based on semisupervised learning by linear interpolation between two data on graphs; GRAND [24] minimizes both classification and consistency losses with pre-defined augmentation for representation learning. Our AugGCR and GraphMix use GCNs as a base model. GRAND proposes both the architecture and semi-supervised learning method. As mentioned in [24], GCN equipped with GRAND underperforms their own architecture, we did not include it in our experiments. Table 4 presents the semi-supervised node classification performance. We observe that our method outperforms all baselines on all the 5 datasets. In particular, on AIRPORT dataset the proposed method improves the F1-score by 5.5% on average compared to all baselines. Overall, our method provides 6.2%, 6.3%, and 2.5% improvements on average compared to the baselines: GCN+CR, GraphMix, and GRAND.

C. ABLATION STUDY
In this paper, we present AugGCR that includes a new architecture for an augmentor and meta-learning formulation to optimize the augmentor. We provide an ablation study of each component in AugGCR in Table 6. We also examine the effectiveness of our architecture and formulation comparing with a standard graph neural network and its supervised loss, e.g., GCN and the cross-entropy, which are described in Table 7. Lastly, in Table 8, we observe the efficacy of the learnable coefficients, α k and β, in Algorithm 1. Table 6 summarizes the results of an ablation study to examine the contribution of each component in our framework. Experiments were conducted with GCN for all datasets with respect to three components: Augmentor, graph Topology, and Consistency. 'Vanilla' denotes a standard training of GCN without augmentation. '+A' trains the GNN with our augmentor but neither consistency loss, nor graph topology is used. To be specific, the loss for augmentation has no consistency regularization term, i.e., L aug := L xe in (3) and the learnable mixed-order propagation ( K k=0 α kÂ k X) is removed from Algorithm 1. '+A&T ' trains the GNN with our augmentor and topology information leveraged with the learnable mixed-order propagation. Similarly, '+A&C' trains a GNN learning the augmentor with consistency regularization L aug := L xe + L con as in (5) but graph topology is not used. Our experiments show that on average using all the components achieves 2.1% and 4.0% improvement compared to +A&C and +A&T , respectively. Besides, AugGCR improves the performance on average by 4.9% compared to +A. In IMDB and CHAMELEON, the gain from topology information in AugGCR is not significant. The gap between AugGCR and +A&C is statistically insignificant since the difference is much smaller than the variance of accuracy. Table 7 compares our AugGCR with a vanilla GCN without augmentation and two variants of AugGCR. The results of the vanilla GCN without augmentation is shown in the first row. The second row of Table 7 shows the performance of Aug-GCR without our formulation. In other words, the stack of AugGCR's augmetor and the learner GNN (GCN) is trained by the standard supervised training with the cross-entropy loss. This construction improves the performance in IMDB, DBLP and CHAM datasets whereas in SQUI. degradation is observed. The results in the third row of Table 7 are obtained from the AugGCR formulation but the architecture of the augmentor is not ours. A standard GCN is used to learn augmentation. In this setting two different GCNs are used; one for a learner GNN and the other for the augmentor. This proves the efficacy of our formulation (or learning method) that our formulation can learn effective augmentation by a graph neural network. The augmentation with our architecture and formulation shown in the last row consistently provides the largest improvement in all five datasets.

3) ABLATION STUDY OF LEARNABLE COEFFICIENTS; α k AND β
We conduct an ablation study of learnable coefficients α k and β in Algorithm 1. In Table 8, ''−α k '' denotes adopting the constant 1/(K + 1) instead of the learnable parameter α k . ''−β'' means generating the augmented node features only byX in Algorithm 1. ''−α k , β'' means adopting both    ''−α k '' and ''−β''. It shows that AugGCR has 10.83% improvement on average compared to AugGCR without β, which implies that controlling the strength of augmentation by adjusting the influence of original node features on augmented node features is essential for proper augmentation. Especially, it provides 28.89% performance gain on IMDB dataset. In addition, α k consistently provides improvements in all five datasets. Thus, we can observe that learning coefficients α k and β are important to AugGCR, especially β.

D. ANALYSIS 1) PREVENTING OVERFITTING
In Figure 2 from the left, the graphs are placed in order of training loss, validation loss, and F1-score for validation set. The learning curves of various augmentation methods in the configuration with the best-performing ( * ) with GCN on DBLP (e.g., DropEdge * , drop ratio δ is set at 0.6.). It shows that the training loss curve decreases rapidly along the training epochs, except for AugGCR and DropNode. As we can expect from the training loss, the validation loss curves for Vanilla, DropEdge, and AddNoise increase, which implies that the model overfits to training data. It is interesting that the F1-score gaps between DropNode and other baselines suffering from overfitting are negligible despite their large validation loss differences. In contrast, the training loss curve of AugGCR decreases less compared to other steeper decreasing curves and its validation loss curve constantly drops by feeding in adjusted input features through the augmentor. This demonstrates the effect of AugGCR on preventing overfitting. Learning curves of AugGCR on DBLP dataset. It shows a cross-entropy loss of (a) train data, (b) validation data, and (c) distance between the original features X i and the augmented features X i , i.e., i |X i − X i |. AugGCR (blue) shows stable learning curves whereas AugGCR without consistency loss (orange) shows slower convergence and even diverges. Also in (c) the large distance between original samples and augmented samples indicate that AugGCR without consistency loss (orange) generates excessive augmentations caused by meta-overfitting. This shows the effectiveness of consistency regularization for learning augmentation.  As shown in the rightmost in Figure 2, we show that the F1-score of AugGCR outperforms significantly other augmentation baselines.

2) CONSISTENCY REGULARIZATION
We analyze the efficacy of consistency regularization, which is a component of our proposed framework. In FIGURE 3, we illustrate the training and validation loss curves of Aug-GCR and AugGCR without consistency regularization. It is conducted on DBLP dataset with GCN as a learner model. The rightmost is the curve for the distance between original and augmented node features of AugGCR and AugGCR w/o C. The distance is measured by MAE (Mean Absolute Error), i.e., i |X i − X i |. As shown in FIGURE 3(c), consistency loss prevents the excessive feature augmentation. It is noteworthy that AugGCR utilizes the consistency regularization only for training the augmentor and not the classifier. Our preliminary experiment showed that there was no significant performance gain in adopting consistency regularization for the classifier. See Table 9 for more details.

3) COMPARISON WITH OTHER REGULARIZATION TECHNIQUES
Dropout and weight decay are also standard regularization techniques to deal with weak-generalization problem of deep neural networks. To study the difference between AugGCR and other regularization techniques, we plot the learning curves of GCN with AugGCR, dropout and weight decay. FIGURE 4 shows the train and test loss of GCN on AIRPORT dataset with various configurations. We can observe that Aug-GCR has considerable generalization ability while the other regularization techniques still suffer from overfitting. We also observe the node classification performance (F1-score) of dropout and weight decay. It is performed by grid search, dropout ratio in {0, 0.2, 0.4, 0.6, 0.8} and weight decay in {0, 0.0001, 0.0005, 0.001}, but the performance gain is small compared to the gain of AugGCR. See Table 10 for details.

4) MIXUP COEFFICIENT β
Our proposed framework can adaptively augment the node feature matrix by the mixup coefficient β, which is a learnable parameter to average the perturbed feature matrixX and the original node feature matrix X, i.e., X = σ (β)X + (1 − σ (β))X. It means that β adjusts the influence of the original node features in augmented features. To analyze the change of β, we plot the cross-entropy loss of AugGCR and Vanilla on training and validation data with GCN as a base model in DBLP.
We observe that β dynamically adjusts as the model is optimized. In FIGURE 5, after several updates, we show that FIGURE 4. Comparison of learning curves between AugGCR (blue) and other regularization techniques, dropout (first row) and weight decay (second row). It shows the cross-entropy losses of train set (first column) and the test set (second column).

FIGURE 5.
Learning curve of GCN, AugGCR with GCN as a learner model, and β. Training and validation loss are illustrated in solid and dash lines respectively. AugGCR (blue) adaptively augment node feature matrix by learnable mixup coefficient β (red). At the beginning of increasing the validation loss of Vanilla (green), AugGCR downgrades β to alleviate the overfitting by focusing on the perturbed featureX.
β decreases to attend to the perturbed feature more than the original feature. It can be interpreted that leveraging node feature matrix perturbed by learnable mixed-order propagation helps to keep model from overfitting to the input feature.

V. CONCLUSION
We propose AugGCR, a novel framework to learn how to augment input features of graph-structured data to enhance the generalization ability of the learner model. Overall, the procedure alternately optimizes the augmentor and the learner model, which can be formulated as meta-learning. Also, we adopt a consistency regularizer to make the augmentor generate data that the learner can keep consistency on. Aug-GCR is applicable to any GNNs and yields up to 8.46% improvement in F1-score over the vanilla model. We analyze the behavior of AugGCR on generalization ability. In future work, we are planning to adopt self-supervised learning to enhance the generalization ability of our augmentor.

ACKNOWLEDGMENT
(Hyeonjin Park and Seunghun Lee equally contributed to this work.)