Graph Neural Networks With Triple Attention for Few-Shot Learning

Recent advances in Graph Neural Networks (GNNs) have achieved superior results in many challenging tasks, such as few-shot learning. Despite its capacity to learn and generalize a model from only a few annotated samples, GNN is limited in scalability, as deep GNN models usually suffer from severe over-fitting and over-smoothing. In this work, we propose a novel GNN framework with a triple-attention mechanism, i.e., node self-attention, neighbor attention, and layer memory attention, to tackle these challenges. We provide both theoretical analysis and illustrations to explain why the proposed attentive modules can improve GNN scalability for few-shot learning tasks. Our experiments show that the proposed Attentive GNN model outperforms the state-of-the-art few-shot learning methods using both GNN and non-GNN approaches. The improvement is consistent over the mini-ImageNet, tiered-ImageNet, CUB-200-2011, and Flowers-102 benchmarks, using both ConvNet-4 and ResNet-12 backbones, and under both the inductive and transductive settings. Furthermore, we demonstrate the superiority of our method for few-shot fine-grained and semi-supervised classification tasks with extensive experiments.


Graph Neural Networks With Triple
Attention for Few-Shot Learning Hao Cheng , Joey Tianyi Zhou , Senior Member, IEEE, Wee Peng Tay , Senior Member, IEEE, and Bihan Wen , Member, IEEE Abstract-Recent advances in Graph Neural Networks (GNNs) have achieved superior results in many challenging tasks, such as few-shot learning.Despite its capacity to learn and generalize a model from only a few annotated samples, GNN is limited in scalability, as deep GNN models usually suffer from severe over-fitting and over-smoothing.In this work, we propose a novel GNN framework with a triple-attention mechanism, i.e., node self-attention, neighbor attention, and layer memory attention, to tackle these challenges.We provide both theoretical analysis and illustrations to explain why the proposed attentive modules can improve GNN scalability for few-shot learning tasks.Our experiments show that the proposed Attentive GNN model outperforms the state-of-the-art few-shot learning methods using both GNN and non-GNN approaches.The improvement is consistent over the mini-ImageNet, tiered-ImageNet, CUB-200-2011, and Flowers-102 benchmarks, using both ConvNet-4 and ResNet-12 backbones, and under both the inductive and transductive settings.Furthermore, we demonstrate the superiority of our method for few-shot fine-grained and semi-supervised classification tasks with extensive experiments.Index Terms-Graph neural network, self-attention mechanism, few-shot classification, meta learning.

I. INTRODUCTION
T HE success of deep learning lies with the promises of train- ing deep neural networks from a large-scale dataset in a supervised manner.However, conventional deep learning methods may suffer from sample inefficiency, thus the trained models can hardly generalize to new tasks with limited supervision.Such task is known as few-shot learning [1], which attempts to learn a classifier predicting the novel labels of query samples using only a few labeled support samples of each class.To tackle the few-shot learning challenges, various methods have been recently proposed [2], [3], [4], [5], including the popular meta-learning framework [2] based on episodic training.
Classic few-shot methods [2], [3], [4], [6] applied Convolutional Neural Networks (CNNs) for image classification.More recent works proposed to apply the Graph Neural Networks (GNNs) [5], [7], [8], [9] or Graph Convolutional Networks (GCNs) [10], [11] to process data with rich relational structures in few-shot scenarios.Compared with CNNs, Graph Networks are more powerful in exploiting the intra-and inter-class relationships amongst samples, which are thus more effective for few-shot learning.The current GNN-based few-shot methods improve the accuracy and generalizability from the perspective of node/edge update [8], [9], [12], [13] and graph structure design [5], [14], [15].In general, graph-based methods model the feature embeddings of samples as vertices in a graph and propagate label information between nodes by performing node or edge feature aggregation from neighbor nodes with graph convolution.Unlike general-purpose GNN models for other tasks, adjacency matrices have no structural priors in few-shot scenarios.Therefore, existing fewshot GNN methods usually construct fully connected graph models and utilize neighbor similarity information for graph updates.
GNNs have achieved superior performance on many tasks such as node classification [16], skeleton action recognition [17], point cloud classification [18], and video classification [19].However, several works [20], [21], [22], [23] reported overfitting and over-smoothing issues when learning deeper GNN models (i.e., poor scalability) as shown in Fig. 1, as applying GCN or GNN is a special form of Laplacian smoothing, which averages the neighbors of the target nodes.Furthermore, graph-based few-shot methods usually model each task as a fully connected graph, i.e., each node is adjacent to all other nodes, making it more prone to this issue.Some recent works [21], [23], [24], [25], [26] have been proposed to alleviate the above issues and designed deeper graph layers.DropEdge [21] attempted to alleviate these obstacles via randomly dropping graph edges in training, showing improvement for node classification tasks.DeepGCNs [23] borrowed the ideas from popular CNNs (e.g., ResNet [27] and DenseNet [24]) and adopted the residual/dense connections and dilated convolutions to deep GCN layers for point cloud semantic segmentation tasks.To the best of our knowledge, no work to date has addressed these issues for few-shot learning using the graph attention mechanism.
In this work, we propose an attentive graph neural network (AGNN) with a novel triple-attention mechanism, i.e., node self-attention, neighbor attention, and layer memory attention for highly scalable and effective few-shot learning.Specifically, node self-attention exploits inter-node and inter-class correlation beyond CNN-based features and class information.Neighbor attention imposes sparsity on the adjacency matrices to attend to the most related neighbor nodes across layers.Layer memory attention applies dense connection to earlier-layer "memory" of node features and adjacency matrices.Furthermore, we explain how the attentive modules help GNN generate discriminative features and alleviate over-smoothing and over-fitting with feature visualization and theoretical analysis.We conduct extensive experiments demonstrating that the proposed AGNN outperforms state-of-the-art methods over four datasets, including two standard few-shot classification benchmarks, mini-ImageNet and tiered-ImageNet, and two fine-grained datasets, CUB and Flowers-102.
The contributions of this paper are summarized as follows, r We propose an AGNN model which contains a triple- attention mechanism to tackle the over-fitting and oversmoothing problem and improve the few-shot performance for GNN models.
r We provide both theoretical analysis and visualizations to explain the effectiveness of the proposed AGNN for alleviating over-smoothing and over-fitting problems in few-shot scenarios for GNN-based models.
r Extensive experiments are conducted over four bench- marks for standard, fine-grained, and semi-supervised fewshot learning tasks under both inductive and transductive settings.Results show that our AGNN method achieves state-of-the-art performance over all benchmarks under all settings.This paper is an extension of our recent conference work [28] that briefly investigated three attention mechanisms on GNNs for few-shot classification.Compared with this earlier work, here we provide more details and discussions about the relation to existing graph-based models.We also improve the performance of AGNN by modifying the proposed triple attentions.Specifically, we adopt the self-attention transformer block to replace the self-correlation computation to provide a more flexible task-specific embedding for graph initialization.We combine the latter two attention mechanisms to propose a novel enhanced layer-wise sparsity mechanism.Compared with the previously proposed neighbor attention mechanism with a fixed sparse rate, each AGNN layer adopts a variable sparsity rate, which provides flexible relationships between nodes across different AGNN layers.For the layer memory attention mechanism, in addition to using the output feature as the layer memory, the output adjacency matrix of the current layer is also considered as part of the layer memory, which is passed to the next layer as edge knowledge.Furthermore, we include more extensive experimental results to illustrate the properties of the proposed method with extensive evaluation and comparisons on additional datasets under more challenging settings, e.g., fine-grained few-shot classification, and semi-supervised few-shot classification.
The remainder of this article is organized as follows.Section II summarizes the related work, including few-shot learning, graph neural network, and attention mechanism on graph models.Section III gives a brief overview of the few-shot learning task and the general GNN model.Section IV describes the proposed AGNN with triple attention mechanisms, how to apply it to the few-shot task, and the relation to existing graph-based models.Section V gives a theoretical analysis of the proposed attention mechanisms on why it helps to alleviate the over-smoothing and over-fitting problems for few-shot learning with graph models.Section VI demonstrates the effectiveness of the proposed AGNN model for few-shot classification over four benchmarks under standard, fine-grained, and semi-supervised few-shot settings.Section VII concludes this article.

A. Few-Shot Learning
Few-shot learning is a challenging task that aims to recognize novel categories with limited labeled examples of each class.Following the meta-learning framework [2], existing methods can be generally divided into three groups: gradient-based methods [29], [30], [31], [32], data augmentation-based methods [33], [34], [35], and metric-based methods [2], [3], [4], [6], [11], [36], [37], [38], [39], [40], [41], [42], [43].Recently, a rising trend is to apply attention mechanisms to solve few-shot tasks.For example, CAN [44] generated cross attention maps for each pair of nodes to highlight the object regions for classification.Inspired by non-local block, Binary Attention Network [45] considered a non-local attention module to learn the similarity between node embeddings globally.Considering the attention between query samples with each support class, CTM [37] found task-relevant features based on both intra-class commonality and inter-class uniqueness.FEAT [11] utilized a self-attention Transformer to learn task-specific adaptive instance embeddings.RENet [46] employed self-correlation and cross-correlation modules to extract relational feature embeddings within and between images.SET-RCL [42] proposed a style-aware episodic training strategy with robust contrastive learning to learn a style-invariant feature representation for cross-domain few-shot learning tasks.CubMeta [43] introduced the concept of curriculum learning into meta-learning and proposed an effective self-paced meta-learning method to obtain stronger meta-learners for few-shot classifications.DUAL ATT-NET [47] adopted a dual-attention to explicitly model the crucial relation of fine-grained parts and implicitly captures discriminative while subtle fine-grained details.While these methods are all based on CNNs for feature embedding, most recent works exploited GNN for more effective modeling of inter-and intra-class relations in few-shot classification.It is unclear how these attention schemes can be extended to GNN frameworks.

B. Graph Neural Network
GNN [48], [49] was first proposed for learning with graphstructured data and has proved to be a powerful technique for aggregating information from neighboring vertices in the graph.Recently, there is growing interest in GNNs [5], [7], [8], [9], [12], [13], [14], [15] to handle the few-shot learning task.GNN was first used for few-shot learning in [5], which aims to learn a complete graph network of nodes with both feature and class information.Based on the episodic training mechanism, meta-graph parameters were trained to predict the label of a query node on the graph.Later, TPN [12] introduced the transductive setting into few-shot learning and constructed a top-k graph to propagate labels from support set to query set in the graph.Besides node label information, EGNN [13] exploited edge information for the directed graph by defining both class and edge labels for fully exploring the internal information of the graph.DPGN [14] constructed a dual complete graph network to combine instance-level and distribution-level relations.MCGN [15] combined the GNN and conditional random field (CRF) as a unified model and models the graph affinity as the pair-wise marginal probabilities in the CRF for feature update.TLRM [7] proposed a sample-to-task relation module to capture the task-level relation representations in each GNN layer.TRPN-D [8] adopted the decoupling training strategy to preserve the diversity across different few-shot tasks to enhance the generalizability of GNN models.GCLR [9] applied a VAE-based encoder-decoder module to enrich the node representations in the latent feature space.DR-CapsGNN [19] extended the capsule network to the few-shot video classification task and explored local-global relations while preserving the detailed properties of videos.

C. Attention Mechanism on Graph Models
Attention Mechanism [50] aims to focus on image regions that are more task-related by learning a binary matrix or a weighted matrix.In particular, self-attention [47], [51], [52], [53] considers the inherent correlation (attention) of the input features itself, which is mostly applied in deep models.In GCN scenarios, GAT [54] used a graph attention layer to learn a weighted parameter vector based on entire neighborhoods to update node representation.ReGAT [55] modeled multi-type visual object relations via a graph attention mechanism to learn question-adaptive relation representations for VQA tasks.SAGPool [56] selected the top-k percentage of nodes based on the self-attention score to generate a mask matrix for graph pooling.Despite the promising results achieved by these methods, no attention-based GNN is proposed specifically for few-shot learning.In this work, a novel triple-attention mechanism in GNN is introduced to alleviate the over-fitting and over-smoothing challenges in few-shot classification.

III. PRELIMINARIES
We first provide the formal problem definition of few-shot classification tasks, followed by an overview of the general GNN models.

A. Problem Definition
A general few-shot classification task consists of a largescale and labeled training set with classes C train and a fewshot testing set with classes C test , which are mutually exclusive, i.e., C train ∩ C test = ∅, few-shot image classification algorithms aim to train a classification model over the training set, which could be applied to the testing set with only a few labeled information of each given class.For the testing set, each few-shot test task T follows the N -way K-shot task setting, where N is the number of selected classes and K is the number of labeled samples which is often set from 1 to 5, i.e., the testing set contains a labeled N -class support set S = {x i , y i } NK i=1 with K samples of each class, and a query set Q = {x j , y j } Q j=1 with unlabeled Q query samples also from these N classes to be predicted, denoted as T = S ∪ Q.The values of N and K are both very small for few-shot learning.
A popular and effective way is to apply the meta-learning framework to exploit information in the training set and improve generalizability.Specifically, meta-learning methods separate the training set C train into various few-shot training tasks C train = {T l train } L l=1 to mimic the test setting, and apply episodic training [2] to learn model parameters from a large number of simulated meta-tasks by minimizing the classification error over the query set Q of L meta-tasks on the training set C train as where denotes the cross-entropy loss function and f θ represents the model f (•) with the parameters θ.

B. General GNN Models
GNNs [48], [49], [57] are neural networks for learning with graph-structured data.Similar to the classic CNNs that exploit the local features (e.g., image patch textures, sparsity) for representation, researchers designed GNNs to mimic the behavior of CNNs to handle graph-structured data.In a GNN model, we consider a graph G = (V, E) with nodes V and edges E. Each sample data (e.g., image) is represented as a node in the graph, and GNN mines the neighborhood information of each node Fig. 2. The overall framework of the proposed Attentive GNN (AGNN) framework for the few-shot learning task.This figure shows an example of a 3-way 1-shot setting with a query sample.For each support and query sample, the color and shape of the sample represent its corresponding class.X and Y denote the feature embedding extracted from the backbone and one-hot label embedding, respectively.The grey box denotes the node self-attention module.Specifically, the node self-attention module first applies self-attention and self-correlation blocks on features and labels to generate attention maps, which are then fed into a fusion block to generate the attention map C f for graph initialization.AGNN then predicts the query sample after N AGNN layers.Detailed information on each AGNN layer is shown in Fig. 3.
based on the graph structure, which is crucial for building discriminative and generalization features for many tasks, e.g., node classification, graph classification, etc.To be specific, considering a multi-layer GNN model, following the previous work [58], [59], the output of the k-th GNN layer can be represented as: where denotes the input feature and x (k) i denotes the feature of node i in the k-th layer, with V and d k being the number of nodes and feature dimension at the k-th layer.Besides, Â(k) ∈ R V ×V is called the weighted adjacency matrix, W (k) ∈ R d k ×d k+1 is the trainable linear transformation, and ρ denotes a non-linear function, e.g., ReLU or Leaky-ReLU.
There are different ways to construct the adjacency matrix A (k) .For example, in the classic GCN [48], A (k) i,j indicates whether node i and j are directly connected in the graph.Besides, A (k) i,j can be the similarity or distance matrix between nodes i and j in [2], [5], i.e., A where φ denotes the node feature embedding, and the parameters θ of the distance metric function f can be fixed or learned.One classic example is to apply cosine correlation as the similarity metric, while a more flexible method is to learn a multi-layer perceptron (MLP) as the metric, i.e., , where | • | denotes the element-wise absolute function.More recent works applied the Gaussian similarity function to construct the adjacency matrix, e.g., TPN [12] proposed the similarity function as A i,j = exp (−0.5d (φ (x i ) /σ i , φ (x j ) /σ j )), with σ being an example-wise length-scale parameter learned by a relation network of nodes used for normalization.

IV. ATTENTIVE GRAPH NEURAL NETWORKS
Based on the GNN model, we propose an AGNN model containing triple attentive mechanisms: node self-attention, neighbor attention, and layer memory attention.Fig. 2 shows the pipeline of AGNN for few-shot learning, and Fig. 3 illustrates the details of one AGNN layer.We discuss each attention mechanism, followed by how AGNN is applied for few-shot learning.

A. Node Self-Attention
Denote the feature of each sample (i.e., node) i as x i ∈ R d , and the one-hot vector of its corresponding label as y i ∈ R N , where d is the feature dimension, N is the total number of classes and 1 ≤ i ≤ V .The one-hot vector sets only the element corresponding to the ground-truth category to be 1, while the others are all set to 0. Note that the one-hot encoding of the query sample is initialized with the uniform distribution (i.e., all values in the vector are set to 1/N ).To obtain a task-specific feature as a suitable graph initialization, we propose node self-attention to exploit the sample correlation in the initial stage at the feature and category levels, respectively.Denote the sample matrices and label matrices as: We first consider the self-attention between feature embeddings of nodes in a graph.Inspired by the popular and powerful Transformer architecture [50], we employ two linear projection layers with mapping function W Q , W K ∈ R d×d l and compute the self-attention matrix as: where d l is the feature dimension of the latent space.For simplicity, we set d l = d in this paper.softmax(•) denotes a row-wise softmax operator for label correlation matrices.For Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Fig. 3. Illustration of k-th AGNN layer.The grey and green boxes denote neighbor attention and layer memory attention, respectively.Specifically, in the k-th layer, the proposed AGNN method first adapts an MLP block to encode the node similarity and generate the adjacency (Adj.)matrix A (k) .Then before applying sparsity constraint to construct the optimal Adj.matrix Â(k) , AGNN reweights A (k) as U (k) by considering the "earlier edge memory" of the last GNN layer, i.e., the optimal Adj.matrix Â(k−1) .Then neighbor attention applies sparsity constraints with a variable sparsity ratio β (k) to attend to the most related nodes and generate the output Adj.matrix Â(k) .AGNN then applies the layer memory attention module to update the node features of the graph.
T , the row-wise softmax operator is defined as: where N i denotes the set of nodes that are connected to the node x i .We then calculate the label correlation matrices as: Note that the element (i, j) of the self-correlation function YY T for one-hot label matrix Y indicates if sample i and j are in the same class, which means that the matrix YY T is a binary matrix.However, refer to the study in [60], a binary correlation vector for each sample (i.e., each row of the matrix YY T ) only contains information about the correct class but no information about other classes.To solve this problem and achieve a good initialization of the graph, we enlarge the correlation weight of the inter-class samples and reduce the correlation weight of the intra-class samples by introducing the softmax function.In detail, for the matrix YY T , if a row has a common class with many other samples (corresponding to the K-shot with K > 1), the softmax will result in small equal weights.In this way, each sample combines the information from all neighbors.The proposed node self-attention module exploits the correlation amongst both image features and label vectors, which should share the information from different perspectives for the same node.The next step is to fuse C x and C y using trainable 1 × 1 kernels as: where [C x , C y ] denotes the concatenated attention map, and f τ is a 1 × 1 convolution layer.With the fused self-attention map, both the feature and the label vectors are updated on the nodes: where α ∈ [0, 1] is a weighting parameter.Unlike the feature update, the label update preserves the initial labels, which are the ground truth, in the support set, using the weighting parameter α to regularize the label update.The updated sample features X(1) and labels Y (1) are concatenated to form the node features d+N ) in the first AGNN layer.

B. Neighbor Attention Via Sparsity
Similar to various successful GNN frameworks, the proposed AGNN applies an MLP to learn the adjacency matrix A ij for feature updates.When the GNN model becomes deeper, the risk of over-smoothing increases as GNN tends to mix information from all neighbor nodes and eventually converge to a stationary point in training.To tackle this challenge, we propose a novel neighbor attention via two strategies, i.e., sparsity constraint and memory attention, to attend to the most related nodes as illustrated in Fig. 3.
To exploit the neighbor information of the graph across all AGNN layers, we consider the relationship between the two nodes in both current and previous layers when computing the adjacency matrix in each layer.Specifically, when calculating the weight U (k) (i, j) between two nodes (x j ) in the k-th layer using an MLP (k) , we also consider the relationship between them in the previous (k − 1)-th layer as: where Â(k−1) is the output adjacency matrix in the last layer and MLP (k) (•) contains two convolutional blocks with a sigmoid layer at last.
Note that each element Â(k−1) (i, j) in Â(k−1) is between [0,1], indicating the similarity between node i and node j in the previous layer.Hence the (9) can be regarded as a regularized MLP function that reweights the similarity between two nodes according to the "edge memory" obtained from the previous layer.
With U (k) (i, j), we then apply sparsity constraint to attend to the most related nodes by solving the following sparse problem: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. where denotes the ratio of nodes maintained for feature update in the k-th AGNN layer.With the 0 constraint, the adjacency matrix Â(k) has up to β (k) V non-zeros in each row, corresponding to the attended neighbor nodes.
Different from setting a fixed sparsity ratio for all AGNN layers, we change the sparsity ratio in each layer following β (k) = 1 − 0.1k.Such a dynamic layer-wise version brings more flexibility to each AGNN layer.Specifically, we do not impose strong sparsity constraints in the first few layers to aggregate more neighbor information.As AGNN layers go deeper, the stronger sparse constraint with a lower maintenance ratio forces the model to capture information only from the most relevant nodes, i.e., nodes with a higher probability of belonging to the same class.
The solution to (10) is achieved using the projection onto an 0 unit ball, i.e., keeping the β (k) V elements of each U (k) i with the largest magnitudes [61].Since the solution to ( 10) is non-differentiable, we apply alternating projection for training, i.e., in each epoch, U (k) is first updated using back-propagation by ( 9), followed by (10) to update Â(k) which is constrained to be sparse.For simplicity, we keep the top-m value for each row of A (k) and set the others to 0 to construct the sparse matrix with m = β (k) V .

C. Layer Memory Attention
To avoid the over-smoothing and over-fitting issues due to "over-mixing" neighboring nodes' information, another approach is to attend to the "earlier memory" of intermediate features, including both edge and node features at previous layers.Inspired by DenseNet [24], JKNet [62], GFCN [63] and few-shot GNN [5], we densely connect the output of each GNN layer, as the intermediate GNN-node features maintain more consistent and general representation across different GNN layers.During the adjacency matrix (i.e., edge feature) update process, AGNN treats the output adjacency matrix of the previous layer as "edge memory" and adopts it according to (9) to balance the relationship among neighbors in each AGNN layer before adopting the sparsity constraint.Then, AGNN applies the transition function based on (2) to update node features.In addition, we utilize graph self-loop, i.e., identity matrix I to incorporate self information.Thus the update rule of the AGNN in the k-th layer is formulated as: where means row-wise feature concatenation and W (k) ∈ R 2d k ×m .Furthermore, instead of using F k (X (k) , W (k) ) ∈ R V ×m directly as the input node feature at the (k+1)-th layer, we propose to attend to the "early memory" in a similar way as [5] by concatenating the node feature at the k-th layer as: Equation (12) shows that the output feature size of k-th layer X (k+1) , the size of MLP block for computation of the adjacency matrix in (10) and the corresponded transform matrix W (k) are all positively correlated with the number of layers.Thus, as the number of AGNN layers k increases, it needs more memory to store the features and parameters for each GNN layer.There are only V × m new features introduced in a new layer, while the node features of earlier layer X (k) are attended to the early memory.

D. AGNN for Few-Shot Learning
Following the same strategy of episodic training [2] with the meta-learning framework, we simulate N -way K-shot tasks C train = {T l train } L l=1 which are randomly sampled from the training set, in which the support set includes K labeled samples (e.g., images) from each of the N classes and the query set includes unlabeled samples from the same N classes.Each task is modeled as a graph [5], in which each node represents an image sample with its label.The objective is to learn the parameters of the AGNN model using the simulated tasks, which are generalizable for an unseen few-shot task.
Loss Function: We adopt the single-stage training scheme without pre-training the feature extractor and jointly train the AGNN model combined with the backbone network.For each simulated few-shot task T l train with its query set , the parameters of the backbone feature extractor, node self-attention block f τ , and M AGNN layers i=1 are trained by minimizing the summation of the cross-entropy loss of classes over all query samples from each layer as: where ŷl i denotes the predicted labels of the query sample x i in the l-th AGNN layer and y i is the corresponding ground-truth labels.We evaluate the proposed AGNN for the few-shot task using both inductive and transductive settings, which correspond to Q = 1, and Q = Nq with q ≥ 1, respectively.For each query sample in the N -way K-shot task, we initialize the one-hot feature y with a uniform distribution, i.e., each value is set to 1/N .

E. Relation to Existing Graph-Based Models
GAT [54]: Different from the classic GNNs, GAT [54] exploited attention mechanism amongst all neighbor nodes in the feature domain after the linear transformation W (k) and computes the weights α based on attention coefficients for graph update as: where N i denotes the set of the neighbor (i.e., connected) nodes of x i .Similar to our proposed AGNN model, GAT also considers self-attention on the nodes.However, unlike our proposed method, which applies the node self-attention mechanism before the GNN layer, GAT applies a self-attention mechanism after the linear transformation W. With a shared attention mechanism parametrized by a weight vector − → a , GAT allows all neighbor Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
nodes to attend to the target node with attention coefficients as: p W (k)   .(15) However, GAT only considers the relationship among neighbors in the same layer while it fails to utilize the layer-wise information, which may lead to over-smoothing.Furthermore, GAT just applies self-attention based on node features while ignoring label information.
TPN [12]: TPN [12] first introduced a transductive mechanism to utilize the entire query set for transductive inference in few-shot learning.A graph construction module is proposed to exploit the manifold structure of the novel class space using the union of support set and query set.Specifically, TPN applies a relation module to learn the example-wise length-scale parameter σ, which is then used to compute the node similarity matrix A as: where d(•, •) is the Euclidean distance function.Similar to the proposed neighbor attention mechanism, TPN only keeps top-k values in each row of A to construct a sparse knearest neighbor graph.With this graph, instead of iterative label propagation, TPN applies a closed-form solution to propagate the labels from the support set to the query set: where S = D −1/2 AD −1/2 is the normalized symmetric matrix in which D is a diagonal matrix with its (i, i)-element equal to the sum of the i-th row of affinity matrix A. TPN can predict the class of query samples by regressing directly from support features to query features in closed form without large-scale learnable parameters.However, the graph structure is fixed upon the computation of the sparse similarity matrix A for each iteration.Moreover, TPN does not consider the label information for graph construction and the relationship among different layers, which limits its performance.DPGN [14]: Unlike the single graph-based methods mentioned above, DPGN [14] proposes a dual-graph architecture including a point graph and distribution graph to leverage both instance-level and distribution-level representation to propagate label information better.The point graph contains node features of images and follows the same steps in (2) to update graphs, identical to our proposed AGNN model and other GNN-based methods.However, the strategy of updating the adjacency matrix in DPGN is different from our proposed method.Our proposed AGNN learns a sparse adjacency matrix A (k) with an MLP in (10) by considering both the node feature information of the current layer and the adjacency information of the previous layer, while DPGN constructs a dual distribution graph by gathering 1-vs-n relation on each node to refine the point graph by delivering distribution relations between each pair of samples.

A. Discriminative Sample Representation
It is critical to obtain the initial feature representation of the samples that are sufficiently discriminative (i.e., samples of different classes are separated) for the GNN models in few-shot tasks.However, most of the existing GNN models work with generic features using a CNN-based backbone and fail to capture the task-specific structure.The proposed node self-attention module exploits the cross-sample correlation and can thus effectively guide the feature representation for each few-shot task.

B. Alleviation of Over-Smoothing and Over-Fitting Problems
Over-fitting arises when learning an over-parametric model from the limited training data, and it is extremely severe as the objective of few-shot learning is to generalize the knowledge from the training set for few-shot tasks.On the other hand, oversmoothing phenomenon refers to the case where the features of all (connected) nodes converge to similar values as the model depth increases.We provide theoretical analysis to show that the proposed triple-attention mechanism can alleviate both overfitting and over-smoothing in GNN training.
Lemma 1: The node self-attention module is equivalent to a GNN layer if α = 0 as Proposition 1: Applying the node self-attention module to replace a GNN layer in AGNN reduces the trainable-parameter complexity from O{d x (d x + L)} to O{d x d e }, where d x and d e represents the input feature dimension and the projected dimension in the latent space, respectively.L denotes the depth of MLP for generating the adjacency metric.
The node self-attention module only involves two linear projection layers and the 1 × 1 kernels that are trainable.
Lemma 1 and Proposition 1 prove that the node self-attention module involves much fewer trainable parameters than a normal GNN layer.Thus, applying node self-attention instead of another GNN layer will reduce the model complexity, thus lowering the risk of over-fitting.
Next, we show that using neighbor attention can help alleviate over-smoothing for training GNN models.The analysis is based on the recent works on DropEdge [21] and GNN information loss [64].They proved that a sufficiently deep GNN model will always suffer from " -smoothing" [64], where is defined as the error bound of the maximum distance among node features.Another concept is the "information loss" [64] of a graph model G, i.e., the dimensionality reduction of the node feature-space after T layers of GNNs, denoted as Θ T,G .We use these two concepts to quantify the over-smoothing issue in our analysis.
Theorem 1: Denote the same multi-layer GNN model with and without neighbor attention as G and G, respectively.Besides, denote the number of GNN layers for them to encounter the -smoothing [64] as T ( G, ) and T (G, ), respectively.With sufficiently small β in the neighbor attention module, either (i) Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Remarks:
The result shows that the GNN model with the neighbor attention (i) increases the maximum number of layers to encounter over-smoothing, or if the number of layers remains, (ii) the over-smoothing phenomenon is alleviated.
For the above results, we provide the full proofs in Section V-C.

C. Proofs of the Proposed Attention Mechanism
We present the detailed proofs of Lemma 1, Proposition 1 regarding the proposed node self-attention, and Theorem 1 regarding the neighbor attention.
Proof of Lemma 1: First of all, we analyze the proposed node self-attention, whose feature and label vector updates are where C f denotes the attention map, X and Y (resp.X (1) and Y (1) ) denote the input (resp.output) feature and label vectors, respectively.We prove Lemma 1, which shows that the proposed node self-attention can alleviate Over-fitting by reducing the model complexity compared to adding more GNN layers.The output of the general k-th GNN layer can be represented as With the condition for equivalence, the output of the k-th GNN layer becomes Thus, ( 21) is equivalent to putting the node self-attention to replace the k-th GNN layer, with X (k+1) = X (1) and Proof of Proposition 1: Next, we prove Proposition 1, which shows the model complexity decrease from a trainable GNN layer to the proposed node self-attention module.
For a GNN layer following (2), both W (k) and the MLP (k) are trainable, corresponding to free parameters scale as O{d 2 x } and O{d x L}, respectively.On the contrary, based on Lemma 1, the proposed node self-attention is equivalent to a GNN layer, with the W (k) and the MLP (k) fixed.The only trainable parameters are linear layers to project features into the latent space for attention computation and the 1 × 1 kernels to fuse the C X and C Y , with the complexity scales as O{d x d e } and O{1}, respectively.
Proofs of Theorem 1: Next, we show that using neighbor attention can help alleviate over-smoothing for training GNN.We first quantify the degree of over-smoothing using the definitions from [21] and [64].
Definition 1 (Feature Subspace): Definition 2 (Projection Loss): Denote the operator of projection X ∈ R V ×d x onto a M -dimensional subspace as P M : Denote the projection loss θ M (X) as

Definition 3 ( -smoothing):
The GNN layer that suffers from -smoothing if θ M (X) < .Given a multi-layer GNN G with each the feature output of each layer as X (k) , we define the -smoothing layer as the minimal value k that encounterssmoothing, i.e., Definition 4 (Dimensionality Reduction): Suppose the dimensionality reduction of the node feature-space after T layers of GNNs is denoted as With these definitions from [21] and [64], we can now prove Theorem 1 for the neighbor attention in ( 9) and (10).
Given the original U (k) , the solution to ( 10) is achieved using the projection onto a 0 unit ball, i.e., keeping the βV elements of each U (k) i with the largest magnitudes [61], i.e., Here, the set

VI. EXPERIMENTS
To evaluate the performance of our proposed AGNN method for few-shot classification, we conducted various experiments on four benchmarks.In this section, we first describe the dataset information and implementation details of our network.Then we conduct extensive experiments under several extended few-shot classification tasks, including standard classification, fine-grained classification, and semi-supervised classification under both transductive and inductive settings to evaluate the generalizability of the AGNN model.Finally, we perform ablation studies to analyze the effectiveness of each attention mechanism.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Flowers-102: Flowers-102 [66] was initially proposed for fine-grained image classification of flowers.It contains 102 different flowers with 8189 images, and each image size is 84 × 84 × 3.There are large variations of scale, pose, and light of flower images.In addition, some categories have significant variations within classes.Following the spilt in [36], we split 102 classes into 52, 25, and 25 for training, validation, and testing, respectively.
Caltech-UCSD Birds 200-2011: CUB [67] was also initially proposed for fine-grained image classification.It contains 200 different birds with 11788 images, and each image size is 84 × 84 × 3. Compared with the classic image classification task, we need to find the minor difference between classes, making it a more challenging problem.Following the spilt in [69], we split 200 classes into 100, 50, and 50 for training, validation, and testing, respectively.

B. Implementation Details
We follow most of the DNN-based few-shot learning schemes [2], [3], [5] and first apply the popular ConvNet-4 as the backbone feature extractor, with 3 × 3 convolution kernels, numbers of channels as [64,96,128,256] at corresponding layers, a batch normalization layer, a max pooling layer, and a LeakyReLU activation layer.Besides, two dropout layers are adapted to the last two convolution blocks to alleviate over-fitting [5].Furthermore, to compare with the more complicated CNN-based methods, we also apply ResNet-12 as the backbone, following a similar setup in [38].On this basis, a fully-connected layer with batch normalization is added to the end for dimensionality reduction.We conducted both 5-way 1-shot and 5-way 5-shot experiments, under both inductive and transductive settings [12].We use only one query sample for the inductive and one query sample per class for the transductive experiments on the ConvNet-4 and ResNet-12 backbone.Our models are all trained using Adam optimizer with an initial learning rate of 1 × 10 −3 .For the ConvNet-4 backbone, the weight decay is set to 10 −6 and the mini-batch sizes are set to 40 for all settings.For the ResNet-12 backbone, the weight decay is 10 −5 , and the mini-batch sizes are set to 28.We reduce the learning rate to 0.1 every 15 K and 18 K epochs over mini-ImageNet and tiered-ImageNet, respectively.The output feature dimension of two backbones is 128, and the number of GNN layers is set to 5. The weighting parameter α in Eq (8) is set to 0.5 and 0.9 for mini-ImageNet and tiered-ImageNet, respectively.

C. Standard Few-Shot Image Classification
We compare the proposed AGNN to state-of-the-art CNNand GNN-based methods, using the ConvNet-4 and ResNet-12 backbone, and Tables I, II list the average accuracy of the few-shot image classification, respectively.Table I shows that the proposed AGNN has achieved state-of-the-art performances under 5-way 1-shot and 5-shot settings and outperforms GNNbased methods by about 9.67% and 3.33% over mini-ImageNet and tiered-ImageNet datasets under the ConvNet-4 backbone under the 1-shot setting.When adopting a deeper backbone network (i.e., ResNet-12), we can observe a consistent result, which demonstrates the effectiveness of the AGNN approach.Furthermore, when comparing the results of the two backbones on the same tiered-ImageNet dataset, we can find that the performance improvement of the GNN-based methods is not obvious.One possible explanation is that GNN methods mainly exploit graph structure to collect and exploit important information from neighboring nodes.However, deeper feature extractors can only help better initialization, which is less relevant to GNN layers.Another observation is that for the same method, the accuracy of transductive learning is typically better than that of inductive learning, by further exploiting the correlation amongst the multiple query samples.

D. Fine-Grained Few-Shot Classification
We also evaluate the proposed AGNN method on two finegrained datasets (i.e., CUB and Flowers-102) under the fewshot setting.Compared with classification tasks on standard datasets such as the mini-ImageNet dataset, the few-shot finegrained classification task is more challenging due to the significant intra-class variance and inter-class similarity.Table III  summarizes the 5-way classification results with the ConvNet-4 backbone over CUB and Flowers-102 datasets.It can be seen that our proposed AGNN achieves state-of-the-art performance on both datasets for both 1-shot and 5-shot settings.Our proposed AGNN improves GNN by a large margin ranging from 2.09% to 8.36% on both target datasets, which validates the effectiveness of the proposed attention mechanisms.Notably, we observe that GNN-based few-shot methods perform better than other few-shot methods, proving that GNN can help exploit the intra-class and inter-class relationships between samples, especially for the fine-grained classification task.

E. Semi-Supervised Few-Shot Classification
For the semi-supervised experiment, we follow the typical 5-way 5-shot setting with only a partially labeled support set [5], [13].We conduct the experiments over the tiered-ImageNet benchmark, and the result is presented in Table IV.For each class, we set the same labeled ratio of the support samples, e.g., 20% labeled ratio corresponds to one labeled support sample and four unlabeled support samples.We perform the ProtoNet method [3] as the baseline for comparison and report the results of two versions according to whether the method utilizes the unlabeled support samples.Here the original version of the Pro-toNet method means that we ignore unlabeled support samples and only use the partially labeled support samples of each class to compute prototype for classification, which is equivalent to the corresponding few-shot setting, i.e., 20% (resp.80%) in the semi-supervised setting is equal to the fully supervised 5-way 1-shot (resp.4-shot) setting.In contrast, we also implement a new version of the ProtoNet denoted as "ProtoNet w/ unlabeled," i.e., considering these unlabeled support samples as extra query samples during training for semi-supervised learning.
As shown in Table IV, we can observe that semi-supervised learning increases the performance of all methods in comparison to the results under the typical fully-supervised fewshot setting with the same number of labeled support samples (0.85%, 0.97%, 1.09%, 2.70%, 4.78%, 8.07% for Pro-toNet, GNN, EGNN, AGNN, DPGN, and TLRM, respectively).
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Furthermore, we find that the difference between the ProtoNet baseline and GNN-based methods under the semi-supervised setting becomes larger.This indicates that GNN-based methods can obtain a more discriminative relationship representation via graph structure.The proposed AGNN outperforms the GNN, EGNN, and TLRM in all cases.Furthermore, AGNN achieves better performance than DPGN under the 20% semi-supervised setting, while DPGN performs slightly better under the 80% semi-supervised setting.A plausible explanation is that DPGN applies a dual-graph structure to encode distribution and instance information, which can better propagate label information when there is less unlabeled data (corresponding to the 80% semi-supervised setting) but may convey misinformation between the dual-graph when data missing is severe (corresponding to the 20% semi-supervised setting).Notably, the performance gap between the proposed AGNN and EGNN becomes larger under the semi-supervised setting, which proves the ability of proposed attention mechanisms to integrate more accurate and essential information via graph convolution.Moreover, there exists another popular semi-supervised few-shot setting, i.e., each few-shot task consists of additional unlabeled data for each category.Different from semi-supervised settings in GNN-based methods, methods of this type utilize additional information to help adjust the class distribution to reduce bias.To evaluate the effectiveness of the proposed method, we also compare the results with one of the state-of-the-art method Cluster-FSL [77] in our setting.For a fair comparison, we change the support and unlabeled size for Cluster-FSL to match our setting, e.g., for 80% semi-supervised setting, we set the support size to 4 and the unlabeled size to 1. Compared to Cluster-FSL, GNN-based methods perform better, which demonstrates the superiority of our proposed attention modules.

F. Robustness in Transductive Learning
Comparing to the typical inductive setting in few-shot learning, transductive few-shot learning is a novel setting first proposed in [12], which allows the model to utilize the whole unlabeled query instances in each few-shot task, leading to promising results.While the sampled query samples are always uniformly distributed for each class in the conventional transductive learning setting [12], such an assumption may not hold in practice, e.g., the query set contains the random number of samples for each class.This problem may be severe, especially for GNN-based models, which may learn the distribution of each class by graph convolution during meta-training.We study how robust the proposed AGNN is for such a setting by comparing it to the baseline GNN [5] only with layer memory attention, DPGN [14] and AGNN without specific attention mechanism.In the training phase, we simulate a fixed number of query sets (i.e., 25) and change the number of test samples for each class correspondingly for all methods under such a setting.Table V shows the image classification accuracy with 5-way 1-shot transductive learning with 5 query samples in each class, averaged over the tiered-ImageNet dataset.It can be found that the accuracies of all GNN-based methods decrease due to the different class distributions in each graph between training and testing  tasks, especially for the DPGN method that delivers distribution relations between nodes in the distribution graph to refine the point graph for classification.In contrast, with the query-set samples of "random" labels, the proposed AGNN can still generate significantly better results compared to the vanilla GNN.We also observe that each proposed attention mechanism contributes to robustness.For example, neighbor attention can help prevent "over-mixing" with all nodes, as the sparse adjacency matrix can attend to the related nodes (i.e., nodes with the same class) in an adaptive way,

G. Ablation Study
We investigate the effectiveness of each proposed attention module by conducting the following ablation study.
1) Impact of AGNN Layers: Fig. 4 plots the image classification accuracy over the tiered-ImageNet dataset, with different variations of the proposed AGNN, by removing the node self-attention (self att), neighbor attention (neigh att), and layer memory attention (memory att) modules.We also compare with our previous work [28].It is clear that all variants except our own method generate degraded results as GNN layers go deeper, and some even suffer from more severe over-smoothing, i.e., the accuracy of GNN without any attention mechanisms drops quickly as the number of GNN layers increases.Results also show that neighbor and layer memory attentions are more essential to alleviate the over-smoothing problem.This is consistent with our expectations as the node self-attention is only adopted before the operation of graph convolution, which has less impact on the over-smoothing issue.Furthermore, with the comparison between the two AGNN versions, we conclude the proposed layer-wise neighbor attention can provide better affinity.A reasonable explanation is that the layer-wise neighbor attention mechanism adopts the previously learned neighbor information (i.e., adjacency matrix) as prior knowledge, which can better help learn more accurate affinity between nodes and avoid over-fitting to the features in the current layer.Moreover, the dynamic sparsity ratio can force GNNs to aggregate different degrees of neighbor information at different layers, thereby providing flexible graph structures for various tasks and alleviating over-fitting and over-smoothing issues.
We also plot the t-SNE visualizations of different layers for the proposed AGNN (n = 3, n = 5, and n = 7) and the baseline GNN method (n = 7) over mini-ImageNet, which is shown in Fig. 5. Results show that as the GNN layer goes deeper, i.e., n = 7, the baseline method appears to over-smoothing issues on the testing set.In this case, the few-shot GNN-based method failed to generalize to novel tasks without attention or regularized terms.In contrast, the proposed AGNN method can alleviate this issue even at very deep layers.Another observation is that as the number of layers increases, the AGNN performance first increases and then decreases, and when n = 5, the AGNN achieves the best t-SNE performance.A reasonable explanation is that the proposed attention modules can build a flexible graph model, which helps to alleviate the over-smoothing issue for deep GNN layers and generalize well to unseen tasks to avoid the over-fitting problem.
2) Comparison with Self-Attention Transformer Mechanism: As mentioned before, we propose a node self-attention module to exploit the relationship between samples by learning the correlation matrix from feature and label levels, respectively, and introducing a fusion strategy to combine the information.Instead of the self-attention mechanism we proposed, the self-attention transformer [11], [50] is also a popular module that leverages the relationship between samples to learn discriminative feature embeddings.To validate the effectiveness of our proposed node self-attention module, we experiment with these two kinds of designed modules for few-shot image classification.For a fair comparison, we add these two modules (our proposed node self-attention module and self-attention transformer module) before the AGNN layers to evaluate the performance, respectively.Note that we concatenate the embedding feature and one-hot label feature as the input of the self-attention transformer module.The output dimension of the linear mapping  function for the transformer is set the same as the input dimension.We apply ProtoNet [3] as the baseline method for comparison.We also incorporate FEAT [11], a few-shot method that applies the self-attention transformer as one kind of embedding adaptation function into comparison.As we can see from the results in Table VI, under our AGNN framework, our node self-attention mechanism can achieve an improvement of 13.26% and 1.40% over the self-attention transformer under 5-way 1-shot and 5-way 5-shot settings, respectively.It shows that node self-attention can implement rich relationships between samples from different levels and fuse them better.Moreover, we find that our proposed method with the transformer can also obtain a performance improvement compared with Pro-toNet with the transformer, which validates the effectiveness of our proposed AGNN method.
3) Design Choice of Sparsity Ratio: To validate the effectiveness of our proposed flexible setting of sparsity ratio in the neighbor attention module, we consider two different sparsity ratio designs, i.e., a fixed value for all AGNN layers or different values for each AGNN layer.In both designs, the number of AGNN layers is set to 5, i.e., the integer k ranges from 0 to 4. Table VII shows the classification accuracy with two sparsity ratio designs in the neighbor attention module under the 5-way 1-shot setting.Results show that the classification accuracy of AGNN is affected by the sparsity ratio, and β (k) = 1.0 is the optimal parameter setting considering the fixed sparsity ratio design under the 1-shot setting.Compared with the results of fixed sparsity design, we observe that our proposed variable sparsity ratios can significantly improve the performance of few-shot classification tasks.A plausible explanation is that this dynamic layer-wise sparsity ratio design brings more flexibility to each AGNN layer.Specifically, we need to aggregate more neighbor Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.information in shallow AGNN layers to learn feature representations.Thus, there is no need to impose strong sparsity constraints in the first few layers, that is, no or less sparsity, i.e., a higher value of β.However, as AGNN goes deeper, we need to force the model to capture information only from the most relevant nodes for classification and avoid over-smoothing issues, i.e., nodes with a higher probability of belonging to the same class.
4) Impact of Hyper-Parameters: There are two hyperparameters in the proposed AGNN method, namely α and m, corresponding to the ratio for label fusion and the output feature dimension of each AGNN layer, respectively.The weighted parameter α for label fusion ranges between 0 and 1.The number of dimension m is selected from {16, 32, 48}.Table VIII shows how varying these two parameters affect the test accuracy for image classification averaged over the mini-ImageNet dataset under the transductive setting.Besides, we also test the model when the label fusion mechanism is totally removed, denoted as "-" in the table.The results show that introducing the node self-attention mechanism before the AGNN layers can learn more flexible task-relevant features, thereby improving the performance of the proposed model.It is obvious that m = 32 is a proper ratio for all datasets.The empirical results also show that the mini-ImageNet dataset is not sensitive to the hyperparameters while retaining a larger proportion of ground-truth label information (i.e., a larger value of α) is more conducive to improving performance for the tiered-ImageNet dataset.
5) The Effects of the Proposed Attention Modules: Table IX summarizes the effects of the proposed node self-attention (self att), neighbor attention (neigh att), and layer memory attention (mem att) modules over mini-ImageNet.Without node self-attention, the proposed method directly utilizes the output feature embeddings extracted from the backbone network as node representations of the AGNN model.Without neighbor attention, each AGNN layer adopts a fully connected adjacency matrix to update node features.Without layer memory attention, the proposed method does not consider the dense connection between different AGNN layers, i.e., the node features of each layer are just the output of the current AGNN layer.Results show that all modules consistently improve the classification performance under both 5-way 1-shot and 5-shot settings over mini-ImageNet.Furthermore, we can observe that the effectiveness of neighbor and layer memory attentions is more solid than node self-attention.As neighbor attention via sparsity constructs a task-specific dynamic and flexible relationship between nodes, it improves the generalizability of unseen tasks in few-shot scenarios.In addition, the layer memory mechanism enables AGNN to aggregate the information of each layer, which also helps to improve performance.In contrast, the node self-attention mechanism combines image and category information to provide a task-specific feature initialization, which also helps to some extent.

VII. CONCLUSION
In this paper, we proposed a novel Attentive GNN model for few-shot learning.The proposed AGNN makes full use of the relationships between image samples for knowledge modeling and generalization.By introducing a triple-attention mechanism for graph initialization, graph update, and correlation across graph layers, the proposed AGNN model effectively alleviates over-smoothing and over-fitting issues when applying deep GNN models.Extensive experiments are conducted on both standard few-shot classification benchmarks and two more challenging scenarios (i.e., fine-grained and semi-supervised few-shot classification tasks), showing that our proposed AGNN achieved state-of-the-art results comparing to few-shot learning methods.
Limitations: Despite that the proposed AGNN performs well for few-shot learning tasks, GNN-based few-shot methods in general are usually limited by the graph size.To be specific, when the number of samples in each meta-task increases, GNN-based few-shot methods require more space and computational complexity for graph construction and feature update, which limits the extension of such type of methods to other applications.In future work, it is worth investigating how the proposed method can adapt to large-scale graphs for more classification tasks.

Fig. 1 .
Fig. 1.T-SNE visualization of the image features extracted from the selected classes in deeper GNN layers (here we select the output feature of the 8-th GNN layer).Different colors mean different classes.
it is equivalent to remove the edge connecting the i-th node and j-th node.Thus, | Ωi β (k) V | equals the number of edges been dropped by the neighbor attention, and| Ωi β (k) V | → V as β (k) → 0.Therefore, when β (k) is sufficiently small, there are a sufficient number of edges being dropped by the neighbor attention.Based on the Theorem 1 in[21], we have either of the two to alleviate the over-smoothing phenomenon: r The number of layers without -smoothing increases by neighbor attention via sparsity, i.e., T ( G, ) ≤ T (G, ).

Fig. 5 .
Fig. 5. T-SNE visualization of node features under 5-way 1-shot with 10 query samples in each class on the testing set of the mini-ImageNet dataset.From left to right: the baseline GNN method with n = 7 layers and the proposed AGNN method with different layers, i.e., n = 3, 5, and 7.

TABLE I FEW
-SHOT CLASSIFICATION ACCURACY AVERAGED OVER MINI-IMAGENET AND TIERED-IMAGENET DATASETS WITH THE CONVNET-4 BACKBONE.UNDER EACH SETTING (I.E., TRANSDUCTIVE OR INDUCTIVE, 1-SHOT OR 5-SHOT), THE BEST AND SECOND BEST RESULTS UNDER EACH DATASET ARE HIGHLIGHTED AS RED AND BLUE, RESPECTIVELY.† INDICATES THAT THE SETFEAT METHOD ADOPTS THE CONVNET4 BACKBONE WITH ADDITIONAL 10 SELF-ATTENTION MODULES.DENOTES THAT THE GNN RESULT IS OUR IMPLEMENTATION BASED ON PUBLIC CODE.SHOWS THAT MCGN HAS A DIFFERENT TRANSDUCTIVE SETTING FROM OTHER GNN BASED METHODS

TABLE II FEW
-SHOT CLASSIFICATION ACCURACY AVERAGED OVER TIERED-IMAGENET WITH THE RESNET-BASED BACKBONE.THE BEST (RESP.SECOND BEST) RESULTS ARE HIGHLIGHTED AS RED (RESP.BLUE)

TABLE III FEW
-SHOT CLASSIFICATION ACCURACY AVERAGED OVER CUB AND FLOWERS-102 DATASETS WITH THE CONVNET-4 BACKBONE.THE BEST AND SECOND BEST RESULTS UNDER EACH SETTING AND DATASET ARE HIGHLIGHTED AS RED AND BLUE, RESPECTIVELY

TABLE IV COMPARISON
OF FULLY SUPERVISED AND SEMI-SUPERVISED FEW-SHOT CLASSIFICATION ON THE TIERED-IMAGENET BENCHMARK UNDER THE 5-WAY SETTING.HERE "FS" MEANS THE TYPICAL FULLY SUPERVISED FEW-SHOT SETTING.THE PERCENTAGE UNDER THE SEMI-SUPERVISED SETTING CORRESPONDS TO THE PROPORTION OF LABEL SAMPLES IN EACH CLASS OF SUPPORT SET UNDER THE 5-SHOT SETTING

TABLE V EFFECT
OF QUERY SAMPLES DISTRIBUTION OVER TIERED-IMAGENET DATASET FOR THE 5-WAY 1-SHOT TASK UNDER THE TRANSDUCTIVE SETTING.THE TOTAL NUMBER OF QUERY SAMPLES UNDER THE TWO SETTINGS REMAINS THE SAME (I.E., 25)

TABLE VI TEST
ACCURACY OF DIFFERENT SELF-ATTENTION MECHANISMS OVER THE MINI-IMAGENET DATASET.INDICATES THAT THE PROPOSED METHOD ADOPTS THE SELF-ATTENTION TRANSFORMER LAYER INSTEAD OF THE NODE SELF-ATTENTION MECHANISM

TABLE VII COMPARISON
WITH DIFFERENT DESIGNS (I.E., FIXED OR FLEXIBLE) OF THE SPARSITY RATIO β (k) IN THE NEIGHBOR ATTENTION MODULE OF THE AGNN METHOD UNDER THE 5-WAY 1-SHOT SETTING

TABLE VIII INDUCTIVE
ACCURACY ON MINI-IMAGENET AND TIERED-IMAGENET DATASETS UNDER 5-WAY 1-SHOT SETTING."-" MEANS NOT APPLYING THE NODE SELF-ATTENTION MECHANISM TABLE IX ABLATION STUDY: EFFECTS OF THE PROPOSED ATTENTION MODULES OVER