A Self-Distillation Embedded Supervised Affinity Attention Model for Few-Shot Segmentation

Few-shot segmentation focuses on the generalization of models to segment unseen object with limited annotated samples. However, existing approaches still face two main challenges. First, huge feature distinction between support and query images causes knowledge transferring barrier, which harms the segmentation performance. Second, limited support prototypes cannot adequately represent features of support objects, hard to guide high-quality query segmentation. To deal with the above two issues, we propose self-distillation embedded supervised affinity attention model to improve the performance of few-shot segmentation task. Specifically, the self-distillation guided prototype module uses self-distillation to align the features of support and query. The supervised affinity attention module generates high-quality query attention map to provide sufficient object information. Extensive experiments prove that our model significantly improves the performance compared to existing methods. Comprehensive ablation experiments and visualization studies also show the significant effect of our method on few-shot segmentation task. On COCO-20i dataset, we achieve new state-of-the-art results. Training code and pretrained models are available at https://github.com/cv516Buaa/SD-AANet.


I. INTRODUCTION
S EMANTIC segmentation, as a significant computer vision task, aims to assign a class label to each pixel in the image.Fully convolutional networks (FCNs) [1] have been the pioneer to handle this task in an end-to-end manner, then various of deep neural network models [2]- [7] have made great improvements recently.However, massive pixel-level annotated data required in semantic segmentation leads to expensive annotation cost.In addition, these methods have a dramatically performance decline when meeting unseen classes.In contrast, human cognitive ability can easily accurately complete complex tasks such as recognition based on the existing knowledge and a small amount of new labeled data.
To address the above-mentioned issues, few-shot segmentation [8] is proposed, using limited annotated data to segment unseen classes.Different from fully supervised semantic This work was supported in part by the National Natural Science Foundation of China under Grant 62072021.(Corresponding author: Huojin Chen) Q. Zhao, B. Liu, S. Lyu are with the Department of Electronic and Information Engineering, Beihang University, 37 Xueyuan Road, Haidian District, Beijing, P.R. China, 100191.(e-mail: zhaoqi@buaa.edu.cn,liubinghao@buaa.edu.cn,lyushuchang@buaa.edu.cn).
H. Chen is with the College of Art and Design, Beijing University of Technology, 100 Pingleyuan, Chaoyang District, Beijing, P.R. China, 100124.(e-mail: chenhuojin@bjut.edu.cn).segmentation, few-shot segmentation splits the whole data into support set and query set.The support set in this task provides meaningful and critical features of certain class to guide method extracting target with same class in query set.
Current few-shot segmentation methods are mainly based on metric learning, containing two main technical routes: affinity learning [9]- [11] and prototypical learning [12]- [14].Affinity learning acquires feature of support object with the help of support mask.Then each pixel-wise support feature is matched to query feature through various correlation measure operations, guiding query segmentation.
Prototypical learning methods usually use one or few prototypes to represent support object feature and guide segmentation of query target.These methods adopt masked global average pooling (masked GAP) [12] to obtain support prototype.Generally, the support prototype is combined with query feature through correlation metrics to realize highquality segmentation.
However, there are still two major challenges need to be solved in few-shot segmentation task.First, the appearance of objects in images may have different quantities, perspectives, illumination intensities, etc.The feature distinction between support and query dramatically reduces the segmentation performance.Second, limited support prototypes are incapable of providing sufficient representative information.Some typical failure cases of existing methods can be seen in Fig. 2.  To deal with issue caused by the image distinctions, we attempt to creatively introduce the self distillation method into few-shot segmentation task.We propose a novel module named self-distillation guided prototype generating module (SDPM), which adopts self-distillation approach to bridge the gap between support and query features, finding commonalities between two features.SDPM takes support label, support feature and query feature as inputs, outputting a channel reweighting query feature and a support prototype with intrinsic class feature.
In few-shot segmentation task, one or few support prototypes can not provide sufficient representative information to segment the query target.So we design the supervised affinity attention module (SAAM), a CNN-based end-to-end module which can be simply embedded in deep CNN models and introduces negligible computation cost.SAAM has the same inputs as SDPM, and aims to generate an affinity attention map to give a prior prediction of query target.
Based on the two modules mentioned above, we propose the Self-Distillation embedded Affinity Attention network (SD-AANet) to produce intrinsic prototype and affinity attention map efficiently.Extensive experiments show that our SD-AANet achieves state-of-the-art performance on COCO-20 i and comparable state-of-the-art results on Pascal-5 i .
Our contributions are summarized as follows: • We propose the SDPM to generate an intrinsic prototype by self-distillation approach, which can efficiently align the features of support and query.Otherwise, our SAAM helps to produce a query attention map to teach decoder where to focus.Through combining SDPM and SAAM, SD-AANet can better address the two challenges mentioned above.

A. Semantic Segmentation
Semantic Segmentation aims to predict a semantic category for each pixel in image.Convolutional Neural Network (CNN) based methods have made great progress in semantic segmentation field.Fully convolutional network (FCN) [1] replaces fully connected layers with convolutional layers, achieving semantic segmentation in an end-to-end manner.SegNet [2] and UNet [3] employ symmetric "Encode-Decoder" architectures to map the original image to the same-size predictions.PSPNet [6] integrates pyramid pooling module into several baseline architectures like ResNet [17], [18]) to obtain contextual information from different scales by using different kernelsized pooling layers.Chen et al. [19], [20] employ dilated convolution to expand the receptive field.In addition, some works focus on attention mechanism.PSANet [21] proposes a point-wise spatial attention to explore better connection information between pixels.DANet [22] adopts position attention module and channel attention module to learn position and channel inter-dependencies.CCNet [23] adopts a crisscross attention module to capture contextual information from full-image dependencies.However, well-performed semantic segmentation networks need a large amount of annotated data as training samples which are expensive to obtain.

B. Few-shot learning
Few-shot Learning seeks to recognize new objects with only few annotated samples.In this field, as an interpretable approach, metric learning [24]- [26] is widely used.Koch et al. [24] propose a siamese architecture which shows great performance on k-shot image classification tasks.This architecture can also be extended to deal with k-shot semantic segmentation.Meta learning [27] enables machine to quickly acquire useful prior information from limited labeled samples.Meta-learning LSTM [28] and Model-Agnostic [29] methods apply recurrent neural network (RNN) to represent and store the prior information to handle the few-shot problem.To own the advantage of both two methods, ProtoMAML [30] combines the complementary strengths of metric-learning and gradient-based meta-learning methods.

C. Few-shot Segmentation
Few-shot Semantic Segmentation aims at performing dense pixel-wise classification for unseen classes.Shaban et al. [8] are the pioneers to officially define the few-shot semantic segmentation problem.They propose a two-branch architecture (OSLSM) to produce a binary mask for the new semantic class with dot-similarity manner.SG-One [12], which is now a benchmark architecture in one-shot segmentation task, proposes an architecture that consists of a guidance branch and a segmentation branch.Based on two branchs design of SG-One [12], [9], [13], [14], [16], [31]- [33] further promote the few-shot segmentation performance.PFENet [34] proposes a training-free prior generation process to produce prior segmentation attention for the model, and a feature enrichment module to enrich query features with the support features.ASR [35] reformulates few-shot segmentation as a semantic reconstruction problem and converts base class features into a series of basic vectors.HSNet [36] introduces 4D convolutions to extract diverse features from different levels of intermediate convolutional layers.ASNet [37] trains a learner to construct class-wise foreground maps for multi-label classification and pixel-wise segmentation.NTRENet [38] explicitly mines and eliminates background and distracting objects regions for better segmentation.MSANet [39] exploits multiple featuremaps of support images and query images to estimate accurate semantic relationships.However, the feature distinction and the weak representation of limited support prototypes still hinder performance.Our SD-AANet utilizes self-distillation, prototypical learning and affinity learning, solving problems above and achieving performance improvements.
III. PROPOSED METHOD In this section, we first briefly describe the definition of the few-shot segmentation task in Subsection III-A.Then in Subsection III-B and Subsection III-C, we introduce our self-distillation guided prototype generating module (SDPM) and supervised-based affinity attention module (SAAM) in details respectively.Finally, in Subsection III-D, we discuss optimization details and multi-class segmentation application of our proposed self-distillation embedded affinity attention model (SD-AANet).Training and testing processes of few-shot segmentation can be seen as episodes.The episode paradigm was proposed in [40], and Shaban et al. [8] first introduce it to few-shot segmentation.Each episode is consist of a support set S and a query set Q with the same class C.There are K samples in the support set S, which is formulated as

A. Problem Setting
. Each image-label pair (I i s , M i s ) represents a sample in S, where I i s and M i s are the support image and its ground truth respectively.Similar to the support set S, query set Q has one sample (I q , M q ), where I q and M q are the query image and its ground truth respectively, having the same class C with the support set S. The input of the model is a pair of query image I q and support set S, formulated as {I q , S} = I q , (I . Query ground truth M q is invisible in training stage and it is used to evaluate the performance of methods.

B. Self-distillation Guided Prototype Generating Module
Current prototypical learning methods, such as PFENet [34], approach a great performance on PASCAL-5 i and COCO-20 i , outperforming previous works by a large margin.Prototypes generated by these methods can efficiently guide the segmentation of query target.However, there are large feature differences between support and query targets.So we need to align the features of support and query.Objects always have two types of features, intrinsic features which commonly exist in all objects of this class and unique features which may

Support Prototype
Fig. 4. SDPM first applies masked GAP to generate support prototype, then it uses support prototype to produce channel reweighting vector.Channels of both support feature and query feature are reweighting by above-mentioned reweighting vector.After that, new support prototype and query prototype are generated by masked GAP, then self-distillation approach is used between two prototypes to produce intrinsic support prototype.In order to promote learning process of model, teacher vector in self-distillation approach is the average of support prototype and query prototype, as shown in blue dotted box.The ouputs of SDPM are query channel reweighting feature and support prototype shown in red dotted boxes.
distinct in different objects.Take aeroplane as an example, all aeroplanes are made by metal and have wings.These features existing in all aeroplanes can be seen as intrinsic features.As the differences of shooting angle and lighting conditions, the shape and color of aeroplanes can be different, so they are unique features.Normally, humans have the cognitive ability to easily spot intrinsic features and apply them to subsequent tasks.For similar purposes, in few-shot segmentation, we need to find representative features of support and query images containing abundant intrinsic features.
The knowledge distillation approach proposed by Hinton et al. [41] greatly inspires us to transfer the knowledge between support and query prototypes.Zagoruyko et al. [42] and He et al. [43] expand the knowledge distillation technique by distilling attentions in middle layers.Fukuda et al. [44] propose integrating multiple teacher networks to teach the student network.Lyu et al. [45] realize the knowledge distillation in a single deep neural network, where student network is a part of the teacher network.
Inspired by the above methods, we introduce self-distillation approach to prototypical learning method, aiming to extract intrinsic features and aligning the features of support prototype and query prototype.
1) Support Guided Channel Reweighting: Different from the SENet [46] which uses global feature to reweight channels of feature map, we alternatively adopt support prototype with better suitability for obtaining object-related information.As shown in Fig. 4, architecture in gray dotted box is the variation of SE Module.Support image I s and query image I q with same class go through a shared backbone CNN, denoted as F (•). M s and M q denote the ground truths of support and query, F s and F q denote the support feature and query feature which are outputs of middle-level layers of the backbone CNN: note that I s and I q have same shape n × c × h × w, in which n, c, h and w represent batch size, number of channels, height and width of the feature map.Then support prototype is generated by masked GAP by calculating the average vector of the features in object area in feature map: where i and j denote the index of row and column, F (i,j) s denotes the position at row i, column j in support feature map and M (i,j) s denotes the position at row i, column j in support ground truth.F p (•) denotes the masked GAP operation.To guarantee the correctness of the Eq. 2, M s is resized to the same height and width with the support feature map.[•] denotes Iverson bracket, a notation that signifies a number that is 1 if the condition in square brackets is satisfied, and 0 otherwise, i.e.
Acquired support prototype is then input to a series of fully connected layers (FC layers) to learn contributions of each feature channel, and there are ReLU functions between FC layers.Output of the FC layers is also a vector having the same number of channel with support feature and query feature.We have v s = FC (p s ), where FC (•) and v s denote the FC layers and output channel reweighting vector respectively.
The channel reweighting vector v s scales channels of support feature and query feature, according to each channel's importance.Instead of using SE Block directly, we adopt a feature fusion strategy by using the average of scaled feature and input feature as the final channel reweighting feature: where F scale (v s , F s ) denotes the channel scale function and Fs denotes the final channel reweighting feature, so do the F scale (v s , F q ) and Fq .
2) Self-distillation embedded Method: After Hinton et al. [41] first proposing the knowledge distillation in deep learning, many studies [45], [47], [48] have been conducted to let models learning from themselves.These approaches are named as self-distillation which aims to promote performance of model without external knowledge input.
Inspired by the above works, we introduce self-distillation approach to prototypical neural network, which can significantly improve few-shot segmentation performance by extracting intrinsic support feature.
To generate the intrinsic support prototype with the help of self-distillation approach, the average of support and query prototypes is used as the teacher.Masked GAP is employed to obtain both support prototype and query prototype from channel reweighting features as p s = F p Fs , M s , p q = F p Fq , M q , where p s and p q denote the support prototype and query prototype after channel reweighting.
The query prototype and support prototype are adopted in self-distillation process.Both two prototypes can be seen as a combination of two parts feature, intrinsic feature and unique feature.So p s can be represented as p s (f i , f s ), where f i and f s denote the intrinsic feature and support unique feature respectively.Similarly, query prototype p q can be represented as p q (f i , f q ), and f q is the query unique feature.Following the knowledge distillation method, we apply the Kullback Leibler (KL) divergence loss to realize the supervision of support prototype: where Sof tmax (•) denotes the softmax function, d s and d q denote the outputs of the softmax function while inputs are p s and p q .d t denotes the teacher prototype in knowledge distillation operation and it is equals to d t = ds+dq 2 .L KD in Eq. 5 denotes the loss of self-distillation between support prototype and teacher prototype, and KL (•) denotes the KL divergence function.
Self-distillation approach enhances the consistency of support prototype and query prototype.Because the two prototypes are combinations of intrinsic feature and unique feature, the approach results in the lessen of unique feature and strengthen of intrinsic feature, which means  with only intrinsic feature, more suitable for guiding the query segmentation.
SDPM can efficiently reduce the unique feature in support prototype and significantly ease the gap between support and query features.Extensive experiments show the improvement of performance more intuitively.
3) K-shot Setting: In addition to 1-shot segmentation, segmenting the query target under the guidance of K (greater than 1) support images is defined as K-shot segmentation.To extend SDPM to K-shot segmentation, this module needs to be modified appropriately.Because of the distinctions between K support images, teacher prototype extracted from query feature should supervise each support prototype separately.Depending on whether the teacher prototypes of K support prototypes are same, we design two strategies of SDPM for K-shot segmentation task, Integral Teacher Prototype Strategy and Separate Teacher Prototype Strategy.Details of two strategies are shown in Fig. 5.
The core idea of Integral Teacher Prototype Strategy is applying the average of K reweighting vectors to scale each channel of query feature.Then masked GAP is adopted to extract teacher prototype, and K knowledge distillation losses are calculated between teacher prototype and each support prototype.The final self-distillation loss is the average of K Fig. 6.SAAM produces support prototype using masked GAP and concatenates support prototype to both support feature and query feature.PPM [6] is adopted to extract features of two concatenated results.Then two 1 × 1 convolutions with channel of 1 and 2 are applied to generate support prediction and query attention map.Support ground truth is used to supervise the support prediction, and the output of SAAM is query attention map. losses.
where v i s denotes the prototype of i-th support sample, p q denotes the query prototype generated from Fq , and L I KD denotes the knowledge distillation loss of Integral Teacher Prototype Strategy.
Different from Integral Teacher Prototype Strategy, as shown in Fig. 5b, Separate Teacher Prototype Strategy produces an exclusive teacher prototype for each support prototype, via applying each reweighting vector to scale the query feature separately.
where F i q and p i q denote the query reweighting feature and teacher prototype produced by i-th support reweighting vector.L S KD denotes the knowledge distillation loss of Separate Teacher Prototype Strategy.Final output query feature Fq of SDPM is the average of K query reweighting features, formulated as No matter which strategy is applied, the final output support prototype is the average of K support prototypes.The ablation experiments in Section IV-C show the performances of two strategies.In the end, we use Separate Teacher Prototype Strategy in our overall model.

C. Supervised Affinity Attention Mechanism
Limited support prototypes can not provide enough target features, so we consider using the attention mechanism to provide more adequate prior information about the target for the task.Attention mechanism can effectively capture the location of object and let deep neural network models know where to focus, some previous works also utilize attention mechanism.
PFENet [34] uses high-level features of both support and query to generate query attention map.By employing ImageNet [49] pre-trained model as backbone and fixing its weights, the prior attention mechanism is training-free.
However, the ImageNet pre-trained model is produced to tackle the classification task, which is not suitable to generate attention map in few-shot segmentation task straightly.So we propose a supervised affinity attention mechanism (SAAM), which can solve the problem caused by unrepresentative of limited support features.The architecture of our SAAM is shown in Fig. 6.
1) Supervised Attention: We first utilize masked GAP to obtain support prototype and expand it to the same spatial shape with support feature.Then the expanded prototype is concatenated to both support feature and query feature, we define the results as F C,s and F C,q respectively.Following, F C,s and F C,q are input to a pyramid feature extractor severally and outputs are defined as F P,s and F P,q .We use Pyramid Pooling Module (PPM) [6] as the pyramid feature extractor.
On the head of the PPM, there are two 1 × 1 convolution layers (Convs) to generate support prediction and query attention map respectively.Support prediction is generated by 1 × 1 Conv with two output channels.The 1 × 1 Conv for query attention map generation has only one output channel.Then support label is applied to supervise the SAAM.
where L ce,s is the cross entropy loss of support prediction in SAAM.M (i,j) s and P (i,j) s are (i, j) location of support mask and support prediction in SAAM.
The SAAM uses the whole support feature to guide the generation of query attention map, so the missing intrinsic features of support are learned under supervision.Combined with SDPM, the SD-AANet makes a trade-off during training and obtains richer intrinsic features to achieve better segmentation performance.
2) K-shot Setting: K-shot setting of SAAM is intuitive, because the only different part is support path.K support features go through SAAM severally, then each support prediction is supervised by its own label.Loss of K-shot SAAM is the average of K losses.
where L i ce,s is the cross entropy loss of i-th support image.

D. Optimization
Based on the SDPM and SAAM, we propose the Self-Distillation Embedded Supervised Affinity Attention Network (SD-AANet), as shown in Fig. 3.For the whole model, we choose cross entropy loss L ce for the final segmentation prediction.Counting losses of SDPM and SAAM in, the total loss of SD-AANet is the combination of L ce , L KD and L ce,s as where α and β are coefficients of L KD and L ce,s , used to balance three loss compositions.

E. Multi-class Few-shot Segmentation
To further explore the potential of SD-AANet, we propose a new pipeline for segmenting multi-class objects simultaneously under few-shot setting.Because the fore-mentioned method and pipeline can not applied to segmenting multiclass objects straightly, we modify the decoder and design new training and testing pipeline.We describe the modification and the new pipeline under 1-shot setting as follow.
The input support images and masks are come from five different classes, at least one of which has same class with objects in query image.Then after going through encoder, SDPM and SAAM, there will be one query feature, five support prototypes and five attention maps.Before decoder, we concatenate these prototypes to one vector and add a MLP to reduce its dimension.Then we concatenate the obtained vector, query feature and five attention map as input of decoder.To make decoder segment objects from five classes simultaneously, we change the final output channel from 2 to 5. The output prediction has five channels, whose order is same as the order of concatenation of five support prototypes.The whole learning process of multi-class 1-shot segmentation can be seen in Algorithm.1.
PASCAL-5 i consists of two parts, PASCAL VOC 2012 [58] and extended annotations from SDS datasets [59].There are 20 classes in original PASCAL VOC 2012 and SDS, and they are evenly divided into 4 folds, defined as Fold-i, i ∈ {0, 1, 2, 3}.Each fold contains 5 classes following settings in OSLSM [8], and 1000 pairs of support-query are used in our test.
Following [16], COCO dataset, owning 80 classes totally, is also splited into 4 folds with 20 classes in each fold.The set of class indexes contained in fold-i is written as {4x − 3 + i}, i ∈ {1, 2, • • • , 20}.Due to the number of images in COCO validation is 40137, which is much more than the PASCAL-5 i .So we randomly choose 4000 support-query pairs each fold during testing following [16], which can provide more reliable and stable results for 20 classes than 1000 pairs.To realize the few-shot segmentation, we use three folds to train and test the model on last fold for cross-validation.We alternatively choose different folds in testing to evaluate performance of our model, and we carry out five rounds of experiment and take the average to get the final experimental results.
2) Experimental Setting: We use PyTorch to construct our framework, and we apply ResNet50 and ResNet101 [17] as our backbones for PASCAL-5 i and COCO-20 i respectively.We choose the ResNet with atrous convolutions as the same with previous works [13], [16], [60].The ImageNet [49] pretrained weights provided by PyTorch are used to initialize backbone networks.We use SGD as our optimizer.We set the momentum and weight decay to 0.9 and 0.0001 respectively.The 'poly' policy [5] is adopted in our experiments to decay the learning rate by multiplying 1 − currentiter maxiter power where power is set to 0.9.α and β are set to 50 and 0.5 respectively.We use PFENet [34] as our baseline.
The experiments on PASCAL-5 i train models for 200 epochs as [34], while the initial learning rate and batch size are set to 0.0025 and 4. Because there are more images in training set of COCO-20 i , we train models for 50 epochs with 0.0005 and 8 for the initial learning rate and batch size respectively.We fix the parameters of backbone networks and update other parameters during training.Each example is processed with mirror operation and random rotation from -10 to 10 degrees.Finally, limited by equipment performance, we randomly crop 321 × 321 patches from the processed images as training samples, which significantly reduces storage consumption and runtime.During evaluation, we resize the processed images to 321×321 and pad zero to maintain the original aspect ratio of images.Then the prediction is resized back to original label sizes to evaluate performance.Following [34], for COCO-20 i , we also resize the prediction to 473 × 473 with respect to its original aspect ratio to make another evaluation.The single-scale results are output without multi-scale testing and any other post-processing.Our experiments are conducted on a NVIDIA GeForce RTX3090 GPU and Intel Xeon CPU 10900K.
3) Evaluation Metrics: Following [13], [16], [34], we adopt class mean intersection over union (mIoU) as our evaluation metric, because the class mIoU is more reasonable than the foreground-background IoU (FB-IoU) [13].The formulation of class mIoU is 1 C C i=1 IoU i , where C is the number of classes belong to each fold.So C = 5 for PASCAL-5 i and C = 20 for COCO-20 i .The IoU i is intersection over union of i-th class.

B. Results
As shown in Tables I and II, we adopt ResNet50 and ResNet101 to build our models for PASCAL-5 i and COCO-20 i respectively.And we report the class mIoU results to prove the performance of our proposed models.By incorporating the SDPM and SAAM, with the 321 × 321 size of input images which is smaller than 473 × 473 used in previous works, our SD-AANet still achieves comparable state-of-the-art results on PASCAL-5 i and reaches new state-of-the-art results on COCO-20 i for class mIoU metric.
In Table I, we compare our model with other state-ofthe-art methods on PASCAL-5 i .On this dataset, ASNet [37]  and HSNet [36] achieve better results which are significantly higher than other methods.However, HSNet introduces 4D convolutions to integrate multi-level features, although centerpivot are used to decrease the space and time complexities, the costs are still huge.ASNet computes the correlations between each point in support feature and those in query feature, which can introduce non-negligible computational cost.Otherwise, our SD-AANet still shows some advantages in some categories, such as the result on Fold-1 for 1-shot task.
In Table II, our SD-AANet achieves new state-of-the-art performance on COCO-20i for both 1-shot task and 5-shot task, and surpasses CMN [57] by 1.6% 5.6%.Besides, the SDPM and SAAM improve the performance by 2.4% and 3.7% than our baseline.
We analyze the complexity and computational efficiency of the baseline model and SD-AANet in Tab.III.Compared to the baseline model, our SD-AANet increases the GPU memory cost during inference phase and the number of learnable parameters by 12% and 30%, respectively.Otherwise, SD-AANet slightly reduces the inference speed from 18.75 to 17.65 on Frames Per Second (FPS).
From what has been discussed above, due to the introduction of SDPM and SAAM, SD-AANet achieves superior results on few-shot segmentation task.

C. Ablation Study 1) Ablation Study of SDPM and SAAM:
To quantitatively analyze the influence of SDPM and SAAM, we conduct an experiment about the performance of model w/ and w/o the SDPM and SAAM.Table IV shows the class mIoU results of each model on PASCAL-5 i for 1-shot task.
It can be seen in Table IV that, compared to baseline, using only SAAM or SDPM improve the performance with class mIoU increases of 1.2% and 1.6% respectively.Adopting both SAAM and SDPM can further improve the performance, with 2.5% class mIoU gain.
2) Ablation Study of Multi-scale Inference: Table V shows a comparison experiment between single-scale inference and  As show in Tab.VII, due to the difficulty of segmenting objects from multiply classes simultaneously, the results of multi-class 1-shot segmentation experiments are significantly lower than current few-shot segmentation results.However, Tab.VII still clearly shows the advantages of SD-AANet over the baseline model.On each fold, our SD-AANet can improve the performance of baseline model by at least 3.2 mIoU.For average result on all four folds, SD-AANet gets 3.9 mIoU increase compared to baseline.
We also analyze results of two models on each classes.As show in Tab.VIII, "Class1" to "Class5" denote five classes in each folds orderly.We can see that for some classes, SD-AANet only gets slight improvement such as the "Class 5" of Fold-3, which is "tv/monitor".However, on some hard classes such as the "Class 4" of Fold-2, "motorbike", SD-AANet achieves remarkable progress.

D. Visualization Analysis 1) Qualitative Visualization of Segmentation Results:
To show the performance of our proposed architectures intuitively, we visualize final prediction masks produced by our SD-AANet in Fig. 7.Meanwhile, we compare the segmentation results between baseline and SD-AANet to evaluate the performance improvement realized by SD-AANet.
As shown in Fig. 7, the columns (a), (f) are support images and their ground truths, which are marked in green in figures.The columns (b), (g) are query images and columns (c), (h) are the ground truths of them, which are also marked in green.The columns (d), (i) are the prediction results of baseline, and the columns (e), (j) are the predictions of SD-AANet, marked in red.
As we can see, the second row in Fig. 7 shows cases which have tremendous differences between support objects and query target.Taking the (a) to (e) columns as an example, the bottles in support image and query image has totally different colors, shapes and perspectives, which leads to the segmentation failure of baseline.Relying upon the intrinsic feature extracted by SD-AANet, we can capture intrinsic features of the class and ignore the interference of other factors, so we successfully segment the bottle with negligible error.Other samples also confirm this point.
2) t-SNE Visualization of Support Prototypes: We conduct a t-distributed stochastic neighbor embedding (t-SNE) visualization experiment for support prototypes in Fig. 8.In Fig. 8, four columns in turn represent results of baseline, baseline with SAAM, baseline with SDPM and SD-AANet, and four rows in turn represent four folds from Fold-0 to Fold-3.We use models to process 5000 samples of 5 novel classes and get 5000 support prototypes output from SDPM or backbone.Then t-SNE is adopted to embed prototypes to 2-dimensional space to visualize, and the operation is repeated for 4 methods on 4 folds.As shown in Fig. 8, SDPM can significantly expand the distance between prototypes of different classes and make prototypes of same classes more compact, so the results in columns (c) and (d) are more distinct.The columns (b) and (d) show SAAM can also make support prototypes more discriminative to a certain extent.
Taking Fold-1 and Fold-3 as examples, figures about two folds in first column show prototypes of 5 classes mix together and some classes are split at both ends of figures.The second column, which represents baseline with SAAM can distinguish classes slightly clearer, such as grey points in Fold-1 and orange points in Fold-3.In the third column, baseline with SDPM has greater performance to produce intrinsic prototype, so the points with same color seemed more compact such as grey points in Fold-1 and blue points in Fold-3.Combine the SAAM and SDPM, SD-AANet achieve the best performance which can obviously seen in figures.Clear dividing lines can be seen in the fourth column figures of Fold-1 and Fold-3.
3) Visualization of Performance on Small Target: To further analyze the segmentation performance of SD-AANet for small targets, we choose the samples whose targets have less than 5000 pixels to conduct comparison experiment between SD-AANet and baseline.The samples are split to 10 parts, each part has a span of 500 pixels.We calculate the average class mIoU of each part produced by two models to draw a histogram, shown in Fig. 9.
The results shown in Fig. 9 illustrate that our SD-AANet achieves more class mIoU in all 10 parts, so SD-AANet has greater segmentation performance than baseline for small target segmentation.

4) Visualization of Affinity Attention:
To intuitively analyze the quality of affinity attention produced by SAAM, we visualize the attentsion map as shown in Fig. 10.The first two rows are query image and its ground truth label respectively, and the third row is affinity attention map where close to red (warm-toned) means more attention, vice versa.
In first three columns, we can see that SAAM can effectively capture the spatial information of targets, even if they are  small or there are multiple targets in one image.The last two columns show the SAAM can focus on large-scale targets and capture key information of separate parts, such as the rearview mirror of the bus and wheels of the aeroplane, with only one support sample.5) Visualization of Prototype's Representative: To discuss the representative of support prototype, we calculate the cosine similarity between support prototype and its feature map cos x i,j , p s = x T i,j • p s x i,j • p s (13) where x i,j denotes the vector in support feature at (i, j) location and x T i,j denotes its transpose.p s denotes the support prototype output from SDPM.Finally the • denotes the norm of the vector.
Similar to attention maps in Fig. 10, the similarity map shown in Fig. 11 use warm-toned color to represent high similarity.Three rows from top to bottom in turn illustrate support image and its ground truth, similarity map produced by baseline and similarity map produced by SA-AANet.In first four columns we can see the prototype produced by SD-AANet can filter more irrelevant background compared to baseline, which means the prototype focus on intrinsic feature to capture the target regardless of other environmental factors.In fifth column, baseline fails to found part of sheep which is mixed up with the fence, while SD-AANet finds the whole spatial area of the sheep.6) Visualization of Failure Cases: As shown in Fig. 12, SD-AANet still fails to segment some targets because the target size is too small or the target is very similar to the background.The first column is support image and its ground truth, and the next two columns are query image and its ground truth respectively.The fourth column is attention map produced by SAAM and the last column is prediction of SD-AANet.We can see in the first row that the target is a bird of very small size, and there are great differences between support object and query target.In second row, the bottle is very similar to the flower which is hard to discriminate.

V. CONCLUSION
In this paper, we propose a novel few-shot segmentation method named SD-AANet.Our method significantly differs from existing methods by combining self-distillation, prototypical learning and affinity learning.To address the problem of large intra-class variation, we propose the self-distillation guided prototype module (SDPM) to extract intrinsic prototype, which can efficiently align the features of support and query.We further construct a supervised affinity attention module (SAAM) for generating high quality prior attention map of query image.Extensive experiments on two standard benchmarks verify the performance superiority of our method.Besides, future work may focus on utilizing self-distillation to zero-shot segmentation task.

Fig. 1 .
Fig. 1.Two main challenges in few-shot segmentation task.(a) The huge feature distinction between support and query caused by differences on shooting angle, lighting conditions, shape and color of objects with same category.(b) Limited support prototypes cannot adequately represent features of support objects.

Fig. 2 .
Fig. 2. Some failure cases of current methods.From top to bottom: (a) support image and its ground truth, (b) query image, (c) ground truth of query image, (d) prediction of current methods.

Fig. 3 .
Fig.3.Architecture of SD-AANet.In SD-AANet, middle-level features of CNN backbone and support mask are input to SDPM and SAAM.SDPM uses support prototype to realize channel reweighting on support and query features.Then self-distillation approach in SDPM produces intrinsic support prototype.SAAM introduces support ground truth supervision to a learnable pyramid feature extractor, producing high-quality query attention map.Finally, the fusion of intrinsic prototype, query feature and query attention goes through the decoder to predict the final result.

Few-shot segmentation
is proposed to segment target of unseen classes under the guidance of limited annotated samples of the same classes.The dataset is split into two sets, training set D train and test set D test , taking the class as the split standard.Defining the classes in D train as C train and the classes in D test as C test , the two sets do not intersect, which means C train C test = φ.

Fig. 5 .
Fig. 5. Two strategies of SDPM in k-shot task, Integral Teacher Prototype Strategy (a) use the average of K reweighting vectors and masked GAP to produce the teacher prototype, while Separate Teacher Prototype Strategy (b) produces a exclusive teacher prototype for each support prototype.

Algorithm 1 : 3 4 Concatenate v i s to v c ; 5 v 6 7 p 8 Calculate 9
The learning process of Multi-class 1shot SegmentationInput: support set S = (I 1 s , M 1 s ), • • • , (I 5 s , M5s ) , query image I q and mask M q Output: learnable parameters p of SD-AANet 1 for each batch i = 0 to total batch do 2 Obtain support prototypes v i s ( i = 1, . . ., 5);Get attention maps m i ( i = 1, . . ., 5) and query feature F q ; new = M LP (v c );Concatenate v new , F q and m i as F new ; = Decoder (F new ); L ce , L KD and L ce,s ;Train model with L

4 )
Ablation Study of Multi-class Segmentation: To further explore the potential of SD-AANet, we propose a new pipeline for segmenting multi-class objects simultaneously under fewshot setting.We conduct experiments on Pascal-5 i using baseline model and our SD-AANet.The experiments are based on 1-shot setting.

Fig. 7 .
Fig. 7. Qualitative results of the proposed SD-AANet and the baseline for 1-shot task.The columns from (a) to (e) is results on PASCAL-5 i , and the columns from (f) to (j) is results on COCO-20 i .

Fig. 8 .
Fig. 8. Visualization comparison study of support prototypes between t-SNE results.Each figure contains 5000 support prototypes generated from the same 5000 image pairs.

Fig. 9 .
Fig. 9. Quantitative comparative experiment of segmentation performance on small-scale target between baseline and SD-AANet on PASCAL-5 i for 1-shot segmentation.

Fig. 10 .
Fig. 10.Visualization of affinity attention map generated by SAAM on PASCAL-5 i for 1-shot segmentation.

Fig. 11 .
Fig. 11.Visualization of support similarity map generated by support prototype and support feature using cosine similarity.The comparison visualization study is process on PASCAL-5 i for 1-shot segmentation.
denotes the support prototype

TABLE I COMPARISON
WITH STATE-OF-THE-ART METHODS ON PASCAL-5 i WITH CLASS MIOU METRIC.BASELINE IN TABLE FOLLOWS THE PFENET [34].

TABLE II COMPARISON
WITH STATE-OF-THE-ART METHODS ON COCO-20 i WITH CLASS MIOU METRIC.BASELINE IN TABLE FOLLOWS THE PFENET [34].MODELS WITH † ARE EVALUATED ON THE LABELS RESIZED TO 473 × 473 SIZE.MODELS WITHOUT † ARE TESTED ON LABELS WITH THE ORIGINAL SIZES.

TABLE IV ABLATION
STUDY OF OUR PROPOSED SDPM AND SAAM ON PASCAL-5 i FOR 1-SHOT SEGMENTATION."SD-AANET" MEANS THE MODEL WITH BOTH SDPM AND SAAM.

TABLE VI ABLATION
STUDY OF SDPM WITH INTEGRAL TEACHER PROTOTYPE STRATEGY OF SEPARATE TEACHER PROTOTYPE STRATEGY ONPASCAL-5 i FOR 5-SHOT SEGMENTATION.AANet.The experiment is conducted on PASCAL-5 i for 1-shot task.In this experiment, we resize the logits before classify layer to 321 × 321 and 473 × 473, and adopt softmax operation to get two prediction with different size.Then we resize the 321 × 321 prediction to 473 × 473 by using bilinear interpolation.Finally the two 473 × 473 predictions are added to get final prediction.The result shows that multi-scale inference can improve the performance slightly with 0.3% class mIoU gain.3)Ablation Study of 5-shot Strategy: Table VI studies the influence of different 5-shot strategy of SDPM.We design two strategies for SDPM in 5-shot task, Integral Teacher Prototype Strategy and Separate Teacher Prototype Strategy, named Integral Strategy and Separate Strategy later.Comparison experiment shown in Table VI indicates the Separate Strategy achieves more outstanding result than Integral Strategy.It means that assign a unique teacher prototype to each support prototype can facilitate the production of intrinsic support prototype, which promotes the segmentation performance of query target.