Hierarchically Structured Network With Attention Convolution and Embedding Propagation for Imbalanced Few-Shot Learning

Generally, a few-shot distribution shift will lead to a poor generalization. Furthermore, while the number of instances of each class in the real world may significantly different, the existing few-shot classification methods are based on the assumption that the number of samples in each class is equal, which causes the trained classifier invalid. Moreover, through ResNet and WRN (Wide Residual Network) have achieved great success in the image processing field, the depth and width of CNNs constrain the conventional convolution layer performance. Thus, to overcome the above problems, the model of this paper proposes a novel few-shot classification model that uses learning balance variables to decide how much to learn from the imbalance dataset, which dynamically generates the convolution kernel based on each input. In our model, to extend the decision boundaries and enhance the class representations, this paper uses embedding propagation as a regularizer for manifold smoothing. Manifold smoothing can effectively solve the above problems of transductive learning. The interpolations between neural network features based on similarity graphs are used by embedding propagation. Experiments show that embedding propagation can produce a better embedding manifold and our model in standard few-shot datasets, such as miniImagenet, tieredImagenet, CUB has state-of-the-art results. It significantly outperforms the existing few-shot approaches, which consistently improves the accuracy of the models by about 11%.


I. INTRODUCTION
Contemporary machine learning models usually use hand-designed fixed learning algorithms to train from scratch to a specific task. Compared with machine learning, deep learning solves the problem of manually labeling features to a certain extent. It inputs data into the network to make it automatically learn features, which significantly improves the model's performance. Traditional machine learning and deep learning have achieved great success in many fields (image [1], [2], NLP [3], [4]).
However, these methods faced urgent problems. In many cases, it is impossible to obtain a large number of labeled samples and robust computing power. To solve the above The associate editor coordinating the review of this manuscript and approving it for publication was Liviu-Adrian Cotfas .
Like human learning of new concepts and skills, children can distinguish something after seeing it several times. People who can ride a bicycle may ride a motorcycle skillfully without learning how to ride a motorcycle. How to quickly learn new concepts and skills Through several training examples. It is a problem to be solved by few-shot learning. Recently, some approaches have successful results in few-shot classification (FSC): metric learning and meta-learning. Prototypical Networks [9] and Relation Networks [10] are metric learning. Moreover, in [11], to distinguish the training data and testing data in the training stage and the actual testing data in the testing stage, it provides a novel name for the data in the training stage. Support data (training data in the training stage), query FIGURE 1. To distinguish the train data, test data in the training stage, and the actual data in the testing stage. Support data (train data in the training stage), query data (test data in the training stage). The actual test data in the testing is also called test data. The right part shows the definition of n-way-k-shot. data (testing data in the training stage). The details about the definition of Support and Query show in Fig.1. Another way in the few-shot learning is meta-learning [12], [13], which learns a set of initial network parameters that can effectively shift to new tasks through the two-stage optimization. Transfer learning [14], [15] leverages prior knowledge well and learns the class presentation.
However, a critical challenge in few-shot classification is that models cannot achieve a global parameter suit for all tasks through few-shot images. To solve this problem, some methods have been used, such as dropout [15], batch normalization [15], manifold mixup [14], and some task distribution algorisms [16]. In the above works, we found the manifold smoothing steadily improves the performance of a classifier. Our method is different from manifold [14] in that our method works between the embeddings, which can obtain higher representation information.
The EP (embedding propagation) method in our model can utilize features captured from the network and have a set of interpolations using their similarity based on a graph. This model uses the radial basis function (RBF) to handle the pairwise similarities in the constructed graph. EP applied on the top of the feature extractor. It can be seen as a layer in the training stage, and testing stage has a regularized manifold. The details about EP show in Fig.2.
The ResNet and other networks have achieved many successful results in the image field. Nevertheless, to improve the ability to capture the feature and deduce the reduction of the convolution network, our model use Dynamic Convolution [17] to replace the conventional convolution layer. It can dynamically yield the convolution kernel depending on inputs by utilizing the attentions [18].
Furthermore, most published few-shot ways have not considered the imbalance problems in each class (in the realistic world, some classes have a large number of instances, but others very seldom), which leads to the imbalance (class imbalance) appears in classes, causing the classification results to tend to the class with more samples. Moreover, in the training stage, this model needs to randomly choose classes from the train set to construct the n-way-k-shot proposed by [18]. In the end, some classes maybe are chosen more times than other classes, which causes the imbalance in class to choose (class chosen imbalance). In our work, this model learns the balancing variables. It can decide how much this model learns in the special classes to solve the above imbalance problems.
We refer to our network as EPCINet (Embedding Propagation Dynamic Imbalance Network). For few-shot classification, EPCINet leverages the unlabeled data in the testing stage to yields a better label of the query set. [18], [19] As productive semi-supervised learning is used in our work at the few-shot classification setup. Thus, for each unlabeled image, EPCINet chooses the class of the maximum predicted probability as the pseudo label. EPCINet then uses these pseudo labels and the support set to perform label propagation to predict the labels of the query set. This approach achieves VOLUME 9, 2021 significant improvement over the previous state-of-the-art in the 1-shot SSL setting. We hypothesize that EPCINet is effective in the SSL setting because of the properties of smoother manifolds [18].
Overall, EPCINet has the state-of-the-art results on mini-Imagenet [11], CUB [20], tieredImagenet [21] in few-shot classification. In our experiments, we evaluate variations of embedding propagation and their impact on the smoothness of the decision boundaries. Experiment shows that, with EP, our model also achieves a clear improvement compared to the same model without EP. Our main contributions can be summarized as follows.
First, Leverage embedding interpolations to capture higher-order feature information.
Second, the model proposed in this paper considers a problem that is easily overlooked by previous few-shot works, the class imbalance and the class chosen imbalance in a real application scenario.
Third, Dynamic convolution is used to improve the ability to capture features and significantly reduce the reduction of the network.
Fourth, the effectiveness of the proposed methods is verified by ablation experiments.

II. RELATED WORK
Few-shot classification domains in our work. Nevertheless, manifold regularization, meta-learning, and dynamic convolution are also used in this model. This section discusses the recent related topics and finds their connections to our method.

A. FEW-SHOT CLASSIFICATION
The most existing meta-learning methods base on the bilevel optimization and the most famous is MAML [13], which proposed the 'Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks'. However, the traditional MAML needs a lot of computation ability and often has a model shock. Thus, Reptile [22] is proposed. This algorism and joint training are similar, the first-order is used instead of the second-order to update the parameters, which significantly reduces the calculation of the model, and there is no loss of gradient information. Experimental results show that the method of one-order approximation to two-order almost does not affect the performance of the model, and at the same time, dramatically reduces the computational complexity of the model. Based on previous work, other ways [23]- [25] that belong to meta-learning were proposed.
Most metrics use episode methods to train the model. It also can be seen as a part of meta-learning. Metric learning first learns a classifier over a feature space. Propagate labels are attained by [26] in EPCINet. Graph-based methods also belong to this category [27]- [29]. GNN [30] can produce the classification weights and leverage an autoencoder to regularize their representation. However, the unsupervised and non-parametric graph in our work is different from GNNs. Metric learning provides a convenient method (see Fig.3) to learn transferable features. Neural networks trained with conventional supervised learning already learn transferable features. This theory can be confirmed experimentally [15], [31]-33]. Therefore, this paper uses the fine-tune method by learning a classifier for a new task based on prior knowledge. Meanwhile, this way can learn a more productive feature extractor to an episodic scenario. In our pre-training stage, we use transfer learning. In the end, EPCINet integrates metric classifiers and transfer features to attain an adequate representation.

B. DYNAMIC CONVOLUTION
Traditional convolution has great success in image progressing fields. In general, deeper network depth makes the network more productive. Nevertheless, it often has gradient disappears, gradient explosion problems. Previous work has proposed many productive networks are consist of CNN, such ResNet [34] and WRN [35]. Compared to a simple CNN network, these networks can capture more feature information, making the network perform better. However, the above methods use the traditional CNN. Light-weight CNN has performance degradation when suffers computation constrain. Dynamic convolution [17] is presented to overcome this issue, which improves the model performance without increasing the network depth. The general architecture shows in Fig.4 and Fig.5. Appalling attention to the input, it dynamically yields a special convolution kernel per layer. These experiments [36], [37] indicated that dynamic convolution could improve the result in the testing stage. Not surprisingly, our model achieves a more productive feature extractor experimentally and demonstrates the effectiveness of dynamic convolution by ablation experiments. The architecture of the Dynamic convolution generator shows in Fig.4.

C. REGULARIZATION FOR GENERALIZATION
Many methods are used to overcome the generalization performance problem, such as dropout, batch normalization. These methods improve the robustness of the model. But, a more effective regularization method called manifold regularization has better results in our work. [14], [32], [38] The  above propose methods soften the decision boundaries and improve the class representations ability, which are the key factors to improving model generalization ability [39].
Similarly, in the training stage, this paper uses smoothing the manifold to incorporate an embedding propagation over the extractor. Some current works [32] also first use manifold regularization in few-shot learning. Nevertheless, ours are different from previous works in the following ways. First, they are short of considering the practical problems when using the manifold regularization. Second, the effectiveness of the regularization method is not proved by ablation experiments in their works. Third, we use CosineAnnealing to adjustment the learning rate in the training stage. In the few-shot classification benchmarks, our model has state-ofthe-art classification accuracy in MiniImagenet datasets and CUB datasets in the 1-shot,5-shot, and 10-shot cases.

III. PROPOSED METHOD
This paper proposes a novel net called EPCINet with the following methods. The overview architecture of the dynamic convolution generator shows in Fig.6. Each input yields a dynamic convolution kernel by dynamic convolution layer and extracts the images features by the extractor. Then we leverage a new embedding propagation method to construct the map between features and embeddings. The classifier uses the above embeddings to label images. This section explains EPCINet in detail.

A. DYNAMIC CONVOLUTION 1) DEFINITION
Dynamic Convolution yields a dynamic convolution kernel for each input in each convolution layer instead of a fixed kernel for each convolution layer. To better understand this concept, here are some symbols. We denote the traditional perceptron as y = g W T x + b , W is weight vector, b is bias vector, g as an activation function (e. g. ReLU). Our paper aggregates multiple (K) linear functions W T k * x + b k as follows: where π k is the k th attention weight in the W T k * x + b k , W (x), and b (x) share the same attention score.

2) ATTENTION
Each input x dynamically changes π k (x). W T k x + b k is a non-linear function, which combines given input and has better performance. SENet [18] is used to gain attention score π k (x). First, the global information squeezed by GAP (global average pooling). Following is the fully-connected layer, ReLU, and fully-connected layer, the softmax layer yields K attention weights for input.
where D k is kernel size, C out is output channels.
From CNNs to Dynamic-CNNs. It can be used in a network that consists of convolution. Such as ResNet, WRN, and so on. From [17], we know it did not add too much computation. For the degree of performance improvement, the additional increase is worth it.

B. EMBEDDING PROPAGATION
In an episode, after using a feature extractor (Dynamic CNN), we obtain the feature vectors Z i ∈ R m . Embedding propagation handles a set of vectors and outputs embeddings Z i ∈ R m by two steps.
First, for each feature (i, j), our model produces the distance by d 2 ij = Z i − Z j and computes the adjacency A i,j = exp(−d 2 ij /σ 2 ), where σ 2 is the controlling variable, A i,i = 0, as mentioned in TPN [22]. In our work, set σ 2 = Var(d 2 ij ) to make a model more stable.
In the following section, we compute by (4) to obtain the Laplacian operator of the adjacency matrix, and compute propagator matrix P by (5), as mentioned in [39] which proposes the label propagation formula.
where a ∈ R is a controlling variable, and I is the identity matrix. Next, we compute embeddings by (6).
Because the Z i has its weight which can remove the useless noise from the feature vectors. It is easy to operate and can be compatible with a multi-function extractor and classifier. More significant, the computation cost can be ignored in fewshot episodes [22] due to there is a few-shot cases.

C. FEW-SHOT SETUP
Previous work [1] has proposed the procedure of few-shot setup. We have three datasets: train dataset (D train ), validation , where X j belongs to unseen classes y i , which is used to measure model transfer ability. The last dataset D val is used for optimizing the model parameter.
Furthermore, each episode chooses n classes without repetition from all classes and K samples, called n-way-k-shot. In order to distinguish train set and test set in the training stage, we call them separately support set and query set.

D. INFERENCE PHASE
In the inference phrase, this model leverage EP to handle features from the input images and label propagation is applied. In general, we can obtain the matrix of propagated embeddings Z ∈ R (k+q)×m by (1)-(6) to support set and query set. P Z is a propagator matrix. We predict the labels for the query set by label propagation as described in [37].

E. TRAINING PROCEDURE
This model is trained through two stages. In the first stage, to learn a general representation, our method leverages the pre-training method over D train . The second stage is to generalize the novel classes. The episodes are constructed by D train . In both stages, extractor F θ (X ) parametrized by θ is used to capture the features Z for image X .

F. PRE-TRAINING STAGE
As describe in Fig 6., we train F θ (X ) by simple linear classifiers with activations parametrized by W l and W r respectively. At the start of the training stage, the classifier is to predict the class labels. We use a cross-entropy loss to optimize our model.
where y i ∈ Y b and it can use softmax to compute the probabilities.

G. EPISODIC LEARNING STAGE
In our model, we use the learned classifier to generalize the new classes. In this stage, our model also uses the two classifiers to handle images. The first uses label propagation through computing class probabilities over the query set label.
The second classifier and the W l − based classifier in the pretraining stage are the same. Therefore, we try our best to reduce the following losses:

IV. VARIATIONAL INFERENCE
The log p Y T , Y T | X T , X T ; θ is the conditional loglikelihood of the joint Dataset D T and D T , which is needed to maximize each task. However, computing the posterior p φ T | D T , D T is difficult, as shown in Fig.7. In the end, variational inference is used to solve this limitation. This model introduces a tractable approximate posterior q φ T | D T , D T ; ψ , parameterized by ψ. The same network generates the two variables to minimize the computational cost.
For consistency, we observe the whole testing dataset in the training stage and the other in the testing stage where it does not have the labels. Thus, the new approximate posterior is q φ T | D T ; ψ , which can reduce the computational difficulty of the inference framework. Meanwhile, the lower bound of the log [43] evidence is valid. To make the log-likelihood only contain test samples with the appropriate scaling factor, this method modifies the objective function a little bit. Our excepted objective is as follows: Follows univariate Gaussian having trainable mean and variance. Moreover, each dimension of prior p ψ T is  N (0, 1). Thus, KL-divergence is a very especially simple form from [41].
The final form objective with Monte-Carlo (MC) approximation is as follows: To make gradient estimate ψ stable and unbiased, assuming the reparameterization trick for φ T [41]. For the sake of computational efficiency, S = 1 in MC. In the testing stage, set S = 10.

V. DATASET ENCODING
In modeling variational, this paper constructs the training dataset D T into designated representation, which is very important. Some statistical methods (mean, var, and cardinality) can capture the statistics information to identify the class imbalance based on Dataset D T . Papers often use the sum-pooling or mean-pooling as an efficient set-encoder, where each instance is shared and then generates a vector summarizing the set. However, in our classification dataset, the traditional non-hierarchical structure will ignore the label information and perform poorly. Thus, this paper uses the two-layer hierarchy. The first is used to encode classes as a series of samples, then encodes the set as the set of classes. Therefore, in our work, a two-layer hierarchical set-encode is introduced. From [26], the experiment shows that mean pooling has a limitation in that it does not compute the set number. It leads to a wrong impression of the imbalanced Dataset. Thus, our model takes the number of instances into the encoder. However, the set cardinality is not enough to capture important information about a dataset. As a supplement, this paper encodes the variance of the set as well.
In summary, this model calls the set encoder as Statistic-sPooling ( * ), which generates a series of set statistics, mean, variance, and cardinality.
In Table 1. We talk about the effectiveness of the dataset encoding by using an ablation study. Mean and var represent the mean and variance of the data set, respectively. Moreover, N is the set cardinality. We can find that element-wise sample skewness and kurtosis hardly improve the performance, but the set cardinality is better.
where c = 1, . . . , C. X T c is the instances of class c in task T . NN 1 and NN 2 are neural networks. V T contains the dataset statistics information. The balancing variables ω T C are generated from S C (see Fig.8.).

VI. LEARN TO BALANCE
In [11], it enlarges the definition of n-way-k-shot and the process of generating tasks. D contains train set D T = X T ,Y T and a test set D T = X T , Y T . In each episode, this paper randomly chooses n classes, and k instances belong to these n classes. Here, we find two kinds of imbalances (class imbalance and class chosen imbalance). However, the existing few-shot approaches do not take these imbalances into account, preventing the model from efficiently solving realistic problems.
1) Class imbalance: The few-shot methods do not have any structure to handle class imbalance. Thus, the class with more samples may dominate the task-specific learning in the training stage, and the fewer shots classes have poor representation. (Class A has 500 samples; B class has ten samples) VOLUME 9, 2021 2) Class chosen imbalance: In previous work [1], this model has learned that in each episode, the model randomly chose n-way-k-shot to construct tasks, which leads to some classes be chosen more times but others are fewer. When our Dataset is extensive, and there are many episodes, this imbalance will become more serious. The categories that are chosen more often learn more fully, and vice versa. In the end, there are always some categories whose selection times are far more significant than others, which will lead to insufficient training and affect the performance of classifiers.

A. TACKLING IMBALANCE
In view of the above two kinds of imbalanced problems. This paper proposes the following solutions.
1) Tackling class imbalance: Through the understanding of the above imbalance problem. For each inner-optimization step, we can set a specific learning rate for each category. For category c = 1, . . . C, we use a set of class-specific scalars ω T = ω T 1 , . . . ,ω T C ∈ [0, 1] C , for each gradient descent process, the model can obtain the following parameters ∇ θ L θ;D T 1 , . . . ,∇ θ L θ;D T C . When the sample size of a certain category is very small, ω T C will be large. 2) Tackling class chosen imbalance: In order to control the selection times of the current task and the proportion of the iteration number of the whole model, we introduce task-rate learning rate γ T = γ T 1 , . . . ,γ T C ∈ [ 0, 1) C , category c = 1, . . . ,C, the scalars become γ T 1 α, . . . ,γ T C α . The smaller the number of selected categories, the larger the variable γ T is. By controlling the relationship between variables and learning rate, the learning degree of the model is determined.
Finally, we assume that all the variables proposed are integrated into the original meta-learning framework, and the update rules are changed as follows: α is the global learning rate, • is the multiplication operator. k = 1, . . . , K . The last step θ k corresponds to the task-specific predictor θ T .

B. GENERATING IMBALANCE VARIABLE
As shown in Fig.8. our model introduces γ , ω to reduce the impact of data imbalance. Firstly, each input is sent to the NN 1 network for processing. The structure of the NN 1 network shows in Fig.9. (a). The obtained results are sent to statistics pooling for processing (SP (statistics pooling) principle is shown in V. DATASET ENCODING), and then some useful statistical data, such as Mean, Var, and Set cardinality, are obtained. Then, the obtained results are processed by the NN 2 network ( Fig.9. (b)), and the filtered statistical data are obtained by SP processing again. By using the original statistical data of Mean, Var, and SC, as well as the relevant data after filtering, the variables were generated µ, θ. Processing global variables θ.

VII. EXPERIMENTS
In order to verify the effectiveness of our proposed model. Experiments are conducted on the following standard FSC datasets, miniImagenet, CUB, and tiredImagenet. We also use ablation experiments to verify the effectiveness of our proposed properties. At the end of this section, we show the result compared to previous state-of-the-art models.

A. DATASETS
MiniImagenet is part of the Imagenet dataset, which has 100 classes and 600 instances per class.

1) IN THE TRAINING STAGE
we use SGD to optimize our model with a learning rate of 0.01 for 100 epochs. Furthermore, it is reduced by 0.001 per 10 times. If the model has more than thirty epochs and the performance has not been improved, the training will be terminated in advance.

2) FOR THE FINE-TUNING STAGE
As mentioned above, per episode, n classes and k samples per class are selected in the support set, and select 15 per class for the query set. We use SGD with a learning rate 0.001 to optimize the model. The rest is the same as the training stage. In addition, all conventional convolution layers in our network are replaced with a dynamic convolution layer.

3) FOR WRN (WIDE RESIDUAL NETWORKS)
we only use the simple data augmentation approaches [44] in these architectures.

C. IMBALANCE SETUP
To simulate natural imbalance conditions, we randomly chose 40 classes in the miniImagenet dataset, which are set the number of instances in each class to 100, and other classes still contain 600 instances.

D. THE EXPERIENCE RESULTS
In Table 2. Our model is compared with previous state-ofthe-art models. Ours is applied to the miniImagenet and CUB with 1-shot, 5-shot. Params are shown in the second column, and the third and the fourth column show the accuracy result about 1-shot, 5-shot in the corresponding dataset.
This is in line with our conjecture that as a substitute for conventional convolution and dynamic convolution can significantly improve the network's performance.

E. EXTRA RESULTS ABOUT A SIMILAR NETWORK
EPNet and our network are evaluated, the results obtained with tiredImagenet, 84 * 84 images. EPNet uses a 12-layer residual network. MMNet is a Manifold mixup. The details about comparison are in TABLE 3.

F. SEMI-SCENARIOS
SSL with 100 unlabeled samples, * NET and EPCINet are the same but without EP (embedding propagation). As shown in Table 4, Table 5, Table 6.

VIII. ABLATION STUDIES
To verify the effectiveness of our proposed methods. This paper applies ablation studies to investigate dynamic convolution, EP (embedding propagation), balance variables, and LP (label propagation) on 1-shot miniImagenet. As shown in Table 7.
A. ALGORITHM ABLATION In Table 7, through ablation algorithm, this paper separately shows the improvement of image classification accuracy by dynamic convolution, LP (label propagation), EP (embedding propagation), and balance variables on 1-shot miniImagenet.
Our model apply algorithm ablation to conv-4, 1-shot mini-Imagenet.  Interestingly, we can find that when our model leverages the dynamic convolution layer to replace the conventional convolution layer, the improvement is the biggest among the proposed methods in our model.

B. ALGORITHM ABLATION
BV1 is used to handle a class imbalance. Table 8 shows the accuracy varies with the degree of imbalance. We find that BV1-EPNet has a better result than EPNet. The improvement shows in Fig.10. demonstrates that our balance variable is influential, and BV1+EPNet learns the appropriate balance variable, which has better performance on 1-shot miniImagenet, using WRN-28-10. In order to verify the effectiveness of BV1 and BV2, the EPCNet is used as the benchmark, and the ablation experiment is used for BV1 and BV2. For the sake of simplicity, we only use 500 episodes on 1-shot miniImagenet, using WRN-28-10. As shown in Fig.11., the experiment shows that when BV1 and BV2 are used for EPCNet respectively, the accuracy of the model is improved by about 3% and 4%.

IX. CONCLUSION
Our work proposes a novel few-shot classification model-EPCINet that handles the distribution limitations of shift few-shot learning by using EP and LP. Moreover, we have fully considered the imbalance problem in the real world: none of the existing few-shot classification methods takes this into account through learning balance variables to control the degree of imbalance. Furthermore, as far as we know, dynamic convolution is introduced into a few-shot classification for the first time. Empirically, we leverage ablation studies to prove the effectiveness of the above methods drawn from our model, and in the end, our model achieves the state-of-the-art results in a 1-shot and a 5-shot case in three standard few-shot classification datasets (mini-Imagenet, tiredImagenet, CUB). We also have interests in few-shot tasks with multiple labels, which is a direction of our future efforts, and we are also trying to see if we can apply it to NLP or other fields.