Multi-Scale Decision Network With Feature Fusion Multi-Scale Decision Network With Feature Fusion and Weighting for Few-Shot Learning and Weighting for Few-Shot Learning

,


I. INTRODUCTION A. BACKGROUND
In the past few years, the performance of image recognition models [1]- [7] in deep learning has been significantly improved on the benchmark datasets [8]- [11].These models typically rely on the deep convolutional network and large-scale labelled training examples, which obviously increases the parameters and computation and yields a high training cost.On the other hand, they can only recognize the image classes in training data, which limits the further development of image recognition.Unlike the machine, human can easily and effectively learn from few training examples.
The associate editor coordinating the review of this manuscript and approving it for publication was Yongming Li .
For example, when children are shown few pictures of tigers, they may always remember what they look like.Researchers hope that the machine also has such an ability, thus the concept of few-shot learning [12], [13] is proposed in machine learning.
Few-shot learning aims to learn information about object classes from one, or only few labelled images.A straightforward idea is to fine-tune the deep neural network on few-shot datasets.However, it will lead to a severe over-fitting problem.This is because that using only a few training examples will not accurately represent the real data distribution, which impacts the generalization ability of the model [14].
Meta-learning is a promising method in machine learning to deal with the few-shot recognition, also known as learning to learn [15].In the training stage of meta-learning, the dataset is divided into different meta tasks to learn the generalization ability of the model with the change of image classes.In the testing stage of metalearning, the recognition task can be fulfilled for new image classes unchanging the existing model.Currently, few-shot learning algorithms based on meta-learning mainly include model optimization-based methods [16]- [19], metric learning-based methods [20]- [28], data augmentation-based methods [29]- [35] and attention-based methods [21], [36]- [40].Among these methods, metric learning-based method is regard as a simple and effective way to solve the problem of few-shot recognition.
Metric learning-based method divides the few-shot recognition task into two stages: 1) The image features of the support set and the query set are extracted, respectively.2) A metric algorithm is used to classify the test images by comparing the distance or similarity between the support set feature and the query set feature.Some metric learning-based methods [20]- [22], [24] have been attracted attention.However, they still cannot achieve high recognition accuracy.We summary three reasons for these low-accuracy models as follows: 1) They only exploit the single-scale feature (top-layer feature) of the images, which ignores the low-layer features that are also critical for the few-shot recognition.Traditional convolutional neural network (CNN) can extract discriminative feature information.For example, the low layers of CNN can capture high resolution, texture, contour and so on, while the high layers of CNN can capture strong semantic representations [41].Since few-shot learning cannot obtain enough examples, it is necessary to exploit multi-scale features information on the limited dataset.Reference [42] proves that the accuracy of recognition can be improved by adding texture features of the low-layer.In addition, using only top-layer feature information will lead to severe information loss, especially for the small-sized images.2) In [20]- [22], they use a fixed metric (e.g., Euclidean or cosine similarity distance), which lacks non-linear internal structure that captures the similarity between features.Although [24] learns a non-linear metric through a comparable convolutional network, called Relation Network, it simply concatenates the features of the support set and the query set in the channel dimension before training Relation Network, which leads to an insufficient feature fusion.Therefore, none of them can establish a good connection between the features and learn the features adaptively.
3) In [20], [22] and [24], the image features of the same or the different classes are equally treated, which makes the metric function has no emphasis point.
Based on the analysis above, three main issues are studied by the paper.1) How to sufficiently extract features?1) How to enhance the adaptability of features when measuring the similarity of features?3) How to pay more attention for the same classes of images?Therefore, we propose a multi-scale decision network with feature fusion and weighting for few-shot learning (MSDN) to solve these issues.MSDN can not only exploit multi-layer features to capture abundant image information, but also make the metric function to measure the similarity of features better and pay more emphases for the same classes of images.

B. MAIN CONTRIBUTIONS
Our paper has four main contributions.
1) The idea of multi-scale is exploited to extract features sufficiently.After calculating the relation score [24] of each layer feature, the test images are classified via a well-designed voting strategy.Each useful feature information is considered, which makes the recognition more accurate based on the method of multi-scale.2) A non-linear feature fusion item is designed to fuse the features of the support set and the query set.It can enhance the adaptability of features and improve the ability of the metric function.3) An attention mechanism is introduced by calculating the cosine similarity between the support set and the query set features as the weight of each layer feature vector.It makes the metric function have more emphases to the same class of images during training.4) The binary cross entropy (BCE) loss is used during training.Our model achieves the state-of-the-art accuracy result on Omniglot and miniImageNet datasets compared with popular few-shot recognition models [16], [18], [21], [22], [24]- [26], [33], [34], 37], 40].Meanwhile, the ablation experiments on miniImageNet dataset also demonstrate the effectiveness of every improvement in our model.The structure of this paper is organized as follows.Section I introduces the development of few-shot learning and main contributions of this paper.Section II summarizes the related work of few-shot learning.Sections III presents the method and model of this paper.Section IV analyzes and discusses the experimental performance.Finally, section V summarizes the study of this paper and gives a future work.

A. MODEL OPTIMIZATION-BASED METHODS
Model-agnostic meta-learning algorithm (MAML) [16] is a valid few-shot learning algorithm.It can be trained on different meta tasks with a few steps of gradient update and quickly get a good generalization on the new task.Reference [17] introduces a new algorithm, called Reptile, which can be regarded as an updated version of MAML.It learns the parameter initialization method of neural network so that it can be adjusted with a small amount of new task data.Unlike MAML that uses the computational graph of the gradient descent algorithm to unfold the different calculation process, Reptile performs a standard stochastic gradient descent (SGD) in each task, which requires less computation and memory.Meta-Learner LSTM [18] thinks that the parameter update rule between Long Short-Term Memory (LSTM) and SGD is very similar, so LSTM architecture is used to train the meta-learning model with a good parameter initial condition.However, it suffers from the need to fine-tune on the target problem.Reference [19] proposes UMTRA, an algorithm that performs unsupervised, model-agnostic meta-learning for classification tasks.The statistical diversity properties and domain-specific augmentations are used to generate the training and validation data for synthetic tasks.It can be applied to other tasks as well, such as video classification.

B. METRIC LEARNING-BASED METHODS
Siamese Network [20] is a special neural network architecture with weight sharing.It inputs two images to extract image features, and calculates the Euclidean distance between the training data and the test data to measure their similarity.Matching Network [21] encodes the support set and the query set by using different LSTM architectures and measures them by using a weighted (attention-based) metric function.Once the model is trained, it can produce sensible test labels for unobserved classes without any changes to the network.Prototype Network [22] learns an embedding space by computing the mean of the support set to get the prototype representation of each class.Therefore, the few-shot recognition task can be regarded as finding the nearest neighbor in the embedding space.Task dependent adaptive metric model (TADAM) [23] uses the idea of Prototype Network to construct a class representation for metric calculation.It introduces some technics, including metric scaling, task conditioning and auxiliary task co-training, to improve general metric learning-based methods.Relation Network [24] thinks of that the metric is also a critical factor of influencing the recognition result.Therefore, it trains a comparable convolutional network to learn a non-linear metric in the embedding space instead of using a fixed metric distance.Our proposed model (MSDN) can be regarded as the improvement of Relation Network as well.Reference [25] proposes Prototype-Relation Network (PRN) by using the idea of Prototype Network and Relation Network.Moreover, PRN designs a novel loss function, which takes both inter-class and intra-class distance into account.References [26] and [27] use the graph convolutional network (GNN) to solve the few-shot recognition problem.According to the similarity between nodes, GNN selectively spreads the image information of the existing label to the test image that is most similar to it.Reference [28] proposes a hybrid meta-learning model, called Meta-Metric-Learner, which combines the benefits of optimization-and metric based methods.It proves the effectiveness of fusing different few-shot learning methods.

C. DATA AUGMENTATION-BASED METHODS
Attribute Guided Augmentation (AGA) [29] uses an attributeguided method to augment the training examples by mapping images into an attribute space.The method can be applied to the few-shot recognition in a transfer-learning setting without prior knowledge of the new classes and object-based few-shot scene recognition.References [30] and [31] use the idea of hallucinating to synthesize new labelled training examples.Reference [32] proposes DAGAN, which uses the generative adversarial networks (GAN) to generate new examples.The model can be applied to novel unseen classes because this generative process does not depend on the classes themselves.MetaGAN [33] also draws on the idea of generative adversarial networks.It exploits the imperfect generator in GAN to generate fake data between the manifolds of different real data classes, which provides additional training signals to the classifier as well as makes the decision boundaries much sharper.Reference [34] proposes Adaptive Learning Knowledge Networks (ALKN) to learn the knowledge of different classes from the features of labeled samples and store the learned knowledge into memory, which will be dynamically updated during the learning process.The method of knowledge augmentation can make up for the lack of training samples.Reference [35] proposes a novel auto-encoder network dual TriNet for feature augmentation.It can directly synthesize multi-layer instance features by utilizing semantic information to solve the few-shot recognition problem.

D. ATTENTION-BASED METHODS
Simple Neural Attentive Learner (SNAIL) [36] is a simple meta-learner model with attention.It can overcome the bottleneck of the meta-learner to internalize and refer to past experience by combining temporal convolutions and soft attention.Meta Network [37] uses the meta information to produce the fast weight.The fast weight and the slow weight are combined to classify the test images.Meanwhile, the similarity between the memory index and the input embedding is calculated by using cosine similarity as attention.Matching Network [21] also introduces the attention mechanism to predict the output class label.The attention mechanism can take a simple form, such as Softmax function over the cosine distance.Attentive Matching Network (AMN) [38] proposes a feature-level attention mechanism to help similarity function pay more emphases on the features that better reflect the inter-class differences as well as to help embedding network learn better feature extraction capability.Moreover, AMN also learns a discriminative embedding space that maximizes inter-class distance and minimizes intra-class distance.Reference [39] extends the target recognition system by using a few-shot recognition weight generator with attention mechanism, and redesigns the convolutional network model classifier with cosine similarity.It is able to quickly learn new classes without sacrificing the initial accuracy of training.Reference [40] applies channel attention and spatial attention module (C-SAM) to Relation Network.It can mine more effective information by using samples of different classes that exist in different tasks.

III. METHOD AND MODEL
In this section, we first define the notation and terminology of few-shot learning.Then, we propose our model MSDN and introduce the design details of MSDN, including loss function, voting strategy, feature fusion and feature weighting.Finally, the training algorithm is given to describe the procedure of training.

A. PROBLEM SETUP
Meta-learning has been widely used in the field of few-shot learning.In general, meta-learning divides the dataset into training tasks and test tasks (sometimes validation tasks may be required).During training, we randomly extract C × K samples to construct a meta task as the support set, where C is the unique class of the image and K is the image number of per class.Then, we extract a batch of samples from the remaining images of the C classes as the query set.The target of few-shot learning is to learn how to classify the C classes from a series of different meta tasks.It is called Cway K-shot problem.During testing, the testing dataset is also divided into the support set and the query set.By entering them into the trained model, the test images in the query set can be classified.
Formally, the support set and the query set can be formulated as, respectively: S denotes the support set.The x s and y s denote the image and its corresponding class label from the support set, respectively.
Q denotes the query set and N denotes the number of images in the query set.The x q and y q denote the image and its corresponding class label from the query set, respectively.
When setting K = 1, the problem is one-shot learning; K > 1, the problem is few-shot learning.

B. MODEL OVERVIEW
In this paper, we propose an end-to-end, metric learningbased model, called multi-scale decision network with feature fusion and weighting (MSDN) to solve the few-shot recognition problem.MSDN is shown in Fig. 1.
As show in Fig. 1, MSDN model consists of two networks: Feature Extraction Network (FN) and Relation Network (RN).Take 2-way 1-shot problem as an example.
We randomly sample two different classes of images x s1 , x s2 from the support set: a cat (left in Fig. 1) and a dog (right in Fig. 1); sample one image x q1 from the query set: a cat (middle in Fig. 1).Our target is to classify the image x q1 in the query set.
First, we input x s1 , x s2 and x q1 into FN to extract each layer feature.For each image, we construct a four-layer feature pyramid, denoted as {F i (x s1 )} 4 i=1 , {F i (x s2 )} 4 i=1 and F i (x q1 ) 4 i=1 , respectively.The F i denotes the feature of i th layer.Next, we concatenate the support set feature and the query set feature of each layer with operator which C i (•, •) denotes concatenating the two features of the i th layer in the channel dimension.The improved concatenation way is designed in section III.C.Then, we input the concatenated features into RN to calculate the relation score of each layer r s1,q1 and r s2,q1 . The relation score produces a scalar in range of 0 to 1 representing the image similarity between the support set and the query set [24].For example, in Fig. 1, we can get r s1,q1 i=1 = 0.9 (cat) and r s2,q1 i=1 = 0.3 (dog) for the first layer Relation Network (RN ).Finally, we use a voting strategy of the minority being subordinate to the majority to decide which class the query set image x q1 should belong to.
In Fig. 1, the feature comparison results of the first, second and fourth layers show that the probability of the cat class is high for the query set, whereas the feature comparison result of the third layer is considered to be the dog class with a high probability.According to the voting strategy, the query set image x q1 should be classified as the cat (correct classification in Fig. 1).

C. DESIGN DETAILS OF MSDN MODEL 1) LOSS FUNCTION
In this paper, the binary cross entropy (BCE) loss is used to train our model.It can produce a relation score r between 0 and 1 to represent the image similarity between the support set and the query set.The BCE loss function is: The y denotes: y = 1, y s = y q 0, y s = y q (4) In our model, each layer feature is used to predict the relation score, so the total loss is: 2) VOTING STRATEGY We use such a voting strategy that the minority is subordinate to the majority to classify the test images.The relation score of each layer needs to be calculated, thus we can get four classification results.The specific voting strategy is: If four results are the same (AAAA) or three results are the same and the remaining one result is different (AAAB), or two results are the same and the remaining two results are different (AABC), then it is obvious that the same result (A) should be selected as the final classification result.
If two results are the same and the remaining two results are also the same (AABB), or four results are different (ABCD), then the result of the fourth layer should be selected as the final classification result because it is obtained by the top-layer feature.In other words, once we cannot vote, we should classify the test images according to the result of the top layer.

3) FEATURE FUSION
For the K -shot problem where K > 1, Relation Network [24] element-wise sums over the top-layer feature output of FN for all the support set samples to form the feature map of training class.The pooled class-layer feature map is combined with the query set image feature map.In our work, we take the same operation, but for each layer of FN.
In Relation Network [24], the concatenation way of the support set and the query set features can be simply described by formula: F s denotes the feature of the support set and F q denotes the feature of the query set.The function Concate(•, •) denotes that the features are concatenated in the channel dimension.However, it is only a simple linear concatenation, which cannot sufficiently fuse the support set and the query set features.To enhance the fitting ability of RN, we design a non-linear feature fusion item F s F q and concatenate it to the channel dimension for each layer feature.It can be described by formula: The operation denotes element-wise product.
The adding of F s F q can guarantee RN to adaptively adjust the fusion way of features, so it can learn a stronger relation score during training.

4) FEATURE WEIGHTING
According to Equation ( 6) and ( 7), RN treats the features of the support set and the query set equally.To make RN pay more attention to the same class of images and less attention to the different class of images, the cosine similarity between the support set and the query set features is calculated as the weight of each layer feature vector: The operation • denotes dot product and × denotes ordinary multiplication.
We multiply Cos F s , F q by each layer feature C F s , F q .Therefore, the final feature concatenation is: Feature weighting can also be regarded as an attentionbased method, so RN can have more emphases for the same class of images during training.

IV. EXPERIMENTS
In this section, first, we introduce the environments and datasets.Next, we set the parameters of the network architecture.After making more specific experimental details, we carry out quantitative empirical comparisons to demonstrate that our model can achieve the best result compared with popular few-shot models.The ablation experiment also demonstrates the effectiveness of every improvement in our model.Finally, we simply analyze the influence of parameter selection on experimental results and the time complexity of algorithm.

A. ENVIRONMENTS AND DATASETS
In this paper, all experiments about the few-shot recognition task are implemented on the Pytorch1.0GPU platform in SERVER Ubuntu 16.04 environment.A 1080Ti graphics card with 11G memory is enough for these experiments.
We use two classic image datasets in few-shot learning: Omniglot [43] and miniImageNet [21].Table 1 shows the details for the two datasets.
As shown in Table 1, Omniglot consists of 1623 characters from 50 various alphabets.Each character represents a class.There are 20 examples for each class with the image size of 28 × 28.Following [21], [22], and [24], we rotate images 90 degrees, 180 degrees and 270 degrees to add new classes and use 1200 classes plus rotated classes as the training data and remaining 423 classes plus rotated classes as the test data.
shown in Table 1, miniImageNet consists of 100 randomly classes and there are 600 examples for each class with the image size of 84 × 84.We use the same split proposed by [18], which consists of 64 classes for training, 16 for validation and 20 for test.

NETWORK ARCHITECTURE PARAMETERS
The network architecture of MSDN model is shown in Fig. 2.
order to fairly compare our models with baseline models in section IV.D, we take the same network depth and similar parameter settings for Feature Extraction Network (FN) and Relation (RN).
FN is a four-layer convolutional neural network.The combination of a 3 × 3 kernel size with 64 filters, a batch normalisation and a ReLU activation function [44] is used for each layer and a 2 × 2 max pooling is added for the first two layers.Because each layer feature needs to be concatenated as the input of RN, they must be reshaped to the same size before concatenating.In FN, a 2 × 2 pooling is used to reshape the feature of the first layer, has the same size with the features of next three layers.According to the way of concatenation in Equation the concatenated features with 64 × 3 channels will obtained for each layer.
RN is also a four-layer neural network with two convolutional layers and two fully connected layers.The combination of a 3 × 3 kernel size with 64 filters, a batch normalisation and a ReLU activation function is used for the first two layers.Before entering the fully connected layer, the input size of 64 and 64 × 3 × 3 will be obtained for Omniglot and miniImageNet datasets, respectively.Then, combination of 8 hidden units and a ReLU activation function is used for the first fully connected layer, and the combination of 1 hidden unit and a Sigmoid activation function used for the second fully connected layer in order to get the relation score.

C. MORE EXPERIMENTAL DETAILS
For the comparative experiments on Omniglot dataset and miniImageNet dataset, our model is trained from scratch with random initialization.We take the Adam optimization algorithm [45] with the learning rate 10 −3 and cut it in half every 50,000 episodes.
There are two popular C-way settings on Omniglot dataset: 5-way and 20-way.Because the accuracy of 5-way is almost 100%, we only compare 20-way result with 1-shot and 5-shot settings.As shown in Table 1, in each training episode, the 20-way 1-shot experiment consists of 1 support image and 10 query images and the 20-way 5-shot experiment consists of 5 support images and 5 query images.During testing, following [22] and [24], we batch 1 and 5 query images per class respectively and calculate the accuracy results by averaging over 1000 randomly generated episodes from the testing data for evaluation.
We take the 5-way 1-shot and the 5-way 5-shot settings on miniImageNet dataset.As shown in Table 1, in each training episode, the 5-way 1-shot experiment consists of 1 support image and 15 query images and the 5-way 5-shot experiment consists of 5 support images and 10 query images.During testing, following [22] and [24], we batch 15 query images per class respectively and calculate the accuracy results by averaging over 600 randomly generated episodes from the testing data for evaluation.

D. RESULTS AND ANALYSIS 1) COMPARATIVE EXPERIMENTS
We compare our model with several state-of-the-art in various experiment settings and datasets.Table 2 and Table 3 show the comparative accuracies on Omniglot and miniImagenet dataset, respectively.
In Table 2 and Table 3, the ±N% denotes the standard deviation with 95% confidence interval.The experimental data shows that our model (MSDN) achieves the best accuracy result compared with baseline models on two benchmark datasets, except that it has the same accuracy as ALKN [34] in 5-way 1-shot setting of Table 3.In MSDN, the multi-scale features are fully utilized and the useful information of each layer is retained.Therefore, the extracted features have stronger representation ability in various types of datasets for the image recognition task.Meanwhile, compared with these baseline models, the low-layer features can capture the abundant detail information for the small-sized and easily confusing objects.During feature concatenation, the non-linear feature fusion term and attention mechanism also enhance the fitting ability of Relation Network and make the relation score of the same class of images higher.Therefore, our model can achieve a higher recognition accuracy.

2) ABLATION EXPERIMENT
In order to objectively analyse the impact of each improvement of our model, we do the ablation experiment that resembles the variable-controlling approach on miniIma-geNet dataset.The result of the ablation experiment is shown in Table 4.
As shown in Table 4, we carry out the following four groups of experiments based on Relation Network [24], including (A) only multi-scale voting, (B) only feature fusion, (C) only feature weighting and (A + B + C) entire model MSDN.The experimental results of (A), (B) and (C) show that every improvement of our model is better than Relation Network.And we find that using only multi-scale voting can achieve more significant improvement in accuracy compared with only feature fusion and only feature weighting.This also suggests that extracting more abundant feature information is critical for the few-shot recognition task.During feature concatenation, the accuracy of (B) has a slightly higher than (C). is because that the non-linear feature fusion item we designed can significantly improve the way of feature fusion and enhance adaptability of features during training.Although the feature weighting makes our model pay more attention to the same class of images, it is only a kind of simple and fixed attention-based method (weighted by the cosine similarity).Therefore, its influence on the experimental result is not as obvious as (C).Finally, we also find that the accuracy of entire MSDN model also far exceeds that of Relation Network.

3) THE SELECTION OF PARAMETERS
''Higher way'' setting experiment.In Table 3, we have a slightly higher standard deviation compared with Prototype Network [22].This is understandable because Prototype Network uses more classes (higher ''way'') to train instead of using our standard training classes.Therefore, we add the ''higher way'' setting experiment to determine the influence for the experimental results.The result is shown in Table 5.
In [22], Prototype Network uses 30-way (30 classes) for 5way 1-shot problem and 20-way (20 classes) for 5-way 5-shot problem in training, thus it shows a higher accuracy compared with our standard training way (class) setting.When adding the way (class) in training, the accuracy of our model exceeds Prototype Network with lower standard deviation.Moreover, we also find that the accuracy will reach a bottleneck when continuing to add the way (class).

4) THE COMPARISON OF TIME COMPLEXITY
According to Algorithm 1, although our model uses the idea of multi-scale, the time consumption is almost the same as Relation Network [24] at the stage of feature extraction because the four-layer features can be obtained simultaneously by Feature Extraction Network.However, the time complexity of our model is almost four times that of Relation Network [24] when calculating the relation score (Algorithm 1. line [9][10][11][12][13][14].Considering that only four-layer convolutional networks (RNs) are used, the time complexity does not increase much.During running, our model can still get the experimental results quickly.

V. CONCLUSION
In this paper, we propose a multi-scale decision network with feature fusion and weighting for few-shot learning (MSDN).We use Feature Extraction Network to extract the features of the support set and the query set in each layer and inputs them into Relation Network for comparison.The test images can be classified by a clear majority voting strategy.Meanwhile, we introduce the feature fusion and the feature weighting to enhance the fitting ability of Relation Network during feature concatenation.The comparison experiments on Omniglot and miniImageNet datasets show that our model achieves the state-of-the-art result compared with popular few-shot recognition models.The ablation experiment on miniIma-geNet dataset also demonstrates the effectiveness of every improvement in MSDN.
The future work will mainly focus on three aspects.1) We will study the voting mechanism to design a more robust voting strategy.2) We will extend MSDN from

FIGURE 1 .
FIGURE 1. Overview of our model MSDN for a 2-way 1-shot problem.

Algorithm 1
the training loss computation of our proposed MSDN.As shown in Algorithm 1, N is the number of examples in the training set, K is the number of classes in the training set, N c ≤ K is the number of classes per episode, N s is the number of support examples per class, N q is the number of query examples per class, D k denotes the subset of D containing all elements (x i , y i ) such that y i = k, F si and F qi denote the i th layer feature of the support set and the query set, respectively.r i and L i denote the relation score and loss of the i th layer, respectively.Training Procedure of MSDN Input: Training set D = {(x i , y i )} N i=1 , y i ∈ {1, 2, . . ., K } Output: The loss L for a randomly generated training episode 1: Select N c randomly from K classes to construct the support set S and the query set Q 2: for k in {1, 2, . . ., N c } do 3: Select N s randomly from D k to constitute S 4: Select N q randomly from (D k − S k ) to constitute Q 5: end for 6: {F si } 4 i=1 = FN (S) by Feature Extract Network (RN) 7: F qi 4 i=1 = FN (Q) by Feature Extract Network (RN) 8: L = 0 9: for i in {1, 2, 3, 4} do 10: Concatenate F si and F qi by Equation (9) to constitute C i 11: r i = RN (C i ) by Relation Network (RN) 12: Compute loss L i by Equation (3) and (4) 13: L = L + L i 14: end for 15: Update L by Adam optimization algorithm

FIGURE 2 .
FIGURE 2. The network architecture parameters of MSDN model.

TABLE 1 .
The details of using on Omniglot and miniImageNet dataset.