Improving Augmentation Efficiency for Few-Shot Learning

While human intelligence can easily recognize some characteristics of classes with one or few examples, learning from few examples is a challenging task in machine learning. Recently emerging deep learning generally requires hundreds of thousands of samples to achieve generalization ability. Despite recent advances in deep learning, it is not easy to generalize new classes with little supervision. Few-shot learning (FSL) aims to learn how to recognize new classes with few examples per class. However, learning with few examples makes the model difficult to generalize and is susceptible to overfitting. To overcome the difficulty, data augmentation techniques have been applied to FSL. It is well-known that existing data augmentation approaches rely heavily on human experts with prior knowledge to find effective augmentation strategies manually. In this work, we propose an efficient data augmentation network, called EDANet, to automatically select the most effective augmentation approaches to achieve optimal performance of few-shot learning without human intervention. Our method overcomes the disadvantages of relying on domain knowledge and requiring expensive labor to design data augmentation rules manually. We demonstrate the proposed approach on widely used FSL benchmarks (Omniglot and mini-ImageNet). Experimental results using three popular FSL networks show that ours improves performance over existing baselines through an optimal combination of candidate augmentation strategies.


I. INTRODUCTION
O VER the past decade, we have witnessed remarkable performance gain with deep learning in many tasks, such as classification [2], [3], detection [4], [5], and segmentation [6], [7]. The extensive calculation of deep learning through artificial neural networks, advances in computing power, and a large number of labeled examples have allowed deep learning models to achieve performance improvement. However, many real-world scenarios do not allow us to access sufficient labeled data due to some reasons, including privacy, security, high labeling costs, and difficulty in managing data. Therefore, many researchers have attempted to learn deep learning algorithms with few examples, and the field of few-shot learning (FSL) [17], [18] has been recently emerged.
In this work, we address both metric-based and data augmentation-based FSL approaches. Among them, metricbased FSL has been actively studied, and Matching Networks (MatchingNet) [19], Prototypical Networks (Prototypical-Net) [20], and Relation Networks (RelationNet) [21] have been popularly used. Note that it is important to learn an appropriate metric space for the approaches. One can learn the similarity between two samples, extract features from the samples, and calculate the distance between the features. MatchingNet compares the cosine distance between samples mapped in embedding space, and PrototypicalNet computes the Euclidean distances between prototype samples representing classes. RelationNet introduces a relation metric to compare relationships between samples or between classes. FIGURE 1: Overview of the proposed EDANet. In the augmentation phase, we automatically find the most efficient augmentation policy from the original data D train to produce the augmented data T * (D train ). In the classification phase, the FSL network is trained with the augmented training data, and then the performance is evaluated using the original test dataset.
Even if the metric-based FSL algorithms made an important contribution to the early FSL, they are still unsatisfying compared to traditional many-shot learning approaches in terms of task performance. Commonly, data augmentation is a de facto practical technique for improving task performance. Therefore, this study aims to find an optimal data augmentation strategy from a pool of candidate augmentation techniques for metric-based FSL.
In the data augmentation-based approach, few available samples are augmented to generate diverse samples to enrich the training experience. Data augmentation methods provide additional training data by transforming existing data samples into supplementary ones. The data augmentation-based FSL approaches [22]- [24] augment data by hallucinating the feature vectors from the training dataset. IDeMeNet [24] generates synthesized examples from a data augmentation method. However, data augmentation methods have different strategies to improve task performance, depending on the domain. It is important to note that such augmentation-based algorithms generate training samples via hand-craft data augmentation rules. This plays an important role in improving performance. However, those approaches rely on expert domain knowledge and thus require high-cost labor. Moreover, optimal augmentation rules can be problem-specific, making them hard to be applied to other tasks. Therefore, manually designed data augmentation strategies may produce suboptimal solutions.
We overcome the difficulty of manual design choice in augmentation-based FSL. Focusing on this point, we first apply various augmentation methods to the metric-based FSL algorithms. Since most of the existing FSL algorithms have manually applied the augmentation method, we propose an efficient data augmentation method, termed EDANet, that automatically searches for an optimal augmentation rule. Thus, it provides an optimal combination of candidate augmentation methods that help enrich the knowledge of the network. It consists of two phases: augmentation and classification phases. In the augmentation phase, we explore data augmentation policies that improve performance. We use a density matching algorithm [30] to find the best combination of data augmentation strategies automatically. After that, we evaluate the performance of metric-based FSL networks with the augmented dataset in the classification phase. Figure 1 gives an overview of the proposed EDANet. Our approach reduces the manual design efforts, improves the quality of augmented data, and enhances the generalization ability of FSL models. It also automatically searches for augmentation policies such as augmentation type and magnitude. To our knowledge, this is the first work to automate the search mechanism for augmentation-based FSL.
We apply the proposed method to a range of metric-based FSL methods and evaluate them using image classification benchmark datasets. We use the Omniglot [18] and mini-ImageNet [19] datasets as benchmarks and analyze the performance of the approaches in terms of different distance metrics. Experimental results show that EDANet improves the performance by providing an optimal combination of candidate data augmentation techniques. Specifically, EDANet significantly improves the classification accuracy over baseline methods (i.e. MatchingNet [19], PrototypicalNet [20], and RelationNet [21]) on mini-ImageNet, achieving 63.51% one-shot accuracy and 79.74% five-shot accuracy compared to the competing approaches. Our approach achieves 98.61% one-shot accuracy and 99.13% five-shot accuracy on Omniglot, outperforming existing augmentation-based FSL baselines. By modifying the backbone architecture to ResNet-18 and ResNet-50, EDANet improves the classification task by 5.67% on the Omniglot dataset and 1.97% on mini-ImageNet for one-shot accuracy, respectively.

II. RELATED WORK A. FEW-SHOT LEARNING
Few-shot learning (FSL) aims to learn a model with a limited amount of labeled examples for predicting novel classes. FSL is also known as n-way k-shot learning, where n and k denote the number of classes and the number of data points per class, respectively. Each sub-task consisting of n-way k-shot is called an episode (mini-batch), which is why FSL is called episodic learning. A query and a support set are used for episodic training, where the support set is used to learn to solve a classification task, and the query set is used to evaluate the performance of the task. The major challenge of FSL is the insufficient supply of training examples, assuming that massive amounts of datasets are expensive to label correctly. Thus, FSL tries to improve the prediction capability and generalization performance of a model with limited datasets. FSL approaches can be classified into four categories: metric-based [19]- [21], data augmentation-based [22]- [24], optimizer-based [25], [26], and semantic-based approaches [27], [28]. Among them, we address metric-based and data augmentation-based approaches that are closely related to this work.
Metric-based approaches [19]- [21] learn a representation of data with a metric in the feature space. These approaches transform data into a low-dimensional subspace, cluster the transformed samples, and compare the clusters using a metric function. MatchingNet [19] is one of the popular metricbased methods that compute the cosine distance to classify query samples. It proposes the full context encoding module (FCE), which conditions the weights on the whole support set across classes. PrototypicalNet [20] compute an average feature, called a prototype, for each class in the support set, then classify query samples by calculating the Euclidean distance between each prototype and the samples in the query set. RelationNet [21] classify samples of novel classes by calculating the relation score between the query and the sample of each novel class.

B. DATA AUGMENTATION
Data augmentation is a practical strategy used for a variety of learning-based tasks. Random crop, rotation, clipping, flip, scaling, and color transformation are used as baseline augmentation methods [1], [2], [16] when applying them to image benchmark datasets, such as CIFAR [10] and Ima-geNet [9]. Mixup [12] generates an augmented image by mixing two images by linear interpolation. Cutout [13] is a kind of regional dropout strategy where random patches are replaced with zeros (black pixel). Recently, CutMix [14] has been proposed to cut and paste a part of a randomly selected image into another image. Note that such augmentation techniques have been naturally used in supervised learning. The chronic problem of FSL is that there are not enough training samples. Since FSL is a data-hungry method, it also deploys data augmentation approaches [22]- [24]. SGM and PMN [22], [23] generate additional samples from trained hallucinators. Based on this, they propose strategies to improve the performance of FSL by using the generated samples. IDeMeNet [24] has adaptively fused samples from the support set to generate synthesized samples. It trains an embedding sub-module, which maps samples to feature representations and then performs FSL.
However, the aforementioned approaches rely on handcrafted rules and require task-specific domain knowledge and cumbersome exploration to find optimal augmentation strategies. We break away from the augmentation techniques that have been routinely applied in the existing metric-based FSL. In our work, we do not generate synthetic samples but explore promising augmentation techniques to find an optimal augmentation strategy. As such, we do not rely on hand-crafted rules and require task-specific knowledge to find the optimal augmentation strategy. We propose an efficient automatic augmentation method for FSL.

C. AUTOMATED LEARNING
One promising way is to find data augmentation methods automatically [29]- [32]. Recently, automating the design process of a neural network, called neural architecture search (NAS) [8], has been proposed to search for an optimal network architecture and to reduce the manual design efforts. Automating augmentation using NAS has been recently proposed in the field of data augmentation [29], [30], [32]. AutoAugment [29] uses an RNN controller to augment train data with a randomly selected augmentation method. Initially, all augmentation methods are explored uniformly, and we find optimal augmentation techniques that yield the best performance based on a reward function. While AutoAugment has achieved promising results, it is less efficient and expensive. PBA [32] generates augmentation policies based on population-based training [33]. Fast AutoAugment [30] uses hyperparameter optimization to explore optimal augmentation policies. Unlike the above-mentioned methods, in this work, we focus on developing an automatic augmentation method for the field of FSL. We adopt the augmentation strategy in [30] to accelerate our augmentation phase, which will be described in the following section.

III. METHODOLOGY
We present an automatic augmentation strategy for the metric-based FSL without requiring domain knowledge and painful design efforts. The proposed Efficient Data Augmentation network, termed EDANet, consists of augmentation and classification phases. In the augmentation phase, augmentation data is obtained by automatically exploring the most efficient augmentation policy for the FSL model from the original data. In the classification phase, the FSL network is trained with the augmented training data, and then the performance is evaluated with the original test dataset. We describe the proposed augmentation framework and few-shot augmentation with the proposed framework in Section III-A. In Section III-B, we apply EDANet to popular FSL problems.

A. EDANET: EFFICIENT DATA AUGMENTATION NETWORK
Manual data augmentation techniques generally require expert knowledge and painstaking design efforts despite the advantage of improving performance by enriching training data. As a remedy, we automate the augmentation procedure to find an optimal strategy and improve task performance. The search space of the automatic augmentation process contains diverse augmentation techniques, listed in Table 1. An augmentation operation O receives a sample x given the magnitude λ. The result can be either O(x; λ) with probability p or the original x itself with probability 1−p. For example, when the rotation is selected as an augmentation operation, the magnitude becomes the degree. In the augmentation phase, we first introduce the search space of the augmentation techniques (operations) and perform density matching to find optimal augmentation policies. Figure 2 (b) shows the augmentation phase of EDANet. We follow the VOLUME 4, 2016 search strategy presented in [30] to find a good augmentation policy for FSL.
Auto Augmentation Phase. Let O denotes the set of augmentation operations, which transforms an input sample by applying the operation. Specifically, each augmentation operation O i contains a parameter, the magnitude λ i . The input samples are augmented by a policy consisting of t different sub-policies τ 's, and each O i (x; λ) is applied to transform x with the probability p. Figure 2 (a) shows an example of augmenting samples by each sub-policy τ . One can generate T (D) indicating a set of augmented samples of dataset D transformed by all sub-policies τ ∈ T : Note that the dataset D is divided into k-folds. Each fold consists of D model and D aug . We find a promising policy giving the highest performance for D aug based on model M (θ) that is trained by D model . Using the Bayesian optimization approach [30], top-n augmentation methods are selected based on the minimum error rates of classifier θ when predicting dataset D aug . All the best augmentation methods are merged from each fold.
Density Matching. Our goal is to find an efficient augmentation policy that matches the density of D model and D aug . We follow the practice in [30] [13] and sample pairing [15].
where R(θ | D aug ) gives the accuracy using the model parameter θ for D aug . We can find a policy based on the model trained with D model . In other words, it minimizes the distance between the density of D model and the density of T (D aug ). We optimize the search process of augmentation strategies in EDANet through Bayesian optimization [30].
Classification Phase. After the augmentation phase, we learn few-shot classification networks with the augmented dataset T * (D train ), which contains {(x 1 , y 1 ), ..., (x n , y n )}, where x is a sample and y is the class label. After that, we test the proposed method with the original test set. Figure 3 shows the classification phase of the proposed method. In the training stage, each task takes the form of L N -way k-shot, and it consists of a support set X s and a query set X q . Then, we train the classification model M that minimizes L N -way prediction loss for the query set, also known as episodic training. In the testing stage, the episode consists of a novel support set and a novel query set. The classification model M can predict novel classes. The model M can be MatchingNet [19], PrototypicalNet [20], or RelationNet [21] as the backbone of the proposed method, which will be described in the following section.

B. FEW-SHOT LEARNING OBJECTIVE FUNCTIONS
The support set and query set for training FSL are extracted from D aug and D model , respectively. The support set X s contains m examples as (x s 1 , y s 1 ), · · · , (x s i , y s i ), · · · , (x s m , y s m ). The query set X q contains t examples as (x q 1 , y q 1 ), · · · , (x q j , y q j ), · · · , (x q t , y q t ).

1) Matching Networks
MatchingNet [19], given a query sample x q (a novel unseen sample), predicts the class of x q by comparing it with the support set. The prediction of class for the query sample is defined asŷ whereŷ is the predicted class of the query sample.

Attention(·, ·) is a simple attention mechanism between
x and x s i , which is the softmax function over the cosine distance betweenx and x s i as follows: To be specific, two embedding functions f and g are used to learn embeddings ofx (input in the query set) and x s i (input in the support set), respectively. We convert the support set label into one-hot encoding and multiply them with attention to get the probability ofŷ belonging to each class in the support set. Then, we selectŷ with the maximum probability value as the class label. PrototypicalNet [20] learn the embeddings of the data points using the embedding function, f ϕ , where ϕ is the set of parameters of the embedding function. The prototype represents the mean embedding of data points in each class. The prototype c j of the j-th class is calculated using the embedding of data points for the class as Then, the softmax function is applied after calculating the Euclidean distance d(·, ·) between the query set embedding and the class prototype. Through this, the probability of the query sample of the class j is predicted as follows: Finally, we define the loss function as the negative log probability of the class j as and we minimize the loss using SGD.

3) Relation Networks
RelationNet [21] consist of two important functions: the embedding function, denoted by f η , and the relation function, denoted by g φ , where η and φ are the parameters of the embedding and relation functions, respectively. It first takes a sample, x s i , from the support set and feeds it into the embedding function to extract the feature. Similarly, it learns the embedding of a query sample x q j by passing through the embedding function, f η (·). Then, it combines f η (x s i ) and f η (x q j ) using the concatenation operation, i.e., concat(f η (x s i ), f η (x q j )), which is fed into the relation function, g φ . The function generates the relation score ranging from 0 to 1: This represents the similarity between the samples in the support and query sets. For RelationNet, we use the mean squared error as the loss function, which is computed as where α is 1 if y s i == y q j and 0 otherwise.

IV. EXPERIMENTS
In this section, we show the results of our approach on mini-ImageNet and Omniglot datasets and compare the results with other few-shot approaches. We summarize the proposed method as follows. The proposed method first finds an efficient augmentation policy in the augmentation phase and applies it to the original training set D train to provide T * (D train ). Then, we add the augmented data to the D train and use it in the classification phase. In the classification phase, the few-shot models described in Section III-B are trained with the augmented dataset T * (D train ) and are evaluated with the test set D test . We run every experimental scenario independently five times and report the average results. All the experiments were implemented based on the PyTorch library [36].

A. SETTINGS 1) Datasets
We used the mini-ImageNet and Omniglot benchmarks to evaluate the few-shot classification algorithms. The mini-ImageNet dataset proposed in [19] is derived from the ILSVRC-12 dataset [11].  [21] usually augment the training set by rotating the images. However, for the mini-ImageNet dataset, we used the augmentation techniques shown in Table 1 to find and apply the optimal augmentation policy. Note here that since the Omniglot dataset is a grayscale image, of the 16 operations shown in Table 1, only five affine transformation methods were chosen as the augmentation candidates. The dataset to which the efficient augmentation methods are applied is added to the original training set through the augmentation phase. After that, a few-shot classification task is performed through the classification phase. We compared with the augmentation-based approaches, SGM [22], PMN [23] and IDeMeNet [24], for few-shot recognition.

2) Implementation Details
Following the practice in [30], we applied the steps to make implementation easier. First, we split the training data, D train , into five folds. Second, we perform explorationand-exploitation using HyperOpt [35] to search for optimal augmentation policies. We selected augmentation operations from the PIL python library 1 . We apply Test-Time-Augmentation (TTA) [34] to estimate the appropriate augmentation policy without repeated training. After training the FSL model, an effective augmentation policy is estimated from a pool of candidate augmentation policies. Finally, the augmentation policy that achieves the highest performance is selected. The original implementations of MatchingNet [19], PrototypicalNet [20], and RelationNet [21] use the embedding architecture containing four convolution layers (Conv-4). Other than the architecture, we used ResNet-18 and ResNet-50 as additional embedding functions for diverse experiments. We have the same training procedure for the   embedding functions as those of existing approaches. We used the 5-way classification setting in the experiments.

1) mini-ImageNet
We performed the experiments on 5-way 1-shot and 5-shot according to the common setup in metric-based FSL. Since the mini-ImageNet dataset consists of RGB images, EDANet considered all 16 augmentation operations listed in Table 1.
We obtained the classification accuracy by averaging over 600 randomly generated episodes from the test set. We compare EDANet with existing augmentation-based algorithms SGM [22], PMN [23], and IDeMeNet [24]. EDANet was trained under three different embedding functions (models) described in Section III-B. We report the accuracy of the compared methods on mini-ImageNet in Table 2. EDANet achieves the best accuracy when using PrototypicalNet as the embedding function. We employed ResNet-50 as another backbone for a fair com-parison with data augmentation-based approaches [22]- [24], which shows better performance than the Conv-4 architecture. The deeper backbone network performs better than shallow networks on every shot. Notably, our approach based on Conv-4 gives similar or better performance than the compared methods employing the ResNet-50 backbone. This shows that the automatic augmentation method is effective in FSL, compared to other approaches with hand-crafted augmentation rules. Note that 5-shot learning achieves higher accuracy than 1-shot learning, and the augmented dataset helps the network generalize better. Compared to SGM [22], PMN [23], and IDeMeNet [24], we observe that EDANet performs better than those approaches in both 1-shot and 5-shot. The 1-shot performance of the proposed method improved over the compared approaches. For EDANet based on MatchingNet, 5-shot learning performance is lower than that of SGM. However, EDANet performs better than other methods on average for the 5-shot learning.

2) Omniglot
We conducted another experiment on Omniglot and computed few-shot classification accuracy by averaging over 1,000 randomly generated episodes from the test set. We experimented with the proposal based on two different backbone architectures in metric-based few-shot classification.
For comparison, we compared with MatchingNet [19], Proto-typicalNet [20], and RelationNet [21] without data augmentation and with manually augmentations (rotation, crop, and flip). Since the Omniglot dataset contains black-and-white binary images, we applied a total of five affine transformation methods except for the color enhancement techniques shown in Table 1.
The results of the compared approaches are shown in Table 3. The proposed method achieved better performance than the baseline methods for the dataset when Prototypical-Net was used under 5-way 1-shot learning and RelationNet was used under 5-way 5-shot learning. Compared with the manual augmentation using the Conv-4 backbone, the proposed method achieves better performance by about 1%. As shown in the table, there is a performance gap depending on the distance metric, and RelationNet using the relation score produces the best performance among the distance metrics for both 1-shot and 5-shot. PrototypicalNet and Relation-Net outperform MatchingNet using the cosine distance, and their performance difference is marginal. EDANet using the automatic augmentation strategy gives better performance than the manual augmentation-based approaches under the same models (embedding function). Even, with the larger backbone network, ResNet-18, it further improves the performance by 2% to 3%. The experiments show that EDANet finds an optimal data augmentation strategy from the pool of candidate augmentation techniques for metric-based FSL.

C. ABLATION STUDY 1) Effectiveness of automatic augmentation
We demonstrate the effectiveness of the proposed EDANet that explores an optimal augmentation strategy from a pool of candidate strategies in metric-based FSL and compare it with manual augmentation approaches. Table 4 shows the results with respect to different augmentation strategies. No augmentation follows the method suggested in existing papers without adding augmentation techniques. Manual augmentation applies three commonly used augmentation techniques; flip, crop, and rotation. Auto augmentation corresponds to EDANet, which explores the best augmentation rule among 16 augmentations techniques listed in Table 1. In the study, we used the small backbone network (Conv-4). It can be seen that the proposed automatic augmentation yields the best accuracy compared to other methods by large margins. The proposed method automatically explores optimal augmentation methods without requiring expert efforts, resulting in more than 7% performance improvement over no augmentation for both 1-shot and 5-shot results. Similarly, it outperforms the manual augmentation method by a margin of 4% to 6%. The experiments show that EDANet is a promising candidate to achieve competitive performance without laborious manual design costs.

2) Results on various augmentation methods
We conducted additional experiments to see the performance of some augmentation operations in Table 1. We selected eight augmentation operations (including no augmentation) and applied each of them as an augmentation method under the MatchingNet framework. The few-shot learning results from the augmentation strategy are summarized in Table 5. The performance of MatchingNet without augmentation gives the lowest performance (37.1% for 1-shot learning and 50.4% for 5-shot learning). All of the augmentation operations give a performance improvement of 2% or more over the method without augmentation. We also observe that augmentation methods (e.g., translate, contrast, brightness, etc.) other than rotation can improve task performance better in metric-based FSL. Note also that the color enhancement operations (contrast, brightness, saturation, and hue) give a 2.9% higher performance improvement on average than the affine transformation operations (rotate, translateY, and translateX). In addition, we applied some combinations of the  There is a slight gap between the results reported in [19] and the results from our implementation due to the difference of the implementation. augmentation methods to see whether the combination leads to performance improvement. The number of combinations of choosing two out of those above eight selected augmentation operations is 28. We randomly selected three of these methods. The table shows that combining two augmentation methods performs better than the standalone augmentation method (at least 2% improvement). Here, we would like to note that since there are a large number of combinations to select two of the 16 augmentation operations, it is nearly impossible to select appropriate augmentation methods man-ually. Due to this reason, we can say that our EDANet can be a promising strategy for the success of FSL. Figure 4 shows some examples of augmented images with their corresponding policies and parameter values selected in EDANet. EDANet provides the probability and magnitude for each policy combination and selects an augmented image from the combination of multiple augmentation operations. This reveals that the proposed method is capable of exploring various augmentation policies for different images. VOLUME 4, 2016

V. CONCLUSION
Data augmentation is one of the promising methods for the success of few-shot learning that requires a small number of data. In this work, we have proposed an efficient automatic augmentation approach for few-shot learning to explore the best augmentation strategies from a pool of candidate augmentation operations and to reduce the hand-crafted design efforts. The proposed method, EDANet, searches for a combination of augmentation techniques given various candidates augmentation methods using Bayesian optimization and density matching. We have applied the proposed method to three popular few-shot learning baseline models using different distance metrics. Experimental results show that the automatic augmentation rule yields better performance than manual augmentation-based counterparts under the same model, showing its effectiveness in few-shot learning.