Observations on K-image Expansion of Image-Mixing Augmentation for Classification

Image-mixing augmentations (e.g., Mixup and CutMix), which typically involve mixing two images, have become the de-facto training techniques for image classification. Despite their huge success in image classification, the number of images to be mixed has not been elucidated in the literature: only the naive K-image expansion has been shown to lead to performance degradation. This study derives a new K-image mixing augmentation based on the stick-breaking process under Dirichlet prior distribution. We demonstrate the superiority of our K-image expansion augmentation over conventional two-image mixing augmentation methods through extensive experiments and analyses: (1) more robust and generalized classifiers; (2) a more desirable loss landscape shape; (3) better adversarial robustness. Moreover, we show that our probabilistic model can measure the sample-wise uncertainty and boost the efficiency for network architecture search by achieving a 7-fold reduction in the search time. Code will be available at https://github.com/yjyoo3312/DCutMix-PyTorch.git.


I. INTRODUCTION
The advent of deep classification networks has emphasized the importance of data augmentation [1]- [3]. Proper data augmentation can remedy the performance degradation due to insufficient data and weak robustness to noisy data [4]. Accordingly, many researchers have proposed training strategies to apply the data augmentation methods to the deep classification network.
Among the popular data augmentation methods, imagemixing augmentation methods, especially CutMix [2] exhibited impressive performance in training large-scale deep classification networks. Image-mixing augmentation methods augment a new image by mixing the two paired images. For example, CutMix mixes the paired images by reformulating their segments into one image. By applying this simple princple, the image-mixing augmentation successfully improves the performance of deep classification networks in various scenarios. Furthermore, through imagemixing augmentation, the deep learning model becomes robust to corrupted and uncertain data.
However, the mechanism underlying image-mixing augmentation is still not fully understood. Specifically, even the optimal number of images to mix has not been elucidated: in response, the number of K images was empirically set to 2. Researchers [3], [5] have made naive attempts at Kimage expansions of the augmentation. However, the Kimage expansion attempts have been unsuccessful in terms of classification performance improvements. Here, we aim to answer the following question: Is K = 2, the number of image, optimal for image-mixing augmentation?
In this study, we derive a novel formulation for generalizing image-mixing augmentation and apply it to obtain improved results for image-mixing augmentation methods. Notably, we find that a mixture of three or more images can further improve the performance of baseline methods using only the paired images. The superiority of the generalized formulation is validated under different classification scenarios. In addition, we test the robustness of our method: the results reveal that our method can drive the model into the widest (flattest) and deepest local minima. In terms of adversarial robustness, we experimentally demonstrate that the proposed image-mixing augmentation methods strengthen adversarial robustness and reveal that the expansion into the K-image case further improves the robustness. Additionally, we demonstrate that the proposed imagemixing augmentation can be used to characterize and estimate the uncertainty of the data samples. Based on the estimated uncertainty, we acquire the subsampled data pool that can efficiently represent the overall data distribution. We validate the efficiency of our subsampling framework under the proposed scheme on network architecture search (NAS). Notably, our method preserves the performance while achieving 7.7 times higher training speed when using the subsampled data pool as a training set.
Our contribution can be summarized as follows.
• We generalize the image-mixing augmentations for image classification and achieve better generalization ability on unseen data against the baseline methods. • We experimentally analyze the mechanism behind the better generalization of K-image augmentation by illustrating a loss landscape near the discovered minima. Accordingly, we reveal its ability to achieve convergence to wider and deeper local minima. We also demonstrate that K-image augmentation improves the adversarial robustness of the model. • We propose a new data subsampling method by measuring sample uncertainty based on the proposed imagemixing augmentation, which is especially beneficial for handling a small number of training samples. We further verify the efficiency of the proposed subsampling method by applying it to NAS.
The rest of the paper is organized as follows. In Section II, we list related studies of image augmentation, data efficiency, and architecture search. Section III describes details of the formulation, implementation, and applications of the proposed augmentation method. Section IV demonstrates the experimental results of classification on CIFAR and Ima-geNet, adversarial robustness, and NAS from the subsampled dataset by the proposed methods. Finally, we conclude the paper by mentioning the limitations in Section V.

A. AUGMENTATION
Including augmentation in training classification networks has become standard practice for achieving high performance. Beginning with simple augmentations, such as random crop, flipping, and color jittering, increasingly complex techniques, including including Cutout [1], Mixup [3], CutMix [2], PuzzleMix [6], SaliencyMix [7], and Co-Mixup [5] have been applied. Among the latter, CutMix, Mixup, PuzzleMix, and SaliencyMix typically mix two images, and a recent variant, Co-Mixup, has reportedly achieved an impressive enhancement in classification performance. Co-Mixup also generalized the image-mixing augmentation methods into K-image cases using submodular-supermodular optimization, which involves huge computational cost. Notably, our proposed K-image mixing augmentation methods do not require optimization and thus, have less computational overhead than Co-Mixup while achieving similar performance (Table 2) .

B. DATA EFFICIENCY
Several approaches, with the aim of efficiently utilizing a training dataset with a semantically important measure, have focused on collecting examples that are considered informative by re-weighting the importance of different training samples: calculating the importance value from additional forward [8] and backward path [9] of training, defining the approximated function [10], or using loss based training scheme [11]- [13]. Nevertheless, a criterion based on the hardness of the example cannot be generalized if the samples contain label noise. [14] also showed that hard examples, unlike easy examples, are unsuitable for the initial stages of training. Building upon previous works on measuring the importance of samples, we propose a robust importance subsampling methodology. We apply our subsampling concept to the differentiable search-based NAS and achieve performance improvements in both search time and classification accuracy.

C. NETWORK ARCHITECTURE SEARCH
Initiative NAS [15]- [19] utilizing reinforcement learning (RL) requires significant computational cost so that is difficult to apply them to ImageNet scale dataset. To alleviate the problem, the weight-sharing NAS [20]- [26] introduce the SuperNet concept, which incldues all the operation in the search space and extract the target architecture, SubNet from the SuperNet. For the extraction of the SubNet, [20]- [22], [26] propose a gradient-based searching method, which has become dominant in the research field, currently. In this study, we demonstrate the effectiveness of the subsampled data from our proposed DCutMix in NAS by implementing it to PC-DART [21]. Like the other methods, PC-DARTS focuses on designing a cell, and a user can easily adapt the layer depth during the architecture search phase by appending or removing more of the search cells in the search space.

III. PROPOSED METHOD
In this section, we define a formulation of the proposed K-image mixing augmentation and apply a probabilistic augmentation framework to an image classification task. Accordingly, as a novel method of applying the proposed K-image mixing augmentation, we propose a subsampling method that utilizes the uncertainty measurement in the augmented data samples.

A. FORMULATION FOR K-IMAGE MIXING AUGMENTATION
K-image mixing augmentation In this subsection, we formulate the K-image generalization for image mixing augmentation on the image classification task. We consider that augmented sample x c is composed of x 1 , ..., x K , denoted as: where the function f c (·) denotes the composite function, and the term φ = {φ 1 , ..., φ K } is a mixing parameter denoting Image 1 Image 2 Image 3 FIGURE 1: Example of three-image composition in CutMix case. We composite red box to green box, green box to blue box with the ratio of 1 : 1 − v 1 , and 1 : 1 − v 2 as in (4). Consequently, the region proportion of each image fraction r 1 (red diagonal pattern), r 2 (green diagonal pattern), and r 3 (blue diagonal pattern) will correspond to {φ 1 , φ 2 , φ 3 }, which follows Dirichlet distribution. Notably, a low variable anchor image (anchor image with red border) mostly serves as either an easy or hard sample regardless of the occlusion position. In contrast, a highly variable anchor image (anchor image with green border) serves as both an easy and hard sample depending on the random image mixing operation. These highly variable anchor images possibly provide more diverse information during training.
the portion of each sample x k on the composite sample x c . Note that Equation (1) can be considered as the general form of the popular image-mixing augmentations, such as CutMix [2] and Mixup [3] which mix only two images. Specifically, in the case of Mixup (denoted as DMixup) the function f c (·) is defined by the weighted summation as follows: The mixing parameter φ is defined by the Beta distribution in the usual two-image cases (e.g., CutMix [2] and Mixup [3]) Note that Equation (2) can be naturally expanded to the case of the K-image mixing case by applying a Dirichlet distribution φ ∼ Dir(α), α ∈ R K . In this case, the composite sample x c becomes a random variable for the given hyper-parameter α. K-image generalization of CutMix Based on the above formulation, we define the K-image generalization of Cutmix (denoted as DCutMix). Note that the definition of the function f c (·) becomes more complicated because the function should contribute all the segments of images x 1 , ..x K to composite image x c considering their mixing parameters φ. Here, we composite the images proportionally following the stickbreaking process (SBP [27]), with the widely used approach of sampling from Dirichlet distribution.
Assume that φ is sampled from the prior distribution Dir(α). The K-image mixing augmentation of CutMix is conducted by compositing the image with respect to the proportion φ k ∈ φ, where Σ k φ k = 1. For sampling φ from Dir(α), we use SBP by leveraging an intermediate variable v as follows.
each v k is denoted as follows: Note that the variable v is sampled from the beta distribution (Beta(1, α)) by deriving SBP. Now, we define the image fractions r = {r 1 , ..., r K } from K different images which constitute to a mixed sample x ∈ R W ×H×C . Let the functionr = d(x|v) randomly discriminate the image fractionsr : x\r with the area ratio v : 1 − v, where x\r denotes the region of x excludingr. Consequently, the fractions r are determined by following equation: where the virtual fraction r 0 and the last fraction r K are set to ∅ and x\ K−1 j=1 r j , respectively. The discrimination function d(·) determines the exact bounding box coordinates r kx , r ky , r kw , r k h of image fraction r k , to be located within the bounding box coordinates of former image patch r k−1 . These coordinates are randomly sampled from the uniform distribution with random variable γ, as follows: where its width r kw and height r k h are determined by v k as defined in Equation (4). Note that, in the case of k = 1, r 1x = 0, r 1y = 0, r 1w = W and r 1 h = H. Hence, the composite function f c of the DCutMix is governed by hyperparameter α and random variable γ. An illustration of the proposed K-image mixing augmentation following Equation Equation (4) is presented in Figure 1.
In the subsequent experiment section, we will experimentally demonstrate the advantages of the proposed Kimage generalization in terms of loss landscape, adversarial robustness, and classification accuracy. Probabilistic framework The overall probabilistic framework of the classification problem, considering the proposed augmentation, can be defined as: where is the j th sample drawn from the Dirichlet prior distribution p(·|α). Hereafter, we define the label l i ∈ R L as a one-hot indexing variable denoting one of total L total classes. Based on the derivation from Monte-Carlo dropout [28], we can approximate the distribution p(l c |x c ) in the variational function f W (·) with regard to several different φ and γ samples. The variational function is realized by a VOLUME 4, 2016 classification network, parameterized by W , with a softmax output. In the case of DCutMix, we additionally consider another variable γ from (6), such as: Consequently, from (6) and (7), we can approximate the posterior p(l c |x c ) by estimating the predictive mean of network outputs, depending on several differently augmented data sampled for varying φ and γ values. Similarly, the uncertainty of a given data sample x c for the given classification network can be approximated by calculating the posterior estimated from augmented data samples.

B. IMPLEMENTATION DETAILS OF DCUTMIX
In this section, we present the implementation details of DCutMix. We describe the pseudo-code of the mixing process of DCutMix in Algorithm 1. First, we sample variable φ from Dir(α) (see Line 1). For K − 1 iterations, we cut and mix K − 1 image fractions. At each iteration, a minibatch input and target are shuffled along with the batch dimension. An intermediate variable v is then selected using SBP sampled from φ (see Line 6, 7, 12, and 13). The variable v determines the width and height of the image patch to be mixed, where the exact position is bounded on the former image patch (see Lines 17 and 18). We then cut an image patch from source images x s and mix on x c (see Line 19). In Lines 20-27, the soft label is accordingly mixed by λ and λ K−1 , which denote the exact area ratio of each mixed image patch.

C. SUBSAMPLING USING THE MEASURED DATA UNCERTAINTY
As a new method of utilizing the K-image mixing augmentation, we propose a novel subsampling method that considers x s , l s =Shuffle(x, l) // stick-breaking process from equation (3) 5: if k == 1 then end if // bounding box setting, from equation (4) and (5). 15: 19: x c [:, :, r x : r x + r w , r y : end if 29: r w,p , r h,p = r w , r h // record the bounding box and soft label.

30:
r x,p , r y,p = r x , r y

31:
l p = l s 32: end for 33: Return x c , l c // return an augmented image and soft label. the data uncertainty obtained from K-image augmentation for the first time. In order to measure the uncertainty of a data sample, we define the loss distribution L(l c |x c ) for variously augmented data samples depending on φ, γ and its expectation can be approximated based on (7) as follows: where x = {x 1 , .., x K }, and L denotes the cross-entropy loss. The expectation is defined on the space by the random variable φ and γ. Similarly, the uncertainty can also be acquired by estimating the variance of the loss distribution L. Figure 2 shows qualitative examples of uncertainty measurement, given sample data and their mixed images. Noticeably, the diverse tendency of loss values changes for each mixed image, mainly depending on the randomly selected position of occlusion caused by non-anchor image patches.
For measuring the sample-wise uncertainty using the loss distribution, we select an anchor sample x i ∈ x with fixed φ i and then jitter φ\φ i related to other non-anchor samples x\x i to calculate the uncertainty of the anchor sample x i . The φ\φ i are drawn from a conditional Dirichlet distribution D(α\α i ), according to its definition. We will term L i = {L i,m |m = 1, ...M } as the loss distribution for all the mixed images given the anchor x i the corresponding loss is calculated from (8). The number M denotes the total number of sampling φ\φ i from D(α\α i ).
Based on the sample-wise uncertainty measurement, we aim to sample the core training data sub-set among the entire training dataset. Presumably, for better generalization of a neural network when training using a small number of data points and image-mixing augmentation, the core training sub-set should consist of the highly uncertain samples which can serve as both easy-and hard-level samples depending on the image-mixing augmentation (e.g., the images bordered with green in Figure 2). Therefore, a new training subset is subsampled by descending order of uncertainty measure. We observed that employing the coefficient of variation (CV) metric, which is defined as σ(Li) m(Li) where σ(·) and m(·) is the standard deviation and average of L i , is most effective for measuring the uncertainty (See Figure 5a for details).

1) Subsampling details
We herein describe the implementation details for the proposed subsampling framework. A newly subsampled set D for each class is defined as follows: where O(·) denotes the subsampling measure and S(·) denotes a sampling function indicating whether data sample x k is to be included in D or not by using the subsampling ratio t. Here, j denotes the index of N intra number of intraclass images where the class labels are equivalent among the others. O(·) is a proxy for subsampling; data samples are subsampled in order of O. With regard to the sampling function S(·), it samples t × N intra data samples based on the sampling measure O(·), which falls into two categories: a deterministic function sampling top t × N intra samples sorted by O(·), and an interval-based function that collect samples sorted by O(·) with a fixed interval.
Regarding the subsampling measure O(·), we employed the sample-wise uncertainty measure using Coefficient Variation (CV), σ(Li) m(Li) (where L i is derived from (8). For estimating the sample-wise uncertainty, we set the number of non-anchor images x\x i and their Dirichlet sampling parameter α\α i as 2 and { 2 9 , 2 9 }, respectively. Additionally, we set the total

IV. EXPERIMENTS AND DISCUSSION
Here, we experimentally verify the effect of the K-expanded image mixing augmentation. First, we show the improved classification performance after applying our method to CIFAR-10/100 and analyze its advantages in terms of the shape of the loss landscape. Second, on ImageNet, we propose an elaborately designed K-image mixing augmentation that considers the saliency map to overcome label noise. Moreover, we present the experimental result on classification and adversarial robustness for further discussion. Finally, we demonstrate the effectiveness of the proposed data subsampling method and its practical application in NAS.

A. EXPERIMENTAL RESULT ON IMAGE CLASSIFICATION 1) CIFAR-10/100
We present classification test results on CIFAR-10 and 100 [38]   Also, we further compare our DCutMix and DMixup with state-of-the-art image-mixing augmentation methods, including PuzzleMix [6], Co-Mixup [5], and StyleMix [34] , in Table 2. As seen in the results, DCutMix achieved better accuracy than PuzzleMix and a competitive accuracy compared to the Co-Mixup. We note that the proposed augmentation provides comparable classification performance with achieving superior calculation time compared with recently published augmentation methods such as StyleMix and Co-Mixup [39]. Co-Mixup and StyleMix each require more training time overhead (over 20 times and 50 times) than ours due to the high optimization cost. These overall results demonstrate the effectiveness of the proposed K-image generalization for augmentation methods.
b: Analysis on the shape of the loss landscape For more explicit investigation, we analyze DCutMix with regard to its loss landscape. Flatness of the loss landscape near local minima has been considered as a key indicator of improved model generalization in various situations in numerous previous studies [40]- [44]. Regarding the shape of the loss landscape, convergence to a wide (flat) local minima is generally considered to represent a model with better generalization performance on an unseen test dataset. Accordingly, we use the PyHessian [45] framework to obtain the loss landscape patterns of each model, as illustrated in Figure 3a. The plotted result shows that DCutMix has the widest loss landscape near local minima among the compared models. Moreover, DCutMix exhibits lower losses overall, denoting good generalization to the unseen test data as well. We further plotted the patterns of loss landscape for each model in Figure 3b by perturbing the model parameters with random Gaussian noise through increasing the degree of variance σ [42]. DCutMix and DMixup clearly exhibited the widest and lowest loss landscape compared to the other methods, including CutMix and Mixup, which are baseline two-image mixing augmentation methods. For DMixup particularly, we observed that the convergence stability was better than that of Mixup. We believe that this reveals the superiority of the proposed K-image mixing augmentation.
From a more analytical point of view, we can hypothesize that the K-image generalization of DCutMix and DMixup, regarding the wide flat local minima, is attributable to their labels being softer than that of CutMix and Mixup. Several researchers have reported that a model trained with an artificially smoothed label can result in the model converging to wide local minima, thus achieving better generalization [41], [42], [44], as in the case of the superior results of Label Smoothing [41] compared to the baseline in Figure 3a and Figure 3b. However, as opposed to the previous regularization methods using artificially smoothed label, note that our softened label directly reveals the augmented ratio of several images, and hence, we conjecture the tendency can be a key factor why the model trained by our approach converges to lower and wider local minima.

2) ImageNet
We present ImageNet classification results of DCutMix and DMixup compared to the two-image mixing baselines: CutMix and Mixup. The results are obtained under the equivalent training and augmentation-specific hyperparameter setup used in [2]. As presented in Table 3 (left), DMixup considerably improved the performance of Mixup and Manifold Mixup by 0.7% and 0.62% respectively, while reducing the top-1 error. However, DCutMix exhibited a higher top-1 error rate compared to CutMix. This result was attributable to DCutMix suffering from the label noise problem, where a background object other than the ground truth class object is contained in the randomly cropped image [46], as shown in Figure 4. Moreover, Table 3 (right) reveals that as the number of mixing images K increased, the performance of DCutMix deteriorated due to the higher probability of background objects being accumulated.
To address this label noise problem, we devised a more    sophisticated mixing method named Saliency-DCutMix, which employs saliency-map information for integration with our DCutMix. First, we obtain a salient image patch by selecting the most salient pixel point of the saliency map as the center point, as suggested in [7]. Here, the width and height of each patch from (3) to ensure the Dirichlet distribution is followed. Consequently, we mixed these salient image patches with SBP, similar to in DCutMix, as given in (4). Figure 4 shows the qualitative examples of DCutMix and Saliency-DCutMix. Samples augmented with DCutMix contain background class objects other than the ground truth object, which could lead to label noise during training. Meanwhile, samples augmented with Saliency-DCutMix reveal that the foreground class objects are mixed without background class objects being included. In Table 3 (right), Saliency-DCutMix indeed exhibits relatively stable and significantly improved performance regardless of K compared to DCutMix. Furthermore, Saliency-DCutMix achieved higher performance than its baseline two-image mixing augmentation method, CutMix, as demonstrated in Table 3 (left).

a: ImageNet-O
To evaluate the robustness of our proposed model to the out-of-distribution (OOD) data samples, we performed tests on the ImageNet-O dataset [47]. The dataset contains OOD images whose class labels do not belong to 1000 classes of the ImageNet-1K dataset. The most ideal output of a classification model against the OOD data sample is uniformly predicting all classes with low confidence because a class in the OOD dataset was not considered when training the classification model. These OOD images reliably cause various models to be misclassified with high confidence. To evaluate the robustness of each model against the OOD dataset samples, we measured the area under the precisionrecall curve (AUPR) on the ImageNet-O dataset, where a higher AUPR denotes that the model robustly predicted OOD samples with lower confidence. Notably, In Table 4, the model trained without augmentation (Vanilla) exhibited the best AUPR. All the augmentation methods are highly overconfident for the OOD samples, and the results demonstrate the fragility of the augmentation methods when a label distribution shift is present.   After the vulnerability of deep neural networks was elucidated by [48], achieving superior classification performance for both non-attacked examples (standard accuracy) and adversarial robustness (robust accuracy) has been considered as the key factor for making the deep neural network truly robust and reliable [49]- [52]. To achieve higher adversarial robustness, several competing attack and defense methods have been alternately proposed [49]- [52]. In contrast to the above-mentioned studies, [2], [50] have reported that training a model with an input transformation or augmentation enhances the robustness of the model against adversarial examples without adversarial training [50], which suffers from high training cost and a severe trade-off between standard and robust accuracy.
In this section, we demonstrate the additional advantage of K-image mixing augmentation in terms of adversarial robustness. . We selected each classification model (ResNet-50) trained by the baseline (w/o augmentation), CutMix, DCutMix, and Saliency-DCutMix using the ImageNet training dataset. To evaluate adversarial robustness against more diverse types of attack, we considered not only white-box attacks as in [2], but also gray-and black-box attacks [51].
Regarding white-box attack where the attacker can freely access to the model's parameter, we use FGSM ∞ attack [49] with = 8, as in [2]. The black-box attack is more challenging for an attacker because they do not have any information about the target model to be attacked. For this case, we set a substitute model (i.e. ResNet152) and made adversarial examples by attacking it. Gray-box attack is a compromise between white-and black-box attacks: the attacker knows the architecture of the model (i.e. ResNet50)) without having access to the weight parameters. Therefore, we generate adversarial examples using a substitute ResNet50 model trained with the ImageNet dataset using a different random seed. For gray-and black-box attacks, we generated adversarial examples of ImageNet validation dataset using a more strong attack method called as PGD ∞ attack [50] with epsilon = 8. Table 5 shows top-1 accuracy on given adversarial examples generated by each attack. As proposed in [2], a model trained with CutMix exhibits better adversarial robustness    than the baseline case against all types of attacks. DCutMix achieves more improved adversarial robustness against grayand black-box attacks compared to CutMix. However, this was not the case against the white-box attack. Notably, the white-box attack is the most powerful attack among the attacks. We hypothesized that the label noise problem associated with DCutMix degrades adversarial robustness against a strong attack (white-box), and this hypothesis was indirectly confirmed through experiments on Saliency-DCutMix. We observed that saliency-map-guided DCutmix (Saliency-DCutmix) has stronger adversarial robustness than CutMix and DCutMix for all types of attacks.
For a more sophisticated investigation of robustness against shifts in input data distribution, we evaluated the accuracy of augmentation methods on the ImageNet-A dataset [47]. this dataset contains natural adversarial examples, which cause the wrong classification in an ImageNet-pre-trained model without any adversarial attacks. In Table 5, we found that the performance tendency on ImageNet-A is similar to that of adversarially attacked ImageNet. CutMix exhibited better accuracy than the baseline case and DCutMix. Meanwhile, Saliency-DCutMix improved the CutMix accuracy, exhibiting better generalization on natural adversarial examples and, hence, better robustness on the input data distribution shift.

1) Data Subsampling
We investigate the effect of the proposed subsampling method when trained with DCutMix as an augmentation in Figure  5a. In the figure, we compare our data subsampling method with others using different subsampling measures. For all subsequent experiments involving data subsampling, a full 10K CIFAR-100 validation set was used for evaluation, and we reported the averaged results for three independent random seeds using PyramidNet [29].
As shown in Figure 5a, sampling the easy-only or hardonly examples based on m(L i ) shows deteriorated performance compared to the random subsampling. The hard-only subsampling severely suffered from poor generalization. This result indicates that subsampling only hard samples where the salient regions were occluded by image-mixing augmentation being applied (see Figure 2) is not desirable under the constraint of a small number of training samples. In a similar manner, subsampling only easy samples extract the biased data samples that cannot be helpful for better generalization. Moreover, simply employing standard deviation σ(L i ) as subsampling measure induced a similar test error plot as that from the above mean-based sampling methods. On the other hand, our high-CV-based subsampling significantly VOLUME 4, 2016  outperformed the random sampling. Specifically, the test error was 5.79% lower when the number of subsampled training samples was extremely small (i.e., t = 0.05). High CV subsampling enables us to acquire various levels of data samples, from easy to hard. High CV subsampling basically selects the easy data samples, which can frequently become hard samples depending on the image-mixing augmentation. Therefore, High-CV subsampling leads to better performance when training with image-mixing augmentation. We demonstrated the superiority of our subsampling method over other subsampling methods employing uncertainty derived by weight dropout [56] and K-Center Coreset sampling [57].

2) Application on NAS
We further demonstrate the practicality and effectiveness of our proposed data subsampling method on another domain, namely, NAS. Our goal is to reduce the time spent searching the architectures by searching on the subsampled dataset drawn from our framework rather than on the full training dataset. We demonstrated that the architecture search time is greatly reduced without accuracy degeneration. Notably, the data subset subsampled by our algorithm can be applied to any neural architecture search framework, including gradientbased and non-gradient-based search methods. We adopt one of the most computationally efficient and stabilized NAS methods, PC-DARTS [21], as our baseline. For the searching process, we divided the subsampled (or entire) training dataset into two equal parts, with one for optimizing the network parameters and the other one for optimizing the architecture hyperparameters (i.e., α, β in [21]). Additionally, we adopted the warm-up strategy during the search process, where only network parameters are optimized. We freeze the hyperparameters α, β for the first 15 epochs as in [21]. We applied the warm-up strategy for the first five epochs for the baseline method (i.e., searching on the entire dataset) where the number of total searching epochs was 10 (i.e., the left-most point for the Baseline in Figure 5b of the manuscript). We used Tesla V100 GPU to perform the search.
For the evaluation involving training the searched network from scratch, we used the equivalent training hyperparameters as in [21]. The performance of the neural networks searched on the entire CIFAR-100 dataset (baseline) is plotted in Figure 5b. We plotted the performance of neural networks searched on the entire CIFAR-100 dataset (baseline) by adjusting the searching epochs while adjusting the subsampling ratio t for searching on the randomly subsampled dataset and our subsampled dataset drawn by high CV measure. The searching epochs were adjusted while adjusting the subsampling ratio t for searching on the randomly subsampled dataset and our subsampled dataset drawn using a high CV measure. The results demonstrate the outstanding efficiency of our subsampling framework in terms of search time and accuracy. Specifically, it (searching on the high-CV subsampled dataset) achieved comparable accuracy with a 7.7-fold reduction in search time compared to other baselines. Furthermore, it consistently outperformed random subsampling given an equivalent number of data samples for searching.
As listed in Table 6, we observed that our framework serves as an effective proxy dataset, and the neural network searched using it is well-generalized on ImageNet. Notably, it reduced the GPU search time to as much as 0.01 d (i.e., 16 min) while achieving comparable or even higher accuracy than that of the models searched with PC-DARTS on the entire CIFAR-10, CIFAR-100, and randomly subsampled ImageNet datasets. Moreover, compared to the other NAS methods, ours achieved the best accuracy and significantly lower search computational cost.

V. CONCLUDING REMARKS AND LIMITATION
In this study, we present the advantages of expanding the number of images for image-mixing augmentation based on various experimental results and analyses. First, we propose the generalized method for K-image mixing augmentation motivated by SBP. Second, we demonstrate that the proposed K-image mixing augmentation improves classification performance. Moreover, from a novel perspective, we demonstrated that the key factor behind this improvement is the convergence to wide local minima. Moreover, we empirically found that increasing the number of images for the image mixing augmentation enhances the adversarial robustness of a classification model against various types of adversarial examples. Additionally, we derived a new sub-sampling method that utilizes the proposed K-image mixing augmentation in a novel way. We experimentally demonstrate that the proposed subsampling method can effectively reduce search time without performance degradation. We believe our observations can inspire new research directions for image mixing augmentation and data subsampling.
Limitation: Because our method focuses on setting a probabilistic framework explaining the CutMix augmentation and its potential effectiveness, we did not employ other semantic knowledge, such as spatial attention or saliency map. However, if strictly targeting the SOTA classification performance, it would be a promising future direction to employ the additional information in the augmentation process. Furthermore, employing the idea in other computer vision tasks, such as object detection and segmentation, will enhance the applicability of the method. He is currently a research scientist in NAVER AI Research, and also leads the Image Vision team, NAVER CLOVA. His research interests include deep learning for computer vision and probabilistic theory. VOLUME 4, 2016