Introduction
DATA augmentation, which mainly consists of geometric transformations (rotate, translate, etc.) and color transformations (invert, contrast, etc.), is a commonly used tool to generate additional data from the original. It increases diversity of the training dataset without collecting extra data. This technique is widely used in computer vision tasks to help the training of deep neural networks without severely changing high-level semantics in images. It can be seen as a regularization method to alleviate the over-fitting problem as well. Many synthetic data augmentation strategies [1], [2], [3], [4] have been designed and achieved great success in recent years. However, these designs usually require expert knowledge, large amounts of experimental trials and prior information to seek a proper configuration. Improper application or choice of augmentation even introduces outliers to the training data, which harms the final performance [3], [5], [6].
With the advances of automated machine learning (AutoML), automatically exploring data augmentation strategies directly from the dataset becomes popular. The process to explore an optimal augmentation strategy, including a set of parameters or rules, is called search. For example, AutoAugment [7] (AA) focuses on the learning of augmentations based on reinforcement learning (RL). Compared with traditional data augmentation, AA requires less expert knowledge and prior information to achieve impressive results by automatically searching for the amounts of transformations and their combinations. It is an offline search method that decouples the process for searching the augmentation strategies and training the target model. The searched strategies are called policy, which can also be easily transferred to new classification tasks for wider applications. However, the search cost of AA is extremely expensive, even on a proxy task that adopts a subset of the target dataset. As a result, the following works try to improve the search efficiency. Differentiable methods [8], [9], [10] appear and significantly reduce the search cost of policy learning to a few hours. Trends to the development of automatic data augmentation show the potential for differentiable methods. Nevertheless, these methods generally slightly sacrifice performance, especially on complex datasets. Besides, current differentiable methods mainly focus on learning a set of fixed magnitudes for transformations, which limits the scale of the augmented image space and the upper bound for model performance.
Online automatic data augmentation methods use a different manner that learns augmentation strategies together with model training [5], [11], [12], [13], [14] to improve model performance. These methods avoid extra search overhead before training while expanding the augmented image space. However, the frameworks of online augmentation works are complex, and the search overhead remains high. Although search methods combined with meta-learning [15] and bi-level optimization [16] reduce the time for online search, these methods suffer from obvious performance decrements that lose one of the most important benefits of online learning. Besides, the augmentation policy of the online search methods can hardly be transferred to different tasks. These shortages limit the wide application of online augmentation methods.
Recent works, such as RandAugment [5] (RA), adopt randomness to improve performance for wide applications. RA uniformly samples combinations of transformations in the search space to augment images. The simple design achieves unexpected impressive results. Due to the simplicity and effectiveness, RA has been integrated into other works (e.g., DeiT [17], Swin Transformer [18]) as an augmented training strategy. However, to achieve optimal performance, RA requires an offline grid search on the whole training dataset to find the proper policy parameters, which is time-consuming even with a dramatically reduced search space. Although RA has the capability to achieve satisfactory performance with manually selected parameters to avoid the search, the simple search space limits the upper bound of the augmented image space. UniformAugment [14] and TrivialAugment [13] are two other methods that benefit from randomness with no search process, while they are not flexible enough for different target models and tasks that rely on inconsistent types and magnitudes of transformations.
In this work, we take the advantages of differentiable augmentation and effectiveness of random factors, and develop Differentiable RandAugment (DRA), an offline automatic augmentation method that can effectively learn policy parameters, including selecting weights and magnitude distributions of transformations, with a small search cost to expand the augmented image space of RA. Our DRA treats each augmentation transformation as a module in the model, with a weight indicating the probability in forward sampling and a magnitude following a learnable normal distribution to control the deformation. Besides, we introduce the gradient of transformations to reduce the bias of gradient estimation during the search. We also revisit the optimization objective of the differentiable augmentation search and find the inconsistency between updating model parameters and policy parameters. To reduce the gap, we propose a loss function with KL divergence, which measures the similarity of augmented and original images after model inference. Experiments on CIFAR-10/100 [19] and ImageNet [20] show that our DRA achieves better accuracy compared with some mainstream methods within a short search time. Transfer learning on object detection using COCO [21] further demonstrates the effectiveness of DRA. We emphasize that our DRA outperforms RA by 0.28% under the same settings, with only 0.95 GPU hours overhead on a single Tesla P100 on ImageNet using ResNet-50. Compared with prior works, it has a flexible design to be adaptive to different tasks, with a small search cost to improve performance of the offline automatic data augmentation. The pipeline of DRA is shown in Fig. 1.
A simplified example of the pipeline of Differentiable RandAugment in the forward. One yellow box represents an augmentation layer, which samples an operator from categorical distribution according to the weights of operators
The contributions of our work can be summarized as follows:
We propose a differentiable automatic data augmentation method DRA that models the magnitudes of transformations following a learnable normal distribution, which achieves better performance on classification and object detection compared with RA. The search overhead is also reduced to only 0.95 GPU hours on ImageNet with a single Tesla P100.
DRA adapts the search space based on RA and applies operator-sharing strategies to reduce both the search cost and difficulty, which is different from many previous methods that are based on the search space of AA.
We revisit the inconsistency between updating model parameters and policy parameters in policy learning, and introduce KL divergence as a loss item in the outer loop of bi-level optimization to reduce the optimization gap.
Related Works
A. Data Augmentation
Data augmentation (DA) has been widely used in computer vision tasks to improve the robustness of models, especially in image classification. The most widely used augmentations include cropping, flipping, and resizing, which usually couple random factors to transform images. In recent years, several novel augmentation methods [1], [2], [3], [4], [6], [22], [23], [24] are designed according to expert or domain knowledge, and improve the performance and robustness of models. Apart from supervised training, DA has also been applied in other areas, such as contrastive learning [25], [26], [27] and reinforcement learning [28]. Although these augmentation methods achieve great success in many tasks, they require careful design with painstaking labor and large amounts of trials. Improper usage of augmentation shows no effect or even hurts performance. For example, Cutout [1] significantly decreased performance on reduced SVHN [29] as reported by Cubuk et al. [7]. As a result, the automatic design of data augmentation might be more suitable for different tasks.
B. Automatic Data Augmentation
Since Google proposed neural architecture search [30] (NAS), learning architectures and hyperparameters of deep neural networks automatically became popular. Inspired by NAS, automatic data augmentation appears and achieves great success in computer vision tasks. Current mainstream automatic data augmentation methods can mainly be divided into two types: offline [7], [8], [9], [10], [31], [32] and online [5], [11], [12], [13], [14], [33], [34].
The offline methods attempt to search for proper combinations of different image transformations, namely policy. Many offline data augmentation methods search the optimal policy on a proxy task to reduce the huge calculation cost, assuming that the policy found on the proxy performs as well as the one found on the whole dataset and target model. For instance, AutoAugment [7] (AA) is proposed based on reinforcement learning to automatically find a policy for optimal data augmentation. The policy contains a series of sub-policies, with a list of sequentially applied operators containing the name, applying probability, and a level indicating the magnitude of the operator. It achieves excellent performance on image classification in several datasets. Nevertheless, the search time is too expensive to be widely applied to different datasets. Although the learned policy can be transferred to other tasks, it is not as good as the one directly searched on the target task. Some algorithms are proposed afterward to speed up the search procedure. PBA [32] proposes to search for an augmentation schedule instead of a fixed policy. Fast AutoAugment [31] (Fast AA) avoids the trials on each policy to accelerate the search procedure. Although their search cost is severely reduced compared with AA, it is still expensive for the wide applications of these methods.
Recently, inspired by DARTS [35], differentiable methods for automatic data augmentation appear and show great efficiency in policy search. Faster AutoAugment [10] (Faster AA) regards data augmentation as filling the missing points in the training dataset. DADA [8] uses a GDAS-like [36] sampling strategy in the forward, and applies the RELAX estimator [37] to estimate unbiased gradients of policy parameters in the backward. DDAS [9] directly derives the search formula from training loss without the Gumbel-Softmax estimator for a more accurate gradient estimation. These differentiable methods use gradient update strategy to solve the optimization problem and reduce the search cost in order to be affordable. However, they mainly adopt a fix magnitude that limits the scale of augmented space and the upper bound for model performance.
On the other hand, online data augmentation methods adjust policy parameters dynamically during model training. OHL-Auto-Aug [12] and Adversarial Augment [11] jointly adjust policy parameters and model parameters on the target dataset, which dynamically augment images during training and adjust the augmentation policy without retraining the model. These online search methods achieve superior results compared with offline methods, but the search overhead remains large; meanwhile, the policy is hardly transferred to other tasks or also suffers from obvious performance degradation. Shortages limit the wide application of these online search methods. Meta online data augmentation [15] and online bi-level optimization [16] for data augmentation search also use differentiable learning to reduce the overheads of online methods, while they severely sacrifice performance.
Another simple but effective way to augment images is to introduce random factors for transformations. RandAugment [5] (RA), UniformAugment [14] (UA) and TrivialAugment [13] (TA) carefully design the augmentation ranges and uniformly sample transformations to augment the input data. The simple design shows amazing performance in classification tasks. However, as these random factors are non-specific to datasets, there exists a great chance for improvement. RA also suffers from heavy constraints and a large search cost to find optimal policy parameters.
Very recently, DAAS [33] and DHA [34] jointly optimized policy parameters with architecture parameters or even hyperparameters. These methods try to find an optimal augmentation strategy that fits the searched model. However, they suffer from the same limitations as online search methods.
Differentiable RandAugment
In this section, we first reformulate automatic data augmentation and define our search space based on RandAugment. Then, we introduce the relaxation and approximation for differentiable learning. After that, we revisit the optimization objective of the differentiable augmentation search, and add KL divergence as a loss item to alleviate the inconsistency between updating policy parameters and model parameters.
A. Reformulate Data Augmentation
Data augmentation can be presented as a sequence of operations to transform the input image, which can also be viewed as sequentially piled layers in the model before normalization. Let \begin{equation*} X^{d}=\sum \limits _{o\in O} {o\left ({X^{d-1} }\right)} \cdot c_{o}^{d},\quad d=1,2,\cdots,D, \tag{1}\end{equation*}
\begin{equation*} c^{d}\sim Categorical\left ({w^{d} }\right), \tag{2}\end{equation*}
\begin{align*} o\left ({X^{d-1} }\right)=\begin{cases} \displaystyle \boldsymbol { 1},& if c_{o}^{d}=0 \\ \displaystyle OP_{o}\left ({X^{d-1},m_{o}^{d} }\right),& if c_{o}^{d}=1 and p< p^{d} \\ \displaystyle X^{d-1},& otherwise \end{cases} \tag{3}\end{align*}
In RA, operators have the same probability, or they follow a categorical distribution to be sampled as proposed by Wightman et al. [38] when manually designed weights are given. Here, the weights
Besides, to improve the generalization ability of the target model, the magnitude of each operator in DRA follows a learnable distribution to generate more transformations of the input images. This design is motivated by the idea that randomized magnitude can improve the diversity of the data, meanwhile the augmented images should follow a similar distribution to the original ones to alleviate the over-transformation problem [10], [39]. Since the magnitude controls the deformation to the original image, with 0.0 indicating no change and 1.0 indicating the maximum, using different magnitudes for the same operator to generate different transformations of the input is an intuitive idea. The effectiveness of variant magnitudes in data augmentation has also been shown in TA and UA. Thus, we assume that introducing randomness to the magnitudes can yield abundant transformations to improve model performance. On the other hand, since fixed magnitudes in sub-policy-based methods [7], [31] achieve good results, we assume that augmented images that are similar to images generated by the learned sub-policies can yield better performance. As a result, we adopt a magnitude sampling strategy that samples magnitude from a specific normal distribution rather than uniform distribution within the feasible region. We model the magnitude of each operator in each augmentation layer following a separate learnable normal distribution. Under this setting, the augmented images are expected to have more variants that are close to the one transformed with the mean magnitude, while there still exist fewer variants having larger deformations. However, when the standard deviation of magnitude has a large value, the magnitude distribution becomes smoother. The smoother distribution will generate diverse transformations, which may increase the number of over-transformed images. To alleviate the over-transformation raised by the uncontrolled sampling from normal distribution, we minimize KL divergence that measures the distance between the distributions of augmented samples and the original ones during policy parameter learning, which is discussed in Section III-C. The magnitude in our method can be denoted as \begin{equation*} m_{o}^{d}\sim N\left ({\mu _{o}^{d},\sigma _{o}^{d^{2}} }\right), \tag{4}\end{equation*}
In our search space, we have weight parameters, means of magnitudes, and standard deviations of magnitudes that are learnable. These learnable parameters are called policy parameters in this paper. Note that the probability
B. Estimate Gradient of Policy Parameters
1) Relax Weight Parameters:
Since sampling is not differentiable w.r.t. weight parameters, relaxation is conducted in the backward propagation to make weight parameters differentiable. In the relaxed setting, Gumbel-Softmax [40], [41] estimator is introduced and (1) becomes \begin{equation*} X^{d}=\sum \limits _{o\in O} {o\left ({X^{d-1} }\right)\cdot \hat {c}_{o}^{d}},\quad d=1,2,\cdots,D, \tag{5}\end{equation*}
\begin{equation*} \hat {c}^{d}\sim RelaxCategorical\left ({w^{d},\tau }\right) \tag{6}\end{equation*}
\begin{equation*} \hat {c_{o}}^{d}=\frac {\exp \left ({\left ({w_{o}^{d}+g_{o}^{d} }\right)/\tau }\right)}{\sum \nolimits _{o^{\prime }\in O} \exp \left ({\left ({w_{o^{\prime }}^{d}+g_{o^{\prime }}^{d} }\right)/\tau }\right)}. \tag{7}\end{equation*}
\begin{align*} X^{d}&=\sum \nolimits _{o\in O} {o\left ({X^{d-1} }\right)} \cdot h_{o}^{d},\quad d=1,2,\mathrm {\cdots },D, \tag{8}\\ h^{d}&=\mathcal {H}\left ({\arg \max _{o} \hat {c}^{d} }\right). \tag{9}\end{align*}
2) Learn Magnitude Distributions:
Unlike optimizing wei- ght parameters, the gradient of magnitude parameters needs approximation, because some operators (e.g., posterize, solarize) are not differentiable w.r.t. magnitudes. Thus, we use the straight-through estimator [42] to evaluate the gradient of magnitudes. The straight-through estimator can be denoted as \begin{equation*} \frac {\partial OP_{o}\left ({X^{d-1},m_{o}^{d} }\right)}{\partial m_{o}^{d}}=\mathbf {1}, \tag{10}\end{equation*}
To pass the gradient of magnitudes in the back propagation, magnitudes are specially calculated in the forward and the original \begin{equation*} \hat {X}^{d-1}=X^{d-1}+m_{o}^{d}-StopG(m_{o}^{d}),\quad \mathrm { }d=1,2,\cdots,D, \tag{11}\end{equation*}
\begin{align*} o\left ({X^{d-1} }\right)=\begin{cases} \displaystyle \boldsymbol { 1},& if h_{o}^{d}=0 \\ \displaystyle OP_{o}\left ({\hat {X}^{d-1},m_{o}^{d} }\right),& \mathrm { }if h_{o}^{d}=1 and p< p^{d} \\ \displaystyle X^{d-1},& otherwise \end{cases} \tag{12}\end{align*}
As mentioned in Section III-A, the magnitude of each operator in each augmentation layer of DRA follows a normal distribution. However, the sampling operation is not differentiable w.r.t. magnitude parameters. Therefore, we use the reparameterization trick [43] to make magnitude parameters differentiable. Reparameterization can be denoted as \begin{equation*} \hat {m}_{o}^{d}=\mu _{o}^{d}+\epsilon \cdot \sigma _{o}^{d},\quad \epsilon \sim N\left ({0,\mathrm { }1 }\right), \tag{13}\end{equation*}
Note that when using magnitudes following normal distributions, RA can be seen as a special case of DRA, with equivalent
3) Use the Operator Gradient:
Data augmentation is usually decoupled with model training (e.g., augmentation based on Pillow2), which introduces the straight-through gradient estimator to policy parameters if not editing the backward process. However, the straight-through estimator is biased. To reduce the impact of this bias, Hataya et al. [10] uses operators in Kornia [44], a PyTorch-based [45] differentiable computer vision library, to augment data and calculate the operator gradients w.r.t. the input of the operator. Similarly, we rewrite the operators using TensorFlow [46] and encapsulate them into differentiable layers. For operators that are not differentiable to the input (e.g., posterize, equalize), we use the straight-through estimator to pass the gradient directly.
C. Revisit the Optimization Objective in the Differentiable Data Augmentation Search
Bi-level optimization becomes the standard and widely applied optimization objective for differentiable learning. It separates the training dataset into two equal subsets for alternatively updating model parameters to the optimal (the inner loop) and one iteration of architecture parameters (the outer loop) to achieve the minimum loss w.r.t. architecture parameters. As proposed in DARTS [35], the one-step gradient update can reduce the expensive inner optimization cost in bi-level optimization, which uses the results of one iteration in the inner loop to approximate the optimal model parameters. In differentiable automatic data augmentation, policy parameters show similar functions to architecture parameters. As a result, policy parameters can be optimized using the same strategy. However, we revisit the optimization objective of the differentiable data augmentation search and find an inconsistency between the outer optimization and inner optimization. Specifically, the optimization objective of the differentiable data augmentation search can be written as \begin{align*} &\min _{T} {L_{val}\left ({\theta ^{\ast } }\right)} \\ & s.t. \theta ^{\ast }=arg\min _{\theta } L_{train}\left ({\theta,T }\right), \tag{14}\end{align*}
Recently proposed contrastive learning focuses on the distributions of augmented views using contrastive loss, with the aim of maximizing the agreement of similar images while expanding the differences between views from different inputs [25], [26], [27]. Inspired by the idea, we hypothesize that to achieve good performance, the augmented images should have similar logits after Softmax to the original ones, meanwhile keeping the validation loss low. Therefore, a metric to measure the similarities and distances between augmented images and original images after inference is required to achieve the expectation. This idea is consistent with the previous viewpoint that data augmentation is a process to fill in the missing points of the original data distribution through density matching [10]. KL divergence, which is the most widely used metric to measure the differences between two distributions [47], is selected in our setting. It has also been used in adaptive knowledge distillation for different training samples, which shows a similar nature to our DRA [39], [48]. To reduce the gap between the original optimization objective and the approximation, we add KL divergence to the loss function. Thus, the optimization objective becomes \begin{align*} &\min _{T} L_{val}\left ({\theta ^{\ast },T }\right)+\lambda \cdot KL(p^{ori}\vert \vert p^{aug}) \\ & s.t. \theta ^{\ast }=arg \min _{\theta } L_{train}\left ({\theta,T }\right), \tag{15}\end{align*}
D. The Relationship to DADA, DDAS, and Faster AA
Prior offline works DADA, DDAS, and Faster AA share similar spirits to DRA, all of which use differentiable learning to estimate the policy parameters. Apart from the learnable magnitude distributions specially proposed in DRA, various designs between these works are different as well. To highlight innovations of DRA, we list some key differences here.
DADA adopts AA-based search space that uses separate sub-policies to augment images. It uses the RELAX estimator with second order gradient estimation to learn a more accurate policy. In contrast, DRA uses RA-based search space that shares operators in each augmentation layer to reduce the number of learnable parameters. The gradients are estimated directly through back propagation without second order gradient estimation. Meanwhile, DRA uses KL divergence to reduce the impact of biased straight-through gradient estimation for a more accurate policy.
DDAS directly uses the expectation of the training loss to derive the formulas of policy parameters without gradient estimators. It adopts a repeated augmentation strategy for the same minibatch to estimate loss expectation. The same operator with different magnitudes is treated as different candidate operators for training to avoid the estimation of indifferentiable magnitudes. In contrast, DRA only requires augmenting the same minibatch once to reduce the search cost when applying the operator requires much calculation. In addition, gradient estimators are kept to learn the magnitudes for more flexible policies.
Faster AA also adopts RA-based search space to learn policy parameters. It passes the weighted sum of transformed images in each augmentation layer to estimate the gradients during search, which is very close to the design in DARTS for feature aggregation to pass the information flow between inner nodes. Besides, it adopts adversarial learning through Wasserstein GAN to estimate the distance between the augmented images and original images to achieve density matching, where the estimation of the distance is based on two different minibatches from the training dataset. This design avoids the nested loop in bi-level optimization, where the outer loop has no gradient for the policy parameters in the basic design. In contrast, DRA uses one-hot relaxed categorical sampling for the operators in each augmentation layer in the forward that only transforms the image once, which reduces the computation time, especially on the large dataset ImageNet. Besides, it adopts the bi-level optimization strategy with an approximation to estimate the gradient of policy parameters, solving the problem in the outer loop. KL divergence is also adopted to achieve density matching without any additional model to estimate the distribution distance.
Experiments
In this section, we conduct classifications on CIFAR-10/100 and ImageNet, and compare the performance of DRA and some other augmentation methods. The results of these compared methods are from the original papers, if not specifically mentioned. These results are expected to be excellent, since the authors usually tuned the settings to adapt to the proposed methods. Thus, the comparison with DRA is relatively fair and acceptable. Since RA is the most concerned method that has a similar augmentation pipeline of DRA, we also re- implemented RA under our settings for comparison. Note that the original RA has only 14 operators in the search space, while there are 16 operators in the re- implemented RA and our DRA. Our DRA shows superior performance compared with other methods, especially on ImageNet. To further demonstrate the generalization ability of DRA in downstream tasks, we conduct transfer learning on COCO and compare the performance of RetinaNet [49] and GFLV2 [50] with different pre-trained backbones based on basic settings, RA, and DRA. We also visualize the changes of policy parameters during the search process and the searched results for a better understanding of DRA.
A. Implementation Details
Our search is conducted on the proxy task using a split from the original training dataset with fewer search epochs, without using the original validation dataset to update any learnable parameter. The proxy task greatly decreases the search cost in a widely affordable manner, especially on large-scale datasets like ImageNet. Half of the proxy dataset is used for updating policy parameters, while another half is used for updating model parameters. Note that unlike previous methods, we do not use a proxy model to search for policy parameters.
We have 16 operators in our candidate operator set
Geometric Operators. Geometric shape or the position of the image are transformed. (e.g., TranslateX, ShearY, Rotate)
Color Operators. The general geometric shape of the image remains unchanged, while the pixel values of part of the image or the whole image are transformed (e.g., Solarize, Sharpness, Equalize).
Note that the ranges for translate operators use pixels instead of the ratios in DRA. We use WRN-28-10 [51] and PyramidNet+ShakeDrop [52], [53] to evaluate DRA on CIFAR, while ResNet-50 [54], ResNet-200, and vision Transformer DeiT-Tiny-16-224 [17] without distillation on ImageNet. Detailed hyperparameter settings for all classification experiments are listed in Supplementary Materials Table SIII. Note that we tune the initial value of
For object detection, we select ResNet-50 as the backbone, and apply transfer learning that only uses different pre-trained weights for backbones. For RetinaNet, we use the horizontal flip as the augmentation, and set the batch size to 32, weight decay to
We implement our experiments on TensorFlow 2.3 and Python 3.7. All search experiments are conducted on a single Tesla P100, and other experiments are conducted on TPU v2 (8 Cores).
B. Results on CIFAR-10/100
CIFAR-10 and CIFAR-100 are two small datasets with balanced class distributions. They both have a resolution of
As shown in Table II, DRA improves the accuracy on both CIFAR-10 and CIFAR-100 compared with RA and other methods. We note that the re- implemented WRN-28-10 using RA under our settings has an obvious performance increase compared with the reported one on CIFAR-100, which may arise from the operators that are only in the search space of DRA working well on CIFAR-100. We also notice that DRA performs slightly worse than AA on CIFAR-100 using PyramidNet+ShakeDrop, while better using WRN-28-10. The inconsistency may arise from two aspects. PyramidNet+ShakeDrop benefits more from the separate sub-policies in AA on CIFAR-100 that consider the detailed impact of previous transformations for PyramidNet+ShakeDrop to distinguish from. In contrast, WRN-28-10 benefits less due to its limited ability to distinguish the detailed changes. Besides, different policy models have different impact on DRA, where PyramidNet+ShakeDrop benefits less from DRA compared with WRN-28-10. Further analysis of the impact of different proxy model structures on DRA is shown in the Section V-F.
The search costs of different augmentation methods are shown in Table III. We notice that DRA has a smaller search time difference between CIFAR and ImageNet compared with other methods. The reasons are from two aspects.
On one hand, ImageNet has a resolution of
However, DRA uses WRN-28-10 as the proxy model, while the counterparts adopt WRN-40-2. Since the proxy dataset from CIFAR is small, the efficiency of DRA is not obvious, while the influence of a larger proxy model dominates the search time. Although the search time of DRA is slightly longer than other differentiable methods, it is still within 0.5 GPU hours, which is short and affordable for wide applications.
C. Results on ImageNet
ImageNet is a large-scale dataset with an almost balanced class distribution. It has 1.3 M images in 1,000 classes from daily life, which is totally different from the ones in CIFAR. As mentioned by DeVries and Taylor [1] and Cubuk et al. [7], augmentation that performs well on one dataset may not work well on another with different data distributions. Thus, directly conducting a search on the task dataset is a proper choice. Thanks to the efficiency of DRA, we can directly search on the proxy of ImageNet rather than transferring the policy searched on CIFAR. Following AA, we randomly sample 120 classes from 1,000 classes in the training dataset of ImageNet with 50 images per class as the proxy dataset, and search for the policy parameters on it. The Top-1 and Top-5 accuracy of DRA and other augmentation methods are shown in Table IV, which is also evaluated on three runs. Similar to experiments on CIFAR, we reimplement RA under our framework for a fair comparison. Note that the original training hyperparameter settings of RA are different from other works, so we re- select the value of RA hyperparameters and use
As shown in Table IV, DRA has the best accuracy compared with other methods with obvious improvement up to 0.28% Top-1 accuracy over the re- implemented RA on ResNet-50. Even the re- implemented RA has great improvement over the original one, which attributes to the longer training time for the model to converge with augmented images. Note that the re- implemented RA under our settings shows competitive Top-1 performance to the re- implemented one reported by Liu et al. [9], demonstrating that the result is reasonable. DRA also achieves the best performance on ResNet-200, demonstrating the capability of DRA to adapt to different target models. With DRA, the model learns more solid features that yield better performance. However, the improvement over RA on ResNet-200 is not as obvious as the one on ResNet-50. We guess that this is due to the stronger ability of ResNet-200 to capture features in the input data, which may prefer larger variances of the magnitudes of operators. With a larger value of the initial standard deviation of magnitude
To further evaluate the effect of DRA on non-convolutional neural networks, we conduct experiments on vision Transformer DeiT-Tiny with a patch size of 16, which is denoted as DeiT-Ti-16 in Table IV. No distillation is applied during training. The input resolution is set to 224. We reimplement DeiT-Ti-16 with the same training hyperparameters except for three augmentation settings, including the basic, RA, and DRA. Since the comparison is designed to reveal the power of DRA compared with RA on Transformers, other augmentations apart from the basic random resized cropping and horizontal flipping are not applied. The hyperparameter settings generally follow the original DeiT-Basic. Note that for RA we set
As shown in Table IV, DRA also achieves the best performance on ImageNet compared with the basic and RA, with 0.16% Top-1 performance improvement. We also notice a significant performance improvement is achieved in our reimplemented version compared with the original one in [17], which may arise from more candidate operators in DRA. We also evaluate the performance of RA without the random applying probability to be consistent with the setting in [17] and the implementation without standard deviation as the original design of RA, where the Top-1 accuracy is only 73.21% and 72.84% (not shown in Table IV), respectively. The results demonstrate the generalization of DRA to be integrated into more types of neural network structures, indicating bright prospect for wide applications.
D. Transfer Learning on Object Detection
Using ImageNet pre-trained backbones is a common strategy in object detection for fast convergence [58]. To evaluate whether the backbone network trained with our DRA has a stronger ability to help extract features in object detection, we conduct experiments on the COCO dataset with pre-trained backbones under three different settings, including the basic, RA, and DRA. We test two single-stage detection networks, RetinaNet and GFLV2, to better understand the influence of pre-trained backbones on the performance where detection heads have either weak or strong feature extraction capability. We compare the mean bounding box Average Precision (AP) to evaluate the performance. The results are shown in Table V on the average of three runs. Models using backbones pre-trained with DRA generally outperform the basic and RA. In particular, DRA mainly helps to capture features of medium-sized objects, while slightly sacrificing the ability to capture those of smaller-sized objects. DRA also helps to detect the coarse location of objects. These phenomena are shown in both simplified settings and augmented settings, which may explain how DRA helps models to extract semantic features in the images.
E. Visualization
1) Visualize the Search Process:
To better understand how DRA works during the search, we visualize the changes of the mean magnitude
As shown in Fig. 2(a), DRA prefers a larger transform intensity of geometric transformations while less for color blending operators on CIFAR-10. Changes of
Visualization of the search process for mean magnitudes
For ImageNet shown in Fig. 2(b),
We also notice that the changes of
2) Visualize the Searched Results:
We also display the final searched magnitude parameters of both the mean and standard deviation of magnitudes on CIFAR-10 and ImageNet in Fig. 3 with WRN-28-10 and ResNet-50, respectively. Note that the sampled magnitudes of DRA should be within range
3) Visualize the Loss Curves:
To reveal the impact of DRA on training the target model, we further analyzed whether DRA makes training easier or harder. The loss curves during training on ImageNet are used to measure the difficulty, which is shown in Fig. 4. DRA has a larger augmented image space compared with RA, which is expected to increase the difficulty of training. Surprisingly, the results show that DRA has both a smaller training and validation loss at the end of the training. Meanwhile, the loss gap of DRA between training and validation is also smaller. These results indicate that DRA makes samples easier to be trained compared with RA. We analyze that the reasons are from two aspects. For one thing, DRA has a smaller average magnitude compared with RA, which makes models easier to extract features from variants of the original image. For another, since DRA has a standard deviation that makes the sampled magnitudes different for the same operator, the same image has many variants that show gradual deformations. These gradual deformations fill the missing points in the original image space, making the space smoother to be learned for the target model. The easier training may explain why DRA has a better performance compared with RA.
Training and validation loss curves of ResNet-50 on ImageNet, which are evaluated on three runs. Colored areas show the range of the loss while lines show the average results.
DRA slightly increases the performance while reducing the training difficulty, which achieves a good balance between the optimization of accuracy and diversity. It demonstrates a balanced augmentation design between good performance and randomness may exist in an adaptive policy to the target task. We hope our DRA can enlighten future works for better design of data augmentation that can further promote the accuracy, with tricks such as augmentation customization or hard sample mining.
Discussion
DRA achieved admirable results compared with RA with only a small search cost. To explore how DRA improves classification performance, we conduct ablation and hyperparameter studies on ImageNet with ResNet-50. We also discuss the influence of KL divergence and proxy task in our DRA in this section. Hyperparameter settings are the same as the ones in Section IV-A if not mentioned.
A. Components of DRA
The components of our DRA can be summarized as four parts: learnable magnitude distributions, learnable selecting weights, usage of operator gradients, and modification of the outer optimization objective in bi-level optimization. The latter two provide a more accurate estimation of the gradient for the learnable policy parameters. To further understand how these components affect the performance, we explore the gradual impact of these components and list the ablation results in Table VI, focusing on the improvement of Top-1 accuracy on ImageNet with ResNet-50. Results are evaluated on the average of three runs.
We find that the proposed components can gradually improve performance, especially for the learnable magnitude distributions and usage of KL divergence. The joint usage of the learnable policy parameters and optimization for gradient estimation obviously contribute to performance improvement. Besides, accurate gradient estimations for differentiable methods also contribute to good results.
B. Magnitude Distribution
We further explore the impact of magnitude distributions that improve performance of DRA. We compare the performance of six cases to explore the detailed impact of magnitude parameters, including fixed magnitudes, learnable magnitudes, magnitudes following fixed distributions, learnable magnitude distributions with a fixed standard deviation, learnable magnitude distributions with a fixed mean, and learnable magnitude distributions. The results are shown in Table VII, which are evaluated on three runs.
Learning both the mean magnitudes
C. KL Divergence
KL divergence is used to measure the differences between the two distributions. It has been adopted in automatic data augmentation to avoid outliers caused by the removal of part of the semantic information when applying heavy augmentations [39]. Inspired by the idea, we also introduce KL divergence to refine the searched policy parameters in our DRA to avoid generating various outliers.
To explore the impact of KL divergence in DRA, we choose a group of
Impact of some hyperparameters on ImageNet evaluated with ResNet-50. a) The ratio of KL divergence during searching. b) The total number of augmentation layers.
Although more values of
D. Augmentation Depth
With more augmentation layers, the transformation of the input image is expected to be more obvious and severe, which will increase the diversity of the input dataset. However, more augmentation layers may also introduce outliers that lose important semantic information and hurt model performance. To explore the impact of augmentation depth for DRA, we conduct experiments with different total augmentation depths
In our experiments, we select
E. Operators
Operators transform images into different variants that provide deformations to the input dataset to improve the robustness of the trained models. However, the improper use of operators may hurt performance. As reported by Cubuk et al. [5], not all operators are beneficial to performance. To get a sense of the influence of each operator, we separately remove each operator in the candidate operator set of DRA and evaluated the performance.
In particular, we remove one operator from the candidate operator set and search for new policy parameters on the remaining ones to evaluate the performance decrement. The results are shown in Fig. 6 on the average of three runs. We group the operators into three groups, including geometric transformations (left), color transformations with magnitudes (middle), and color transformations without magnitudes (right). Removing any of the operators causes performance to decrease. We also find rotation is the most important to the classification on ImageNet. Besides, transformations in the x-axis contribute more to performance compared with those in the y-axis in geometric transformations. Color transformations with magnitudes contribute similarly to performance. For color transformations without magnitudes, the contributions vary significantly, with invert showing poor contribution compared with contrast changing operators.
Decrement of performance on ImageNet using ResNet-50 with each operator removed from the candidate operator set. Results are reported on the average of three runs. The policy parameters are searched from scratch after the removal of operators.
The results show that ImageNet prefers variants generated from both geometric and color transformations, demonstrating the complexity of the dataset that requires abundant transformations during training for better generalization. Experiments with more operators in the candidate set are worth evaluation in the future, which we believe will further improve performance of DRA on ImageNet.
F. Proxy Task
1) Reasons to Use the Proxy Task:
Differentiable automatic augmentations generally use the proxy task to reduce the search cost, which sacrifices the precision of gradient estimation and slightly decreases the performance as reported by Lim et al. [31]. To better estimate the loss in the outer loop of bi-level optimization, RA searches the policy parameters directly on the target task [5]. However, it also introduces heavy constraints to reduce the size of the search space, thus the flexibility is also constrained. The search cost of RA is also limited by the scale of the target dataset, which is difficult to reduce further.
Considering both the advantages and disadvantages, our DRA adopts the proxy task with several tricks for better estimation of the gradient, which strikes a balance between the search cost and performance. The search strategy based on the proxy task works well, thus we believe DRA is practical and worthwhile for wide applications. In the future, we will try to apply DRA directly on the target dataset, which is in theory more precise for gradient optimization.
2) Impact of the Proxy Model:
DRA adopts the target model as the proxy model to search the optimal augmentation strategy. The idea is intuitive because the search on the target model is expected to have a smaller gap to the target task compared with the search on a smaller proxy model. To explore whether the intuition is correct, we evaluate the impact of proxy models on CIFAR-100. We compare the performance of both WRN-28-10 and PyramidNet+ShakeDrop trained on policies found on themselves and the smaller proxy model WRN-40-2. We also compare the performance of WRN-28-10 trained on the policy found on PyramidNet+ShakeDrop that can be viewed as a larger proxy model. The results are shown in Table VIII.
The usage of both smaller and larger proxy models decreases performance of the target tasks, indicating the selection of the proper proxy model is important for proxy tasks. The intuitive idea that directly using the target model as the proxy model can reduce the gap between proxy tasks and target tasks and promote the final performance.
Besides, the performance gaps between PyramidNet+ ShakeDrop with policies searched on different proxy models are smaller than WRN-28-10, indicating that the impact of proxy models on larger models is smaller. This finding may support that DRA is more suitable for smaller models that require more regularization to achieve better performance.
3) Impact of the Balanced Proxy Dataset:
Following previous proxy-based automatic data augmentation methods, DRA randomly samples part of the images from the training dataset of the class-balanced dataset CIFAR as the proxy dataset. For CIFAR-100 that has many classes, the randomly sampled small-scale proxy dataset may be unbalanced. We note that previous works neglect the analysis of whether a randomly sampled dataset or a balanced proxy dataset is better for the proxy task on the balanced target dataset. To further evaluate the impact, we compare the performance of DRA trained on a randomly sampled proxy dataset and a stratified one from the CIFAR-100 training dataset, respectively. Note that the separation of the stratified proxy dataset to two halves for proxy search is random, which is the same as in previous experiments. The results show that a balanced proxy dataset does not benefit the target task on the balanced target dataset. Randomly sampling the proxy dataset is enough to yield satisfactory performance.
G. Influence to Future Works
DRA increases the performance compared with RA. Meanwhile, it reduces the training difficulty, which achieves a good balance between the optimization of accuracy and randomness. It demonstrates that a balanced policy design may exist in adapting the policy to the specific target task. We hope the idea presented in DRA can enlighten future works for better design of data augmentation that can further promote the accuracy.
A specific way that is worthwhile trying is online DRA with techniques such as adversarial learning [59], hard sample mining, or knowledge distillation [39]. With the improvement in controlling the difficulty of generated samples or transforming samples with adjustive policies, DRA may achieve better performance. These prospections are also our future efforts.
Due to the simple design and offline characteristics, transferring DRA to existing training pipelines is easy to achieve with almost no extra training budget, especially on applications that use RA as an augmentation strategy. Besides, DRA may promote model performance in cases with limited calculation sources or large cost for new data collection, such as mobile computing and medical image analysis [60]. It shows wide prospectives to the community.
Conclusion
In this work, we focus on the inflexibility and large search cost of RandAugment (RA), and propose Differentiable RandAugment (DRA), a method that can automatically learn the selecting weights and magnitude distributions of different transformations. DRA generally outperforms RA with a small search cost. It adopts the search space of RA and models the magnitude of each transformation following a learnable normal distribution, and uses relaxation and approximation to differentiate learnable policy parameters. We also introduce operator gradients and KL divergence to reduce the bias in gradient estimation. Experiments on several datasets and tasks demonstrate the efficiency and effectiveness of DRA, especially on the classification of ImageNet. Our DRA is one of the few to outperform RA on ImageNet under a similar training budget. We believe this framework can be integrated into more computer vision tasks, serving as a baseline for subsequent research.
ACKNOWLEDGMENT
The authors would like to thank Weichen Yu, Qinglin Zhu, Jiyang Guan, and Xinjian Wu for their help.