Class-Adaptive Data Augmentation for Image Classification

Data augmentation is a widely used regularization technique for improving the performance of convolutional neural networks (CNNs) in image classification tasks. To improve the effectiveness of data augmentation, it is important to find label-preserving transformations that fit the domain knowledge for a given dataset. In several real-world datasets, appropriate augmentation policies differ between classes, owing to their different characteristics. In this paper, we propose a class-adaptive data augmentation method that utilizes class-specific augmentation policies. First, we train the CNN without data augmentation. Subsequently, we derive a suitable augmentation policy for each class through an optimization procedure to maximize the degree of transformation while maintaining the label-preserving property of CNNs. Finally, we re-train the model using data augmentation based on derived class-specific augmentation policies. Through experiments using benchmark datasets with class-specific transformation constraints, we demonstrate that the proposed method achieves comparable or higher classification accuracy than the baseline methods using the same augmentation policy for all classes. Additionally, we confirm that the derived class-specific augmentation policies are consistent with the domain knowledge of each dataset.


I. INTRODUCTION
Data augmentation plays an important role as a regularization technique for improving the generalization performance of convolutional neural networks (CNNs) in image classification tasks [1], [2]. Data augmentation synthetically increases the amount and diversity of a training dataset with randomly transformed images through label-preserving transformations. Owing to its effectiveness, data augmentation has been widely used to improve classification accuracy in various image classification tasks.
An appropriate augmentation policy for data augmentation can significantly improve the classification accuracy of CNNs. For a given dataset, it is necessary to determine an augmentation policy that specifies the types of The associate editor coordinating the review of this manuscript and approving it for publication was Krishna Kant Singh .
transformations and their distributions to randomly transform images while preserving their class labels. However, if the labels of transformed images are different from those of original images, the classification accuracy of the CNN can be degraded by using the data augmentation.
A naive approach is a manual specification by human experts, based on the domain knowledge of the dataset [3], [4], [5], [6], [7]. If the domain knowledge is sufficient, an appropriate augmentation policy can be easily derived. However, domain knowledge may not be available in many real-world scenarios. Additionally, this approach generally requires a trial-and-error process and thus can be labor-intensive and time-consuming [8], [9].
Recently, attempts have been made to automatically search for an appropriate augmentation policy for a given training dataset in a data-driven manner [10], [11], [12], [13], [14], [15]. Existing methods can be used to effectively apply VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ data augmentation to improve the performance of a CNN in the absence of domain knowledge. These methods mainly focus on optimizing the augmentation policy to be datasetspecific, implying that every image in the dataset is randomly transformed in the same manner, regardless of the class label.
Our research motivation stems from the fact that appropriate augmentation policies can differ between classes. For example, in the digit images of MNIST and SVHN datasets, random horizontal and vertical flips preserve the class labels of images for classes 0, 1, and 8, but not for the other classes. Random rotation over a wide range preserves the class labels only for class 0. When different classes correspond to different types of labelpreserving transformations, consideration of class-specific characteristics can lead to a substantial improvement in data augmentation.
In this study, we propose a class-adaptive data augmentation method to utilize the class-specific characteristics of a dataset without domain knowledge. To build a CNN, the proposed method consists of three stages: (1) CNN training without data augmentation; (2) optimization of class-specific augmentation policies; (3) CNN re-training with class-adaptive data augmentation. In the first stage, the CNN is trained using the original training dataset without data augmentation. In the second stage, a suitable augmentation policy for each class is derived through an optimization procedure to maximize the degree of transformation while maintaining the label-preserving property of the CNN. In the third stage, the CNN is re-trained using data augmentation based on class-specific augmentation policies. To examine whether the proposed method derives suitable class-specific augmentation policies, we conducted experiments using three benchmark datasets constituting classes with different label-preserving transformation characteristics.
The remainder of this paper is organized as follows. In Section II, related studies are reviewed. In Section III, the proposed method is introduced. Section IV describes the experimental settings and results. In Section V, the conclusions and future work are presented.

II. RELATED WORK A. TRANSFORMATION OPERATIONS FOR DATA AUGMENTATION
In image classification tasks, various transformation operations are readily available for data augmentation. Traditional transformation operations slightly vary an existing image using geometric and photometric transformations [16]. Geometric transformation operations change the image geometry by changing the pixel positions [8], [17], [18]. Examples include rotation, shifting, zooming, shearing, and flipping. Photometric transformation operations change pixel values in the color channels of an image [2], [9], [19]. Examples include brightness, contrast, and color inversion. Recently, transformation operations that synthesize a new image by combining two existing images, such as Mixup [20], CutMix [21], and CutPaste [22] have been proposed.
In this study, geometric and photometric transformation operations are used to formulate class-specific augmentation policies. These operations are suitable for capturing and easy to interpret class-specific characteristics of a dataset.

B. OPTIMIZATION OF DATA AUGMENTATION POLICY
A data augmentation policy comprises transformations and their distributions. The distribution determines the degree of transformation of the image. For example, the rotation operation rotates an image by an angle randomly sampled from a predefined distribution, such as Uniform ([−30 • , 30 • ]). The horizontal flip determines whether to flip an image depending on a distribution such as Bernoulli(0.5). To improve the data augmentation effectiveness, it is important to appropriately set the hyperparameters of the distributions to ensure that the augmentation policy does not change the original class labels of the images. On the one hand, if the distribution is set too narrow, images are not well diversified, thus reducing the data augmentation effect. On the other hand, if the distribution is too broad, there is a risk in that the class label of an image will not be preserved after transformation, which can negatively affect the classification performance.
While manually designing an augmentation policy for a dataset based on domain knowledge is challenging and often infeasible, recent studies have attempted to automatically determine an appropriate augmentation policy. Existing methods define the search space of an augmentation policy and optimize the hyperparameters of transformation operations in the policy in a data-driven manner. The methods for defining a search space can be categorized into dataset-level and instance-level approaches, based on the manner in which the search space is defined.
The dataset-level approach optimizes the augmentation policy to make it suitable for a given dataset. Cubuk et al. [10] used reinforcement learning based on proximal policy optimization to search for the best augmentation policy for a dataset, in which each candidate augmentation policy is directly evaluated by training the CNN. Ho et al. [11] leveraged population-based training to generate a non-stationary augmentation policy schedule that defines the best policy for each training epoch, instead of a fixed augmentation policy. Lim et al. [12] optimized the augmentation policy via efficient density matching based on Bayesian optimization, which does not require repeated CNN training for evaluating augmentation polices and thereby significantly reduced computational cost. Zhang et al. [13] formulated an augmentation policy search as an adversarial learning problem, wherein a policy network generates an augmentation policy to maximize the training loss of the CNN, while the CNN learns from augmented training data to improve generalization. Hataya et al. [14] and Li et al. [15] used a differentiable augmentation policy to simultaneously optimize the augmentation policy and train the CNN. These methods apply the same augmentation policy to all instances 26394 VOLUME 11, 2023 without considering the differences in the characteristics between classes.
The instance-level approach builds a policy network that predicts an image-specific augmentation policy suitable for each input image. Zhou et al. [23] used a policy network to assign different weights to transformed images in CNN training. Cheung and Yeung [24] used a policy network that adaptively generated the hyperparameters for the transformation operations of each image. For image-dependent data augmentation, these methods commonly require a policy network to be maintained, which increases the computational cost.
The proposed method aims to improve the effectiveness of data augmentation for CNN training by leveraging class-specific augmentation policies. Compared with the dataset-level approach, optimizing the hyperparameters of individual polices to be class-specific contributes to further diversifying transformations while preserving class labels. Compared to the instance-level approach, the proposed method obtains class-specific augmentation policies using a single CNN without requiring any additional network to be trained.

III. METHODOLOGY
The problem situation addressed in this study is as follows. For a C-class image classification task, we suppose that the i=N +1 are given, where X i represents the i-th image and y i ∈ {1, . . . , C} represents the corresponding class label. In the training phase, we build a CNN f using the training and validation sets D and D ′ . The input and output of the CNN f are an image X and its softmax response vector f (X) ∈ [0, 1] C , respectively, for predicting the class label y.
Data augmentation is applicable during the training to improve the classification accuracy of the CNN f . The conventional approach uses an augmentation policy A(φ) comprising M transformation operations, where φ = (φ 1 , φ 2 , . . . , φ M ) is the set of hyperparameters for the transformation operations involved in the policy. For the mth operation, the hyperparameter φ m specifies the distribution from which the degree of transformation can be sampled. At each training epoch, each training image X is randomly transformed using a transformation τ sampled from the augmentation policy A(φ).
This study aims to make data augmentation further diversify images with preserving their class labels during the training by reflecting class-specific characteristics in the dataset. Unlike the conventional approach, which uses the same augmentation policy regardless of the class, the proposed method introduces class-specific augmentation policies A(φ 1 ), . . . , A(φ C ) such that each image is randomly transformed in a different manner depending on its class label. For this, we derived suitable hyperparameters for individual classes, namely φ 1 , . . . , φ C , by performing an end for 10: f ← Train(D, D ′ , , γ ) 12: end procedure 13: function Train(D, D ′ , , γ ) 14: f ← initialize a CNN 15: while not termination condition on D ′ do 16 17: f ← update f for one epoch usingD 18: end while 20: return f 21: end function optimization procedure using a CNN trained without data augmentation. Figure 1 shows a schematic diagram of the proposed method consisting of three stages: (1) CNN training without data augmentation, (2) optimization of class-specific augmentation policies, and (3) CNN re-training with classadaptive data augmentation. Algorithm 1 presents the pseudocode of the proposed method. In the pseudocode, the first, second, and third stages correspond to lines 2, lines 3-10, and 11, respectively. The following subsections describe each stage.

A. TRAINING WITHOUT DATA AUGMENTATION
In this stage, we train a CNN f using the training set D without applying data augmentation, in which the validation set D ′ is used to monitor the validation performance. The training is terminated upon meeting a predefined termination condition on the validation set D ′ . The training procedure is presented in Train function of Algorithm 1.
The trained CNN f is used in the optimization of class-specific augmentation policies. As the CNN f learns the characteristics of original images, we determine the degrees of transformations for individual transformation operations such that the augmentation policies do not change the class labels of images after they are transformed.

B. OPTIMIZATION OF CLASS-SPECIFIC AUGMENTATION POLICIES
This stage optimizes the hyperparameters for class-specific augmentation policies. For the augmentation policy of the VOLUME 11, 2023  c-th class, A(φ c ), M hyperparameters must be optimized: . Each hyperparameter φ c m corresponds to the degree of transformation for the m-th transformation operation involved in the policy. In this study, we selected 10 transformation operations typically used in image augmentation. These operations are relatively simple and easy to interpret. Table 1 lists the transformation operations and their hyperparameters.
The objective here is to maximize the degree of transformations for the augmentation policies of individual classes while maintaining the label-preserving property. For the m-th transformation operation for the c-th class, the hyperparameter φ c m is optimized by solving the optimization problem below: 26396 VOLUME 11, 2023 where D ′ c is the subset of the validation set D ′ for the c-th class (i.e., D ′ c = {(X i , y i ) ∈ D ′ |y i = c}) and u m is the upper bound in the search space of φ c m specified in Table 1. In the objective function J m , the first term provides a penalty with a tolerance of ϵ if the softmax response output from the CNN f changes significantly when the input image is transformed by a degree of φ c m . The change in the softmax response is quantified using the function If m ∈ {1, 2, 3, 4, 5, 6, 7}, the search space for the hyperparameter φ c m has a continuous range. The function m is set as follows: where τ m (X i ; −φ c m ) and τ m (X i ; φ c m ) are the transformed images of X i by −φ c m and φ c m degrees, respectively, using the m-th transformation operation. These two are the maximum transformations that can be obtained by the mth transformation operation given the hyperparameter φ c m . Because the objective function J m has a complex form with respect to φ c m , we use Bayesian optimization to efficiently solve the optimization problem in Equation 1.
If m ∈ {8, 9, 10}, the search space for the hyperparameter φ c m is discrete and contains only two values: 0 and 0.5. Then, the function m evaluates the change in the softmax response when the transformation is applied, as follows: where τ m (X i ; 2 · φ c m ) indicates that the m-th transformation operation is applied to image X i with a probability of 2 · φ c m . Because the hyperparameter φ c m can have a value of either 0 or 0.5, we optimized it by directly comparing m (0; f , D ′ c ) and m (0.5; f , D ′ c ). After solving the optimization problems for all pairs of transformation operations and classes, we obtain optimized hyperparameters of M transformation operations to specify class-specific augmentation policies . The c-th hyperparameter set φ c is used to specify the distributions of transformation operations for the class-specific augmentation policy A(φ c ). For class-adaptive data augmentation, an image X belonging to the c-th class is randomly transformed by applying a trans- An important consideration is the choice of the tolerance hyperparameter ϵ. If it is set to larger, the hyperparameters of transformation operations will become larger. Because the augmentation policy uses multiple transformation operations simultaneously, an image can be extremely transformed in a such way that its class label may not be preserved. However, setting ϵ to a smaller value results in smaller hyperparameters of transformation operations, and therefore, the effect of data augmentation will be reduced. To sidestep this difficulty, we set ϵ to be sufficiently large and introduce an augmentation decay strategy for using class-adaptive data augmentation in the next stage.

C. TRAINING WITH CLASS-ADAPTIVE DATA AUGMENTATION
In this stage, we re-train the CNN f with data augmentation using class-specific augmentation policies derived in the previous stage. At each training epoch, each training image X is transformed differently depending on its class label y using the class-specific augmentation policy A(φ y ). For example, an image X belonging to the c-th class is transformed using a transformation τ ∼ A(φ c ). The CNN uses a transformed imageX to output its softmax response f (X).
For class-adaptive data augmentation, we introduce an augmentation decay strategy that initially uses relatively large values for the hyperparameters = {φ 1 , . . . , φ C } and gradually decreases these values over time such that the gap between original and transformed images will be reduced as the training progresses [25]. To decrease the hyperparameter values, we decay the hyperparameters = {φ 1 , . . . , φ C } by a factor of γ ∈ [0, 1) at every training epoch. With the augmentation decay, the CNN f is trained with intensively transformed images at early training epochs and is refined with fewer transformed images at later stages. This can contribute to better regularization of the CNN f [25].

IV. EXPERIMENTS A. DATA DESCRIPTION
We validated the effectiveness of the proposed method using three benchmark image datasets with class-specific constraints of transformations: MNIST [26], SVHN [27], and WM-811K [28].
• MNIST: The dataset contains handwritten digit images, each belonging to one of the digits from 0 to 9. The original shape of each image is 28 × 28. In the experiments, we resized each image using bilinear interpolation and converted it into three channels, resulting in a shape of 32×32×3. For data augmentation, each class has different restrictions on the rotation and flip operations owing to the characteristics of digits.
• SVHN: The dataset contains printed digit images cropped from images of house number plates. There are 10 digit classes from 0 to 9. Every image has a shape of 32×32×3. For data augmentation, shear and color invert operations are known to be effective for all classes [10]. VOLUME 11, 2023  Each class has different restrictions on the rotation and flip operations, such as MNIST.
• WM-811K: The dataset contains semiconductor wafer bin maps with 9 defect pattern classes: Center, Donut, Edge-Loc, Edge-Ring, Loc, Random, Scratch, Near-Full, and None. In a wafer map, each element has a value of 1 if the corresponding die is defective and 0 otherwise. Among the total 811,457 wafer maps, 172,946 labeled wafer maps with a size greater than 100 were used in the experiment. Because the wafer maps were differently shaped, we resized them using bicubic interpolation and duplicated the channels to be 64 × 64. For data augmentation, the rotation and flip operations are label-preserving transformations for all classes [5]. However, shift and zoom operations do not preserve labels for most classes. Table 2 lists the dataset details. For each dataset, we used the default split of the training/validation and test sets in the original release. All pixel values were normalized using the mean and standard deviation of the training set.

B. EXPERIMENTAL SETTINGS
For the architecture of the CNN f , we used VGG19 [29] with batch normalization [30] and ResNet34 [31]. For each of them, we removed all fully-connected layers and added an output layer having as many nodes as the number of classes in the target image classification task. In the case of VGG19, we replaced the flatten operation with global average pooling.
In the first and third stages of the proposed method, we used the following configurations to train the CNN f . The ratio of training set D and validation set D ′ was set to 8:2. Categorical cross-entropy was used as the loss function. We used the Adam optimizer for parameter updating, with a mini-batch size of 128. L2 regularization with a factor of 10 −4 was applied to the parameters. The augmentation decay γ for the third stage was set to 0.98. The initial learning rate was set to 10 −4 . The learning rate was reduced by a factor of 0.1 if the validation accuracy did not improve during 25 successive training epochs. After 100 training epochs, training was terminated if the validation accuracy did not improve during 50 successive epochs or the number of epochs reached 500.
In the second stage of the proposed method, we used the 10 transformation operations listed in Table 1 to form one class-specific augmentation policy. We set ϵ = 0.1 and λ = 0.01 in the objective function J m . For Bayesian optimization, we used the Gaussian process with the Matern kernel as the surrogate model and expected improvement as the acquisition function. The numbers of initial random exploration steps and optimization steps were set to 5 and 30, respectively.
The CNN performance was evaluated in terms of the classification accuracy of the test set, which is a typically used performance measure in image classification tasks. All experiments were repeated 10 times independently with different random seeds, and the mean and standard deviation of each result over the 10 replications were calculated.

C. COMPARED METHODS
The proposed method, referred to as ClassAdaptive-DA, was compared with three baselines: CNN trained without data augmentation (Without-DA), CNN trained with data augmentation based on a known dataset-level augmentation policy in the literature (Baseline-DA), and an ablation of the proposed method in which the same augmentation policy applied to all classes instead of class-adaptive policies (DatasetAdaptive-DA).
For Baseline-DA, the augmentation policies for the MNIST, SVHN, and WM-811K datasets were obtained from previous studies [5], [8], and [10], respectively. The other training configurations were the same as those in the proposed method.
For DatasetAdaptive-DA, we altered the second stage of the proposed method by optimizing the dataset-level augmentation policy. This was to examine the effectiveness of class-adaptive augmentation policies in the proposed method while controlling other conditions. Table 3 presents a comparison of the classification accuracies of the baseline and proposed methods with two CNN architectures, VGG19 and ResNet34, on the benchmark datasets. In each row, the highest accuracy is shown in bold. The results show that ClassAdaptive-DA yielded a classification accuracy comparable to or superior to that of the baseline methods in most experimented cases. Without-DA always yielded the lowest classification accuracy. In the case of MNIST, wherein all methods yielded very high classification accuracy, ClassAdaptive-DA yielded the highest classification accuracy; however, there were no significant performance differences between Baseline-DA, DatasetAdaptive-DA, and ClassAdaptive-DA. In the case of SVHN, DatasetAdaptive-DA performed the best,    followed by ClassAdaptive-DA. In the case of WM-811K, ClassAdaptive-DA outperformed all the baseline methods. Regarding the CNN architecture, VGG19 showed slightly better performance than ResNet34.

D. RESULTS AND DISCUSSION
For ClassAdaptive-DA, we examined class-specific augmentation policies derived using VGG19 for each benchmark dataset. The optimized hyperparameters for MNIST, SVHN, and WM-811K are listed in Table 4, Table 5 and Table 6, respectively. The notable results in the tables are highlighted in bold. Figure 2 shows examples of class-adaptively transformed images at the 50-th training epoch of VGG19 for the benchmark datasets. We found the class-specific augmentation policies derived using ClassAdaptive-DA were consistent with the known domain knowledge about each benchmark dataset. In addition, we found other classspecific label-preserving transformations that have not been discussed in the literature. Our findings for each dataset are described below.
• MNIST: The class-specific augmentation policies of MNIST are determined according to the characteristics of the digits. For rotation, a full range was used for class 0 only, whereas a smaller range was used for other classes. Wide shear ranges were used for all classes. Shear is a label-preserving transformation of handwritten digits as they are naturally sheared. For horizontal and vertical shifts, almost the full range was used for class 1 only. The probability of horizontal flip was always set to maximum for classes 0, 1, and 8, whereas horizontal flip was not used for classes '2, 3, 5, 6, and 9. The probability of vertical flip was maximum for classes 0, 1, 3, and 8, whereas vertical flip was not used for classes 2, 5, 6, 7, and 9. For class 4, 26400 VOLUME 11, 2023 horizontal and vertical flips had a non-zero probability. For certain handwritten four digits in the dataset, these flips appeared to preserve the labels.
• SVHN: The class-specific augmentation policies of SVHN were similar to those of MNIST. Additionally, the characteristics of the digits printed on the home number plates were observed. Compared with MNIST, the differences in the optimized hyperparameters between classes were relatively small, which may have degraded the effect of class-adaptive augmentation policies on classification accuracy. Some interesting characteristics different from those of MNIST are as follows: The ranges of the shear were set to be relatively narrow as the printed digits were generally less sheared than handwritten digits. Because the images in this dataset had various colors, color invert was label-preserving and was thus used with a wide range for all classes. Color invert was only applied to class Random.

V. CONCLUSION
We presented a class-adaptive data augmentation method for effectively using data augmentation in image classification tasks in the absence of domain knowledge of a given dataset. Using a CNN trained without data augmentation, classspecific augmentation policies for individual classes were derived by performing an optimization procedure. The CNN was re-trained using data augmentation with class-specific augmentation polices. Experimental results using benchmark datasets with class-specific characteristics showed that the proposed method yielded comparable or superior performance to the baseline methods. Additionally, it was shown that the class-specific augmentation policies derived using the proposed method for each dataset were consistent with the domain knowledge.
We expect that the proposed method will be useful in further improving the classification accuracy of a CNN when class-specific constraints of transformations are imposed on the dataset. The proposed method requires maintaining only one CNN, which is advantageous in terms of the memory usage. Additionally, the analysis of class-specific augmentation policies can aid in understanding the class-specific characteristics of a dataset without domain knowledge.
In future work, we will investigate methods to further improve the efficiency and effectiveness of the proposed method. The efficiency can be improved by training the CNN and optimizing class-specific augmentation policies simultaneously. The effectiveness can be improved by introducing more randomness into augmentation policies, such as randomly permuting the order between transformation operations and randomly sampling transformation operations to apply, to generate more diversified variations of images.
JISU YOO received the B.S. degrees in industrial and systems engineering and in software convergence from Dongguk University, in 2021, and the M.S. degree in industrial engineering from Sungkyunkwan University, in 2023. Her research interests include data augmentation, hyperparameter optimization, and industrial artificial intelligence.
SEOKHO KANG received the B.S. and Ph.D. degrees in industrial engineering from Seoul National University, in 2011 and 2015, respectively. He was a Research Staff Member with the Samsung Advanced Institute of Technology. He is currently an Assistant Professor of industrial engineering with Sungkyunkwan University. His research interests include bridging the gap between theory and practice in machine learning through practical methodologies and data mining applications in various industrial fields, including manufacturing, materials, and healthcare.