Regional Patch-Based Feature Interpolation Method for Effective Regularization

Deep Convolutional Neural Networks (CNNs) can be overly dependent on training data, causing a generalization problem in which trained models may not predict real-world datasets. To address this problem, various regularization methods such as image manipulation and feature map regularization have been proposed for their strong generalization ability. In this paper, we propose a regularization method that applies both image manipulation and feature map regularization based on patches. The method proposed in this paper has a regularization effect in two stages, which makes it possible to better generalize the model. Consequently, it improves the performance of the model. Moreover, our method adds features extracted from other images in the hidden state stage, which not only makes the model robust to noise but also captures the distribution of each label. Through experiments, we show that our method performs competently on models that generate a large number of parameter and multiple feature maps for the CIFAR and Tiny-ImageNet datasets.


I. INTRODUCTION
With the application of deep convolutional neural networks (CNNs) in diverse computer vision tasks (e.g., image captioning [1], [2], object recognition [3], [4], and semantic segmentation [5]), a range of models have been explored, including multi-path networks [6], deep networks [7], and networks utilizing the attention mechanism [8]. These models have the necessary tendency to learn more parameters to increase their representational power. Moreover, as deep CNNs learn sparse representations, their decision boundaries are less clear than those of conventional statistical methods [9]. Consequently, excessive dependence on training data causes a generalization failure [10]. Models that fail to generalize may cause a decline in the performance of the test data. To address these challenges, diverse regularization methods have been proposed.
Various regularization methods have been proposed to deal with CNN's excessive dependence on training data. The most widely utilized regularization method involves the direct augmentation of images. Some methods simply crop, rotate, and flip the images [11], whereas others eliminate or combine The associate editor coordinating the review of this manuscript and approving it for publication was Haruna Chiroma . image information to generate new images. The research into the technique that eliminates partial information of the image considers the entire object area as well as the discriminatory part of the object in training. Thereby, it can improve generalization and localization, but some beneficial information may be removed from the image. Furthermore, the method of combining images has the limitation that the images are visually unnatural. Several methods have been explored to overcome this problem and one of them is CutMix [12]. CutMix preserves the image information by cropping the image in patches and adding them to a new image.
Furthermore, regularization methods for manipulating feature maps in a hidden state have been extensively researched. Regularization methods that eliminate some value from feature maps [13], [14], add noise to feature maps [15]- [17] or combine feature maps in hidden states have been proposed. However, the methods involving the elimination of values and addition of extra noise methods slow the convergence speed because they directly affect gradient. In addition, the methods of combining feature maps via additional networks and obtaining regularization effects may prompt additional computational costs depending on how the feature maps were combined [9], [18].
A method of patch-based regularization that applies both image augmentation and feature map regularization is proposed. The proposed method creates a new image by mixing input images on a patch basis, whereas the label of the new image, the label of the patch, and the label of the existing input image are mixed in proportion to the patch size. Then, a CNN is applied to the generated new image to generate a feature map. Next, the generated feature map undergoes linear interpolation with a feature map of the patch in proportion to the patch size.
The proposed method can be applied effectively in image manipulation and feature map regularization to obtain regularization effects. In the image manipulation stage, two images can be combined, based on the patch to generate a new image without information loss, as in CutMix [12]. Additionally, we added patch-based other image features through interpolation between feature maps of new images in the feature map regularization stage. This allows the model to simultaneously learn the image distribution of other labels and to generate a model robust to noise.
We used the CIFAR [19] and Tiny ImageNet [20] datasets to test the proposed method in ResNet [21] and WideResNet-28 [22] models. The top-1 accuracy performance of CIFAR-10 improved by 8.03% to 15.07% over the baseline and that of CIFAR-100 improved by 18.60% to 29.28% over the baseline. Tiny ImageNet displayed an improved top-1 accuracy performance between 2.54% and 5.28% over the baseline. Particularly, models that involved massive parameters and generated multiple feature maps performed better.

II. RELATED WORKS
Image manipulation is the most widely used regularization method, and it is largely divided into the following two types: Image Manipulation and Regularization in feature maps.

A. IMAGE MANIPULATION
Image manipulation refers to transforming an image to create a new image and involves diverse techniques such as image flipping, cropping, and rotating, as well as color space transformation and random erasing [11], [23]. Cutout [24] is a method that randomly drops a square region from the input image. These techniques are easily applicable but may cause information loss; the noise may also become detrimental to the performance in the case of images biased towards specific textures or shapes or in case of geometrically biased images.
Recently, various methods have been researched to not only convert a single image but also to mix two or more images. Among them, the Mixup [25] method mixes two images on a pixel basis for augmentation. Pairing sample [26] is also a method of mixing two images by average RGB pixels. In addition to mixing pixels, there has been a study to connect images to each region [27]. Despite its effectiveness in regularization, Mixup and Pairing sample have the limitation of an unnatural look of the images. To overcome this problem, devoted the study of mixing images using deep learning model to give intuitiveness of images [28]. However, in this case, it resulted in a large computational cost due to the necessity of learning the models for mixing separately. The CutMix [12] method rectifies the problem of a resultant shortfall by not just eliminating the pixels in the patches, as in Cutout, but also by filling them with pixels from other images. This patch-based image mixing achieves not just augmentation but also localization effects.

B. REGULARIZATION IN FEATURE MAPS
In addition to image augmentation, regularization in feature maps is another method universally used in many models. This method draws on the feature map obtained from the hidden state of a model, not the input image.
Dropout [13] is the most widely used and robust technique for dropping features from the hidden states for achieving regularization. Another popular technique, Dropblocks [14] does not simply drop features into random, but instead drops features into the hidden state via localization. Besides, for some tasks such as object detection, a dropout with various techniques including attention mechanism have been examined [29], [30]. Batch normalization [31] can prevent the decline in performance by solving the gradient vanishing problem in the nonlinearity function with an internal covariate shift, which directly influences the gradient and thereby renders a model robust against noises. Yet, the speed of convergence slows down in the batch normalization compared with other methods.
Furthermore, techniques for mixing noises with feature maps are also being continuously explored. In actual tests, however, unsound data may become involved. Therefore, learning by intentionally adding noises enables the models to focus on the essentials of the tasks rather than texture biases [32], [33]. This method does not simply use random noise values, but also adds probability distribution values based on statistical grounds into noise [16], [17]. Among various noise with different distributions, the Gaussian distribution is most widely used. Still, depending on the modes of application, Gaussian noise has been proven to cause substantial confusion for these models [11]. In addition, an ongoing study is delving into different methods, other than the Gaussian distribution, for adding noises to cause confusion to the distribution of the datasets and enable the models to perform tasks more robustly [15].

C. MIXING IN FEATURE MAP
Applying manipulation techniques to feature maps is also of interest to researchers. Mixing in feature maps is a widely used technique for regularization in hidden states, where the feature maps of the extracted characteristics of the images are mixed through different operations and affect the gradients of the models.
Manifold Mixup [9] addresses challenges such the sharp decision boundaries and the short distance to data by mixing in hidden states. Moreover, a method using several networks for Mixup in hidden states has been suggested [18]. In the method, a triple network structure is used to extract the features of two images with two shallow networks. Then, a new network is used to mix up the two feature maps.
In this paper, we propose a method for efficiently achieving both image manipulation and feature map regularization effects. The proposed method transforms the input images with image manipulation and performs linear interpolation on the feature maps for Mixup. The method applies the regularization to the model in two steps, enabling robust feature representation against noises and better generalization of the models.

III. METHOD
The proposed method consists of two steps. First, the image manipulation step combines the images based on patches to generate new images. Second, the feature map regularization step uses the generated images to perform a convolution network operation. Then, linear interpolation is performed on the feature maps generated in this process. The proposed method is outlined in Fig. 1.
The algorithm relevant to the proposed method is discussed in the following sections.

A. IMAGE MANIPULATION
The image manipulation step uses a ratio λ for mixing the two training samples (as in Cutmix). A training image x ∈ R W ×H ×C is combined with the patch P x ∈ R W 1−λ ×H 1−λ ×C extracted in the ratio 1-λ from another training image with a different label to generate a new image,x ∈ R W ×H ×C . The new images labelŷ is generated by combining y a with the patch's label P y . The patch and the new training sample (x,ŷ) are generated as follows: Here, (x,ŷ) is the image and label from which the patch is extracted. (x a , y a ) is the training sample extracted from the original mini-batch index. (x b , y b ) is the training sample extracted following a shuffle in the index within the minibatch. As in the Mixup method, the combination ratio λ is sampled from the beta distribution Beta(α, α). For (x b , y b ), a patch is generated in the ratio 1-λ. The patch coordinate is extracted using uniform distribution, whereas the patch size is determined in proportion to λ of the image size. B ∈ {0, 1} W ×H is a binary mask where the patch and position size are filled with 1 and the remaining with 0. The pixel at position P x in x a is removed by the element-wise multiplication of B and x a . The element-wise multiplication is performed on 1−B and x b to extract the pixels at position P x from x b . Thereafter, we create a new imagex by adding x a from which the pixel at the patch is removed and x b containing only the pixel from that patch.ŷ is created by combining y a with P y .
For the training dataset D = {(x 1 , y 1 ), . . . , (x k , y k )}, we use (1) to generate a new training datasetD = In the training step, we proceed with the test by setting α to 1, that is, by sampling λ in the uniform distribution.

B. FEATURE MAP REGULARIZATION
The images generated in the image manipulation step are used as inputs for the convolutional layer to generate feature maps, which are in turn linearly interpolated to create new feature maps. By inputting the generated images into a convolutional model, we calculate the hidden state vector f derived from the convolutional layer l.
Here, C is a convolutional model that has l convolutional layers. f l x is a feature map that is calculated by l's convolutional layers, by inputting imagex.
We perform the regularization through linear interpolation as given in (3), on the feature map generated in the convolution layer.
x k a is the new image generated with images x a and x b combined.x k b is the image where x b is combined with another training image. f l is the feature map generated for the lth layer and thex k b image. The newly generated image's feature map and the feature map of the patch image used to generate the image are linearly interpolated in the mixing ratio λ to generate a new feature map.
Equation (3) is applied to a new training datasetD. Instead of simply going through the convolution operation on the existing feature map, the proposed method can cause further confusion to models by combining the feature map with the distribution of another label within the data, which has been experimentally proven to implement a more robust feature representation. In our experiment, l was used with uniform distribution for sampling. The experiments section describes layer l, which is more efficient for the regularization.

IV. EXPERIEMENTS A. DATASETS
For the experiment, CIFAR-10, CIFAR-100 [19], and Tiny ImageNet [20] data were used. To compare the performance with previous works, CIFAR datasets, which are the most widely used benchmark datasets, were used. We used Tiny ImageNet data with more images and labels than the CIFAR datasets. CIFAR-10 and CIFAR-100 data are intended for image classification and consist of 60,000 color images (50,000 training images and 10,000 test images), where each image size is 32 × 32. Additionally, the number of data labels in each dataset is 10 and 100, respectively. Tiny ImageNet data are also intended for image classification and consist of 100,000 images, each of which measures 64 × 64 in size, and has 200 labels.

B. IMPLEMENTATION DETAILS
We used a GTX-1080ti GPU for training the models, with the widely used ResNet [21] as the model. A WideResNet [22] variant of ResNet was used. Specifically, 18-, 34-, and 50layer ResNet and 28-layer WideResNet were used. The batch size for the CIFAR data was set to 64, and that for Tiny ImageNet data was set to 128. The training epochs of each model were set to 250 and 300 for the CIFAR and Tiny ImageNet data, respectively. We used the Stochastic Gradient Descent (SGD) [34] for optimization. For the CIFAR data, the learning rate was initially set to 0.25 and decayed by a factor of 0.2 at the 60th, 120th, 160th, and 200th epochs, respectively. For the Tiny ImageNet, the learning rate was initially set to 0.1 and decayed by a factor of 0.1 at the 75th, 150th, 225th, and 300th epochs, respectively. Moreover, given that overfitting easily occurs in the baseline in comparison with other models, we used early stoppage as necessary. We describe the best performances of our method and other methods during training. We used accuracy and Error metrics to evaluate the classification task. These metrics indicate the accuracy and error rate of the predicted values of the trained model concerning the ground-truth values.

C. EXPERIMENT WITH CIFAR DATASET
Each dataset was used for an experiment in the baseline model and other different models. We compared our methods with the baseline, augmentation, and other regularization methods. The augmentation settings were random cropping and random flipping. Other regularization methods used for comparison were Cutout [24], DropBlock [14], Mixup [25], CutMix [12], and Manifold Mixup [9].
Each method was used in the experiment based on the optimal hyper parameter values mentioned in each article. For Cutout, the learning rate was set to 0.1, the number of holes to 1, and the hole length to 16. For Dropblock, the keep-prob was set to 0.9 and the block size to 4. In Mixup, the learning rate was set to 0.1, α to 1.0, and decay to 1e-4. In CutMix, the learning rate was set to 0.25 and α to 1.0. The results are summarized in Table 1, Table 2, and Table 3. The results show the top-1 accuracy achieved by testing each method in the ResNet-18, ResNet-34, ResNet-50, WideResNet-28 models.
For more details, we visualized the error values of each epoch of ResNet-34 and WideResNet-28 for the CIFAR-100 datasets. Fig. 2(a) shows that our model converges slightly slower than the baseline in the early stage, but becomes more stable after the first learning rate scheduling. Compared with the baseline, the error value increases in some sections. In our model, however, the error values gradually decrease throughout the training. A similar trend can be found in Fig. 2(b). It shows that the convergence is slightly slower in the early stages, as before, but it can be confirmed that it converges faster after the first learning rate scheduling. In particular, the baseline has not shown much difference since the first learning rate scheduling in comparison with our method.
To test the performance of our proposed method, we compared it with state-of-the-art augmentation methods. For the CIFAR-10 data, our method outperformed other existing methods in ResNet-34. In contrast, CutMix and Mixup outperformed our proposed method, respectively, in the ResNet-18 and ResNet-50 models, although the difference was marginal (approximately 0.2), which indicates that the proposed method is sufficiently effective.
Moreover, in CIFAR-100, regardless of the models, the proposed method achieved the highest performance and demonstrated approx. 2% performance improvement, exerting a substantial effect on the model generalization. Table 3 shows the performance comparison against other state-of-the-art data augmentation and regularization methods of the CIFAR dataset in the WideResNet-28 model. Our method achieves a 97.28% top-1 accuracy on CIFAR-10 and an 84.21% top-1 accuracy on CIFAR-100. Our method outperforms CutMix and Manifold Mixup, by 0.37% and 0.04%, respectively on CIFAR-10. On CIFAR-100, it surpasses CutMix and Manifold Mixup, by 1.59% and 0.24% respectively.
The experimental results indicate a performance improvement in the deep ResNet or WideResNet models in generating massive feature maps in comparison with the shallow ResNet model. Indeed, with the CIFAR-100 data, ResNet-18 showed an approximately 0.2% performance improvement compared with other methods, whereas the WideResNet-28 achieved an approximately 1% performance improvement.
According to the experimental results, we achieved improved performance for those models where our methods produced multiple feature maps. Moreover, the results confirm that performing regularization in both the input and hidden state phases has a robust regularization effect on the model.

D. EXPERIMENT WITH TINY IMAGENET DATASET
To explore if our method works well with data that are larger than CIFAR and have diverse labels, we used the Tiny ImageNet data set. Each dataset was used in the experiments in the baseline model and other models. As in the aforementioned experiments, each method was applied with the optimal hyperparameters mentioned in each article. Each method was tested in ResNet-18, ResNet-34, and ResNet-50 models. The results are summarized in Table 4.
The experimental results from the Tiny ImageNet data were comparable to those from the earlier experiment. First, compared with the baseline, a performance improvement between 2.54% and 5.28% was achieved in terms of the top-1 accuracy. Apart from the performance improvement, the trend thereof is also comparable to earlier experimental results. For the top-1 accuracy in ResNet-18, our method outperforms other methods, especially in the deeper models.
Particularly, in ResNet-50, our method shows more than 2% performance improvement compared to Cutmix. Hence, in models that are deeper and have more parameters to learn, our method is more effective for generalization.

E. CLASS ACTIVATION MAPPING VISUALIZATION
The proposed method is not only an image manipulation but also a linear interpolation of two feature maps. Because a powerful regularization scheme overlaps, the model may not focus on the main information in the image. Therefore, we plotted a class activation map (CAM) [35] to visually check if the model properly captures the main information of the image.
The result of the CAM according to each label of input images is given in Fig. 3. As seen in Fig. 3, the CAM result of the baseline finds no important features and is widely activated on the entire images. In the case of our method, it can easily be seen that the important features for image classification are activated. The reason for this result is that our method effectively utilizes image manipulation and feature map regularization to learn important features in an image.

F. ABLATION STUDY 1) LAYER IN OUR-METHOD
The proposed method involves linear interpolation on feature maps within hidden states. Here, the layer where the method was applied was randomly selected. Fig. 4 shows the result of fixing the layer onto which our method was applied.
The experimental details are same as those mentioned earlier (Section IV-B). In the experiment, ResNet-50 ( Fig. 4 (a)) and WideResNet-28( Fig. 4 (b)) were used for the  Tiny ImageNet and CIFAR-100 data, respectively. We denoted 1 in the index for linear interpolation after the first convolution operation, batch normalization and activation function. Values 2 to 5 in the index indicated that linear interpolation was performed after each stage and 6 denoted after average pooling. The results show that random selection of layers led to a better performance than a planned designation of layers.
2) IMPACT OF HYPER-PARAMETER α Table 5 shows the impact of the hyperparameter in extracting the mixing ratio from the hidden states and the input step in our method. We experimentally used Tiny ImageNet in the ResNet-34 model as described earlier (Section IV-B). As the results indicate, when the size of a patch and the original image were similar in the mixing step, the performance was better and improved with a range of choices.

3) COMPARISONS OF DIFFERENT INTERPOLATION METHOD
In our method, a preset was multiplied in the feature map regularization step and a new feature map was generated with linear interpolation. When generating a new feature map, we performed the experiments with different methods. The results are shown in Table 6.
The linear layer involves interpolation on two different feature maps after going through the linear layer. The concat linear layer involves the concatenation of two different feature maps before going through the linear layer. The nonlinear layer involves using a nonlinear function. We used the hyperbolic tangent function. The experimental results show that when a new feature map is generated, using linear interpolation leads to better results than as compared with adding a linear layer and nonlinearity.
The results of this experiment show that mixing two feature maps is effective, because it has better performance than the basic baseline regardless of the method of mixing the two feature maps. Moreover, we can confirm that mixing in our way is the best performance. In the case of mixing the same way as the concatenation of two feature maps and reduction in dimension by the linear layer, the baseline and performance changes are insignificant, while our method shows about 6% improvement in performance compared with other mixing methods as well as with the linear method and nonlinear method.

V. CONCLUSION & FUTURE WORK
This paper proposes a method for regularizing both input images and the feature maps thereof. The proposed method, unlike existing ones, mixes a different image distribution, not random noises, with the feature map on a patch basis in the feature map regularization step.
As a result, the proposed method enabled the models to learn the distribution of images with different labels as well as to eliminate noises, and it ultimately outperformed other methods. When applied to WideResNet-28 for the CIFAR data, the top-1 accuracy was 97.28% for CIFAR-10 and 84.21% for CIFAR-100; the improved performance for CIFAR-10 was between 0.04% and 8.03% and that for CIFAR-100 was between 0.24% and 18.6% over other methods. For the Tiny ImageNet dataset, ResNet-34 and ResNet-50 achieved a top-1 accuracy of 68.77% and 69.21%, respectively, and ResNet-34 showed an improved performance between 0.3% and 4.75% compared with other methods.
When using the convolutional network in different practical applications, instead of using a single regularization technique, several techniques are used in combination (e.g., flip + crop, dropout + batch-normalization). Hence, instead of using our proposed method alone, using it in combination with other regularization methods will add to the effects of robustness in regularization. In the future, we plan to expand the method for applying noise to the model and study adversarial attacks.