An Efficient Data Augmentation Network for Out-of-Distribution Image Detection

Since deep neural networks may classify out-of-distribution image data into in-distribution classes with high confidence scores, this problem may cause serious or even fatal hazards in certain applications, such as autonomous vehicles and medical diagnosis. Therefore, out-of-distribution detection (also called anomaly detection or outlier detection) of image classification has become a critical issue for the successful development of neural networks. In other words, a successful neural network needs to be able to distinguish anomalous data that is significantly different from the data used in training. In this paper, we propose an efficient data augmentation network to detect out-of-distribution image data by introducing a set of common geometric operations into training and testing images. The output predicted probabilities of the augmented data are combined by an aggregation function to provide a confidence score to distinguish between in-distribution and out-of-distribution image data. Different from other approaches that use out-of-distribution image data for training networks, we only use in-distribution image data in the proposed data augmentation network. This advantage makes our approach more practical than other approaches, and can be easily applied to various neural networks to improve security in practical applications. The experimental results show that the proposed data augmentation network outperforms the state-of-the-art approaches in various datasets. In addition, pre-training techniques can be integrated into the data augmentation network to make substantial improvements to large and complex data sets. The code is available at https://www.github.com/majic0626/Data-Augmentation-Network.git.


I. INTRODUCTION
Deep neural networks have achieved very impressive results in various computer vision tasks [1]- [3]. When training a neural network, the training data is an independent identical distribution, also called in-distribution data. On the other hand, data that does not belong to in-distribution data is called out-of-distribution data. For example in Fig. 1, traffic signs, zebra crossing, and cars are in-distribution data while birds are out-of-distribution data. When a neural network is too confident about the prediction results and gives too higher confidence scores to out-of-distribution data, these erroneous results will bring security risks to safety-critical applications The associate editor coordinating the review of this manuscript and approving it for publication was Tony Thomas.
such as autonomous vehicles [4], medical diagnosis [5] and sensor-fault detection for industrial safety [6], [7]. Therefore, out-of-distribution detection has become a very important research goal in artificial intelligence security issues [8].
The goal of out-of-distribution detection is to detect whether an input data comes from in-distribution or from out-of-distribution. To resolve the problem, many approaches are proposed and can be mainly categorized into three types including softmax-based approaches [9]- [11], uncertainty-based approaches [12], and generative model based approaches [13], [14]. Softmax-based approaches use the maximum value of softmax probability as a threshold to distinguish out-of-distribution data. On the other hand, uncertainty-based approaches add an additional confidence branch to provide an uncertainty score for an input. Finally, FIGURE 1. In-distribution data (gray circle) and out-of-distribution data (red circle) in feature space.
generative model based approaches treat an input as an outof-distribution data when its corresponding output is poorly reconstructed.
Among the proposed approaches, softmax-based approaches are widely used because they can be easily combined with any neural network without modifying its original architecture or adding other models. Furthermore, they can detect out-of-distribution data without affecting the performance of primitive tasks such as classification. Therefore, many softmax-based approaches have been effectively used in pre-training models. Because softmax-based approaches use the maximum value of the softmax probability as a confidence score and compare it with a threshold, softmax-based approaches can be regarded as a binary-classification task. When the confidence score is higher than the threshold, the model predicts that the input data comes from indistribution. Otherwise, the model predicts that the input data comes from out-of-distribution.
Although the softmax-based methods are simple and the computational cost is low, they must rely on neural networks to effectively separate the confidence scores of in-distribution data and out-of-distribution data. That is to say that a model must have the ability to give in-distribution data high confidence scores, while giving out-of-distribution data low confidence scores. However, distinguishing in-distribution and out-of-distribution data is very difficult if the confidence score is determined by only one output probability, especially for the model which is easy to be confused due to out-of-distribution data. In order to improve the accuracy of out-of-distribution detection, our idea is to introduce a set of common geometric operations into training images to generate a couple of training data. The idea comes from the assumption that data enhancement can enable the neural network to classify a set of augmented data from the same image into the same class, that is, to output similar distribution of predicted probabilities for the set of augmented images. On the contrary, when the inputs comes from outof-distribution, the probability of obtaining a similar distribution of predicted probabilities of the augmented images is relatively small. In other words, even if one of the enhanced images has a higher predicted probability, the other probabilities will diminish its influence. Finally, these predicted probability distributions are combined by an aggregation function to obtain a confidence score, which is used to determine whether the input data comes from in-distribution or out-ofdistribution.
In this paper, we develop an effective data augmentation network to detect out-of-distribution data and improve its robustness without reducing the accuracy of classification. In order to make a fair comparison, we apply the proposed method to WideResNet [15] and evaluate its effectiveness on many common datasets. The proposed data augmentation network outperforms the state-of-the-art approaches and can be further improved on larger datasets, such as TinyImageNet [16] through pre-training technique. The first innovation of this paper comes from the observation that when the input image comes from out-of-distribution, the predicted probabilities of the augmented images may be inconsistent and we make full use of this feature to detect out-of-distribution image data. The second innovation is that only in-distribution data are used to train our framework which makes our approach more practical than other approaches [10], [11], and can be easily applied to various neural networks to improve security in practical applications.

II. RELATED WORKS A. OVERCONFIDENCE IN NEURAL NETWORKS
Neural networks have achieved significant progress on many computer vision tasks. However, we not only care about the accuracy of the model's prediction, but also how we trust the model's prediction results. For example, if the maximum value of the softmax probability distribution output by the model is 0.9, approximately 900 of the 1000 classifications performed by the neural network are correct. In other words, we can estimate how confident does the model predict for an given image by the maximum output value.
Nevertheless, neural networks are found to be overconfident occasionally for the out-of-distribution data, and classify them into a class with anomalous high scores. The MSP [9] claimed that the overconfident predictions are produced by the softmax function in neural networks because the softmax probability are computed with the fast-growing exponential function and a small addition to the softmax input will cause a large change in the output distribution. In addition, the authors in [17] also pointed out that a neural network using ReLU [18] as an activation function may output arbitrarily high confidence score to predict data that is not seen during the training phase. This problem can only be solved by changing the architecture and activation functions. In other words, a higher confidence score from the neural network does not necessarily mean that the result of the classifier is more likely to be correct, as shown in [19]. These results can be also visualized by reliability diagrams [20] which plot the gap between mean prediction accuracy and confidence scores. Surprisingly, there exists huge gaps in the modern neural networks which means that they are poorly calibrated [21], [22]. To mitigate the miscalibration of neural networks, the authors in [21] uses temperature scaling to divides the logits by T before calculating softmax values. This regularization suppresses extremely high scores in output probability while not affecting the original prediction accuracy. Moreover, multi-modal approaches are likely to reduce the over-confidence problems of deep neural networks as shown in [22].

B. OUT-OF-DISTRIBUTION DETECTION
In practical applications, deep neural networks often encounter out-of-distribution data. Due to overconfident predictions, the out-of-distribution data will seriously damage the correctness of the neural networks. To resolve the problem, many approaches [9]- [14] are proposed to detect out-of-distribution data and can be divided into three categories including softmax-based approaches, uncertainty-based approaches, and generative model-based approaches. Uncertainty-based approaches modify the architecture of neural networks to produce uncertainty score for detecting out-of-distribution data. For instance, the authors in [12] constructed an auxiliary branch onto a pre-trained classifier and derive a new out-of-distribution score from this branch. Generative model-based approaches assume that out-of-distribution data cannot be effectively reconstructed by generative model such as autoencoder or variational autoencoder. For example, the authors in [13] incorporated the Mahalanobis distance in latent space to detect out-of-distribution data by measuring reconstruction error. In [14], the authors obtained Mahalanobis distance-based score from the class conditional Gaussian distribution using hidden features in neural networks. Softmax-based approaches are widely used because of their simplicity and low computation cost. The MSP [9] proposed a baseline method using the maximum value of the softmax distribution of the classifier to detect out-of-distribution samples. Several softmax-based approaches are proposed based on this work to improve the detection performance. The ODIN [11] separated the softmax score distribution between in-distribution and out-of-distribution images using temperature scaling and adding small perturbations although fine tuning parameters for different testing data are required. Despite its low computational cost, the detection performance highly depends on the pre-trained classifier. To assist neural networks learn to differentiate between in-distribution data and out-of-distribution data, the authors in [10] proposed a method of simultaneously using Generative Adversarial Neural Networks (GAN) [23] to generate out-of-distribution data forming a boundary for in-distribution data and jointly train a classifier which should have low confidence on generated samples outside the boundary. However, training such model is computationally expensive. Moreover, tuning the hyperparameters with validation sets of out-of-distribution samples [10], [11] is often impossible since the prior of out-of-distribution samples is unavailable. Unlike only using in-distribution data in our work, the OE [24] recently proposed leveraging diverse, large real outlier images to train anomaly detectors against auxiliary datasets of outliers to improve out-of-distribution detection. Moreover, it has been shown that when neural networks are pre-trained on a large dataset such as ImageNet [25], the robustness of the model can be further improved [26] which can be integrated in our work.

III. DATA AUGMENTATION NETWORK
In this section, we propose an efficient data augmentation network which can distinguish between out-of-distribution and in-distribution image data. There are three main components in our method including data pre-processing, data augmentation training, and aggregation function during testing phase. Fig. 2 shows the proposed data augmentation network where we introduce a set of geometric transformations, e.g. rotation, into an image to generate a set of augmented data during training and testing phases. The proposed approach requires only one CNN. When training or testing the model, the input image will be rotated into N images and sent to the CNN in turn. In the training phase, the N loss values are accumulated to the final loss which is used to update the weights of the CNN through backpropagation as shown in Fig. 2(a). In the testing phase, the input image will also be rotated into N images and sent into the trained CNN model, and then the total N predicted probabilities are aggregated to obtain the final confidence score, as shown in Fig. 2(b). When training the model, the objective function is to classify the enhanced data from the same image into the same class. Algorithm 1 shows the training process of the proposed data augmentation network. Different from traditional training processes, the proposed network calculates the total loss after delivering the four augmented images, and then updates weights through back propagation.
After the training process, the data augmentation network will output a set of predicted probabilities for the enhanced data, and these predicted probabilities have similar distributions. We would like to mention that we only use in-distribution data to train the proposed network. This makes our method much more practical than the methods that require out-of-distribution data [10], [11], [24].
On the contrary, we assume that when input images are from out-of-distribution, the model will produce a set of predicted probabilities with inconsistent distributions. Based on this assumption, Algorithm 2 illustrates the procedure that how the model detects out-of-distribution data during the evaluation phase. Given a trained model P θ and an input image x, a set of augmented images x i are generated from the input image by rotation transformation R(.). The model takes in enhanced data in multiple rotation angles and produces a set of predicted probabilities O i . An aggregation function is then introduced to obtain the confidence score s from the distributions. Finally, if the confidence score is smaller than a given threshold λ, we estimate that the input image comes from out-of-distribution. In the following sections, we will VOLUME 9, 2021 FIGURE 2. (a) In the training phase, a set of enhanced data is generated from rotating input data by four angles. The model tries to classify them into the same class according to the objective function. (b) In the testing phase, a confidence score is derived from a set of predicted probabilities using aggregation functions.

Input:
P θ : A model will be trained on in-distribution dataset x: Input images from in-distribution R(.): Rotates images for 360 * i N degrees. E: Epoch for training η: Learning rate obtain loss for each predicted probability L = N −1 i=0 L i calculate final loss from each augmented images P θ ← P θ + η ∂L ∂θ update weights through backpropagation e++ end for return P θ discuss the details of data augmentation, model training, and how to obtain confidence scores from aggregation functions.

A. DATA AUGMENTATION
When training neural networks for image classification, geometric transformations such as translation and rotation are often used for data augmentation [27]. However, convolutional neural networks are inherently translationinvariant, which may contradict our assumption that if outof-distribution data is enhanced by translation, the neural network will produce inconsistent predicted probabilities.
Hence, we generate a set of enhanced images for an input image x with a set of rotation transformations R i , i ∈ {0, .., N − 1}, that is, each R i rotates the image for 360 * i N degrees to get x i = R i (x), i ∈ 0, . . . , N − 1. For example, when N = 4, an input image will be converted into four enhanced images by rotating the original image for 0, 90, 180, and 270 degrees. Note that x 0 is the original image without augmentation. Moreover, training images are randomly flipped and cropped in training phase to increasing data diversity.

B. MODEL TRAINING
The proposed data augmentation network can be integrated into various neural networks without modifying their architectures. Given a model P θ and enhanced data from in-distribution images, i.e. x ∈ D in , in order to classify them into the same class, the objective function is designed as (1)-(2).
where L total denotes the sum of cross entropy for all augmented images. In other words, the model learns to classify

C. CONFIDENCE SCORES FROM AGGREGATION FUNCTIONS
In the testing phase, an input image x will be transformed into a group of enhanced data R i (x), i = {0, . . . , N − 1} as done in the training phase. The trained model then generates a set of distributions of predict probability P θ (x i ), i ∈ 0, . . . , N − 1, P θ (x i ) ∈ R c , where c is number of classes. An aggregation function A(.) will be introduced to derive a confidence score by combining the above distributions as shown in Figure 2(b). The following shows all candidate aggregation functions we have used in this work.

1) MEAN OF MAXIMUM VALUE (MeanMax)
Although out-of-distribution data have been statistically shown lower maximum value of softmax probability according to the baseline [9]. It also has been found that some individual out-of-distribution data lead to relatively higher confidence score. Hence, to detect out-of-distribution data accurately, we aggregate the prediction distributions from the multiple augmented images and assumes that anomalously high confidence score from the first image will be suppressed by others. The assumption will be verified in the next section. Equation (3) shows the confidence score obtained by calculating the mean of maximum value of all predicted probabilities, which is called MeanMax.

2) MAXIMUM OF ALL VALUES (MaxMax)
It has been shown that in-distribution data tends to produce higher confidence score than out-of-distribution data [9].
In addition to using the mean of maximum value of all predicted probabilities, we also test the effectiveness of calculating the maximum value as a confidence score in all predicted probability distributions. Equation (4) shows the confidence score obtained by calculating the maximum value of all predicted probability distributions, which is MaxMax. s = max{max(P θ (x 0 )), . . . , max(P θ (x N −1 ))} (4)

3) MEAN OF POSITIONAL MAXIMUM VALUE (MeanPos)
In the training phase, a neural network learns to classify all enhanced data from an in-distribution image into the same class. We assume that the model will encounter outof-distribution data during the testing phase and predict them as inconsistent classes. Based on the above assumptions, (5)-(6) shows the confidence score obtained by averaging predicted probability from the same index where has maximum value in predicted probability of the original image without augmentation.
arg max where P j θ (x i ) means the j th value in predicted probability for x i .

4) JENSEN-SHANNON DIVERGENCE (JSD)
In [10], the authors proposed the confidence loss by adding the confidence term based on Kullback-Leibler divergence (KL) on the basis of the cross-entropy loss, in which it is assumed that the predicted probability of the model should be more uniform when the data is from out-of-distribution. Therefore, we can detect out-of-distribution data by measuring the similarity between the prediction distribution and the uniform distribution. JSD [28] has been chosen in this work because its output is between 0 and 1, which clearly indicates the confidence of the prediction with proper normalization. Equation (7)- (8) show how to calculate JSD for two probability distributions, P and Q.
where M = P+Q 2 and c is the number of classes. Equation (9) derives the confidence score from JSD by setting P as output probability and Q as uniform distribution.

5) MAXIMUM VALUE IN PREDICTION PROBABILITY (MSP)
The baseline proposed the MSP as the confidence score which is also one of the aggregation functions, as shown in (10). Instead of using a set of enhanced images, they only derived the confidence score from the original image without augmentation. s = max(P θ (x 0 )) (10)

IV. EXPERIMENTAL RESULTS
In this section, we conduct a set of experiments to evaluate the effectiveness of our data augmentation network for out-of-distribution detection. In [23], pre-training claims to improve the robustness and uncertainty of neural networks, although it is reported that it has no significant impact on the classification accuracy of the model [29]. In [23], the baseline method is re-implemented using 40-2 WideRes-Net for classifying CIFAR and TinyImageNet [30] datasets. To compare with their the results, we choose the above three datasets and their corresponding testing data as in-distribution samples to evaluate the effectiveness while various natural datasets including SVHN [31], LSUN [32], Texture [33], Place365 [34], and synthetic dataset such like Blob, Gaussian, Rademacher are chosen as out-of-distribution samples.
Our data augmentation network can be regarded as a threshold-based detector. If the confidence score of a given input image x is lower than the threshold λ, it will be predicted as an out-of-distribution sample. We evaluate the effectiveness of our framework with the following four metrics: • False positive rate (FPR) at 95% true positive rate (TPR).
Let TP, TN , FP, and FN denote true positive, true negative, false positive, and false negative, respectively. We evaluate FPR ( FP FP+TN ) when TPR ( TP TP+FN ) is 95%. • Area Under the Receiver Operating Characteristic curve (AUROC) [35]. Receiver Operating Characteristic (ROC) curve uses varying thresholds to plot the relationship between TPR and FPR. The larger the AUROC value, the better the performance. A model is an ideal detector when its AUROC reaches 1.
• Area Under Precision-Recall curve (AUPR) [36]. Precision-Recall (PR) curve plots the relationship between Precision ( TP TP+FP ) and Recall ( TP TP+FN ) by varying a threshold. The larger the AUPR value, the better the performance.
• Detection error (DetErr). We evaluate the effectiveness of the detector by find the minimum classification error for all thresholds. The DetErr can be defined as P(x in )P(err in |x in ) + P(x out )P(err out |x out ). The lower the DetErr value, the better the performance. Note that err in indicates that the confidence score of the in-distribution data is lower than the threshold while err out indicates that the confidence score of the outof-distribution data is higher than the threshold. We also suppose that the prior of in-distribution data P(x in ) and out-of-distribution data P(x out ) are both 0.5.
In addition, we only compare certain metrics with other works based on the metrics shown in their results, such as AUROC and AUPR. In this work, we train our model from scratch using SGD with Nesterov momentum and a cosine learning rate. The initial learning rate is set to 0.1, and it decays to 1e-6 for 100 epochs without restarting. Also, dropout is set to 0.3 to prevent overfitting. In addition, when we apply the proposed data augmentation network to a pre-trained network, the dropout will not be used and the learning rate is set to 0.01.

A. VALIDATION OF OUR ASSUMPTION
In this paper, the proposed data augmentation network is based on the assumption that when an input sample comes from out-of-distribution, the confidence score should be low. To validate our assumption, we choose in-distribution data from CIFAR-10 while out-of-distribution from Texture, SVHN, Places365, LSUN, CIFAR-100, Gaussian, Rademacher and Blob. In Fig. 3, x-axis and y-axis represent the confidence scores and the number of data in percentage, respectively. Blue lines indicate the distribution of confidence scores for in-distribution data while orange lines represent the distribution of confidence scores for out-of-distribution data. This aims to visualize the confidence of the model for its predictions. Out-of-distribution data are given lower confidence scores because the model has less confidence of them. In other words, the distribution shape of the confidence scores is uniform. On the contrary, in-distribution data are given higher confidence scores. Compared with the distribution of confidence scores for the baseline algorithm (MSP) shown in Fig. 3(a), data augmentation can provide out-of-distribution data more uniform confidence scores as shown by the orange line in Fig. 3(b). Therefore, neural networks benefit from our data augmentation network, which can distinguish between data from in-distribution and out-of-distribution.

B. ABLATION EXPERIMENTS
Appropriate data augmentation and aggregation functions plays an important role in our approach. Therefore, two ablation experiments are conducted with CIFAR-10 to determine the aggregation function and the number of rotation angles.

1) ROTATION ANGLES
In order to understand the influence of the rotation angles on the performance of the out-of-distribution detection, we choose different number of rotations (N) from 1 to 6 for experiments. As shown in Table 1, MSP (N = 1) has the worst performance because it uses only one prediction probability. On the other hand, as N increases, the confidence score can be obtained from more predicted probabilities. In other words, the original high confidence score may be suppressed by other predicted probabilities. This result can be regarded as a voting mechanism. Furthermore, the performance TABLE 2. OOD detection performance with respect to different aggregation functions on CIFAR-10. The symbol ↑ indicates that the larger the value, the better the performance, and the symbol ↓ indicates that the lower the value, the better the performance. of our proposed model on out-of-distribution data is gradually improved, when N increases until 4. This result supports our initial hypothesis that when N is equal to 4, the 4 augmented samples from the original image are sufficiently different from each other, so if the input image comes from out-of-distribution, 4 inconsistent prediction probabilities can be generated. However, our method has poor results for images with symmetry, such as the texture set. We think the reason is that the 4 augmented samples of the original image are not sufficiently different from each other.
In addition, we also found that when N is greater than 4, the performance begins to decrease. We infer that when N is greater than 4, the difference between these augmented samples is not large enough, and the performance is reduced. Because the amount of computation in the training and testing phases of our proposed method increases proportionally with the increase of N, we choose N to be 4 to obtain a compromise between performance and computational cost. Table 2 shows the performance of out-of-distribution detection with respect to different aggregation functions, where the bold numbers indicate the method with better performance. For most data sets, JSD is better than other aggregation functions. As a result, JSD is selected as the aggregation function in the following experiments. Also, JSD has worse performance on the Texture data set than other data sets.

2) AGGREGATION FUNCTIONS
We infer that the symmetry in the Texture will cause the performance of the proposed method to degrade.

C. DIFFERENT DATASETS
After performing the above ablation experiments on CIFAR-10, we test our approach on more complicated datasets and compare with the baseline and the state-ofthe-art approach. Table 3 shows that the proposed approach performs worse when the data set has more classes. For example, the AUROC and AUPR scores of our method on CIFAR-10 are much better than those on CIFAR-100 and TinyImageNet.
Because our method is based on softmax prediction probability, when the number of classes in the dataset increases, the predicted probability tends to be uniform. In other words, the confidence scores of in-distribution data and outof-distribution data are easily overlapped and difficult to distinguish.

D. COMPARE WITH STATE-OF-THE-ART APPROACHES
The ablation experiments help us determine the number of rotation angles (N) and the aggregation function A(.) to be 4 and JSD respectively. The MSP [7] has created a simple and effective softmax-based approach for detecting outof-distribution data and established a strong baseline which serves a foundation for many works. Also, it has been shown that the baseline could be further improved when a model is pre-trained on a large dataset such as ImageNet [21].  As a result, we integrate the pre-trained model into our method and compare with the baseline and state-of-the-art approaches. Table 3 shows that our method is superior to the baseline. When using CIFAR-100 as the in-distribution data, the AUROC and AUPR scores increase by 21.0% and 65.6%, respectively. Compared with state-of-the-art approaches, our method can further improve the performance on highly complicated data set such like TinyImageNet, the AUROC score increases by 6.3% while the AUPR score increases by 12.7% using our method.
In addition, we also compare with a GAN-based approach [10] which is proposed to generate samples on the low-density boundary around the in-distribution data space. The original classifier is trained together with the proposed GAN model to learn to differentiate between in-distribution data and out-of-distribution data. We re-implement our framework on VGGNet [37] to compare with the GAN-based approach [10], as shown in Table 4. The experimental results show that our method is superior to the GAN-based approach in AUROC and AUPR. Moreover, the GAN-based approach tuned parameters to fit specific out-of-distribution dataset which should be difficult in real-world applications because the prior of out-of-distribution data is unknown. Finally, training a classifier jointly with GAN is computationally expensive.

V. CONCLUSION
We have proposed an efficient data augmentation network which assists neural networks to detect out-of-distribution image data. We have conducted several preliminary experiments to validate our assumption where parameters and aggregation functions are determined by ablation study. The experimental results show that the proposed data augmentation network achieves significant progress in out-of-distribution detection on various visual datasets. In addition, when the model has been pre-trained on Ima-geNet, the effectiveness of the proposed framework can be further improved. However, our method does not have good results for images with symmetry, such as the Texture set. We think the reason is that the 4 augmented samples of the original image are not sufficiently different from each other. Therefore, future work will focus on solving the problem and apply our approach in other computer vision tasks such as object detection and semantic segmentation. In addition, the proposed framework will provide effective anomaly detection on real-word applications where safety is considered the priority.