AdvGuard: Fortifying Deep Neural Networks Against Optimized Adversarial Example Attack

Deep neural networks (DNNs) provide excellent performance in image recognition, speech recognition, video recognition, and pattern analysis. However, they are vulnerable to adversarial example attacks. An adversarial example, which is input to which a little bit of noise has been strategically added, appears normal to the human eye but will be misrecognized by the DNN. In this paper, we propose AdvGuard, a method for resisting adversarial example attacks. This defense method prevents the generation of adversarial examples by constructing a robust DNN that provides random confidence values. This method does not require training of adversarial examples, use of other processing modules, or the ability to perform input data filtering. In addition, a DNN constructed using the proposed scheme can defend against adversarial examples while maintaining its accuracy on the original samples. In the experimental evaluation, MNIST and CIFAR10 were used as datasets, and TensorFlow was used as a machine learning library. The results show that a DNN constructed using the proposed method can correctly classify adversarial examples with 100% and 99.5% accuracy on MNIST and CIFAR10, respectively.


I. INTRODUCTION
Deep neural networks (DNNs) [1] display good performance in recognition domains such as image recognition [2], speech recognition [3], intrusion detection [4], and pattern recognition [5].However, DNNs have a vulnerability in that they misrecognize adversarial examples [6].An adversarial example is an input sample to which a little noise has been added, noise that is invisible to humans but is designed to induce misrecognition by the DNN.Such adversarial examples can cause harm due to misinterpretations in medical business settings or misperceptions by autonomous vehicles that use DNNs.Therefore, studies on various types of attack and methods of defense against these adversarial examples are being conducted.
Adversarial example attacks can be categorized according to the availability of model information [7] as white box attacks or black box attacks.In the case of a white box attack, the attacker knows the structure, parameters, and confidence values calculated by the model.In the case of a The associate editor coordinating the review of this manuscript and approving it for publication was Thomas Canhao Xu .black box attack, only the confidence values, which are the result delivered by the model for a given input, is known; the structure of the model is not known.To generate an optimized adversarial example, an attacker should know the confidence values calculated by the target model, which is accomplished by accessing the softmax layer.The optimized adversarial example should satisfy the aim of being misrecognized by the model while having minimal distortion from the original sample; this is accomplished through feedback in the form of the confidence values delivered by the target model.This feedback is needed because the adversarial example generation process adds adversarial noise to the input data until the moment the confidence value for an incorrect class increases, in order to generate an adversarial example that lies only slightly outside the target model's decision boundary.
There are two main conventional methods for defending against adversarial examples: the classifier modification method [6], [8], [9] and the input data modification method [10], [11].In these defense methods, a separate adversarial training process or a process for generating a separate module is required to change a classifier or modify input data.In addition, in changing the classifier or the input data, these defense methods can degrade the model's accuracy on the original samples.In contrast, the method proposed in this paper can prevent the generation of an optimal adversarial example without affecting the class prediction, as it works by using a noise vector in the softmax layer without the need for a separate module or process.
The proposed method, called AdvGuard, constructs a DNN that is robust to adversarial examples.The proposed method prevents the creation of adversarial examples by adding a new noise to the softmax layer, which is where the target model generates the confidence values.The noise vector causes the confidence values for each class to become unknown by making them noisy.The contributions of this paper are as follows.
We propose a method that can resist adversarial examples through a noise vector that causes noisy confidence values to be provided.We explain the systematic principles and construction of the proposed scheme.We analyze the model's accuracy and confidence values for adversarial examples after the proposed method has been applied.In addition, we analyze the performance of the proposed method by investigating the effect of the presence of the noise vector.To assess the performance of the proposed method, we evaluated it using the MNIST [12] and CIFAR10 [13] datasets.
The rest of the paper is organized as follows.Section II describes related work.Section III defines the conceptual approach is explained.The proposed scheme is explained in Section IV.Section V describes the experimental method, and Section VI presents the experimental results for the proposed method.Discussion of the proposed method is given in Section VII.Section VIII concludes the paper.

II. BACKGROUND AND RELATED WORK
Research on the adversarial example began with Szegedy et al. [6].The purpose of an adversarial example, created by adding some noise to an original image, is to cause misclassification by a DNN while keeping the difference from the original image undetectable to the human eye.
The generation of adversarial examples is described in Section II-A.Types of adversarial example attack are described in Section II-B, and methods of defense are described in Section II-C.

A. BASIC ADVERSARIAL EXAMPLE GENERATION
A DNN model optimizes its parameter values so that training data will be correctly classified into their corresponding classes [14].Then, when new data arrive as input, the pretrained DNN will provide well-classified results with high accuracy on the new data.An adversarial example causes the pretrained DNN model to misinterpret manipulated test data.
A transformer receives the original sample x and original class y as input values.The transformer then creates a modified sample x * = x + δ, with noise value δ.The modified sample x * is provided as input to the target model.The target model provides the classification confidence results for the modified sample x * to the transformer.The transformer updates the noise value δ until the confidence value for the desired target class is higher than that for the original class of the modified sample x * while maintaining minimal distortion between the modified sample x * and the original sample x.

B. CATEGORIZATIONS OF ATTACK METHODS
Attack methods can be categorized according to the misrecognition target, the method of distortion, and the method for generating the adversarial examples.In the first of these types of categorization, the methods of attack are divided into targeted attacks [15], [16] and untargeted attacks [17].In a targeted attack, the adversarial example is designed to be misclassified by the target model as a specific target class chosen by the attacker.In an untargeted attack, the adversarial example is designed to be misclassified by the target model as any class other than the original one.An untargeted attack has the advantage of generating adversarial examples with fewer iterations and having less distortion than those generated for a targeted attack.
In the second type of categorization, attack methods are divided according to the distortion method [18] used in generating the adversarial examples: L 1 [19], [20], L 2 [21], [22], and L ∞ [23], [24], given as where x is the original sample and x * is the adversarial example.In all three methods, the smaller the distortion value, the less the distortion between the original sample x and the adversarial example x * .In this study, the L 2 method was the method of distortion used for generating adversarial examples.
Finally, attack methods can be categorized according to the method used to generate the adversarial examples.There are five representative methods for generating adversarial examples.The first method is the fast gradient sign method (FGSM) [8], which finds x * through L ∞ : where F is an objective function, t is a target class, and x * is the adversarial example.In FGSM, the adversarial example is generated by from the input image x through gradient ascent.FGSM, using gradient ascent, is simple but has excellent performance.
The second method is iterative FGSM (I-FGSM) [25], which is an improved version of FGSM.Instead of using as in FGSM, a smaller amount, α, is used and is eventually clipped by the same value: In terms of performance, I-FGSM outperforms FGSM.The third method is the DeepFool method [26], which creates an untargeted adversarial example.This method creates an adversarial example having an attack success rate higher than that of FGSM while minimizing distortion from the original sample.It uses the L 2 method of distortion and produces an adversarial example using a linearization approximation method.However, when creating an adversarial example, DeepFool requires many iterations because the deep neural network is not a linear structure, and it is more complex than FGSM.
The fourth method is the Jacobian-based saliency map attack (JSMA) [27], which is a targeted attack and uses an iterative method.To minimize distortion from the original sample, this method reduces the adversarial example's saliency value.As a measure of an element, the saliency value can determine a model's output class.However, this method is computationally time consuming.
The fifth method is the Carlini & Wagner attack [28], which displays better performance than FGSM and I-FGSM.This method is designed to increase the attack success rate while minimizing distortion by setting the distortion function and attack success function in the objective function: where D(x, x * ) is the distortion function, f (•) is the attack success function, and c is a binary value.In addition, by adjusting the confidence value, the Carlini & Wagner attack can achieve a 100% attack success rate against defense methods such as the distillation structure [9].FGSM, I-FGSM, DeepFool, JSMA, and the Carlini & Wagner method, described above, normally compute the gradient of the output of the target model to generate an adversarial example.The gradient is computed by back-propagation [29], assuming that the model's confidence value is known to the attacker, for generating adversarial noise.This requires the ability to compute the derivative of the output layer for the input layer in order to generate the optimal type of adversarial noise.
The Carlini & Wagner attack can achieve a 100% success rate even on distillation structures [9], which are a method of defense.The core principle of the Carlini & Wagner method is to control the attack success rate by adjusting the distortion weights.In this study, adversarial examples were generated using the Carlini & Wagner attack, which has excellent performance.In addition, an experiment was conducted to provide a comparison with DeepFool [26] and the JSMA method [27], which generate optimized adversarial examples.

C. CATEGORIZATIONS OF DEFENSE METHODS
There are two well-known conventional methods for defending against adversarial examples: the input data modification method [10], [11], [30] and the classifier modification method [6], [8], [9].The first, the defense method that works by adjusting the input data, can be divided into three types: the filtering module [10], [11], feature squeezing [31], and the magnet technique [32].Shen et al. [11] proposed a filtering method to remove adversarial noise from an adversarial example using generative adversarial nets [33].This method removes noise using the filtering module before the target model receives the input image.It maintains the model's accuracy on the original samples but has the disadvantage of requiring an additional process in order to create the filtering module.Xu et al. [31] proposed a feature squeezing method that manipulates the input image.This method reduces the depth of each pixel and reduces the difference between each pair of corresponding pixels by spatial smoothing.However, this method requires additional processing to manipulate the input data.Meng and Chen [32] proposed the magnet method, which is an ensemble method that combines several defense methods.The method consists of a detector and a reformer.The detector detects and removes adversarial examples having large amounts of distortion, and the reformer converts adversarial examples having less distortion to the nearest original sample.However, the magnet method also requires a separate process to create the detector and reformer modules, and it is vulnerable to white box attacks.
The second method, which works by modifying the classifier, can be divided into two types: the adversarial training technique [6], [8] and the defensive distillation technique [9].Szegedy et al. [6] and Goodfellow et al. [8] proposed the adversarial training method.In this method, the target model trains not only on the original training data but also on adversarial examples, thereby making the target model robust to adversarial examples.Tramèr et al. [34] proposed an improved defense method of training on adversarial examples generated from several models as an ensemble method.The adversarial training method, however, can reduce the model's accuracy on the original samples, and it requires an additional training process.Papernot et al. [9] proposed the distillation method.This method constructs a deep neural network robust to adversarial examples by preventing the gradient descent calculation when adversarial examples are generated.To prevent the gradient descent calculation, the method uses two neural networks; the output class probability from the initial neural network is used as the input class for the second neural network.However, this method also requires an additional neural network, and it is vulnerable to white box attacks.
In contrast to these conventional adversarial example defense methods, the method proposed in this paper does not require a separate module or a separate process.Conventional defense methods that work by manipulating input data require training and generating a filter module.In addition, to make the classifier robust, it is necessary to have an additional adversarial training process or separate neural networks.The proposed method, by contrast, can prevent the generation of an optimal adversarial example without affecting the class prediction as it works by using a noise vector in the softmax layer, and thus it does not requires a separate module or process.

III. CONCEPTUAL APPROACH
When an optimized adversarial example is being created, the confidence value for each class needs to be obtained for each input value.Under the proposed method, the confidence value provided for each class is randomized in order to restrict the generation of adversarial examples; at the same time, the class prediction is maintained according to these confidence val-TABLE 1. Classification scores and confidence values for an adversarial example: ''5'' → ''6''.ues.In this section, the context for the approach is explained in detail through the softmax layer, decision boundary of the model, and classification score for each class.
Figure 1 shows the softmax layer in the DNN structure.The softmax layer is a multi-class sigmoid; it is the last layer of the DNN and provides the probability value for each class.
In other words, the softmax layer is a squashing function that converts the value for each class into a probability value in the range from 0 to 1. Through confidence values, which are the probability values provided in the softmax layer, the class values and probability values can be determined for the input provided to the DNN.The confidence values provided in the softmax layer are required for creating adversarial examples.Creation of an adversarial example is an optimization problem, that of finding a sample that satisfies two conditions: causing misclassification and having minimal distortion.Table 1 shows the classification scores and confidence values for an adversarial example (''5'' → ''6'') and for the corresponding original sample (''5'').As seen in the table, for the original sample, the class ''5'' classification score (10.16) is higher than that for any other class.For the adversarial example, however, the class ''6'' classification score (14.61) is slightly higher than that for the original class, ''5'' (14.60).This is because the adversarial example generation process adds noise to the original sample only until the classification score for the targeted class chosen by the attacker becomes slightly higher than that for the original class.As the generation of an adversarial example is an optimization problem, the adversarial example should be located where the model will misclassify it, while its distance from the original sample should be minimized.Therefore, an adversarial example should be located slightly outside the decision boundary.
Thus, there is a pattern displayed by adversarial examples with regard to their classification scores and the decision boundary, as shown in Table 1 and Fig. 2. The classification score for the original class of an original sample is higher than that of any other class.However, the classification score for the incorrect class targeted by the adversarial example tends to be only slightly higher than that for the original class.
In order to generate adversarial examples slightly outside the decision boundary, the attacker must know the confidence value for each class.However, if those confidence values are set to random values, the generation of adversarial examples can be prevented.Our proposed method is designed to offer a robust defense against adversarial examples by providing noisy confidence values for each class by using a noise vector.

IV. PROPOSED METHOD
The proposed method constructs a robust model that prevents the generation of optimized adversarial examples by providing random confidence values.The proposed method uses the characteristics of the process for generating the optimized adversarial example.Creation of an optimized adversarial example requires the confidence value for each class corresponding to the input value.Under the proposed method, while the class prediction is maintained according to the confidence values provided by the target classifier, the confidence value provided for each class is random in order to restrict the generation of adversarial examples.In order to assign random confidence values, the proposed method adds a noise vector in the softmax layer, which is where the model provides confidence values for each class of input.With the new noise added via the noise vector in the softmax layer, the target model will provide random confidence values for each class while maintaining the same class prediction.Figure 3 shows an overview of the proposed method.As shown in the figure, the new noise is added to the softmax layer of the pretrained classifier to provide noisy confidence values for each class.
The purpose of the noise vector is to make the confidence values of all classes as random as possible while maintaining the original class prediction.In mathematical terms, the final output layer in the deep neural network is a softmax layer, which provides the vector s with true confidence score vector r.The vector s is the result of the last second layer in the deep neural network and is called logits.The true confidence score vector can be obtained as follows.
The noisy confidence score vector r * are generated by adding the new noise w generated by the noise vector to the vector s.
where w denotes a new noise vector denominated noise logits.
The noisy confidence values r * are calculated by adding softmax to the value of true logits plus noise logits.We can replace the true confidence score vector r with softmax( s) and the variable vector δ with softmax( s + w)-softmax( s).Then, we can get the following optimization problem.
subject to : ∀j g( r * j ) = 1 and g( r * j ) ≥ 0, ∀j where n is the number of classes, and min w is the objective function to find the w value that minimizes the distance between softmax( s) and softmax( s + w).For this condition to be satisfied, the predicted class in softmax( s) and the predicted class in softmax( s + w) must be the same.This is because a new noise needs to be added in a range where the recognized class is not changed due to the noise vector.The epsilon was a very small value (≤ 0.001).g(•) is a softmax activation function and normalized to [0, 1] to convert the probability value.In addition, g( r * j ) is a noisy confidence value as the probability, so it must be greater than 0, and the sum of g( r * j ) corresponding to all classes must satisfy 1.By solving w in the above optimization problem, we can get the noise vector δ as follow.
Since two constraints (Equation 8 and Equation 9) are highly nonlinear, it is necessary to convert Equation 8 and Equation 9into an objective function to solve this optimization problem.
In terms of converting Equation 8 into an objective function, we denote the predicted label for the query data sample as l: l = argmax j ( r j ) = argmax j ( s j ). ( The constraint of Equation 8 means s l + w l is a largest entry in s + w.Therefore, we enforce inequality constraint s l + w l ≥ max j|j =l { s j + w j }.Moreover, we can change the inequality constraint to the loss function as follows: where ReLU is defined as ReLU(v) = max(0, v).When inequality s l + w l ≥ max j|j =l { s j + w j } is satisfied, the loss function loss 2 becomes 0.
In terms of converting Equation 9into an objective function, the target model is a neural network whose output layer is multi-class classification and has a softmax activation function of last output layer.Therefore, we can get the following equation.
where, in the second-to-last layer, g(softmax( s l + w l )) is a softmax activation function for softmax( s l + w l ), which means a probability value as normalization to [0,1].Therefore, constraint in Equation 9 can be changed to the following loss function.
where n means the number of classes.The epsilon was a very small value (≤ 0.001).When g(softmax( s l + w l )) gets closer to 1 n + , loss 3 gets closer to 0. As an unconstrained optimization problem, the total loss function can be obtained by converting constraints to the objective function.
where c 1 and c 2 adjust the balance of loss 1 , loss 2 , loss 3 , and the initial value is 1.And the loss 1 function is the L 1 method as follows.
To solve the unconstrained optimization problem, we proposed a gradient descent-based algorithm.In Algorithm 1, we repeatedly find a large c 1 to find a noise vector w with an even confidence score distortion.For a given c 1 , we use gradient descent to find the value of w that satisfies Equation 8 and Equation 9.If we cannot find w that satisfies both Equations, the process of searching c 1 stops.In particular, given c 1 , c 2 , and learning rate α, we repeatedly update vector variable w.Since Equation 8and Equation 9 are converted to objective functions, it cannot be guaranteed to find w that satisfies two constraints during the iterative gradient descent process.Therefore, for each gradient descent iteration, we need to check whether two constraints are satisfied.In particular, during the gradient descent process, the predicted label should not change and the probability of g( r * l ) should be 1/n + .If both constraints are satisfied, the gradient descent process is stopped.We used g(softmax( s l + w l )) = 1 n + to approximate Equation 9.At this time, the epsilon value was set to a very small 0.001.We find vector w by applying a small learning rate α.Note that we can also iteratively find c 2 , but it is computationally inefficient to find both c 1 and c 2 .

Algorithm 1 AdvGuard Method
Input: true logits s, c 1 , c 2 , learning rate α, epsilon , max iteration iter, number of class n.Output: w l ← argmax j ( s j ) // Predicted label while true do w ← 0 while i < iter and (argmax j { s j + w j } = l or g(softmax( s l + w l )) n + then // Return the vector in the previous iteration if the predicted label changes or g( r * l ) = 1/n + in the current iteration return w end if c 1 ← 10 • c 1 end while

V. EXPERIMENT SETUP
Through experiments, we assessed the extent to which the proposed method can effectively resist adversarial examples.We used the TensorFlow library [35], widely used for machine learning, and a Xeon E5-2609 1.7-GHz server.

A. DATASETS
MNIST [12] and CIFAR10 [13] were used as experimental datasets.MNIST is a standard dataset with handwritten images from 0 to 9. The number of pixels in an MNIST image is 28 × 28 × 1, or 784 pixels in total.It consists of 60,000 training data and 10,000 test data.CIFAR10 is composed of color images in 10 classes of objects: plane, car, bird, cat, deer, dog, frog, horse, ship, and truck.The number of pixels in a CIFAR10 image is 32 × 32 × 3, or 3072 pixels in total.It consists of 50,000 training data and 10,000 test data.

B. PRETRAINING OF CLASSIFIER AND NOISE VECTOR SETTINGS
The classifiers pretrained on MNIST and CIFAR10 were common CNNs [36] and VGG19 networks [37].Their configuration and training parameters are shown in Tables 9, 10, and 11 in the Appendix.In the MNIST test, the pretrained classifier D correctly classified the original MNIST samples 5350 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.TABLE 2. Classification scores and confidence values of an original sample in class ''9'' before and after application of the noise vector.TABLE 3. Mean confidence values for the predicted class before and after application of the noise vector on MNIST and CIFAR10.''SD'' is standard deviation.
with over 99% accuracy.In the CIFAR10 test, the pretrained classifier D correctly classified the original CIFAR10 samples with over 91% accuracy.The noise vector was applied after the classifier had been pretrained.For the noise vector settings, the number of iterations was set to 300, and the noise vector was applied at the end of the softmax layer.The method of updating the noise vector was increased by 10 at each iteration to find an appropriate amount of noise.The noise vector set the confidence values of all classes at random without changing the class prediction provided as output.

C. GENERATION OF ADVERSARIAL EXAMPLES
For the experiments, it was assumed that an attacker attacks a pretrained classifier equipped with a noise vector, generating an adversarial example by the Carlini method (a state-of-theart method with a 100% attack success rate) using the L 2 distortion method.For MNIST, the number of iterations was 500, the Adam [38] algorithm was used as the optimization algorithm, the learning rate was 0.1, and the initial value of constant value was 0.01.For CIFAR10, the number of iterations was 10,000, the Adam algorithm was used as the optimization algorithm, the learning rate was 0.01, and the the initial value of constant value was 0.01.For each dataset, we created 1000 randomly targeted and untargeted adversarial examples.

VI. EXPERIMENTAL RESULTS
Attack success rate refers to the proportion of samples for which the class recognized by the target classifier matches that intended by the attacker for the adversarial example.For example, if 95 of 100 samples are misidentified by the target classifier as the class intended by the attacker, the attack success rate is 95%.Accuracy refers to the proportion of matches between the input data and the original classes corresponding to the input.The distortion is defined as the square root of the sum of each pixel's difference from the corresponding pixel in the original sample (the L 2 distortion).
Table 2 shows confidence values and classification scores for an original sample in class ''9'' before and after application of the noise vector.As shown in the table, before passing through the noise vector, the highest classification score (12.7) and confidence value (0.95) for the original sample were for class ''9''.Even after passing through the noise vector, although the classification scores were randomly distributed throughout the classes, class ''9'' maintained the highest classification scores and confidence values.Although the confidence value for class ''9'' dropped to 0.315, this class maintained the highest confidence overall, and therefore the class prediction was not affected.
Table 3 shows average confidence values for the predicted class before and after application of the noise vector for MNIST and CIFAR10.From the table, it can be seen that before passing through the noise vector, the average confidence values corresponding to the original class were 0.9686 and 0.97141 for MNIST and CIFAR10, respectively.After passing through the noise vector, adding noise to the confidence values, the average confidence values for the predicted class fell to 0.293 and 0.305 for MNIST and CIFAR10, respectively.Lowering the confidence values via noise to a range within which the class prediction does not change causes the optimum adversarial noise to be improperly determined during the generation of an adversarial example.
Table 4 shows adversarial examples for MNIST before and after application of the noise vector when the attack success rate was 100%.''Baseline'' means a method without a noise vector.The table shows adversarial examples generated through the baseline method and the proposed method.In the case of the baseline method, the target classifier without a noise vector classified the adversarial examples as incorrect classes.On the other hand, in the case of the proposed method, the adversarial examples were correctly recognized as their original classes by the target classifier with the noise vector.After the decision boundary of the target classifier was changed through the addition of a noise vector to the TABLE 4. Adversarial examples for MNIST before and after application of the noise vector when the attack success rate was 100%.These samples are images having average distortion as shown in Table 6.''Baseline'' means a method without a noise vector.

TABLE 5.
Adversarial examples for CIFAR10 before and after application of the noise vector when the attack success rate was 100%.These samples are images having average distortion as shown in Table 6.''Baseline'' means a method without a noise vector.TABLE 6. Distortion of adversarial examples for the baseline method and proposed method.''SD'' is standard deviation.
confidence values by the proposed method, the adversarial noise in the adversarial examples was not optimal, and so the target classifier correctly recognized the adversarial examples as their original classes.
Table 5 shows adversarial examples for CIFAR10 before and after application of the noise vector when the attack success rate was 100%.The table shows adversarial examples generated through the baseline method and the proposed method.In the case of the baseline method, as with MNIST, the target classifier without a noise vector classified the adversarial examples as incorrect classes.On the other hand, in the case of the proposed method, as with MNIST, the adversarial examples were correctly recognized as their original classes by the target classifier with the noise vector.
Table 6 shows the distortion for 1000 adversarial examples generated for target models with and without the noise vector.As can be seen, in the case of the baseline method, considerable distortion was required to generate an adversarial example; this is because of the large difference in confidence values between the original class and the incorrect class.CIFAR10 required more distortion than MNIST because of the larger number of pixels.On the other hand, with the proposed method, it can be seen that less distortion was added to the adversarial examples; this is because there is little difference in the confidence value between the original class and the incorrect class, owing to the noise in the confidence values caused by the noise vector.With the proposed method, however, the adversarial noise in the adversarial examples was not optimized for the true confidence values, and so the target classifier correctly recognized the adversarial examples as their original classes.
Table 7 shows the accuracy of adversarial example recognition for the baseline method and the proposed method.With the baseline method, the accuracy of adversarial example recognition was 0% and 0.5% on MNIST and CIFAR10, respectively.In other words, in the case of the baseline method, the attack success rate of the adversarial example was 100% and 99.5% on MNIST and CIFAR10, respectively.On the other hand, with the proposed method, it can be seen that the accuracy of adversarial example recognition was 98% and 90% on MNIST and CIFAR10, respectively.The reason is that when the adversarial example is generated through the noise vector, the adversarial noise of the adversarial example is not optimally determined, so only the distortion that does not affect the class prediction will be added to the adversarial example.Therefore, a DNN constructed using the proposed method can correctly recognize adversarial examples; its class predictions are not affected.
For the proposed method, a comparative analysis was also conducted using other types of adversarial examples.Because the proposed method is designed to defend against an optimized adversarial example generated by a white box attack method, the DeepFool method [26] and Jacobian-based saliency map attack (JSMA) method [27], which generate optimized adversarial examples, were additionally tested.Table 11 shows the accuracy of a DNN constructed using the proposed method on adversarial examples generated by the DeepFool, JSMA, and Carlini & Wagner methods.As shown in the table, the DNN constructed using the proposed method had high accuracy on the adversarial examples generated by the Carlini & Wagner, DeepFool, and JSMA methods.

A. ASSUMPTION
The assumed setting for use of the proposed method is a scenario in which the attacker knows the confidence values and can access the softmax layer of the DNN to generate an optimized adversarial example.Therefore, in the transfer from the logits of the second-to-last layer to the result value of the softmax layer, the proposed method adds distortion to the logits so that the result value of the softmax layer maintains the original class prediction and has an equal confidence value for each class.(In a scenario in which the attacker does not have access to the softmax layer, the proposed method can also change the softmax output directly.)Because the adversarial example is generated using the false confidence values due to the noise vector rather than the true confidence values, the adversarial noise of the adversarial example will not be optimized, and so the target model can correctly recognize the adversarial example.
In addition, the reason that the model does not deliver 0 or 1 as the output value is the assumption made when an optimized adversarial example is generated and the assumed setting for use of the proposed method.First, in the assumption for use of the proposed method, the attacker must know the confidence values and have access to the softmax layer of the DNN to generate an optimized adversarial example.When generating an optimized adversarial example, the attacker needs to know the confidence value for each class in the model.Second, an optimized adversarial example is a sample generated by adding a minimal amount of distortion to the original data so that the confidence value for an incorrect class will be slightly higher than that for the original class.If the model provided only 0 or 1 as output values and provided no confidence values, then an optimized adversarial example could not be generated, or it would be generated with excessive noise, resulting in a misrecognized sample having a high degree of distortion.Therefore, in order to generate an optimized adversarial example, the attacker must have access to the softmax layer of the model and must know the confidence values.

B. CONFIDENCE VALUES AND DECISION BOUNDARY
The noise vector produces confidence values with random noise as a whole while maintaining the highest confidence value for the predicted class.Because the confidence values provided by the noise vector are random, the decision boundary of the target classifier will be changed.Because the adversarial example is generated for the modified decision boundary of the target classifier, the adversarial example generated is deceived by the changed decision boundary and minimizes its distortion from the original sample.However, because the modified boundary is not the true decision boundary of the target classifier, the adversarial example will be correctly recognized by the true decision boundary of the target classifier.

C. A POSSIBLE SCENARIO
The proposed method restricts the creation of adversarial examples by providing random confidence values for all classes within a range that does not affect the class prediction.
In terms of image recognition, the DNN knows the true confidence values for all classes.Therefore, to select the predicted class, the DNN uses the true confidence values for all classes, before the noise vector is applied.As shown in Table 2, even though the second or third highest confidence value may be close to the highest one, it is irrelevant because the DNN does not choose the predicted class using the confidence values with the added noise.
The proposed method can be applied in a scenario in which the confidence values for all classes are not important to the end user.For example, in the case of autonomous vehicles, there could be a situation in which a sign is classified in real time using a DNN.In this case, the autonomous vehicle moves using the class predicted according to the highest confidence value, and in this process, there may be a situation in which the human user does not need to be provided with the confidence values for all classes.In other words, it may be necessary for the machine itself (without calling on human judgment) to act on a predicted class that has the highest confidence value and to provide a confidence value that includes noise for security considerations.

D. DEFENSE CONSIDERATIONS
The proposed method does not require a separate module or a separate process, unlike conventional adversarial example defense methods.For example, in the conventional defense method that works by manipulating input data, training and generating a filter module is required.In addition, making the classifier robust requires additional courses of adversarial training or separate neural networks.In contrast, the proposed method can prevent the generation of an optimal adversarial example without affecting the class prediction by using a noise vector in the softmax layer without the need for a separate module or process.
To achieve random confidence values, multiple iterations through the noise vector are needed.If the number of iterations is insufficient, the confidence values are unlikely to be randomly distributed; approximately 30 iterations are needed.When the number of iterations was set to approximately 30, adversarial examples were correctly classified as the original class with high accuracy in MNIST and CIFAR10.

E. DATASET
The distortion in an adversarial example differs with the dataset.This is because of differences in the pixel count and dimensionality of the datasets.For example, MNIST contains one-dimensional monochrome images, and CIFAR10 contains three-dimensional color images.In terms of the number of pixels, MNIST images each have 784 (1 × 28 × 28), whereas CIFAR10 images have 3072 (3×32×32).Therefore, CIFAR10 images will have more distortion than MNIST images.However, in terms of human perception, because CIFAR10 images are color images, adversarial examples for CIFAR10 appear to the human eye as more similar to their original samples than do adversarial examples for MNIST.

F. APPLICATIONS
The proposed method can be applied to autonomous vehicles or in military scenarios.In the case of an autonomous vehicle, if an attacker creates an adversarial example and applies it to a road sign, the proposed method allows the adversarial example to correctly recognize the road sign by providing a random confidence value for each class.In particular, the proposed method can effectively defend against adversarial examples as part of an ensemble of defense methods such as detection through angle and rotation, position sensing through GPS, and various detection sensors.In a military environment, if the proposed method is applied when identifying an object through adversarial example can be correctly recwith high accuracy.

G. LIMITATION
The proposed defense method assumes a limited attack in which only the confidence values are known, not all information regarding the target classifier.If the attacker knows all information about the target classifier that contains the noise vector, it will be possible to create an adversarial example that can neutralize the proposed method.By ignoring the noise vector according to the Carlini & Wagner method and generating an adversarial example corresponding to the true confidence value, the proposed method can be neutralized.

VIII. CONCLUSION
The method proposed in this paper defends against adversarial examples using an existing target classifier.The method is designed to defend against adversarial examples using a noise vector and does not need to train on adversarial examples, use other processing modules, or be equipped to perform input data filtering.The method defends against adversarial examples while maintaining the model's accuracy on the original samples.The experimental results show that the proposed method can correctly classify adversarial examples with 100% and 99.5% accuracy for MNIST and CIFAR10, respectively.Future research will extend the method to other image datasets such as ImageNet [39] and to the face domain [40].In terms of image datasets, the proposed method can be extended to face datasets such as VGG-Face or to ImageNet with its 1000 image classes.In addition, it is possible to expand the study of adversarial examples to include the audio domain [41], [42], the malicious code domain [43], and medical domains [44].Another challenge will be to design a method for detecting adversarial examples based on 5354 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.correlations among various classifiers.Using such correlations, we can study detection methods that improve performance by combining multiple softmax thresholds on multiple classifiers instead of a single classifier.

FIGURE 1 .
FIGURE 1. Overview of the softmax layer in the DNN architecture.

FIGURE 2 .
FIGURE 2. Decision boundary of model D for original sample and adversarial examples.

Figure 2
Figure2shows the decision boundary between adversarial examples and the original sample.The decision boundary represents the line within which the model can correctly recognize the classification for an input.For example, in the figure, original sample x is within the decision boundary, so its original class will be correctly recognized.Because the adversarial examples x * are outside the decision boundary, however, they will be mistakenly recognized as incorrect classes.As the generation of an adversarial example is an optimization problem, the adversarial example should be located where the model will misclassify it, while its distance from the original sample should be minimized.Therefore, an adversarial example should be located slightly outside the decision boundary.Thus, there is a pattern displayed by adversarial examples with regard to their classification scores and the decision boundary, as shown in Table1and Fig.2.The classification score for the original class of an original sample is higher than that of any other class.However, the classification score for the incorrect class targeted by the adversarial example tends to be only slightly higher than that for the original class.In order to generate adversarial examples slightly outside the decision boundary, the attacker must know the confidence value for each class.However, if those confidence values are set to random values, the generation of adversarial examples can be prevented.Our proposed method is designed to offer a robust defense against adversarial examples by providing

FIGURE 3 .
FIGURE 3. Overview of the proposed method.

TABLE 7 .
Accuracy of adversarial example recognition in the baseline method and the proposed method.

TABLE 8 .
Accuracy of DNN constructed using the proposed scheme on adversarial examples generated by DeepFool, JSMA, and the Carlini & Wagner method.