PuVAE: A Variational Autoencoder to Purify Adversarial Examples

Deep neural networks are widely used and exhibit excellent performance in many areas. However, they are vulnerable to adversarial attacks that compromise the network at the inference time by applying elaborately designed perturbation to input data. Although several defense methods have been proposed to address specific attacks, other attack methods can circumvent these defense mechanisms. Therefore, we propose Purifying Variational Autoencoder (PuVAE), a method to purify adversarial examples. The proposed method eliminates an adversarial perturbation by projecting an adversarial example on the manifold of each class, and determines the closest projection as a purified sample. We experimentally illustrate the robustness of PuVAE against various attack methods without any prior knowledge. In our experiments, the proposed method exhibits performances competitive with state-of-the-art defense methods, and the inference time is approximately 130 times faster than that of Defense-GAN that is the state-of-the art purifier model.


Introduction
Significant progress has characterized deep learning in several areas including image recognition [He et al., 2016], disease prediction [Hwang et al., 2017], and autonomous driving [Yoo et al., 2017]. However, security issues of deep neural networks, which are especially vulnerable to adversarial attacks, are emerging. The goal of adversarial attacks is to fool deep neural networks via applying elaborately designed perturbation to input data. Adversarial attacks make it hazardous to apply deep neural networks in real world applications. In the case of autonomous driving [Akhtar and Mian, 2018], attacks can cause an accident by making an object detector recognize pedestrians as roads.
To address these attacks, several defense mechanisms have been proposed. There are three categories of defense mechanisms. The first mechanisms involve modifying the training dataset such that the classifier is robust against the adversarial * Equal contribution † Corresponding authors attack [Szegedy et al., 2013]. Second mechanisms block gradient calculation via changing the training procedure [Buckman et al., 2018;Guo et al., 2017]. However, they are only effective for the gradient based attack methods. The third mechanisms involve removing the adversarial noise from the sample fed into the classifier [Samangouei et al., 2018]. Our main focus is defense mechanisms to purify input data that may have added adversarial perturbation; this can allow the mechanisms to effectively address any attacks. These methods mostly work by using a generative model to learn the data distribution and project the adversarial example into the learned data distribution p(x). We term the generative models as purifiers, and MagNet [Meng and Chen, 2017] and Defense- GAN [Samangouei et al., 2018] Figure 2: Overview of the PuVAE algorithm; The green region represents the training process, and the blue region denotes the inference process of PuVAE. The dotted line is the gradient flow in training process. The parameters of the source classifier are not updated. Figure 1, we show an overview of the defense mechanism using the purifier model.
Specifically, MagNet learns the original data using one or more autoencoders termed as the reformer networks and passes input data to the autoencoders that move input data closer to the data manifold. The purified data are supplied to the classifier. However, the method has a disadvantage wherein it exhibits poor performance when compared with Defense-GAN.
Defense-GAN uses the characteristics of generative adversarial networks (GANs) to defend a target model against adversarial attacks. It uses the fact that optimizing the objective function of the GAN is equivalent to making the generator distribution p g identical to the data distribution p data . After training the GAN using the original data, Defense-GAN iteratively finds generator input z through the gradients. This reduces the reconstruction error between the generated data G(z) and the input data x that may have added adversarial noise. Subsequently, data generated with optimal z are supplied to the classifier as input. The model relies on the unstable performance of GAN, and this occationally reproduces adversarial noise by directly optimizing errors between the adversarial example x + δ and the generated sample G(z). In addition, because of Defense-GAN's iterative nature, it takes a long time to yield the maximum defense performance. In particular, real-time applications such as object detection must operate in a short period of time, so a fast defense algorithm needs to be developed.
In this paper, we aim to rapidly generate well-classified samples from adversarial examples. The purified samples are fed into the target classifier so that they are classified without being affected by adversarial attacks. To solve the limitations of MagNet and Defense-GAN, we propose Purifying Variational AutoEncoder (PuVAE) that purifies adversarial examples using a Variational Autoencoder (VAE). The proposed model uses variational inference to generate samples that provide comparable or better defense performance than the state-ofthe-art model. In contrast to Defense-GAN, PuVAE generates clean samples with one feed-forward step. Therefore, our method is robust against adversarial attacks within a reasonable time limit.
In summary, our contributions are as follows: • We propose a VAE-based defense method, PuVAE, to effectively purify adversarial attacks. The proposed method shows a remarkable performanc over other defense methods.
• The proposed method significantly reduces the time to generate purified samples. Within a reasonable time limit, PuVAE outperforms state-of-the-art defense methods.
• Experimental results demonstrate that our method functions robustly against a variety of attack methods and datasets.

Variational Autoencoder (VAE)
A generative model is used to represent data distribution. Most data are too complex to directly find the relation in themselves, and thus relatively simple latent variables are typically used to represent data distribution. Kingma [2013] introduced Variational AutoEncoder (VAE), which is a method that uses a combination of neural networks and variational inference to learn a decoder that generates data from normal distribution. The objective function of VAE is represented as follows: where log(p(x)) denotes the marginal log likelihood of the data, L(θ, ϕ, x) denotes the variational lower bound of marginal likelihood, p θ (x|z) denotes the output distribution of the decoder, q ϕ (z|x) denotes the output distribution of the encoder, and p θ (z) denotes a normal distribution. By maximizing the lower bound, marginal likelihood of data is maximized. In VAE, latent space is assumed as a multivariate Gaussian distribution. Doersch [2016] indicated that conditional Variational Autoencoder (cVAE) is specifically used to learn a multimodal distribution via class information. The basic idea of cVAE is similar to that of VAE, which aims to learn the distribution of data. However, the encoder and the decoder of cVAE take a class label as an additional input.
In this study, we select a cVAE structure to learn classspecific data distributions. We use the encoder as the mapping function of adversarial examples to the latent space of legitimate images and the decoder as the reconstructor of images from the latent spaces. We confirm that the forwarding process via the encoder-decoder model effectively purifies adversarial noise from data.

Adversarial Examples
An adversarial example is a sample that is designed to be misclassified by the target classifier by using intended noise that is not perceivable by humans. Prior to a study by Goodfellow [2015], attackers exploited the non-linearity of neural networks. However, the authors claimed that the cause of vulnerability to adversarial examples is a linear characteristic of neural networks and proposed the Fast Gradient Sign Method (FGSM) that uses the gradient from the objective function of neural networks. Although FGSM is a fast algorithm to make adversarial attacks, it is easy to defend the one step gradient based approach. In order to overcome the problem, the iterative Fast Graident Sign Method (iFGSM) was proposed by Kurakin [2016]. The method optimizes adversarial noise in several steps with a small perturbation of an image which allows a more accurate attack.
RAND+FGSM [Tra, 2017] is a new attack method that adds random Gaussian noise to an image, and computes FGSM with the perturbed image. Against the FGSM-based methods, we used targeted adversarial attacks because we assume that it is more difficult to defend targeted attacks than untargeted attacks [Xu et al., 2019]. In our experiment, we randomly chose target labels among classes with the exception of the true class.
The Carlini and Wagner (CW) attack suggested by Carlini [2017] is the most powerful attack method among existing methods. By solving the optimization problem in a gradient descent manner, adversarial perturbation is derived as follows: where f (x) denotes an objective function of a classifier and δ denotes perturbation added to image x. In this study, we used p = 2 and created CW attack with open source software CleverHans 1 by Papernot [2018] to verify whether our method can defend the attack.

Proposed Method
In this paper, we propose a VAE-based defense method that is coined as PuVAE to purify adversarial noise from data. We consider a dataset X data that consists of data instances x data ∈ R d where d denotes the dimension of the data space.
Corresponding class labels (one-hot vectors) are denoted by y data ∈ R c in a set of classes C where c is the number of classes.
We then consider a target classifier M t that is the model an attacker wants to attack. We also assume a set X adv that 1 https://github.com/tensorflow/cleverhans consists of adversarial examples x adv ∈ R d created from the target classifier. We define a set X which contains clean samples and adversarial examples. Instances x from the set X are used at inference time. We explain the procedures of training and generating purified samples using PuVAE. The overview of the proposed method is described in Figure 2.

Training process of PuVAE
PuVAE is comprised of an encoder and a decoder network. The encoder receives a data-label pair and outputs the mean µ and the standard deviation σ of the Gaussian distribution on the latent space corresponding to the input label: Using µ and σ obtained from the encoder, the latent vector z on the latent space is sampled: where ϵ denotes a random variable for the reparameterization trick, and σ ϵ denotes a hyperparameter that is multiplied by the standard deviation to control the extent to which the latent vector is sampled. In the experiments, we used σ ϵ = 1 in the training time to ensure that the posterior latent distribution follows the normal distribution. In classification tasks, Convolutional Neural Networks (CNNs) using pooling and strides are used to select useful features and to widen receptive field. However, this selective nature of CNNs is a disadvantage on generative models, since the feature selection causes information loss. Therefore, we use a dilated convolutional neural network as the encoder to get the latent vector z. Dilated convolution inserts zeros in the filter, so that the receptive field is enlarged and information loss is effectively reduced.
the sampled z enters the decoder with the label and produces an output instancex with the same dimension d as the input: At the training time, PuVAE is trained to maximize the variational lower bound in a manner similar to cVAE. Loss functions from the encoder and the decoder are: where L RC denotes the reconstruction loss function to minimize the difference between the input instance and output instance, and L KL denotes Kullback-Leibler divergence between the output distribution of the encoder and the normal distribution. This process allows PuVAE to construct the mapping of legitimate data on the latent space Additionally, we use the cross-entropy calculated from a classifier as a loss function for PuVAE. The classifier, called source classifier M s , learns the decision boundaries on the data space. Since neural networks performing the same task learn similar functions [Goodfellow et al., 2015], we use a fixed architecture for M s . Then, trained M s is used to ensure that the output instance reflects the characteristic of the classes in C. The cross-entropy loss from M s is as follows: Finally, PuVAE is trained using the stochastic gradient descent (SGD): where λ KL , λ RC , λ CE are coefficients for each loss functions.

Generating Purified Samples
At the inference time, PuVAE projects an input sample to the data manifolds of all classes in C as follows: where y i denotes the i-th class label in C to guide the input to the corresponding latent space, andx yi denotes a candidate for the purified sample. The inference also follows the Equations (5), (6), and (7) as in training, where σ ϵ is used to sample the latent vector z. We performed a hyperparameter search on σ ϵ among {0, 0.01, 0.1, 1, 10, 100} for the inference. Thus, the optimal value of σ ϵ is 0.1. Because PuVAE only learns the distribution of z from the training data, the input data is mapped to the learned latent spaces even if adversarial example comes in. the adversarial noise is removed in the projection to the latent variable. Then, the class label corresponding to the closest projection, y * , is selected as follows: where D denotes a distance measure to determine the closest projection. We use the root mean square error (RMSE) as the distance measure. Therefore, the candidate generated with label y * is the purified sample which goes into M t : x purified =x y * Finally, the purified sample is fed into the target classifier M t as follows: The complete process of generating the purified sample using PuVAE is illustrated in Figure 3.

Experiments
In this section, we determine optimal setting for PuVAE, and present the defense performance of PuVAE against adversarial attacks. We used Tensorflow (1.12.0) for the experiments. A GPU, an NVIDIA TITAN V (12 GB), and a CPU, an Intel Xeon E5-2690 v4 (2.6 GHz), were used. We used MNIST [LeCun et al., 1998] which is a hand-written digit dataset, Fashion-MNIST [Xiao et al., 2017] which is a clothing object image dataset, and CIFAR-10 [Krizhevsky and Hinton, 2009] which is a tiny natural image dataset. Each dataset consists of 50,000 training instances and 10,000 test instances. We normalized data between 0 and 1. We used FGSM, iFGSM, RAND+FGSM, and CW attacks for the experiments. FGSM, iFGSM, and RAND+FGSM were generated with an adversarial perturbation size of 0.3 for the MNIST and Fashion-MNIST datasets, and 0.06 for the CIFAR-10 dataset. We set the upper limit on the random noise of RAND+FGSM as 0.05 for the MNIST and the Fashion-MNIST datasets, and 0.005 for the CIFAR-10 dataset. We set the number of iterations of the CW attack as 100 on all datasets. The performance of defense mechanisms is measured by the accuracy of the target classifier.   The architectures of the encoder and the decoder of PuVAE are presented in Table 1. Dilated Conv(n, k × k, r) denotes a dilated convolution layer with n feature maps, filter size k × k, and dilation rate r. Deconv(n, k × k, s) denotes a deconvolution layer with n feature maps, filter size k × k, and stride s. FC(m) denotes a fully connected layer with m units. ReLU denotes the rectified linear unit. We use the first half of the last layer of the encoder, 32 output units, as µ and the second half is passed to the softplus function to infer σ. We used the architecture of Defense-GAN and the reformer network of MagNet as suggested in [Samangouei et al., 2018] and [Meng and Chen, 2017] respectively. The architectures of M t and M s are shown in the Supplementary Materials 2 . Figure 4 demonstrates the characteristics of generated samples based on the combinations of three coefficients λ RC , λ KL , and λ CE . If λ RC is larger than λ KL , the constraint of the posterior 2 https://anonymous-puvae.github.io/ distribution of the encoder is relieved. Thus, the encoder easily maps the input samples to the low likelihood area of the latent space. The characteristic of this mapping causes the decoder to generate strange image as demonstrated in the first row of Figure 4.

Effect of coefficients on Training PuVAE
Conversely, the typical form of each class is generated when λ KL is large in model learning. Therefore, as shown in the last row of Figure 4, it is possible to generate samples that exhibit distinctive characteristics of each class even if an input sample in other classes come in. The red box of Figure 4 illustrate that the purified sample x purified is analogous to the input sample x, with the effect of adversarial noise removed. Additionally, the highest defense performance is acquired from the coefficient combination in the last row of Figure 4. Therefore, we set λ RC , λ KL and λ CE to 0.01, 0.1, and 10, respectively, as coefficients in experiments.

Defense Performance
In this section, we compare the defense performance of Pu-VAE with maximum defense abilities of adversarial training,  Table 2, the performance of PuVAE exceeds that of MagNet on all attacks and is comparable to that of Defense-GAN. Adversarial training is also comparable with our method in FGSM, iFGSM, and RAND+FGSM albeit a very low performance in the CW attack. Since we used gradients from M t for adversarial training, it is robust against gradient based attacks (FGSM, iFGSM, and RAND+FGSM), but weak against the other attack.
As shown in Table 3, PuVAE shows the best performance against iFGSM and the CW attacks in both architectures. Even though iFGSM is the most difficult attack when there is no defense, the proposed model effectively defends the attack. The purified images on the MNIST and the Fasion-MNIST datsets are visualized in Supplementary Materials.
As shown in Table 4, the CIFAR-10 dataset shows overall low accuracy. Although adversarial training exhibits the best performance at a certain setting, it shows unstable results depending on models and attacks. However, PuVAE shows the best performance in various attacks and model architectures. Our method also exhibits a robust performance in settings where it does not take the first place. This indicates that the proposed method shows the general defense ability across various attacks.

Performance in a Reasonable Time Constraint
Defense mechanisms including Defense-GAN and PuVAE purify adversarial examples in a pre-processing manner. In contrast to PuVAE, Defense-GAN takes a significant amount of time to derive the maximum performance. While Defense-GAN takes approximately 14.8 seconds, PuVAE takes 0.114 seconds to purify a mini-batch with 128 MNIST images, allowing nearly 130 times faster inference as shown in Table  5.  The security issue of adversarial attacks becomes particularly prominent in autonomous driving because it is not plausible to apply purifying-models that take a considerable amount of time to real-time applications. This setback of Defense-GAN accentuates the need for time-efficient defense methods. Therefore, we measured the performance based on a reasonable time limit with the MNIST dataset. In the experiments, we set the time limit as one second.
In Figure 5, the solid lines show the performances of Pu-VAE, and the dotted lines show the performances of Defense-GAN. Each color denotes a different attack method. PuVAE performs with one inference, and thus the performance of PuVAE is superior to that of Defense-GAN within the time limit. Since Defense-GAN creates a hidden vector iteratively by the gradient-based optimization process, the performance increases as time lapses. However, its performance does not reach the performance of PuVAE in the time limit. Therefore, PuVAE is more efficient than the state-of-the-art method for real-time applications.
It is unfair to compare PuVAE and Defense-GAN without time constraint because the inference time of Defense-GAN significantly exceeds that of PuVAE. We compared the two defense methods after setting the time limit to one second. As shown in Table 6, it is observed that the performance of Defense-GAN is significantly lower than its maximum performance. Therefore, PuVAE is more practical in realworld scenarios because it exhibits the highest performance in a reasonable time condition.

Conclusion
In this paper, we propose PuVAE, a novel VAE-based defense method that effectively purifies adversarial attacks. PuVAE is robust against various attacks and overcomes the disadvantages of adversarial training. The performance of PuVAE is also comparable to the best performance of Defense-GAN. In addition, PuVAE significantly ourperforms Defense-GAN given a reasonable time limit. We demonstrate the advantages of the proposed method on various datasets and adversarial attacks. For future work, we plan to apply our method to real-time applications such as autonomous-driving, face identification, and surveillance systems.