Evolving Architectures with Gradient Misalignment toward Low Adversarial Transferability

Deep neural network image classifiers are known to be susceptible not only to adversarial examples created for them but even those created for others. This phenomenon poses a potential security risk in various black-box systems relying on image classifiers. The reason behind such transferability of adversarial examples is not yet fully understood and many studies have proposed training methods to obtain classifiers with low transferability. In this study, we address this problem from a novel perspective through investigating the contribution of the network architecture to transferability. Specifically, we propose an architecture searching framework that employs neuroevolution to evolve network architectures and the gradient misalignment loss to encourage networks to converge into dissimilar functions after training. Our experiments show that the proposed framework successfully discovers architectures that reduce transferability from four standard networks including ResNet and VGG, while maintaining a good accuracy on unperturbed images. In addition, the evolved networks trained with gradient misalignment exhibit significantly lower transferability compared to standard networks trained with gradient misalignment, which indicates that the network architecture plays an important role in reducing transferability. This study demonstrates that designing or exploring proper network architectures is a promising approach to tackle the transferability issue and train adversarially robust image classifiers.


Introduction
Transferability of adversarial examples is a phenomenon where images that are slightly perturbed to fool an image classifier known as adversarial examples, can also fool other image classifiers whose parameters, training hyperparameters, and even architectures are Figure 1: Overview of the proposed framework in a binary classification task. (Left pane) the reference network properly classifies the samples into two classes (circles and squares). (Right pane) Our framework uses neuroevolution to evolve network architectures (shown only one network per generation for illustration) in order to discover a network architecture by which one can train a classifier such that (i) it attains a high classification accuracy as the reference network and (ii) it has a drastically different direction of the adversarial perturbation (i.e., low transferability). The middle row shows that although the shape of decision boundaries are different across networks, the classification of samples are identical, which satisfies (i). The bottom row shows that through the course of evolution, the adversarial perturbation on the sample of interest changes, making it difficult to attack all the networks (in particular, the reference network and final outcome network) simultaneously, which satisfies (ii). different from those of the originally targeted classifier [1,2,3,4,5,6,7,8,9,10]. The transferability of adversarial examples poses serious threats to many systems based on image classification (e.g., pedestrian detection in the autonomous driving) because even when attackers do not have a direct access to these systems, they may fool the systems by adversarial examples that are generated using other whitebox systems. Therefore, training an image classifier that achieves high classification accuracy on unperturbed images 1 and is also robust against adversarial examples generated from other classifiers (i.e., low transferability) is a fundamental problem.
The transferability of adversarial examples between models implies that when trained with the same dataset, image classifiers (particularly, deep neural networks; DNNs) with different training hyperparameters and architectures end up learning functions that have similar decision boundaries [1,5,11]. Although the reason behind this training convergence similarity, and thus, the transferability, is not yet fully understood, it is empirically observed that the extent of transferability correlates with the similarity of the network architectures; the more architectural similarity two DNNs share, the more adversarial examples may transfer between them [7,11]. Therefore, discovering a proper network architecture is also an important factor in order to realize a DNN classifier that attains high clean accuracy as well as low transferability of adversarial examples from other classifiers. Nevertheless, there are only a few studies that investigate and try to alleviate the transferability from the network architecture perspective. Differentiable Architecture Search (DARTS) [12] has been used to find robust networks but it only works well with weak adversarial attacks [13]. In another study, neuroevolution has been employed to find robust networks through an exhaustive search in a large search space [14]. Several works included adversarial training [15,16] in the network architecture search to improve network robustness and search efficiency. Liu and Jin [17] incorporated adversarial training to DARTS, which resulted into a more robust network. In addition, some studies have combined oneshot neural architecture search [18] with adversarial training to find networks that balances clean and adversarial accuracies [19,20]. Although they are successful in finding networks that reduces transferability, introducing adversarial training can be computationally expensive, especially when using strong adversarial attack methods.
In this paper, we propose a method that discovers network architectures with which one can train a network with low transferability of adversarial examples from a given pretrained network (reference network), while achieving a similar classification accuracy. To find such architectures, we design (i) neuroevolution [21], which yields unconventional network architectures that still perform well [22], and introduce (ii) gradient misalignment (GM), which encourages networks to have dissimilar input gradient behavior from that of the reference network. The overview of our framework is shown in Fig. 1. In the neuroevolution framework, networks evolve over generations; at each generation, networks are architecturally-mutated, trained, and then selected. At the training step, the GM loss is incorporated to reduce the transferability from the reference network. Intuitively, the GM loss measures the dissimilarity of gradient direction at each input points. As illustrated on Fig. 1, the GM makes it difficult to generate adversarial examples that can deceive both the reference network and evolved network simultaneously; when two networks have largely misaligned input gradients, the direction for adversarially perturbing a clean image to cross the decision boundary of each network is totally different, which prohibits the adversarial example to be effective on both networks. Note that the idea of exploiting input gradients to obtain different functions (or networks) itself is not new [23,24,25,26,27]. However, the existing methods focus on a fixed function architecture, whereas in this study, we are interested in the effect of architecture on obtaining different functions. In the experiments, we show that the GM becomes by far more effective when it is combined with an evolved network architecture rather than using it with the network architecture of the reference network. It is also worth noting that the neuroevolution designed in this study is more efficient than other neuroevolution-based architecture search in the related studies. Kotyan and Vargas [14] proposed a powerful evolutionary method which includes block-wise, layer-wise, and model-wise evolution that pursues adversarially robust networks. Although, with considerably large number of generations, their method can discover robust networks, the exhaustive evolution requires high computational cost. In contrast, our neuroevolution framework incorporates GM in evolution, which allows us to efficiently explore architectures with a small number of generations.
In the experiments, we show that the network architecture is actually an important factor to alleviate the transferability of adversarial examples. Three standard datasets (CIFAR-10 [28], MNIST [29], and KMNIST [30]) are tested in our method. Using a pretrained reference network (ResNet-34 [31]), we generate adversarial examples from the dataset. Then, we perform neuroevolution with GM to evolve networks using the CIFAR-10 dataset. The networks produced are also applied on MNIST and KMNIST datasets to test the versatility of the networks. The results show that for all the datasets, the networks obtained by the proposed framework (neuroevolution with GM) give clean accuracy comparable with that of the reference network while achieving significantly higher adversarial accuracy. Moreover, we demonstrate that training a network with evolved architectures using GM yields a classifier that achieves remarkably better adversarial accuracy than simply training a network with the same architecture as the reference network using GM, which highlights the contribution of the evolved architectures to the adversarial accuracy. The neuroevolution networks with GM maintain the adversarial accuracy even for the adversarial examples generated from other pretrained networks (ResNet-18, VGG [32], DenseNet [33], and SqueezeNet [34]); whereas other methods significantly drop their adversarial accuracy due to the transferability of adversarial examples across hand-engineered network architectures [11,35]. To summarize, the experimental results strongly support our hypothesis that the architecture design of networks plays an important role in training networks dissimilar from the reference network and that our framework serves as a powerful tool for discovering such useful network architectures.
2 Related work

Transferability of adversarial examples
Adversarial examples generated to fool a certain network can be transferred to fool other networks. Extensive studies on such transferability, though empirically, revealed numerous findings [1,2,3,4,5,6,7,8,9,10]. First, some studies attribute the transferability to similar decision boundaries, which suggests similarity of functions among the networks [1,5,9]. Second, complex networks has been known to be more susceptible to transferability than simple networks [10], and the skip connection, which is essential in training very deep networks [31], can be utilized to cause higher transferability [35]. Third, it has been observed that networks with high transferability between them have aligned gradients [4,10] and smoothing the gradients can increase transferability [36]. In contrast, some studies reduce transferability by decreasing the magnitude of input gradients [37,38] and by reversing their direction [23,24]. However, only a limited number of studies, which are published very recently, take network architecture design into account to alleviate transferability [13,14,17,19,20]. Since by observation, the extent of transferability of adversarial examples between two models increases with the architecture similarity, we believe that architecture search has a good potential in reduc-ing transferability. This study provides a framework to efficiently find unique network architectures that reduce transferability and contribute to the emerging trend in the studies on transferability.

Architecture search
As datasets continue to grow, DNNs have become increasingly complex to learn important features for classification. However, it is not straightforward to hand-engineer an optimal DNN architecture for each task and dataset. Neural Architecture Search (NAS) algorithms are a line of methods that optimize the network architecture along with the standard parameter optimization [18,22,39]. One of the successful NAS algorithms is neuroevolution [21] and it has been applied in many different tasks such as image classification, image detection, reinforcement learning etc. [22,40,41]. With neuroevolution, one can automatically produce unconventional but successful networks that compete with best hand-engineered networks [22]. In some studies, neuroevolution is used for adversarial defense. In [14], neuroevolution finds networks that are robust against adversarial examples, although an exhaustive search in a large search space is required. Another work has employed a different NAS called Differentiable Architecture Search (DARTS) [12] to find robust networks, which has shown to be effective only to weak adversarial attacks [13]. Several studies have incorporated adversarial training into the architecture search, resulting in less exhaustive search. For example, DARTS yields a more robust network when it is combined with adversarial training [17]. Another example is oneshot NAS [18], which includes adversarial training to find a family of robust networks [19,20]. Although, adversarial training helps improve adversarial example resistance, it still requires significant computational cost, especially when it uses strong adversarial attack method in the training. In this study, we instead propose to exploit input gradients by GM in the architecture search for finding adversarial robust model. GM is more efficient than adversarial training and also requires only a slight modification in the existing training process.

Input gradients
The gradients of the loss of a DNN with respect to the input, or input gradients of a DNN, has been utilized in various ways [3,23,36,37,38]. One of the useful applications of input gradients is to visualize the focused areas on an input image [42,43,43]. It has also been used to generate adversarial examples in standard methods such as Fast Gradient Sign Method (FGSM) [1] and Projected Gradient Descent (PGD) [44]. Furthermore, the transferability of adversarial examples is enhanced by smoothing the input gradients of a network [36] or by averaging input gradients on an ensemble of networks [3,4,9]. Similar direction of input gradients of two networks implies high transferability of adversarial examples between them [4,10]. This observation has been recently exploited in the adversarial defense [23,24]. In [23], given a network, a new network with the same architecture is trained such that its input gradients align differently from those of the given network and thus reduces transferability between them. In this paper, we exploit the input gradients in the same manner as in [23] to reduce the transferability 5 of adversarial examples. The critical difference from their study is that for maximizing the gradient misalignment, we dynamically evolve networks through neuroevolution.

Method
Here, we describe the process of finding network architectures that reduce transferability of adversarial examples using neuroevolution with GM. Our idea is based on two empirical observations: (i) the extent of transferability of adversarial examples between networks increases along with their architecture similarity [7,11]; and (ii) networks with high transferability exhibits aligned input gradients [4,10]. We first describe GM, to show how input gradients are exploited in the training to reduce transferability and then present overall neuroevolution with GM procedure to evolve networks that converge into functions different from the reference network.

Gradient misalignment
One of the observed characteristics of networks with high transferability is the alignment of their loss gradients with respect to the input, or input gradients [4,10]. The input gradients of networks pointing to the same direction imply similar decision boundaries, which makes an adversarial perturbation on an image effective on these networks. Therefore, changing the alignment of the input gradients can change the network decision boundaries where an adversarial example cannot be simultaneously effective on networks with input gradients pointing on opposite directions as shown in Fig. 1. GM encourages the networks to align their input gradients to the opposite direction with respect to the input gradients of a reference network.
Let f c and f r be the network to train and reference network, respectively. Also, let (·, ·) be the classification loss (e.g., cross-entropy loss). Given a dataset D, the GM loss L GM , which measures how two input gradients are misaligned, is defined as the average cosine similarity of gradients of f c and f r as follows.
where |D| denotes the size of dataset, ·, · denotes the inner product of vectors, and · is the Euclidean norm.
Combining the GM loss with a standard cross-entropy loss, we can train a network to achieve high clean accuracy and low transferability from the reference network. Thus, the overall loss becomes where λ is a hyperparameter. The hyperparameter λ is carefully adjusted depending on the characteristics of a dataset. On one hand, if the λ is too big, the network fails to learn the correct labels of images because it simply focuses on input gradients misalignment.
On the other hand, if the λ is too small, the network fails to encourage input gradients misalignment. 6

Neuroevolution
The neuroevolution framework designed in this work borrows some ideas from steadystate genetic algorithm [45,46]. Unlike typical neuroevolution, where at each generation, the population is created from a batch of selected and mutated candidate networks, our framework at each generation replaces a candidate network only when it is outperformed by one of the children networks. With this idea, we can ensure that each candidate network always improve along the generations. We can also save computational cost because it does not require a large population size to succeed unlike in the typical method. In particular, we only keep a few candidate networks (e.g., four or five networks) in a population and each network produces two to three children only. In our experiments, we empirically observed that such a small number of networks and children are sufficient to produce networks that substantially reduce transferability. Before the start of neuroevolution with GM method, a pretrained reference network is prepared, which is the baseline for clean accuracy and adversarial accuracy improvement. It is also used to produce the adversarial examples equivalent of all the test set images for the adversarial accuracy evaluation.
The basic neuroevolution framework is composed of network mutation, training, fitness evaluation, and selection, which repeats through generations. Specifically, in our neuroevolution with GM method, at each generation, the following procedures are performed on each candidate network in the population: Step 1. A candidate network in the population produces children by mutation. The children are trained with GM with respect to the reference network.
Step 2. Each child is evaluated for fitness and compared to the architecturally closest candidate network in the population.
Step 3. The children with better fitnesses replace their closest candidate network counterparts.
At initialization, the candidate networks are generated to have diverse architectures. Because there are limited number of candidate networks in a population, it is important to make their architectures diverse as much as possible. Having a diverse set of candidate networks lets neuroevolution explore more areas in the search space and find more potential solutions. To this end, we adopt a modified Spectrum-based Niching [45]. Each network is represented by a vector, called spectrum, which contains the following convolutional neural network properties: number of convolutional blocks, number of pooling blocks, number of strided convolutional blocks, number of summation blocks, and number of concatenation blocks. The distance between two network architectures is measured using the Euclidean distance between their spectrums. Formally, the distance of networks N 1 , N 2 is defined by where spec(N i ) denotes the spectrum of network N i .  Figure 2: The selection process is demonstrated with candidate network 1 where it produces a child. The child is trained with GM and compared to the closest candidate network in terms of architecture using the network distance (i.e., Euclidean distance of spectrums). If the fitness of the child is better than the closest candidate network, it replaces that candidate network in the population. In Step 1, each candidate network is mutated by adding, editing, or deleting a convolutional block, a pooling block, or a strided convolutional block. In addition, summation block, which sums the output channels of two prior blocks, and concatenation block, which concatenates the output channels of two prior blocks, are added as skip connection blocks. In the summation block, if the two input blocks do not have the same output channels, the smaller output channel is padded with zeros to match the larger output channel before summing. To facilitate effective architecture search, aggressive mutation is employed by applying multiple mutations to a candidate network to produce a child.
In Step 2, the children networks are trained with GM as explained in the previous Section 3.1 and they are evaluated using a fitness function. The fitness is based on both accuracy on clean images (i.e., clean accuracy) and accuracy on the transferred adversarial examples (i.e., adversarial accuracy) in order not to reduce the transferability at the cost of clean accuracy. Specifically, the fitness of a network is defined by the minimum of the two accuracies.
where A CL is the clean accuracy and A AD is the adversarial accuracy. By using the minimum between the clean accuracy and adversarial accuracy as the fitness, we ensure that the lower bound of clean accuracy and adversarial accuracy will always increase along the course of evolution. In Step 3, the spectrum of each child from every candidate network is obtained. Afterwards, the closest candidate network to a specific child is taken by comparing the child spectrum to every candidate network spectrum and getting the smallest network Algorithm 1 Neuroevolution with GM Input: s: size of population, n: number of children to produce, d: minimum network distance, g: number of generations Output: F : evolved networks after neuroevolution 1: P = {N 1 , N 2 , ..., N s } Initialize the population by s candidate networks N 1 , . . . , N s with minimum network distance d. 2: for t = 1, 2, . . . , g do 3: for each N in P do 4: {C 1 , C 2 , . . . , C n } = mutation(N ) 5: for j = 1, 2, . . . , n do 6: Train with loss Eq. (2). if F(N * ) < F(C j ) then 9: Replace N * with C j . end for 13: end for 14: return P distance. The fitnesses of the child and candidate network pair are compared. If the child has a better fitness, it replaces the candidate network in the population. This selection process using the spectrum is illustrated in Fig. 2 The neuroevolution with GM method is summarized in Algorithm 1. Given the population size s, number of children n (of each candidate network), network distance value d, and number of generations g as inputs, a population P is initialized with candidate networks with a minimum network distance of d to each other. For every generation in g, each candidate network in P produces n children through aggressive mutation. Afterwards, each child in n is trained with GM and its fitness is compared to the candidate network in current P with the smallest network distance. If there are multiple candidate networks with the smallest network distance, the first closest network is used. Although, there is a low probability that this will occur due to the combination of aggressive mutation and spectrum-based niching. During the comparison, if the child has better fitness, it replaces that particular candidate network in the current P . At the end of the neuroevolution process, the population of evolved networks P is returned as output F . The evolved networks F are trained further for refinement.

Experiments
NE+GM networks and baselines. In the experiments, we demonstrate that, exploiting neuroevolution (NE) and GM, the proposed framework can produce architectures that attain both high clean accuracy and high adversarial robustness against adversarial examples transferred from the reference network. We compare four networks: (i) two NE+GM networks, which refers to the top two networks produced by our framework; (ii) reference network, which is the network used to generate adversarial examples; and (iii) reference network with GM, which is a network with the architecture of the reference network and trained with GM. The networks (ii) and (iii) are denoted as baseline networks. In Section 4.2, we introduce a hand-engineered network as another baseline. Implementation details. In the experiments, the NE population is initialized with four or five candidate networks that are randomly designed to have at least a network distance of four to each other. The batch size is fixed to 128 and the other hyperparameters (e.g., learning rate) are kept in default PyTorch DNN library [47] settings without any fine-tuning to establish objective comparison between the baseline networks and NE+GM networks. To obtain NE+GM networks, NE runs for 50 generations and in every generation, each candidate network produces two children networks using aggressive mutation (i.e., mutation for four or five times to produce a child). The children are trained with GM for 20 epochs and the number of epochs increases by five every 10 generations to account for the network complexity that grows every generation. This implementation also helps decrease training cost and time considering that simple networks can converge at small epochs. After completing the whole process of NE, the candidate networks in the final population are given as NE+GM networks after they are refined by additional training for another 1,000 epochs. This ensures that the clean and adversarial accuracies of NE+GM networks converge to a stable value.
Outline of experiments. We demonstrate the effectiveness of the proposed framework by two experiments: (i) full-dataset experiment and (ii) reduced-dataset experiment. In experiment (i), we use full CIFAR-10 [28], MNIST [29], and KMNIST [30] as datasets and ResNet-18 [31], VGG [32], DenseNet [33], and SqueezeNet [34] as network models. We evaluate NE+GM networks and baselines using the datasets as well as adversarial examples generated by the reference network and the aforementioned four network models. We demonstrate that the NE+GM networks notably outperforms the baseline networks in terms of the adversarial accuracy while maintaining a good clean accuracy. We are also interested in whether the proposed framework remains effective for small datasets because there exists various applications where only limited data are available [41,48,49,50,51]. To this end, in experiment (ii), we use the significantly reduced versions of the aforementioned datasets and architecturally small hand-engineered networks as baseline networks to compensate for the limited dataset. We show that even when the dataset size is limited, the proposed framework can find good architectures that can attain high clean accuracy and good adversarial accuracy as in the full-dataset experiment. In experiment (ii), we also perform additional analysis to confirm two observations. First, the architectures of NE+GM networks of some limited dataset are still useful even when they are trained for other datasets. Second, NE+GM networks outperforms the reference network even when it is combined with several standard defense methods. using PGD with L ∞ norm (L ∞ -PGD. Table 1 reports the results on clean accuracy and adversarial accuracy of networks of the proposed methods (NE+GM networks) and baselines. All of the NE+GM networks have slightly higher clean accuracy than the reference network. Noticeably, the NE+GM networks shows remarkable increase in the adversarial accuracy; they outperform the reference network by 31%. It is worth noting that the NE+GM networks also outperform the reference network with GM by at least 11%, while maintaining a good clean accuracy. This indicates that NE+GM successfully discovers robust network architectures which make it easier to misalign input gradients by GM, resulting in lower transferability of adversarial examples from the reference networks.
The effects of GM on networks can be visually explained using integrated gradients [43]. In Fig. 3, we provide the input gradient value on each pixel of three networks namely: reference network, reference network with GM, and NE+GM network. Red pixels and green pixels mean negative and positive direction respectively. One can see inside the upper right blue box that the color of the pixels in reference network becomes the opposite in the reference network with GM and NE+GM network. The change in color indicates the reversal in the direction of input gradients. The same shift in input gradients direction can also be said on other boxes. This implies that GM can certainly encourage networks to have misaligned input gradients with respect to the reference network. Furthermore, when the total pixel distance with the reference network is calculated, the NE+GM network has a higher value than the reference network with GM with 1.297 to 0.687 respectively. Hence, the proposed framework can find better architectures that can misalign input gradients more than the architecture of the reference network. In contrast, the NE+GM network, even with skip connections, is specifically designed to have a distinct architecture that reduces transferability from the reference network. Therefore, it can reduce transferability from networks similar to the reference network and also networks that do not have skip connections.

Results in other datasets
Here, the NE+GM network produced in Section 4.1.1 is retrained using full MNIST and KMNIST datasets (i.e., trained from scratch) independently, and still achieved notable clean accuracy and adversarial accuracy for each dataset. As seen in Table 3, the results show better clean and adversarial accuracies from the NE+GM network, which agree with the results in Table 1. Despite not evolving a new NE+GM network specialized to these datasets, the NE+GM network has still achieved higher accuracies than the reference network with GM on all comparisons. This performance is attributed to the NE+GM architecture created for the full CIFAR-10 dataset, which provided abundant features to learn from. Consequently, the well-developed NE+GM network architecture  can adjust to the features of different datasets (i.e., MNIST and KMNIST) and perform well even if it is not designed for those.

Clean accuracy and adversarial accuracy
In this section, we mainly discuss the results on the reduced CIFAR-10 dataset. The results in other dataset is reported in Section 4.2.4. To obtain the reduced CIFAR-10 dataset, the full CIFAR-10 training set and test set are cut down to 10,000 training images (1,000 images per label) and 1,000 test images (100 images per label) respectively. Accordingly, we generate the adversarial examples for each image in the reduced test set using the reference network and L ∞ -PGD. Here, the reference network is a simple handengineered network because a limited dataset cannot effectively train a deep network such as ResNet-34. In addition to the baseline networks (reference network trained with and without GM), another different simple hand-engineered network with GM is introduced as additional network to compare. The simple hand-engineered networks are each composed of convolutional blocks, pooling blocks, and skip connections that are not more than ten blocks in total (see Appendix A). As shown in Table 4, all of the networks trained with GM have achieved better clean accuracy than the reference network. In particular, the NE+GM networks have better clean accuracy than all the other networks, which confirm the ability of the proposed framework to discover better network architectures. It is also worth noting that while NE+GM networks and reference network with GM have comparable adversarial accuracies, the adversarial accuracies are higher than their clean accuracies. This result is counter-intuitive but the same behavior is not observed in the full dataset experiment. Thus, we consider this behavior to be specific to reduced datasets. Finally, the higher adversarial accuracy of NE+GM networks compared to the hand-engineered network+GM shows that the extent of transferability can be reduced by a properly designed network architecture.

The impact of the choice of adversarial attack methods
Here, we see the impact of the choice of attack method on the performance of the baseline networks and NE+GM networks produced in Section 4.2. We use PGD with L ∞ norm, PGD with L 2 norm (L 2 -PGD), and FGSM to generate adversarial examples on the reduced CIFAR-10 test set. We employ fooling rate as a convenient evaluation metric because with the limited data, the clean accuracy tends to be relatively low and fooling rate can quantify the behavior of networks to adversarial examples more clearly based on the ratio of images that change labels when adversarially attacked. For each adversarial attack method, the fooling rate of every network in the baseline and NE+GM networks is calculated by extracting all the clean images that are correctly classified by both the reference network and the network that is being tested. Afterwards, the extracted images are adversarially attacked using the reference network and the adversarial attack method to generate adversarial examples. Then, the adversarial examples are evaluated by the network that is being tested. The results reported in Table 5 show that the fooling rate of networks on L ∞ -PGD agrees with the previous adversarial accuracy results on Table 4 as expected. Moreover, since L 2 -PGD is known to produce weaker adversarial examples compared to L ∞ -PGD, using GM on the networks results into a lower fooling rate on L ∞ -PGD than L 2 -PGD due to more vulnerable input gradients of the reference network to L ∞ -PGD. The stronger the adversarial attack is, the more effective GM becomes in reducing transferability. However, among the networks, the NE+GM network still has the lowest fooling rate for L 2 -PGD. FGSM is the weakest adversarial attack among the three adversarial attacks. Thus, the FGSM perturbation level or is adjusted to create stronger attacks despite with more obvious image perturbations for easier comparison between networks. Unexpectedly, the hand-engineered network+GM has the lowest fooling rate. One of the reasons it has a lower fooling rate than the NE+GM network, is because the adversarial examples used in the NE evolution are generated using L ∞ -PGD. Although, the NE+GM networks are still better than reference network with GM.

Comparison with standard adversarial defense
Here, we show that the NE+GM networks can perform better than the reference network combined with several adversarial defense methods. We employ two image filtering methods (i.e., JPEG compression [52] and bilateral filtering [53]) and two adversarial training methods (i.e., free adversarial training [15] and fast adversarial training [16]) as the defense methods. Again, fooling rate is used for evaluation. In contrast to Section 4.2.2, the adversarial examples here are obtained by extracting the correctly classified clean images using the reference network only and generating their corresponding adversarial examples using L ∞ -PGD. The same adversarial examples are used for all the networks tested. Note that the fooling rate defined here is modified to compensate for image filtering methods, which cannot classify images. For the image filtering and adversarial training methods, we use the reference network. The results in Table 6 show that image filtering techniques do not offer strong adversarial defense as the NE+GM network due to its passive procedure. The advantage of our proposed NE+GM method is that it actively searches for architectural solutions, which makes it effective even with a strong adversarial attack such as L ∞ -PGD. Fast and Free adversarial training methods are good adversarial defense techniques. However, due to the limited dataset employed, balancing the clean accuracy and adversarial accuracy using adversarial training has produced fooling rates that are notably higher than the reference network with GM and NE+GM networks.

Results in other datasets
In this experiment, we show that by retraining the NE+GM network produced in Section 4.2 with other reduced datasets, we can attain good clean accuracy and adversarial accuracy with the retrained networks. We use datasets, MNIST, FMNIST [54], and KMNIST and reduce them with the same training and test set count as the reduced CIFAR-10. As seen in Table 7, the reference network has the best accuracies in MNIST and FMNIST in exchange of worst adversarial accuracy. However, both the reference network with GM and NE+GM network manage to balance the clean accuracy and adversarial accuracy except for the MNIST result in which the NE+GM network has a notably higher adversarial accuracy with slightly lower clean accuracy than the reference network with GM. The performance of NE+GM network is attributed to its architecture designed for the reduced CIFAR-10, which has limited size and features. Unlike the NE+GM network in Table 3, where the architecture is well-developed, the NE+GM network here is relatively unrefined.

Conclusion
In this paper, we addressed from architectural perspective a problem of discovering DNN classifiers that have low transferablity against adversarial examples of other models. We proposed a method to evolve DNN architectures to find such networks. In particular, we employed NE to produce unconventional network architectures and GM to encourage input gradients misaligned across networks so that dissimilar functions are learned, which is expected to lead to low transferablity. In the experiments, the proposed method successfully discovered networks with expected properties; they achieved comparable clean accuracy with the reference network as well as significantly lower transferable against the adversarial examples generated from the reference network. Importantly, the networks discovered by the proposed method outperformed the reference network even when it was trained with GM. The discovered networks were also robust to adversarial examples of other models (ResNet-18, VGG, DenseNet, and SqueezeNet), whereas the reference network (ResNet-34), trained with or without GM, was vulnerable to them.
These results indicate that evolving network architectures played an important role in reducing the transferability. We consider that our study contributes to a fundamental problem of training adversarially robust classifiers, particularly from the transferability and novel network architecture perspective, which is a promising avenue to explore.

A Simple hand-engineered networks
In Fig. 4, we describe the architecture details of two simple hand-engineered networks in the reduced datatset. The network in (a) is employed as the reference network. It has three convolutional blocks, three skip connection blocks (i.e., summation and concatenation blocks), and two fully-connected blocks. Meanwhile, the network in (b) is utilized as an additional baseline network. It has four convolutional blocks, one pooling block, two skip connection blocks, and two fully-connected blocks.
(dist. from Ref.  Figure 3: The integrated gradients [43] of the reference network, reference network with GM, and NE+GM network show the focus areas of each network using the relative values ranging in [−1, 1]. By contrasting colors of the same pixel locations, one can see that the focus area of reference network with GM (lower left) and NE+GM network (lower right) are both dissimilar to that of the reference network. The total pixel distance of the NE+GM network (1.297) are higher than that of the reference network with GM (0.687). This indicates that the input gradient of the NE+GM network is more dissimilar from that of the reference network.  Figure 4: These are the architecture details of the simple hand-engineered networks for the reduced dataset. The network in (a) shows the simple hand-engineered network used as a reference network and the network in (b) shows the simple hand-engineered network used as additional baseline network.