Orthogonal Deep Models As Defense Against Black-Box Attacks

Deep learning has demonstrated state-of-the-art performance for a variety of challenging computer vision tasks. On one hand, this has enabled deep visual models to pave the way for a plethora of critical applications like disease prognostics and smart surveillance. On the other, deep learning has also been found vulnerable to adversarial attacks, which calls for new techniques to defend deep models against these attacks. Among the attack algorithms, the black-box schemes are of serious practical concern since they only need publicly available knowledge of the targeted model. We carefully analyze the inherent weakness of deep models in black-box settings where the attacker may develop the attack using a model similar to the targeted model. Based on our analysis, we introduce a novel gradient regularization scheme that encourages the internal representation of a deep model to be orthogonal to another, even if the architectures of the two models are similar. Our unique constraint allows a model to concomitantly endeavour for higher accuracy while maintaining near orthogonal alignment of gradients with respect to a reference model. Detailed empirical study verifies that controlled misalignment of gradients under our orthogonality objective significantly boosts a model's robustness against transferable black-box adversarial attacks. In comparison to regular models, the orthogonal models are significantly more robust to a range of $l_p$ norm bounded perturbations. We verify the effectiveness of our technique on a variety of large-scale models.


I. INTRODUCTION
Deep learning has enabled machines to achieve human level performance in numerous computer vision tasks, including image classification [1], [2], object detection [3], [4], semantic segmentation [5], [6] and image captioning [7]. However, despite their impressive accuracy, they are vulnerable to adversarial inputs [8]. These inputs -a.k.a. adversarial examples -are crafted by a careful manipulation of the original input signals. The resulting adversarial examples appear natural to humans but can completely alter the output of a deep model. This intriguing brittleness of deep learning is currently being actively investigated by the research community [9].
Recent years have seen numerous algorithms to compute adversarial perturbations to inputs that can fool deep models. These techniques can be broadly categorized into gradientbased and gradient-free schemes [9]. Generally, the gradientbased approaches accumulate the raw gradients of a model w.r.t. input via iterative traversal of the model's loss surface. These gradients are subsequently used for targeted or nontargeted fooling of the model. Gradient-free techniques craft adversarial examples in a local gradient agnostic manner, e.g., by iteratively taking feedback from model predictions and exploiting it for model fooling [10]- [12]. It is claimed that gradient-based techniques pose a serious threat to deep learning [13].
Gradient-based perturbations exhibit high fooling rates under white-box setting, where complete knowledge of the targeted model is available. Surprisingly, these perturbations also transfer well to models under the more pragmatic black-box setting -where no knowledge of the target model is available. This phenomenon enables the attackers to craft malicious examples using a surrogate model [14]. Such insidious nature of perturbations has raised serious concerns for the practical deployment of deep visual models in sensitive applications like self driving cars, medical diagnosis, face-recognition and many others.
The pivotal role of gradients in transferable adversarial examples naturally leads to the intuition of changing gradient relations among different models to induce immunity against attacks. Gradients of a model depend on its internal architectural layers that govern the flow of data from input to output. This inherent dependency of the gradients over the architecture allows the networks of varying architectures to have somewhat different gradients regardless of similar functional objective [14]. However, different models with the same architecture may still have highly correlated gradients.
Previous studies [14]- [16] demonstrate that perturbations generated from a model maximally transfer to models with similar architectures. We argue that the underlying similarity of gradients facilitate this phenomena. This intuition is based on the fundamental dependency of perturbations over model gradients, which allows the proximity in gradients of different models to manifest in their respective perturbations. Models are therefore, more susceptible to perturbations that are generated from a similar architecture. In this paper, we propose to mitigate the vulnerability of deep models to transferable (black-box) attacks by decreasing the correlation of model gradients.
For empirical investigation of the above arguments, we introduce a framework that allows explicit control over model gradients with respect to a reference model. Figure 1 illustrates the architectural view of our scheme. It highlights the novel gradient based constraint that allows a given model (below) to strive for higher classification accuracy along with orthogonal alignment of its gradients to a reference model (top). In order to remove the disparity between the gradients due to arXiv:2006.14856v1 [cs.LG] 26 Jun 2020 Fig. 1: Schematics of our framework to train orthogonal models. On the top, a trained reference model with frozen parameters is used for the computation of reference gradients. The model to be trained, shown at the bottom, exploits the reference gradients to align its gradients in an orthogonal manner. Control over the direction of gradients is explicitly enabled through our novel similarity loss objective.
architectural differences, we keep the architectures similar and study the impact of gradient similarity over adversarial transferability.
Our detailed analysis based on a systematic treatment of the problem demonstrates that the robustness of models against transferable attacks increases as the alignment between the model gradients decreases. Our experiments establish the enhanced immunity of a variety of models trained over CIFAR-10 and ImageNet ILSVRC 2012 datasets against transferable attacks using our technique. The enhanced robustness is significantly higher for the imperceptible range of perturbations. For instance, robustness of VGG-11 (CIFAR-10) model improves by 36.7% against FGSM, 67.8% against PGD, 66.8% against I-FGSM and 55.8% against the MI-FGSM generated perturbations with = 0.05. The major contributions of this work are summarized as: • We propose a unique adversarial defense that is based on the orthogonality of deep models w.r.t. a reference model. • We propose a novel gradient regularization scheme that enables a model to adjust the orientation of its gradients. • Our systematic empirical study analyzes the correlation between gradient similarity and adversarial transferability for independently trained models of similar architectures. It establishes that transferability of adversarial attacks between the models reduces as dissimilarity between their gradient increases.

II. RELATED WORK
Adversarial perturbations to inputs are primarily investigated in the related literature along the lines of attacking deep models and defending them against adversarial attacks [9]. We first discuss the key contributions along these lines and then also describe a few recent attempts that go beyond the adversarial perspective of input perturbations.

A. Adversarial attacks
Additive adversarial noise that can arbitrarily flip the prediction of a model made its first appearance in the seminal work of Szegedy et al. [8]. This work resulted in the development of numerous techniques to attack deep visual models. Goodfellow et al. [17] devised the Fast Gradient Sign Method (FGSM) to craft adversarial perturbations in a single gradient ascent step over the model's loss surface for the input. Later, Kurakin et al. [18] advanced this scheme by introducing a multi-step version called Iterative FGSM (I-FGSM). In continuation, Dong et. al [15] introduced momentum variables in the iterative traversal of loss surface and demonstrated enhanced transferability of adversarial perturbations. Similarly, Xie et. al [16] further improved the I-FGSM and MI-FGSM by incorporation of random differentiable transformations like resizing and padding to the input image. Their technique is called Diverse Input I-FGSM (DI 2 -FGSM). Further instances of the follow-up iterative algorithms are Variance-Reduced I-FGSM (vr-IGSM) [19] and PGD [13] etc.
The above-mentioned algorithms and other recent works [20]- [25] compute image-specific adversarial perturbations which appear as insignificant noise to the human eye but completely confuse the models. Moosavi-Dezfooli et al. [26] first demonstrated the possibility of fooling deep models simultaneously on a large number of images with Universal Adversarial Perturbations. Later, Akhtar et. al [27] devised a label-universal technique to fool a model on an entire category of object in a targeted manner. It is now well-established that adversarial examples computed by most techniques also transfer well across different deep learning models. This insidious nature of adversarial perturbations is considered as a serious threat to practical deep learning [9] and is fueling a very high level of research activity in this area.
1) Black-Box VS White-Box: Adversarial attacks are broadly divided into white-box and black-box schemes [9]. This segregation is based on the amount of information available to the attacker to craft the perturbations.
In white-box attacks, the attacker assumes complete knowledge of the targeted model including its architecture, parameters values, training methods as well as the training data. The previously discussed schemes [13], [15]- [19], [26] fall under this category. These techniques craft the perturbations based on the image and model specific gradients and, therefore, require full access to the model parameters as well as its architecture. Usually such attacks incur high fooling rates and are further differentiated by the number of iterations, choice of projection method and norm constraint to quantify the perturbation magnitude [9].
In practice, intricate knowledge about the deployed model is seldom available. Therefore, attackers treat the target model as a black-box and utilize limited information made available in the public domain [28]. These black-box attacks are broadly crafted either via 'query feedback' or 'transferable attack' strategy [29]. In the query-based schemes, the attacker crafts the perturbations by repeatedly querying the targeted model [12] and analysing the output. On the other hand, in transferable attack schemes, the attacker trains a substitute model to emulate the targeted model and then learns perturbations for the substitute model in a white-box setting. Such perturbations are known to fool models that are of different architectures and trained with different datasets [21], [28], [30]. Protection against the transferable schemes is considerably challenging as these attacks exploit similar generalization capability of models [31].

B. Adversarial defenses
A variety of techniques have also been proposed to counter the adversarial attacks [32]- [34]. These schemes aim to defend deep models against both image-specific [34] as well as imageagnostic perturbations [33]. Broadly, the defense schemes are developed along four major lines.
Adversarial Training: These schemes aim to boost the robustness of models by specialized augmentation in training dataset [8], [9], [13], [17], [20], [35], [36]. Adversarial examples are computed using the strongest available techniques and the model undergoes training on these examples along with the clean samples. Madry et. al [13] has systematically studied adversarial training over a variety of models and datasets. Recently, they publicly released robustly trained models. Interestingly, beside significant performance degradation of these models, adversarial examples can still be computed for such models [20]. Network Modifications: Unlike adversarial training, these techniques rework some inner aspects of a model. This generally includes introduction of new layers and enhancement of the loss function via regularization of the gradients. For instance, auto-encoders have received special attention among the studies that introduce additional layers. Gu et al. [37] and Bai et al. [38] have explored auto-encoders to mitigate the adversarial noise in the spatial domain. Similarly, [39] introduced a masking layer to clean adversarial noise at the highest feature layer. Gradient regularization schemes [36], [40]- [42] are normally based on the observation that adversarial perturbations have smaller norms. Therefore, penalizing the degree of variation of output with respect to input can assist in detecting adversarial noise. Network Add-ons: External networks have also explored to aid in defense against adversarial attacks. Shen et. al [43] re-purposed the generative adversarial networks (GANs) to rectify the perturbed image. While Lee et. al [44] customized those to generate adversarial examples. The generated examples along with clean samples are utilized to train the ordinary models. The later setup has strong resemblance to adversarial training, however, the fixed presence and larger role of additional generator pushes this scheme in to this category [9]. Similarly, Akhtar et al. [33] proposed a perturbation rectifying network (PRN). This sub-network pre-processes the input images before passing it on to the classification model. Interestingly, training of PRN layers do not modify the weights of the classification model and act as an effective defense against universal adversarial perturbations. Input Transformations: Apart from the above-mentioned schemes, the line of work based on inherent brittleness of adversarial patterns has recently gained momentum. These schemes are based on the observation that perturbed inputs do not stay adversarial in the presence of simple geometric transformations. Among the transformation-based defenses, Pixel deflection [45], BaRT [46] and [47] are notable. An observation common among the mentioned studies is the general deterioration of classification accuracy in the presence of any defense strategy. It has been demonstrated by Carlini et al. [48], [49] and later by Athalye et al. [50] that it is often possible to break the adversarial defenses by stronger adversarial attacks. Among all the defenses, adversarial training is considered to be the strongest. However, it raises the computational budget considerably along with significant performance degradation [13].

C. Beyond attacks and defenses
Recently, a handful of works [27], [36], [51] have demonstrated the usefulness of perturbations beyond the simple task of fooling. In this aspect, [51] and [36] showed that the perturbations generated for adversarially robust classifiers manifest the visual semantics of the target class. Interestingly, Santurkar et. al [51] cast a number of computer vision applications as adversarial attacks over the robust models. It includes image generation, inpainting, interactive image manipulation and image translation. However, the presented visual results are far from acceptable. More recently, Jalwana et al. [52] demonstrated that attacks can also be a useful tool for model explanation. However, the authors still advocate the need of defense techniques in adversarial settings.

D. Orthogonality In Deep Models
Recently, the principle of orthogonality has been explored in the context of deep models. Zhang et al. [53] explored the orthogonal projections of features to devise a capsule projection network. Their work improves the classification accuracy of standard architectures over a variety of benchmark datasets. Similarly, [54], [55] have contributed towards the orthogonal regularization of model parameters. Jia et. al [55] demonstrated that besides the enhanced accuracy, these models also have a superior natural resistance to common perturbations like blur, weather, digitization etc. without any explicit training. However, their work did not explore the robustness towards the engineered adversarial noises. In the next Section, we introduce and discuss our proposed technique for learning orthogonal models. Unlike existing works, our definition of orthogonality is based on the gradients rather than the parameters.

III. PROPOSED APPROACH
The prime objective of our technique is to misalign the gradients of a given model w.r.t a reference model. First, we introduce a metric that quantifies similarity between two models. We then reformulate the training objective to allow explicit control over the similarity. Detailed discussion on the proposed algorithm is presented before analyzing the qualitative aspects of the gradient disparity on model decisions.
We define the relative orientation of gradients as a measure of their similarity ('δ'). Similarity between two normalized gradients g 1 and g 2 is computed as the cosine of their mutual angle. For mathematical ease, the multi-dimensional gradients are cast as d dimensional vectors. The metric, as given in Equation 1, has resemblance to identifying correlation between vectors.
where denotes vector dot product. We briefly discuss the commonly used training of ordinary models before introducing our enhanced loss function for orthogonal models. For a distribution D over images i ∈ R d with corresponding labels l ∈ [K] and selected loss function L(θ, i, l), the goal of ordinary training is to estimate the model parameters θ that minimize the empirical risk where E[.] is the Expectation operator. In this setting, gradients have complete liberty to follow any trajectory that leads to minimal risk or maximum accuracy. Nevertheless, with similar training data, architecture and hyper-parameter settings, gradients of different models induced by different initializations of parameters still have directional similarities. This can also be true for the models with slight architectural variations, which generally leads to good transferability of adversarial attacks across different models. We handle this discrepancy by allowing control over the gradient orientation of a model during its induction. To that end, we enhance the commonly used training objective with the help of the similarity metric defined in Equation 1. For a given reference model trained over the distribution D, we train a misaligned model with parameters θ * by minimizing the empirical risk mentioned below where the hyper-parameter 'γ' enables explicit control over the gradient orientation during minimization. As the value of this parameter increases, the gradients tend towards becoming orthogonal to those of the reference model with minimal loss of accuracy. Before a detailed discussion of the proposed algorithm, we briefly compare it with the closest available scheme of Kariyappa et. al [56]. Similar to our work, they explored the role of gradient disparity in promoting adversarial robustness. However, their algorithm modifies the model gradients with strict adherence to linear dependency of the gradients. This contrasts with the central theoretical argument of our method i.e., the linear independence of gradients. Moreover, their formulation is particularly tailored to the robustness of model ensembles that requires the specific modification of all the nonlinearities that are present in the models. Our novel scheme improves the adversarial robustness of an individual model via novel orthogonal gradient regularization without the need of any architectural changes. The details of our scheme are presented next. if epoch % 20 = 0 then 12:

13:
Acc t ← Find Accuracy on D 14: end if 15: end while 16: return Our procedure to induce misalignment in model gradients is summarized in Algorithm 1. The algorithm solves the optimization problem in Equation 3 with a guided gradient descent strategy. Mini-batches of the training samples are employed for a multi-step traversal of the enhanced loss surface. Iterative stepping in the direction of decreasing '℘ * ' with gradient descent allows the model to gain classification accuracy along with orthogonal alignment of its gradients w.r.t. the reference model. Below we describe the procedure in detail, following the sequence in Algorithm 1.
We train an orthogonal model M (w.r.t. reference classifier K), expecting the inputs mentioned in Algorithm 1. Briefly ignoring the initialization on line 1 and criterion on line 2, the algorithm first samples a mini-batch of size 'b'. On line 4-5, gradients of classification loss function w.r.t samples of mini-batch are calculated for the training and reference models. The gradients are cast as R d vectors and normalized by their l 2 norms which not only helps in confining their dynamic range to the meaningful interval [0,1], but also allows direct similarity computation under Equation 1. Given the normalized gradients, we estimate the similarity 'δ' as mean of the gradient dot products as given on line 6, where E[.] is the Expectation operator. We scale 'δ' by a hyper-parameter 'λ' and update the general classification loss on line 7. This hyperparameter enables explicit control over the degree of gradient misalignment. For λ = 0, the algorithm converges to usual training of deep model, ignorant of any gradient alignment. For higher values (λ > 30), the model becomes nearly orthogonal to the reference model.
Finally, the optimization algorithm is deployed for updating the weights of model parameters with respect to the gradient of our enhanced loss function. The choice of optimization algorithm is not restricted in any manner, however, for a fair evaluation, we employ the same optimization algorithm as used in the training of our reference models. The algorithm continues to improve the validation accuracy using the training data. Due to the stochastic nature of deep models, we monitor the accuracy of the model after every 20 epochs, as indicated on line 11-13. This is a purely empirical strategy to automate the procedure and can be replaced by manual monitoring.
In general, defenses against adversarial attacks are well known to reduce the performance of the original model [8], [9], [13], [17], [20], [35], [36], [36]- [42]. Hence, before we provide the actual quantitative results in Section IV, we find it necessary to discuss the intuition behind the attractiveness of our proposal of employing model orthogonality as a defense. Below, we give a simplified explanation that reveals how two models can still have very similar classification performance, despite their gradients beings orthogonal.
Deep classification models can be viewed as stack of nonlinear transformation layers where each internal layer sequentially transforms its input before feeding it to the next layer. Consequently, the model as a whole is able to project an input to the output -class label in the case of classifiers. For an input image i ∈ R d , the output of a deep model, say i can be expressed as: where 'N ' is the number of layers. Our argument is that despite large differences between the intermediate projections by the internal layers of two misaligned models, both models can still achieve similar classification objective. Thereby, enabling both models to preserve the same desired classification accuracy.
We provide an intuitive explanation of this insight by considering two simple linear models composed of four transformation layers (T 1 . . . T 4 ). For the ease of understanding, we restrict the input and transformation spaces to 3 dimensions. In Fig. 2, we illustrate the transformations by the internal layers of the models by arrows of different colors. It can be observed that despite orthogonality between each pair of corresponding transformations, the input points are eventually mapped to the same output by both sets of transformations (i.e. models). In theory, this identifies the possibility of achieving the exact same performance by two models despite orthogonality between their internal layers.
In the above discussion we consider the simple case of 3 dimensional space with linear systems. For our argument, this is actually a more constrained setup as compared to higher dimensional spaces that also allow non-linear systems. In that setting, the flexibility of non-linearity and inherent low correlation between random vectors provide a more conducive setup for our argument to hold. In high dimensional vector spaces, a slight variation in the components of a vector can lead to a drastic change in its orientation. In fact, in such spaces the probability of random vectors to be correlated approaches to zero as the dimensions become very large [57]. We refer interested readers to Gorban et. al [57] for more details regarding this phenomenon. Here, we capitalize on this phenomenon -which is commonly seen as a curse of dimensionality -constructively.
Expanding on the implications of our argument, misaligned gradients naturally raise concerns about differences in the internal reasoning of the models. With sheer volume of parameters and variety of non-linear layers, the direct analysis of causality between input and output of deep models is not possible. Hence, we analyze this reasoning for ordinary and orthogonal models via a popular explanatory algorithm called GradCAM [58]. This technique exploits the internal representation of a model to reliably localize the region of interest in the input image for the model [59]. In the presence of dissimilar gradients, one may anticipate a significant difference between these regions for ordinary and orthogonal models, given the same input.
We show representative results of Grad-CAM for different VGG-16 models with varying degree of orthogonality in  We provide complete details of our experiments in the next Section. Here, we highlight with Fig. 3 that despite large dissimilarity between the gradients of models, they are generally able to identify the same region of interest in the images.
For the top row with dog image, all three models consider dog's face as the most salient region. Similarly, eye and nose are considered important by the models for the middle row with shark image. For the bird image, shown in the last row, despite significant visual differences in the GRAD-CAM outputs, it is clearly identifiable that the head and some parts of the body are used by all the models for correct classification. These images ascertain that despite large dissimilarities between the gradients of different models (with the same architecture), they are still able to perform similar internal reasoning regarding the semantics of the inputs to perform correct classification. Hence, besides the potential to achieve the desired high accuracy of the reference model, orthogonal models are also able to preserve the intuitive internal reasoning of the models. The proposed models are able to significantly reduce the transferability of adversarial attacks while maintaining these desirable properties. In the next section, we focus on providing thorough empirical evidence that gradient misalignment can boost immunity against black-box adversarial attacks.

IV. EXPERIMENTS
We first outline the experimental setup before detailed discussion and analysis of empirical evaluation of our technique on different datasets.
We evaluate the robustness of gradient misaligned models against black-box adversarial attacks. These attacks assume that the internal gradients of target model are not available. Therefore, perturbations are crafted using surrogate models and transferred to the target models. Observing success of transferred perturbations (fooling ratio) provides a fair estimate of model immunity. We demonstrate the generalization of our approach by including several architectures trained over CIFAR-10 [60] and ILSVRC ImageNet 2012 dataset [61].
We consider two separate choices of source models to craft perturbations. First we keep similar source and target model architectures. This enables us to verify robustness under the more challenging condition as it is known that perturbations transfer maximally to similar architectures [14]. Later, we consider the case of varying source architecture to validate robustness against diverse perturbations. To evaluate our models, for each source-target model pair, we sample 1000 images from the validation sets that are correctly classified i.e. prior baseline accuracy for models is 100%. These images are subsequently perturbed and classified by the target model. Hence, the bias in fooling ratio due to already misclassified samples is completely avoided in our results.
We craft adversarial perturbations using four well-known attack schemes, namely the single-step FGSM [17] and multistep PGD [13], I-FGSM [18] and MI-FGSM [15]. These techniques generate strong transferable perturbations and are often used to validate the robustness of deep models [13], [21]. We used their publicly available implementations in Foolbox [62] and AdverTorch [63]. Our experiments are based on the Pytorch framework [64] as its native dynamic graphs provide convenience for our gradient regularization.
The most salient qualitative feature of a perturbation is its visual footprint. Perturbed images that are easy to identify by human visual inspection are of little practical concern. Therefore, following the standard practice, we keep the perceptibility of perturbations in an acceptable range by controlling the step size ( ) in attack schemes. In Figure 4, we illustrate the visual quality of distortions as is varied. It can be observed that perturbations become more visible with increasing . In our evaluation, we vary this parameter from 0.5% to 8% to demonstrate the efficacy of our scheme. Our coverage of robustness evaluation is more comprehensive because, given the visual perceptibility of distortions, the existing literature generally uses 4-5% as the upper bound for perturbations [26], [27].

A. CIFAR-10 Models
We first demonstrate the effectiveness of our scheme for the visual models trained over CIFAR-10 dataset [60] which consists of 60,000 32×32 color images uniformly divided into 10 classes.
Low computational cost for training models on this dataset allows us to validate our scheme on three different architec- tures. This includes publicly available VGG-11 [65], ResNet-20 [66] and ResNet-32 [66]. The deliberate inclusion of two fundamentally different architectures (VGG and ResNet) permits us to analyze the architectural role in developing immunity against adversarial perturbations. Similarly, inclusion of two variants of ResNet enable us to analyze the role of depth for similar architecture. ResNets have a special place in the adversarial arena as recent works have demonstrated that skip connections facilitate the transfer of adversarial examples [67]. Hence, we include ResNets in our evaluation.
To train our model, we follow the standard practice of reserving 50,000 training samples and 10,000 test samples. Multiple randomly initialized ordinary models (λ = 0) and orthogonal models (λ = 30) were trained via our Algorithm 1. In Table I, we report the notable statistics of these models. We provide pairs of models such that 'Model-1' is the original model and 'Model-2' is its newly trained version. The orthogonal version is distinguished with a superscript 'o'. In the table, it can be observed that when orthogonality is not considered, two independently trained models have high correlation in their gradients. Our orthogonality constraint drastically reduces this correlation. It is also worth noting that the classification accuracy of the models with orthogonality constraint does not reduce much and stays within 5% of the reference model accuracy. This is a considerable improvement over the commonly adopted technique of adversarial training [50] that is known to result in 10-20% reduction in the original accuracy.
In Figure 5, we demonstrate the impact of gradient misalignment over adversarial attack transferability. In the first row of the figure, we use VGG-11 architecture and train a 'source model' that is used to craft adversarial perturbations. These perturbations are used to attack three models with the same VGG-11 architecture. The first is the source model itself (red). The second is the 'ordinary target' model (black), induced with different initialization of the training process. This setup emulates the commonly encountered black-box attack if the attacker is able to correctly guess the (standard) architecture of the target model. The third model is the 'orthogonal target' model (green) that has the VGG-11 architecture, but it is   In each plot, vertical axis reports the fooling rate, while the horizontal axis represents the step size ' ' as the percentage variation of the image dynamic range to compute a perturbation.
The plots in Figure 5 clearly indicate that our orthogonal models have generally stronger resistance against attacks compared to the source and its retrained variants. The robustness is significantly higher for VGG architecture than for the ResNets. We conjecture that this phenomenon results from skip-connections that are known to facilitate the tranferability of perturbations. Nevertheless, the orthogonal ResNets do exhibit significant immunity against PGD, I-FGSM and MI-FGSM attacks. This demonstrates that our technique is able to bring robustness to ResNets as well, despite their inherent facilitation to black-box attacks. Among all the attacks, MI-FGSM is known to generate highly transferable perturbations [15]. However, our orthogonal training enhances the robustness by 55.8% for VGG-11, 22.4% for ResNet-20 and 17.4% for the ResNet-32 against the attacks generated with = 0.05. These results are relative to the ordinary retrained model.

B. ImageNet ILSVRC2012
We extend the validation of our approach to the largescale ImageNet ILSVRC 2012 dataset [61] which consists of 1.2 million color images with 224 × 224 resolution and 1000 classes of daily-life objects. It is a common practice in adversarial defense literature to perform experiments only with CIFAR-10 and small datasets, e.g. MNIST [69]. However, for a more comprehensive evaluation, we present results for this large-scale dataset despite heavy computational requirements. We prefer this because results on ImageNet are more likely to generalize well to other models in the era of large-scale datasets.
Recent works [70] have established a one-to-one relationship between the dimensions of model inputs and adversarial susceptibility of the models. This makes ImageNet classifiers especially challenging to defend against adversarial perturbations as compared to MNIST and even CIFAR models. While the lower computational complexity of CIFAR-10 dataset permits us to train numerous models from scratch, ImageNet training of multiple models from scratch is computationally prohibitive.
For computational reasons, we restrict our evaluation for ImageNet to VGG-16 architecture and employ the validation set of ImageNet for experiments. This set comprises 50,000 images. We follow the standard split of 80/20 for the training  and testing our models. Ordinary models, i.e. base-line, are prepared by fine tuning the ImageNet pretrained models available in Pytorch [64]. Fine tuning is performed by following the standard procedure of re-initialization and training of the last fully connected layers. The initial layers are known to capture the common shared concepts and relearning the last few layers suffices the model to capture the underlying high level features of the new distribution.
We use the baseline model as the reference model and train numerous other models by varying the hyper-parameter 'λ' in Algorithm 1. Each experiment is repeated five times for which the important statistics are summarized in Table II. Here, the similarity of models for each 'λ' is w.r.t. the baseline model. It can be observed that in simple retraining i.e. λ = 0, strong correlation exists among the gradients. As 'λ' increases to 100, the correlation almost vanishes. It is worth noting that similar to CIFAR-10 models, the standard classification accuracy suffers slightly and at the near orthogonal alignment it stays within 5% of the original model's accuracy. We have included the training time of the models to show the computational complexity. Generally it takes 3 days to train an orthogonal model on ImageNet validation set using NVIDIA Titan V GPU with 12GB RAM.
In Figure 6, we illustrate the role of dissimilarity in boosting model robustness against transferable attacks. Perturbations are crafted on source model, shown in red, and then evaluated on orthogonal model, represented in green. Each column presents the results for a particular attack type and each row presents the results for different correlations between the models as controlled by 'λ' value. The vertical axes reports the fooling rate of the classifiers and horizontal axes represents the step size ' ' as percentage of the original dynamic image range. It can be observed that for each attack type (column), as we increase the misalignment of gradients (from top to bottom row) the robustness of orthogonal model increases. Overall, orthogonal models show robustness to all types of attacks. However, maximal resistance is observed for PGD which is considered to be the strongest iterative attack [13].
The orthogonal model increases the robustness by 67% for the PGD attacks generated with = 0.05.

C. Attacks from Other Models
The above results clearly establish the efficacy of our scheme against perturbations generated from similar architectures. We further demonstrate that our scheme enhances the resulting orthogonal model robustness to perturbations crafted from dissimilar architectures -a more realistic setting for black-box attacks.
In Fig. 7, we show how attacks computed from different source model transfer to our orthogonal VGG-16 model. Each row of the figure shows the fooling results of our orthogonal target for perturbations crafted by the architecture mentioned on the left. Each column shows the results for a single attack scheme as indicated on the top. The plot axes follow our conventions from the previous figure. The red curve represents the fooling ratio on source model and the green curve represents the same for orthogonal VGG-16 model. For clear benchmarking, fooling ratio for the ordinary VGG-16 model (not made orthogonal) is included as black curves.
As can be seen, we use three diverse architectures to craft the perturbations, including ResNet-34 [71], DenseNet-161 [72] and SqueezeNet [73]. We used the pretrained models of these architectures available in Pytorch [64]. In Fig. 7, the graphs illustrate superior robustness of our orthogonal model in comparison to the ordinary one. It can be observed that the green curve reports lower fooling ratio by staying reasonably below the black curve for all combinations of attack schemes and source models. These results support that controlled misalignment of model gradients can boost immunity of models against adversarial perturbations especially in a blackbox setting when the target model architecture is unknown.

D. Comparison with Other Techniques
In the previous discussion, we illustrated the empirical results of our proposed defense strategy against different adversarial attacks. In this Section, we extend the results by a detailed comparison of our technique with related defense algorithms that are available in the literature.
We compare our technique with four other schemes, including JPEG compression [47], Total Variance minimization [74], Bit Squeezing [75] and Bilateral Filtering [76]. We comprehensively evaluate existing defense strategies by using different settings of their hyperparameters in the ranges that are commonly used. Therefore, the results are reported for four different variants of each method, making our comparison considerably thorough. We refer to the original works for details on the significance of hyperparameters. In JPEG compression, we sweep the 'quality' hyper-parameter between 80% and 95%, similarly 'weights' are changed from 3 to 9 for the TVM and for bilateral filters the 'window size' is varied between 3 and 9. For bit Squeezing defenses the 'depth' is selected between 1 and 4.
In Table III, we compare the performance of defense algorithms for the models trained over ImageNet and CIFAR10 as discussed in Table I and II. The results are shown for 1000 randomly sampled images from the validation dataset, such that the clean version of these images are correctly classified by the standard as well as the orthogonal model. The architecture under attack/defense is highlighted in the first column of Table III. Perturbations are computed for the source model via four different attack schemes for 1 = 0.03 and 2 = 0.05. The last four columns report the fooling ratios in order 1 / 2 . The choice of values are based on the literature, such that a higher value is usually perceptible [20]. The raw fooling accuracy (no defense) is shown in the first row for each model. These perturbations are then defended via different schemes and the results are reported in the subsequent rows. The best results for each attack type and model are indicated in bold.
The results in Table III validate the superior performance of our technique against others by a significant margin for all the attacks. Only in the case of FGSM attacks against ResNet20, our algorithm has shown inferior performance. This phenomena is inline with the earlier discussed performance  of ResNet-20 model against the FGSM attacks in Figure  5. Besides the improvement in adversarial robustness, it is important to notice the relative clean sample accuracy of the defense strategies. All other techniques suffer from a noticeable drop in the clean samples accuracy that limits the scope of their practical application. Our method maintains a zero fooling rate on clean samples for all models.

V. CONCLUSION
We introduced a model-agnostic gradient regularization scheme that allows to manifest orthogonal variants of a model. These orthogonal versions exhibit enhanced resistance to adversarial perturbations against black-box attacks in scenarios where the attacker is even able to guess the architecture of the target model. In such scenario the different architectures trained over CIFAR-10 show a minimum enhancement of 18.5% against PGD, 9.1% against I-FGSM and 17.4% against the MI-FGSM attacks for = 0.05. Importantly, our regularization only results in a small degradation in the performance of the original models. This is in contrast to the conventional techniques for making the models robust to adversarial attacks, which generally result in significant loss of the original model accuracy. We presented our results for small-scale as well as large-scale datasets to establish the proposed technique. The achieved results highlight the potential of our scheme against the strongest available attacks.