Besting the Black-Box: Barrier Zones for Adversarial Example Defense

Adversarial machine learning defenses have primarily been focused on mitigating static, white-box attacks. However, it remains an open question whether such defenses are robust under an adaptive black-box adversary. In this paper, we speciﬁcally focus on the black-box threat model and make the following contributions: First we develop an enhanced adaptive black-box attack which is experimentally shown to be ≥ 30% more effective than the original adaptive black-box attack proposed by Papernot et al. For our second contribution, we test 10 recent defenses using our new attack and propose our own black-box defense (barrier zones). We show that our defense based on barrier zones offers signiﬁcant improvements in security over state-of-the-art defenses. This improvement includes greater than 85% robust accuracy against black-box boundary attacks, transfer attacks and our new adaptive black-box attack, for the datasets we study. For completeness, we verify our claims through extensive experimentation with 10 other defenses using three adversarial models (14 different black-box attacks) on two datasets (CIFAR-10 and Fashion-MNIST).


I. INTRODUCTION
There are many applications based on Convolution Neural Networks (CNNs) such as image classification [1], [2], object detection [3], [4], semantic segmentation [5] and visual concept discovery [6]. However, it is well-known that CNNs are highly susceptible to small perturbations η which are added to benign input images x. As shown in [7], [8], by adding visually imperceptible perturbations to the original image, adversarial examples x can be created, i.e., x = x +η. These adversarial examples are misclassified by the CNN with high confidence. Hence, making CNNs secure against this type of attack is a significantly important task.
In general, adversarial machine learning attacks can be categorized as either white-box or black-box. This categorization depends on how much information about the classifier is necessary to run the attack. The majority of the literature has focused on white-box attacks [9]- [11] where The associate editor coordinating the review of this manuscript and approving it for publication was Chunsheng Zhu . Notice that if no bar is present, then this means 0% robust accuracy.
the classifier/defense parameters are known. Likewise, the majority of defenses have been designed with the goal of thwarting white-box attacks [12]- [24]. In this paper, we focus on black-box attacks, where the classifier parameters are hidden or assumed to be secret. This type of adversary VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ represents a more practical threat model than the white-box attacker [25]. This is in part due to the fact the adversary cannot access the classifier parameters, but is still able to successfully create adversarial examples [25], [26]. Despite not having the defense parameters, the black-box adversary may still query the defense, be able to access X (the training dataset for the defense), or build a synthetic model to assist them in creating adversarial examples. By analyzing defenses through a black-box adversarial lens, we help complete the security picture by offering both new attack and defense perspectives to the community. Specifically we make the following contributions: 1) Mixed Black-Box Attack: We develop an enhanced version of Papernot's black-box attack [26] by expanding the amount of data available to the attacker and changing the final attack generation method φ. These changes significantly improves the attack success rate, i.e. >30% improvement on CIFAR-10 and Fashion-MNIST. 2) Barrier Zone Defense: We develop a novel defense based on barrier zones -coined BARZ. We show barrier zone based defenses can outperform all 10 other recent defenses studied in this paper. These defenses includes Madry's Adversarial Training [27], Barrage of Random Transforms [22] and Ensemble Diversity [24] just to name a few. A synopsis of our results is displayed in Figure 1 where we show the minimum robust accuracy of each defense under all 14 types of black-box attacks.
3) The δ Metric (Minor Contribution): In adversarial machine learning, every defense comes with two distinct values to consider. These values are the cost of the defense (drop in clean accuracy) and the robustness/security (performance on adversarial data). We propose an intuitive way to help gauge this trade-off between robustness and cost in the form of the δ metric.
A. COMPARING DEFENSES Figure 1 shows how the robust accuracy of the BARZ defense (defined as 1−α, where α is the attack success rate of the best out of 14 types of black-box attacks) compares to 10 other recent defenses from literature. The literature defines the attack success rate α as the fraction of adversarial examples that are misclassified by the defense. Here it is also important to precisely define the term adversarial example. In short, adversarial examples are clean images that are correctly identified by the classifier in their untampered form, and to which adversarial noise has further been added by the attacker. For this reason, using only the attack success rate α does not give the complete picture (i.e. only α is shown in Figure 1). The attack success rate α only corresponds to the fraction of original images which the defense classifier can correctly label. In essence, for any given defense d, α depends on the clean accuracy of the defense p d and not the state-ofthe-art or best achievable clean accuracy p. Here p specifically refers to the accuracy measured on the clean images, without any defense i.e., the clean accuracy. When a defense is present, we denote the corresponding clean accuracy of the defense as p d . So, to complete the story of Figure 1, we need to understand to what extent, the defense itself leads to a lowering of the clean accuracy of the vanilla scheme from p down to p d . Comparing defenses along these two separate metrics of (a) robust accuracy 1 − α (how well the attacker is able to defeat the defense) and (b) clean accuracy p d of the defense itself (without adversarial presence) leads to fuzziness. It is not clear which metric is considered more important or what combination is 'best'. The first row in Table 1 depicts the non-malicious environment (i.e., no adversary) and shows the accuracy p of the vanilla scheme, which is the best we can achieve to-date, and the accuracy p d of the defense, which is less than p as explained above. For the malicious environment, the vanilla scheme cannot achieve any accuracy because α = 0 (see the black-box boundary attack in Figure 2). This type of attack can always successfully transform a correctly classified image into an adversarial example that is misclassifed by the vanilla scheme. The probability of proper/accurate classification by the defense in the presence of adversaries is equal to p d · (1 − α) in the lower right corner of Table 1, since the defense properly labels a fraction p d if no adversary is present, and out of these images a fraction α is successfully attacked, if an adversary is present.
To avoid any fuzziness, we combine both metrics p d and 1 − α into a single 'δ-metric': We define δ as the drop in accuracy from the clean accuracy p of the vanilla scheme in the non-malicious environment (top left corner) to the accuracy of the defense in the malicious environment p d · (1 − α) (bottom right corner): When we analyze the non-malicious environment we are only interested in the clean accuracy of the defense -because we do not assume any attack. This gives Figure 2 where the y-axis corresponds to the accuracy p d for the defense in the non-malicious environment and the x-axis corresponds to the accuracy for the defense in the malicious environment -that is, the x-axis represents the drop δ from clean accuracy of the vanilla scheme in the non-malicious environment to the accuracy of the defense in the malicious environment (the price for resistance against adversarial examples). We notice that the x-axis and y-axis can map in a straightforward way to the clean defense accuracy p d and robust accuracy 1 − α themselves, which we could have reported as the x-axis and y-axis in our plots instead. But this would not visually make clear what combination (p d , 1 − α) is the best in a malicious environment. We prefer to plot the δ-metric as this corresponds directly to the (drop in) accuracy of the defense classifier in the malicious environment.
In practice, when evaluating a defense, we not only take into consideration the accuracy p − δ of the defense in the malicious environment but also the accuracy of the defense in the non-malicious environment given by p d in the top right corner of Table 1. From a pure machine learning perspective, we want a defense which does not affect p 'too much' -in other words the drop γ = p − p d should be small and limited to a couple of percentage points. However, security often does not come for free and in order to minimize δ we may need to sacrifice much more than a couple of percentage points. This means that we need to study a trade-off between minimizing δ and an acceptable p d . This paper presents such a study and our defense BARZ is aimed at minimizing δ despite a possibly significant drop γ from p to p d = p − γ in the non-malicious environment. It turns out that this leads to a robust accuracy for BARZ which outperforms those of other defenses as depicted in Figure 1 and Figure 2.

B. OUTLINE
The rest of the paper is organized as follows: In Section II we discuss black-box adversaries, why we focus on certain attacks and our new mixed black-box attack. In Section III we discuss the defenses we study, the security principles behind them and why we selected these defenses for analysis. In Section IV we introduce the mathematical intuition behind the security principles in the barrier zone defense. We discuss how barrier zone are realized in practice and show empirical proof of them as well, in Section IV. In Section V we explain how to concisely analyze the efficiency of a defense.
We give experimental results for all 11 defenses and 14 attacks in Section VI. Lastly we offer concluding remarks in Section VII.

II. ATTACKS
The general setup in adversarial machine learning for both white-box and black-box attacks is as follows [28]: We assume a trained classifier f with a correctly identified sample x with class label y. The goal of the adversary is to modify x by some amount η such that f (x + η) produces class labelŷ. In the case of untargeted attacks, the attack is considered successful as long asŷ = y. In the case of targeted attacks, the attack is only successful ifŷ = y andŷ = t where t is a target class label specified by the adversary. For both untargeted and targeted attacks, typically the magnitude of η is limited [8] so that humans can still visually recognize the image.
The difference between white-box and black-box attacks lies in how η is obtained. In white-box attacks, η may be computed through backpropagation on the classifier or by formulating the attack as an optimization problem [7], [11], [29] which takes into account the classifier's trained parameters. The white-box adversary has access to the trained parameters which can be used to compute gradients -in essence, the white-box adversary has access to a gradient oracle (which when queried spits out gradient information).
Black-box attacks on the other hand do not have access to the classifier's parameters when generating η and must rely on other information. The black-box adversary may have access to the classifier itself which upon querying returns a score vector or the label for which the score is maximizedwe call this a black-box oracle. Besides a black-box oracle, the black-box adversary may also have information about the training data that was used to train the classifier.
From a crypto perspective, a white-box adversary is strictly stronger than a black-box adversary and also has access to the black-box oracle. However, we often forget that the classifier parameters known to the white-box adversary can not only be used to compute a gradient oracle but also a black-box oracle. This is because we often think that gradient information leads to more powerful attacks, hence, we may not need to consider black-box attacks. A defense that demonstrates robustness to white-box attacks that only make use of a gradient oracle does not always imply robustness to black-box attacks. Gradient masking makes it possible for a defense to give a false sense of security [10] against a fully-equipped white-box adversary as it only thwarts white-box attacks based on the gradient oracle. This shows that there is a need to also separately test gradient free attacks, such as black-box attacks.
In this paper, we focus on black-box adversaries which utilize adaptive attacks [26]. A natural question is why do we focus on adaptive black-box type attacks? We do so for the following reasons: 1) State-of-the-art white-box attacks on published defenses have been extensively studied in the literature [9]- [11]. The level of attention given to black-box attacks in defense papers is significantly less. By focusing on black-box attacks, we seek to complete the security picture. This full security picture means that the current defenses we analyze have not only white-box attacks (from their own publication), but also adaptive black-box results (as reported in this paper). Future defenses can build upon the security concepts developed in this paper and our experiments, when making their own analyses. This completed security spectrum brings us to our next point. 2) By completing the security picture (with black-box attacks) we allow the readers to compare defense results. This comparison can be done because the same adversarial model, dataset and attack is used for each defense. This is completely different from adaptive white-box attacks which may require different adversarial models and different security assumptions for each attack. For example, in [9] to break a detector defense (The Odds are Odd), a custom objective function must be employed to achieve a high attack success rate in the adaptive white-box attack. Alternatively, creating an adaptive white-box attack on an ensemble model defense (ADP [24]) is much different. The only requirement is to increase the number of iteration used in a simple gradient based white-box attack, to make the attack adaptive and effective. Although both adaptive attacks in our example are white-box, the latter (the adaptive white-box attack on ADP) technically only requires being able to backpropagate on the model. As noted in [30] it is improper to compare the robustness of two defenses under different adversarial models.
A. BLACK-BOX ATTACK VARIATIONS 1) PURE BLACK-BOX ATTACK [10], [31]- [33] The adversary is only given knowledge of a training data set X 0 .
2) ORACLE BASED BLACK-BOX ATTACK [26] The attacker does not have access to the original training dataset, but may generate a synthetic dataset S 0 similar to the training data. The adversary can adaptively generate synthetic data and query the defense O to obtain class labels for this data. The synthetic dataset S 0 is then used to train the synthetic model. It is important to note the adversary does not have access to the entire original training dataset X 0 . In this paper, we propose a new version of this attack which we call the Mixed Black-Box Attack. In this attack, the adversary is given the entire original training dataset, the ability to generate synthetic data and query access to the defense to label the data. The adversary in our attack also has multiple different adversarial generation methods φ to choose from to create adversarial examples. In this way, the adversary can train a synthetic model whose behavior mirrors that of the defense more precisely. In short, the attacker adapts the synthetic model to the defense. It is important to note the earlier version of this attack [26] did not allow full access to the training dataset X 0 and the adversarial generation method φ was fixed to be the Fast Gradient Sign Method (FGSM).
Experimentally, we show that the mixed black-box attack outperforms the original attack proposed by Papernot. Our experiments also show the mixed black-box attack works better on certain types of randomized defenses when compared to both boundary and pure black-box attacks [10], [25], [31]- [34]. The pseudo-code for the mixed black-box attack is given in Algorithm 1 and explained in section II-B.  3) BOUNDARY BLACK-BOX ATTACK [35] In this type of attack the adversary has query access to the classifier and only generates a single sample at a time. The main idea of the attack is to try and find the boundaries between the class regions using a binary search methodology and a gradient approximation for the points located on the boundaries.

4) SCORE BASED BLACK-BOX ATTACKS
In the literature, these attacks are also called Zeroth Order Optimization based black-box attacks [36]. The adversary adaptively queries the defense to approximate the gradient for a given input based on a derivative-free optimization approach. This approximated gradient allows the adversary to directly work with the classifier of the defense. Another attack in this line is called SimBA (Simple Black Box Attack) [37]. Unlike all the previously mentioned attacks, this attack requires the score vector f (x) to mount the attack, instead of merely using the hard label.
The only type of black-box attack we do not consider in our analysis from the ones enumerated above, is the score based black-box attack. Just like white-box attacks are susceptible to gradient masking, score based black-box attacks can be neutralized by a type of masking [30]. This means defenses can appear to be secure to score based black-box attacks, while actually not offering true black-box security. Furthermore, it has been noted that a decision (hard label) based black-box attack represents a more practical adversarial model [25]. Therefore, we slightly focus our scope on the three other black-box variants.
We implement the pure black-box attack and mixed black-box attacks. In both these types of attacks adversarial samples are generated from the synthetic model using six different methods, FGSM [8], BIM [38], MIM [39], PGD [27], C&W [11] and EAD [40]. We also consider boundary blackbox attacks. Here we implement the original boundary attack, the Hop Skip Jump Attack (HSJA) [25], as well as the newly proposed Ray Searching Attack (RayS) [34]. In total these attacks represent fourteen different ways to generate black-box adversarial examples.

B. ATTACK SUCCESS RATE
For classifier C we define X (C) as the set consisting of image label pairs (x i , y i ) from the training data set X 0 that are correctly classified by C, i.e., We say X (C) represents the set of clean images with respect to classifier C.
We broaden our description of a classifier C by allowing it to output a 'do not know' symbol ⊥. This may happen if C computes a score vector f (x) on input x where the scores do not clearly favor any label. Later we will also interpret ⊥ as the 'adversarial' symbol indicating that it may be an adversarial example. We define the attack success rate α for classifier C with respect to a particular adversarial sample generation technique φ as Here, the probabilities are over the coin tosses used in φ and C. The attack success rate reflects when an adversarial example is successful meaning that C will predict a legitimate label, that is = ⊥, which is not equal to the correct class label, that is = y i . We note that φ is separately trained/modeled/generated using the information available to the black-box adversary. This information may consist of sets X 0 and set X (C), and based on these sets a self-generated synthetic model M (θ), where θ denotes the parameters of the synthetic model. Implicitly, φ incorporates a perturbation parameter indicating into what extent an adversarial example φ(x i , y i ) may differ from the original image x i .
The attack success rate estimates the fraction of clean images of C for which successful adversarial examples can be generated. Successful means C(φ(x i , y i )) = y i , i.e., the adversarial example φ(x i , y i ) is misclassified to an incorrect label even though it is close to the original image x i (with respect to perturbation parameter ). Here we consider so-called untargeted attacks where the adversary is only interested in misclassification to some other legitimate but wrong label. (An adversarial example for a targeted attack are defined to be successful if the classifier labels it with a target class label specified by the adversary.) In practice we estimate α(C, φ) by taking a subset X clean ⊆ X (C) and compute the fraction of adversarial examples φ(x, y), (x, y) ∈ X clean , that are successful.
The above applies to the mixed-box black attack, see Algorithm 1, as follows. By oracle O we denote the classifier with defense to which the adversary has access. The attacker starts with some starting data X 0 ⊆ X 0 , generally, we assume the worst-case for the defender, i.e., the adversary uses all the training data X 0 = X 0 as a starting point. Data augmentation is used to recursively generate an augmented dataset S e where queries to oracle O are used to find labels. Some training method T (based on mathematical optimization for machine learning) learns new parameters θ e for model M based on S e with initial parameters θ e−1 . The final synthetic model M (θ E ) can be attacked by using a white-box attack method φ (this is possible because the black-box adversary knows parameters θ E , hence, a gradient oracle for its synthetic model M (θ E ) is available). At the final step adversarial examples are generated for X clean and we can compute the fraction for which these are successful -and this estimates α(O, φ(M (θ E ), ; ·)).

III. DEFENSES
The field of adversarial defenses is rapidly expanding, with multiple defense papers released almost every month. 1 To examine every proposed defense is beyond the scope of this paper. Instead, we focus our analysis on ten recent, related and/or popular defenses. In this section we describe the related defenses, their common security elements and why we selected them for comparison. The related defenses we consider are Barrage of Random Transforms (BaRT) [22], The Odds are Odd (Odds) [23], Ensemble Diversity (ADP) [24], Madry's Adversarial Training (Madry) [27], Multi-modelbased Defense (Mul-Def) [21], Countering Adversarial Images using Input Transformations (Guo) [20], Ensemble Adversarial Training: Attacks and Defenses (Tramer) [14], Mixed Architectures (Liu) [33], Mitigating adversarial effects through randomization (Xie) [18], Thresholding Networks (a basic proof of concept defense developed in this paper) and Barrier Zones (BARZ), the main technique proposed in this paper. In general, adversarial defenses can be divided based on several underlying defense mechanisms. We note this type of division is common in other defense papers as well [41]. While the definitions for categorization we provide here are by no means absolute, they give us a way to better understand and analyze the field.

1) Multiple Models -The defense uses multiple classifiers
for prediction. The classifier outputs may be combined through averaging (i.e. ADP), randomly picking one classifier from a selection (Mul-Def) or through majority voting (Mixed Architecture). 2) Image Transformations -The defense applies image transformations before classification. In some cases, the transformation may be randomized (Xie and BaRT) or fixed (Guo). label if the sample is considered to be adversarially manipulated. Odds employs an adversarial detection mechanism, as does the vanilla thresholding network we consider as a proof of concept defense in this paper. 5) Randomization -The defense employs some form of randomization during prediction that is not known a priori to the attacker. BaRT and Xie both apply random image transformations at run time to the input.

A. BARRAGE OF RANDOM TRANSFORMS (BaRT)
Barrage of Random Transforms (BaRT) by [22] is a defense that applies a set of image transformations i 1 , . . . i r to the input x before classification. There are ten types of image transformations that BaRT employs: JPEG compression, image swirling, noise injection, Fourier transform perturbations, zooming, color space changes, histogram equalization, grayscale transformations and denoising operations. For each input x, the number of transformations, the order of the transformations and the parameters in the transformations are randomly selected at run time. Why we selected it: As the defense we propose (BARZ) also uses image transformations, BaRT is a natural candidate to compare to. In building the defense, BaRT trains a single network on multiple image transformations. In contrast, our defense trains multiple networks, each on its own smaller set of image transformations. Comparing these two different ways of building image transformation based defenses is of interest.

B. THE ODDS ARE ODD (ODDS)
The Odds are Odd was first introduced in [23] as a statistical test for detecting adversarial samples. The concept behind the test is based on a simple observation: clean samples and adversarial samples have different values in the logits layer l(·). Here we define the logits layer as the layer before the soft-max layer. When given an input x, the test works by creating multiple copies of the input each with random noise addedx 1 , . . . ,x p . The statistical test uses l(x 1 ), . . . , l(x p )) as input to distinguish between adversarial and clean examples.
Why we selected it: In the black-box setting adversarial detection is one possible way to make the defense stronger as the attacker has to produce a wrong class label and avoid the defense marking the input as adversarial (⊥). In the defense proposed in this paper (BARZ) we also employ detection by using a threshold voting method with multiple classifiers. As security through detection is precisely what Odds attempt to achieve, it makes sense to compare statistical detection methods to voting based detection defenses such as BARZ.

C. IMPROVING ADVERSARIAL ROBUSTNESS VIA PROMOTING ENSEMBLE DIVERSITY (ADP)
Using multiple classifier in a defense is a straight-forward concept based on the notion that it is more difficult to break an ensemble of classifiers as opposed to a single one. In [24] they further this notion by specifically training an ensemble of classifiers to avoid the case where the majority of classifiers simultaneously misclassify an adversarial example. In this defense, security is achieved during training in which an adaptive diversity promoting (ADP) regularizer is used. The ADP regularizer pushes the non-maximal predictions of each ensemble classifier to be mutually orthogonal.
Why we selected it: ADP uses an ensemble of classifiers without image transformations or adversarial training. BARZ on the other hand, uses multiple classifiers with image transformations. If it were possible to achieve black-box robustness in an ensemble without image transformations (e.g. with only special training like in ADP) this would negate the need for special image transformations in a black-box defense. Therefore, testing ADP and comparing it to BARZ has important black-box security implications.

D. MADRY'S ADVERSARIAL TRAINING (MADRY)
Madry's adversarial training [27] is a widely used defense with clear security objectives. As CNNs misclassify adversarial examples, the authors in [27] proposed generating the adversarial examples and subsequently learning to classify them correctly during training. In general adversarial training can be broken down into two steps. In the first step, for a given clean dataset and classifier, the defender uses a white-box adversarial attack φ to derive an adversarial dataset. In the second step, the classifier is trained with the adversarial examples and the original clean labels. These two steps are repeated during training multiple times to create a robust adversarial trained classifier.
Why we selected it: Madry's adversarial training is one of the most commonly accepted adversarial machine learning defenses due to its intuitive design and robust results. While the security principles that Madry's adversarial training are based on do not directly overlap with BARZ, it nevertheless is a defense standard to compare to.

E. MULTI-MODEL-BASED DEFENSE (MUL-DEF)
In [21] they proposed a defense against white-box attacks based on multiple networks, each with the same architecture. The authors in [21] developed their defense based on a specialized training technique. They first start with a classifier C 1 that has been trained on the clean dataset X . A whitebox attack φ C 1 is done on C 1 to generate a set of adversarial examples S 1 . A new training set is formed from the original dataset and adversarial examples: X ∪ S 1 . This new set is used to train the next classifier C 2 . This process is repeated such that classifier C j is trained on During prediction the final output is randomly selected from classifiers C 2 , . . . , C m where m is the number of specially trained classifiers in the Mul-Def.
Why we selected it: Mul-Def has overlapping security concepts with BARZ. Both use multiple models in the defense and both try to create distinct classifiers (Mul-Def through special training and BARZ through training on transformed data). In the randomized form of BARZ, a random subset of model outputs is used similar to Mul-Def. The main difference between the two defenses is that Mul-Def does not employ any voting among the models and does not implement any adversarial detection. If an ensemble defense could avoid having to implement detection, this would clearly boost the clean accuracy of the defense. This is due to the fact imperfect detection methods mark some clean samples as adversarial (false positives). Due to their similar security concepts, it is logical to compare Mul-Def to BARZ.

F. COUNTERING ADVERSARIAL IMAGES USING INPUT TRANSFORMATIONS (GUO)
In [20], the designer selects a set of possible image transformations for a single classifier and keeps the selection of the chosen image transformations secret. The main security idea in this defense (Guo) is that the image transformations will distort the adversarial noise enough such that it is no longer causes the classifier to misclassify the adversarial example.
Why we selected it: While we do not directly test the original Guo image transformations, the security concepts behind the Guo defense are the same as a single network in BARZ. Essentially, the security principles in the Guo defense (single network and image transformations) are a special case of BARZ when the number of classifiers m = 1. Since Guo defense has already been proposed, it would be redundant to propose BARZ, if BARZ-1 (i.e. the Guo defense) already offered substantial security. Therefore, it is necessary to experiment with the Guo defense.

G. ENSEMBLE ADVERSARIAL TRAINING: ATTACKS AND DEFENSES (TRAMER)
The authors in [14] proposes another type of adversarial training method. In this defense, adversarial examples are generated by attacking multiple networks with multiple different attack methods. After this the designer trains a new network with the generated adversarial examples. The authors in [14] argued that this adversarial training can make the adversarially trained network more robust against (pure) black-box attacks because it is trained with adversarial examples from different sources (i.e., pre-trained networks).
Why we selected it: The Tramer defense has natural security concepts parallel to BARZ. Both defenses rely on multiple models. In BARZ these models are used for consensus voting, in the Tramer defense they are indirectly relied on (for generating new adversarial examples). Both defenses are also designed with black-box adversaries in mind. Hence, the Tramer defense is a natural choice to test when considering black-box threat models.

H. MIXED ARCHITECTURE (LIU)
In [33], the authors studied the transferability between CNNs with different architectures for the ImageNet dataset. They found that adversarial samples do not always transfer between different architectures, i.e. adversarial samples misclassified by C 1 are not always misclassified by C 2 . Based on this study one could propose a defense made up of different CNNS C 1 , . . . , C m each with a different structure.
Why we selected it: While not directly proposed in [33], the question of the viability of a mixed architecture defense arises from the results of [33]. As BARZ uses multiple models, would it make a significant difference in robustness if the architectures of the models are mixed? By testing the mixed architecture defense (Liu) we try and empirically answer this question.

I. MITIGATING ADVERSARIAL EFFECTS THROUGH RANDOMIZATION (XIE)
In [18] a defense is developed using a single classifier where a random image transformation i r is applied to the input x at run time. Unlike BaRT or BARZ, this method does not require retraining the classifier on the different image transformations i 1 , . . . i p .
Why we selected it: The Xie defense uses image transformations just like BARZ. Hence this defense presents a unique competing concept: achieve security through randomization without costly retraining. Whether gaining this robustness without retraining is possible under a black-box adversary is why we study the Xie defense in this paper.

J. THRESHOLDING NETWORK (VANILLAT)
The thresholding network is a simple defense demonstrated in this paper to highlight the challenging nature of creating robust barrier zones. The threshold network is a detection type of defense that uses a vanilla classifier C and threshold t. If the highest probability p from classifier C falls below threshold t, the sample is marked as adversarial: ⊥.
Why we selected it: When considering barrier zones defenses, the first intuition might be that simply thresholding a vanilla classifier could work. That would mean robustness could be achieved without multiple classifiers or image transformations. We develop the thersholding network defense to empirical demonstrate that a single classifier barrier zone is not sufficient.

IV. BARRIER ZONE DEFENSE (BARZ)
With so many different kinds of defenses, a natural question is why do we propose another? In short, the answer is because no current defense we analyze performs well against ALL types of black-box attacks and offers a flexible trade-off between security and clean accuracy. For example, adversarially trained networks like Madry perform poorly against pure black-box attacks (less than 65% robust accuracy on CIFAR-10 [27]). Randomized defenses like Xie and Mul-Def work well against boundary attacks but fail against mixed black-box attacks which can adapt to the randomization (we show results for this in section VI). If we want to increase their security, it is not immediately clear how much clean accuracy will be impacted. Likewise, if we want greater clean accuracy, without completely abandoning the defense, it is not obvious how this can be accomplished. In BARZ by adding more networks this trade-off between security and clean accuracy is transparent. BARZ is also one of the only defenses that performs well across all types of black-box attacks (pure, mixed and boundary).
We present full experimental results in section VI to support these claims and give an individual analysis of every defense with respect to black-box attacks in the appendix. Our main focus is to create a defense where the other proposed methods fall short. We strive to create a high fidelity defense (BARZ) that provides flexibility between security and clean accuracy.

A. SECURITY PRINCIPLES OF BARRIER ZONES
The BARZ defense is based on the concept of barrier zones. Barrier zones are the regions in between classes where if an input falls in this region, it is marked as adversarial. For any new defense the first question is why is it effective, or in this case why do barrier zones provide security? Here we give the mathematical intuition behind this concept.
Suppose we have m classifiers C j with corresponding attack success rates α j = α(C j , θ j ), where adversarial sample generation technique θ j is specific to classifier C j . Let us construct a new classifier C which uses each C j to predict a label and outputs the majority decision. If more than one label has the same majority vote, then C outputs ⊥ representing that it does not know how to assign a label. To output a legitimate label, C needs to have a clear majority vote which is not shared by multiple labels.
Consider an adversarial sample generation technique φ tuned to C. Let vote V k be defined as (assuming deterministic algorithms C j and φ for simplicity). Only if V y i > V k for all labels k = y i , classifier C will output the correct label y i . The adversarial example φ(x i , y i ) is successful if a label different from y i and ⊥ is output. That is, there exists a labelŷ ∈ {y i , ⊥} such that Vˆy > V k for all legitimate labels k =ŷ.
This shows that the difference represents the 'advantage' of choosing y i over k in classifier C. By using notation A(., .) and translating our characterization of successful adversarial examples, we have attack where K is the set of all legitimate class labels together with ⊥. This establishes the conditions for a successful attack on multiple standard classifiers when the output is determined by the majority. We now demonstrate how two security principles in BARZ increase the difficulty of the attack conditions.

1) ABSOLUTE CONSENSUS MAJORITY VOTING
Instead of using simple majority voting, in BARZ we use absolute consensus majority voting. This means if all classifiers do not agree on the same label, the sample is interpreted as adversarial/suspicious, labeled ⊥, and the attack fails. We can see that this specifically changes the threshold > 0 in (1) to ≥ m for a successful attack. Note that while the threshold is now higher, the base conditions for a successful attack, advantages A(ŷ, k), did not change in value. Our next security principle deals with the base conditions.

2) INPUT TRANSFORMATIONS
In BARZ each classifier C j implements its own unique secret input linear transformation ψ j . It is important to note that in this subsection we discuss the secret transformations φ j abstractly without designating the specific type of transformation. Theoretically, this allows us to develop the mathematical formulation of the attack success rate of the adversary without assuming the type of transformation. However, for experimentation and defense implementation the image transformation is important and we discuss its choice further in Section IV-B. Once the secret input linear transformation ψ j is applied, a classifier C j is executed: The reason for individual transformations is to further increase the difficulty in crafting adversarial example φ(x i , y i ). It has already been shown in the literature that vanilla classifiers have high transferability [33]. Therefore, using standard vanilla classifiers without transformations (for all k, ψ k is the identity function), does not significantly improve the security for the following reason: If then due to transferability there is a high probability that all standard vanilla classifiers C k output the same wrong labelŷ. This implies that the absolute consensus majority voting with vanilla classifiers yields a high attack success rate α. See necessary condition in (1) with absolute consensus majority vote ≥ m.
We can rewrite φ(x i , y i ) as the corresponding clean image and noise: φ(x i , y i ) = x i + η i . Under this formulation we can reformulate (by using linearity of ψ j ) the base condition There are several important takeaways from (2). While the transformation ψ j changes between classifiers, the noise the adversary crafts η i does not change. In essence for a single sample x i the adversary must generate noise η i that is invariant to the set of transformations ψ 1 , . . . , ψ m . Specifically the condition for a successful attack is now: That is, noise ψ j (η i ) must fool classifier C j , for all j simultaneously, while the adversary can only construct a single noise value η i . When we combine (2) with absolute consensus majority voting our final attack success rate for the adversary can be concisely written as: In the original multi-classifier attack formulation (1) only a majority of the classifiers had to miss classify the adversarial example φ(x i , y i ) to a labelŷ such that A(ŷ, k) > 0 for any k =ŷ. Under the BARZ defense it is clear the new conditions requires ALL classifiers and each transformation to be bypassed.

B. REALIZING BARRIER ZONES
In practice barrier zones forces the adversary to add noise η greater than a certain magnitude in order to overcome the barrier zone. Because an attack fails if the noise becomes visual perceptible to humans, the adversary is limited in terms of the magnitude of η. In many cases this means the adversary may not be able to overcome the barrier zone and therefore cannot fool the classifier. Barrier zones are shown both in a theoretical diagram and with actual experimental results in Figure 3. The natural question is how can barrier zones be implemented in classifiers? In this subsection we discuss different techniques that can be used to create barrier zones.

1) MULTIPLE CLASSIFIERS
Barrier zones can be created through the use of multiple classifiers. A naïve approach to this method would be to simply use CNNs with different architectures. However, we show that merely using different architectures does not yield security. Specifically, we test such a defense in our results by using one VGG16 and one ResNet56 with majority voting (we denote this as the Liu defense). This has also been shown in the literature in [33]. Other examples of architectural defenses not yielding security include ADP and Mul-Def (which we test in this paper). Instead to break transferability between networks we introduce secret image transformations for each classifier. Our defense composed of multiple classifiers (each with their own transformations) is depicted in Figure 4. Each CNN has two simple unique secret image transformations as shown in Figure 4. The first is a fixed linear transformation c(x) = Ax + b, where A is a matrix and b is a vector. After the linear transformation a resizing operation i is applied to the image before it is fed into the CNN. The CNN corresponding to c and i is trained on clean data {i(c(x))}. Multiple CNNs are used, each with their own resizing operation and A and b components as shown in Figure 4.
From [22] we know adversarial examples are sensitive to image transformations which either distort the value of the pixels in the image or change the original spatial location of the pixels. It is important to note that in this paper we experimentally established that image resizing and linear transformations can reduce transferability. However, there may be other image transformations that can also accomplish this goal.

2) IMAGE TRANSFORMATION DEFENSES
A few simple questions arise when dealing with image transformations in security. For example, can only one network with image transformations be used without retraining? We test this concept using the defense by Xie (and we show it performs worse than BARZ under the mixed black-box attack). VOLUME 10, 2022 Can only a single network with image transformations and retraining work? In essence we test a single network, with one set of transformations (Guo) and a single network retrained on multiple random transformations (BaRT). Both of these defenses perform worse than BARZ for the mixed black-box attack.
Another valid question is can only detection of adversarial samples be employed? We test this hypothesis in the following way, we use a vanilla network and a confidence threshold, i.e. any sample below a certain confidence score is marked as adversarial. We also test the Odds defense which employs its own adversarial detection method. In section VI we show that neither thresholding nor the Odds defense are able to outperform BARZ.
It is important to note that it may be possible to further combine other defense techniques such as adversarial training, randomizing some of the image transformations or any number of other techniques. However, the goal of this paper is not to exhaustively test every possible defense combination. The goal is not to test every defense in the literature either. The objective of this work is to provide a defense framework against black-box adversaries that offers clear trade-offs between clean accuracy and security.

C. BARRIER ZONE GRAPHS
In Figure 3 we show barrier zone graphs for various defenses for a single image from CIFAR-10. These graphs are based on the decision region graphs originally presented in [33]. In our graphs, each point on the 2D grid corresponds to the class label of an image I . Green represents that I has been classified correctly, while red and blue regions represent incorrect class labels. Gray represents that the null (adversarial) class label has been assigned. The image I is generated from the original image I : Here g represents the gradient of the loss function with respect to I . In (3) r represents a normalized random matrix that is orthogonal to I (note g is also normalized). Variables, x and y represent the magnitude of each matrix which is determined based on the coordinates in the 2D graph.
In essence the graph can be interpreted in the following sense: The origin is classification of the original image without adversarial perturbations or random noise added. As we move along the x-axis in the positive direction, the magnitude of the gradient matrix x increases. Moving positively along only the x-axis is equivalent to the FGSM attack, where the image is modified by adding the gradient of the loss function (with respect to the input). If we move along the y-axis only, the magnitude of the random noise matrix y increases. This is equivalent to adding random noise to the image. Moving along the positive x-axis and any direction in the y-axis means we are adding an adversarial perturbation and a random noise to the original image I . The further from the origin, the greater the magnitude of x and y and hence the larger the distortion that is applied to create I .
In the case where a defense uses multiple networks m, each network i will have a different gradient matrix g i . To compensate for this, we average the individual gradient matrices together before normalizing to get g. It is important to note that while the graphs shown in Figure 3 give experimental proof of the concept of barrier zones, they cannot be used to attack BARZ defenses in practice. When creating the graphs, we have knowledge of the individual gradient matrices g i for each individual network i. With a black-box adversary only the final output of the defense, O(x) is known. Individual network outputs are not obtainable. Hence it is not possible to precisely estimate the individual gradients g i to construct a barrier zone graph under a black-box adversarial model to the best of our knowledge.

V. MEASURING DEFENSE PERFORMANCE
In general, when building a defense, there are two primary aspects to consider. The first aspect is security. In the field of adversarial machine learning, security is represented by robust accuracy. When building a defense, the second aspect to consider is the cost. In adversarial machine learning, this cost usually comes in the form of a drop in clean accuracy, γ . In the ideal case, security would be free, i.e., γ = 0. In adversarial machine learning, it is well documented that robustness (security) is not free. There is an inherent trade-off between clean accuracy and robustness [42], [43]. Under these circumstances the natural question is, if a cost is always incurred how do we judge a defense?
In this paper, we answer this question by using a metric that measures this trade-off by taking into account both the robustness and clean accuracy. We introduce the δ-metric to properly understand the combined effect of: 1) A drop γ in clean accuracy from an original clean accuracy p to clean accuracy for the defense. Here, clean accuracy p corresponds to a vanilla scheme without defense strategy in a nonmalicious environment. Similarly, clean accuracy p d represents the accuracy for the defense measured in the non-malicious environment without adversaries.
(We take ''clean'' to have the additional meaning of being in a non-malicious environment.) 2) The attacker's success rate α against the defense. If the defense recognizes an adversarial manipulated image as an adversarial example, then it outputs the adversarial label ⊥ and the attack is not considered successful. When defining α, we restrict ourselves to adversarial examples for those images which the defense (in their original non-attacked form) properly classifies by their correct labels. The attacker's success rate is then defined as the fraction of adversarial examples that manipulate these images in such a way that the defense produces labels different from the correct labels and different from the adversarial label ⊥. For completeness, literature defines the robust accuracy or defense success rate as 1 − α. (We notice that most defenses cannot recognize an adversarial manipulated image as an adversarial example and do not have an adversarial label as possible output.) Proper classification by the defense in the presence of adversaries is one of the following: An image (possibly after adversarial manipulation) is recognized by its correct label (implying the attack did not work). Or, an adversarial manipulated image is given the adversarial label ⊥ (if the defense offers this possibility).
The probability of proper/accurate classification by the defense in the presence of adversaries is equal to (p−γ )(1−α) (since the defense properly labels a fraction p − γ if no adversary is present and out of these images a fraction α is successfully attacked if an adversary is present). In other words (p − γ )(1 − α) is the accuracy of the defense in the presence of adversaries (malicious environment). Going from a non-malicious environment without defense to a malicious environment with defense gives a drop in accuracy of δ can be used to measure the effectiveness of different defenses, the smaller the better. If two defenses offer roughly the same δ, then it makes sense to consider their (γ , α) pairs and choose the defense that either has the smaller α or the smaller γ . From a pure ML perspective, in order for a defense to perform well in a non-malicious environment, we want γ very small or, equivalently, p d close to p. From a pure security perspective, in order for a defense to perform well in a malicious environment, we want δ to be small. Therefore, for properly comparing defenses we focus on tuples (δ = γ + (p − γ )α, p d = p − γ ), where α corresponds to the best attacker's success rate across the best known attacks from literature. Notice that the vanilla scheme can be considered in a malicious environment as well and this will correspond to some (δ van , p d = p). Clearly defenses that result in δ ≥ δ van do not improve over implementing no defense at all (which is the plain vanilla scheme).
In the ideal case δ = 0 when the attack always fails (α = 0) and there is no cost in using the defense (γ = 0). Due to adversarial attacks, α > 0 and, hence, this condition does not occur. Therefore, we look for a defense with the smallest δ, e.g. a defense that has both a low α and low γ . If two defenses have similar δ values, we may simply consider the one with the better clean accuracy, which is precisely what we do in this paper. It is important to note the δ metric is simply one way to understand the trade-off between robustness and clean accuracy. It is by no means the definitive or only way to do so. In this paper, we focus on measuring defenses using the δ metric due to its concise ability to capture both sets of information, α (security) and γ (cost). For those interested in other metrics, we provide all the accuracy measurements separately in graphs and tables in the appendix for all attacks and defenses covered in this paper.

VI. EXPERIMENTAL RESULTS
In this section we provide experimental results to show the effectiveness of the BARZ defense. We also show the improvement our mixed black-box attack gives. We experiment with two popular datasets, Fashion-MNIST [44] and CIFAR-10 [45]. Unlike other reported results in the literature, for every defense, we construct it using the same network architecture whenever possible. We apply the defense to the same dataset and we run every defense under the same set of attacks. This allows us to provide an unprecedented comparison of adaptive black-box attack results. We also provide code related to our experiments on Github: https://github.com/MetaMain/BARZ.

A. THE MIXED BLACK-BOX ATTACK
As stated in Section III, our mixed black-box attack is an expansion of the Papernot attack. The original paper [26] experimented with only a single method for generating VOLUME 10, 2022 FIGURE 6. Robust accuracies for the untargeted mixed black-box (top), untargeted pure black-box (middle) and untargeted boundary attack (bottom). Note if the defense is listed but no bar is present it means the defense has a 0% robust accuracy against the attack. That is the attack works 100% of the time on the defense. adversarial samples, the fast gradient sign method (FGSM). We compare results for the Papernot attack and mixed black-box attack in Figure 7 for the l ∞ norm with maximum perturbation = 0.05 for CIFAR-10 and = 0.1 for Fashion-MNIST. The attack success rate is measured using 1000 samples from the test set. Overall, by providing the adversary with more data, the untargeted attack success rate on a vanilla network can increase by 49.4% for CIFAR-10 and by 31.1% for Fashion-MNIST. More experimental details for these results are given in appendix. Some may argue against the practicality of an adversary that has training data access. However, as a defense designer we want to consider the strongest possible hard label black-box adversary. Hence, the mixed black-box attack is clearly necessary for defense validation.

B. PURE BLACK-BOX AND BOUNDARY ATTACKS
In addition to the mixed black-box attack, we also consider the pure black-box and boundary attack. Each of these attacks can be further categorized based on how the adversarial samples are generated. For both the pure and mixed black-box attack (proposed in this paper) we use six different adversarial generation methods (FGSM, IFGSM, PGD, MIM, C&W and EAD). For pure black-box attacks we use the same set of generations methods (but the model used in conjunction with the attack is not adaptively trained). For the boundary attacks, we consider HSJA and RayS. In total this represents four types of black-box attacks and 14 different ways adversarial samples can be generated. For CIFAR-10, the maximum perturbation we allow is = 0.05 and for Fashion-MNIST the maximum perturbation is = 0.1. For RayS we allow 10,000 queries per sample and for HSJA we use a variable query style attack (which we explain in detail in the appendix). Note in Table 2 some attacks are not applicable to certain defenses. This occurs only for boundary attacks for 2 defenses (BaRT and Odds). This is due to computational complexity issues of non-parallelizable prediction for the run time of the boundary attacks. We fully explain this in the appendix along with precise attack details for all the attacks.

C. DEFENSES
We experiment with 11 defenses (BARZ, vanilla thresholding, Guo, Liu, ADP, Xie, Madry, Tramer, Mul-Def, BaRT and Odds). In terms of network architecture, we use ResNet56 [46] for the networks in the CIFAR-10 defenses and VGG16 [47] for the networks in the Fashion-MNIST defenses. It is important to note that the results reported here do not always match the literature results identically. This is due to difference in architectures and datasets. For example, the authors of BaRT never published a CIFAR-10 version of their defense, so our BaRT implementation will have different accuracy than what they report for ImageNet. Likewise, Madry's original CIFAR-10 defense was trained using a Wide ResNet where as we use ResNet56V2. We use the same base architecture for every defense (whenever possible) and the same dataset to make our comparisons as valid as possible. Due to the limited space, we cannot describe the full implementation details of every defense here. We encourage the reader to examine the appendix for further details if interested.

1) BARZ AND THRESHOLDING DEFENSES
In this paper we experiment with BARZ and also a naive defense which we call vanilla thresholding. A common misconception is that by merely thresholding the output of a vanilla classifier (i.e. marking a sample as adversarial if the network is not confident in its prediction) then all black-box attacks can be mitigated. We provide results for the 70%, 95% and 99% thresholding network to show this is simply not the case.
For BARZ, we realize the barrier zones through image transformations. Specifically, each network has an image transformation selected from mappings c(x) = Ax + b. We explain how we chose the randomized A and b based on the dataset in the appendix. We can consider an image transformation c j (x) as an extra randomly fixed layer added to the layers which form the j-th CNN. We tested three of these designs: One with 8 networks (BARZ-8) each using a different image resizing operation from 32 to 32,40,48,64,72,80,96,104. The second with 4 networks (BARZ-4) being the subset of the 8 networks that use image resizing operations from 32 to 32, 48, 72, 96. The third with 2 networks (BARZ-2) being a subset of the 8 networks that use image resizing operations from 32 to 32 and 104.
We also consider a randomized version of BARZ which we denote as BARZ-xRy. In this version, a subset of y networks (selected from x networks) are used to do the absolute majority vote on a sample. For instances, in BARZ-8R2 every time a sample is submitted, two of the eight networks are randomly selected to classify the sample.

D. EXPERIMENTAL ANALYSIS
The main results for our paper are given in Table 2 for CIFAR-10 and Fashion-MNIST and the robust accuracy is visually shown in Figure 6. We compute the δ metric for every defense based on the attack that the defense is weakest to (i.e. has the lowest robust accuracy). For example, if the BARZ-8 defense has a robust accuracy of 60% against RayS VOLUME 10, 2022 (60% of the adversarial samples do not fool the defense) and a robust accuracy of 39% against HSJA, then HSJA is used to compute the BARZ-8 boundary δ metric. Visually the results for the worst case δ metric for the pure black-box attack, mixed black-box attack and boundary attack adversaries are given in Figures 5, 8 and 2.
In terms of performance, our proposed defense (BARZ) outperforms every other defense for both CIFAR-10 and Fashion-MNIST. On CIFAR-10, BARZ-4 gives the best tradeoff between security and accuracy for δ mixed and δ pure, and BARZ-8 has the best robust accuracy (92.6% for mixed and 92.8% for pure). For boundary attacks BARZ-8R6 gives the best trade-off for CIFAR-10 as well as the best robust accuracy (87%). Likewise, for Fashion-MNIST BARZ-8 has the lowest δ for the mixed and pure blackbox attacks. For Fashion-MNIST BARZ-8 also has the best pure and mixed robust accuracy with 90.6% and 89.9% respectively. For the boundary attacks for Fashion-MNIST, we can see BARZ-8R2 gives the best trade-off but Madry gives slightly better robust accuracy (96% for Madry versus 92% for BARZ-8R2). For those interested in the conventional robust accuracy measurement, we give the overall result in Figure 1. This figure shows the minimum robust accuracy across all black-box attacks for each defense. We can only summarize the main results within this section. In the appendix, we go in depth further comparing results for the 11 defenses.

VII. CONCLUSION
In this paper, we advance the field of adversarial machine learning by providing a new black-box attack and a novel black-box defense based on barrier zones. Our new attack is experimentally shown to be stronger than the original Papernot attack. It also outperforms boundary and pure black-box attacks on defenses like Xie and Mul-Def. Second, and most importantly, we develop a new barrier zone based defense. Our defense outperforms all 10 other defense methods we tested under pure, mixed and boundary based black-box attacks. When comparing across all black-box attacks and datasets tested in this paper, our best defense configuration gives over 85% robust accuracy for CIFAR-10 and Fashion-MNIST, an improvement of over 30% compared to the next best defense. Overall we develop the first barrier zone defense (BARZ), experimentally shown to be robust against 14 different types of black-box attacks.

APPENDIX A EXPERIMENTAL DEFENSE RESULTS
In this section, we present our supplementary experimental results for • the mixed targeted and untargeted black-box attacks, • the pure targeted and untargeted black-box attacks and • the boundary attacks -untargeted HopSkipJump [25] and RayS [34]. We run these attacks on ten different defenses strategies, Barrage of Random Transforms (BaRT) [22], The Odds are Odd (Odds) [23], Ensemble Diversity (ADP) [24], Madry's Adversarial Training (Madry) [27], Multi-modelbased Defense (Mul-Def) [21], Countering Adversarial Images using Input Transformations (Guo) [20], Ensemble Adversarial Training: Attacks and Defenses (Tramer) [14], Mixed Architecture (Liu) [33], Mitigating adversarial effects through randomization (Xie) [18], Thresholding Networks (a basic proof of concept defense developed in this paper) and Barrier Zones (BARZ) with the CIFAR-10 [45] and Fashion-MNIST [44] datasets. The adversarial sample generation is done by running white-box attacks on synthetic models (a model obtained from either a pure or mixed blackbox attack). The six white-box attacks used for adversarial sample generation are FGSM [8], BIM [38], MIM [39], PGD [27], C&W [11] and EAD [40]. We also test the defense under boundary black-box attacks (Hop Skip Jump [25] and RayS [34]. We start our section with a discussion on the robustness of defenses under the black-box attacks in this paper. Figures 6 and 9 represent the robust accuracies of the defenses under the different black-box attacks with the Fashion-MNIST and CIFAR-10 datasets. For targeted attacks, Figure 10 shows how the defenses perform in two dimensions, clean accuracy versus delta (δ). We have the following main observations from these figures. 1) Mixed black box attacks are stronger than pure black-box attacks and untargeted attacks are more powerful than targeted ones. Compared to pure blackbox attacks, mixed black-box attacks are given more information about the target model (original training data and query access to the target model to label generated synthetic data); for this reason mixed black box attacks should be stronger than pure black box attacks. Because targeted attacks can be considered as an optimization problem with more constraints than untargeted attacks, targeted attacks should take more effort to run than untargeted ones, and are therefore less powerful. 2) Targeted pure black-box attacks seem to not present a strong attack model. This is supported by the fact that the vanilla scheme (which implements no defense at all) already offers very good robustness (i.e., it already has a high defense accuracy against targeted pure black-box attacks). As a result, almost all considered defenses offer good robustness and clean accuracy under this threat model. This explains why the defenses are relatively close together in the plots for targeted pure black-box attacks in Figure 10. 3) As observed and discussed above, mixed black-box attacks are stronger than pure black-box attacks. This explains why a subset of the considered defenses can still significantly improve over the vanilla scheme for targeted mixed black box attacks as shown in Figure 10. 4) For the untargeted boundary attacks, there are many defenses which have 0% robust accuracy. Hence, we do not see any bars for these in Figure 6, for example Vanilla, VanillaT-0.7, etc. have 0% robust accuracy. 5) The most interesting and important observations from Figures 5, 8, 2, 6, 9, and 10 are as follows: a) There exists a group of defenses which enjoy a high robustness and clean accuracy, i.e., the defenses lie in the upper left corner with small delta value and high clean accuracy and b) BARZ defenses always belong to that group in any of the aforementioned scenarios. These observations show that the BARZ family offers a good robustness and clean accuracy compared to other defenses in all scenarios. We present more detailed attack and defense results in the next sections for Fashion-MNIST and CIFAR-10. Note that all the detailed results in the next two sections have been visualized in Figures 5, 8, 2, 6, 9, and 10, where the most important discussions and observations on these detailed results have been summarized above.

B. FASHION-MNIST: ATTACKS AND DEFENSES
The results for Fashion-MNIST are described in Tables 3, 4, 5, 6, and 7. Recall the formula for the δ metric: where p is the clean accuracy of the vanilla classifier (i.e., no defense at all and without any adversarial presence), γ is the drop in clean accuracy, i.e., γ = p−p d for p d representing the clean accuracy of the defense while no attacker is present, α is the attacker's success rate against the defense and β is the robust accuracy or defense success rate (also called defense accuracy) and is equal to 1 − α.
δ can be used to measure the effectiveness of different defenses, the smaller the better. If two defenses offer roughly the same δ, then it makes sense to consider their (γ , α) pairs and choose the defense that either has the smaller α or the smaller γ .  For Fashion-MNIST and CIFAR-10, p = 0.9356 and 0.9278, respectively. The value of δ is computed by combining p of the vanilla classifier and p d of the considered defense, and by looking at the best attack among all implemented attacks on the given defense (this corresponds to the maximum over the attacker's success rates α for the specific set of attacks considered, similarly, this corresponds to the minimum over the various defense success rates β). For example, the δ metric for BARZ-8 in Table 3 is computed as follows: we substitute p = 0.9356, p d = 0.7779, and the minimal β = 0.986 among all (currently known) targeted mixed black-box attacks (in this case corresponding to the FGSM-T attack) into formula (Eq. 6) for δ. This results in δ = 0.168591.

Discussion:
We have the following observations from the aforementioned tables: 1) The BARZ family achieves the smallest δ for any attack scenario. Figures 5, 8, 2 and 10 reflect this fact. 2) Many defenses (such as Guo, Liu, ADP, Tramer) have a very high clean accuracy (i.e., close to the clean accuracy of the vanilla classifier), but have a very large δ. If we have a close look at the results presented in Figures 6 and 9 or Tables 3, 5, 6 and 7, we can see that they are vulnerable to black-box attacks. In other words, they offer no security. 3) By combining the drop γ in clean accuracy and the increment in robust accuracy β, the δ metric can be  used for understanding how well a defense performs in the presence of attackers. In order to have a further detailed evaluation, we need to separately look at the attack success rate α (or, equivalently, robust accuracy β) and clean accuracy of the defense p d . 4) From Tables 3, 4, 5 and 6 we conclude that mixed black-box attacks are more efficient than pure black-box attacks and untargeted black-box attacks are stronger than targeted ones. When looking at Table 7, boundary attacks are much stronger than mixed and pure black-box attacks. 5) BARZ can realize different combinations of defender accuracy p d and attacker's success rate α by tuning the number of classifiers in the defense.
6) BARZ-8R2, Madry and MulDef have the smallest δ values for boundary attacks. For the BARZ and MulDef defenses the reason is that for a given input x, for each evaluation, these defenses introduce some randomness. As a consequence, the output class label can be changed. This strongly affects the efficiency of boundary attacks which need to accurately estimate the gradients of many images (and due to the introduced randomness these estimates become less accurate).

C. CIFAR-10: ATTACKS AND DEFENSES
The results for CIFAR-10 are described in Tables 8, 9, 10, 11 and 12. VOLUME 10, 2022 Discussion: We have the following observations from aforementioned tables (identical to Fashion-MNIST with a slight difference in item 6): 1) The BARZ family achieves the smallest δ for any attack scenario. Figures 5, 8, 2 and 10 reflect this fact. 2) Many defenses (such as Guo, Liu, ADP, Tramer) have a very high clean accuracy (i.e., close to the clean accuracy of the vanilla classifier), but have a very large δ. If we have a close look at the results presented in Figures 6 and 9 or Tables 8, 10, 11 and 12, we can see that they are vulnerable to black-box attacks. In other words, they offer no security. 3) By combining the drop γ in clean accuracy and the increment in robust accuracy β, the δ metric can be used for understanding how well a defense performs in the presence of attackers. In order to have a further detailed evaluation, we need to separately look at the attack success rate α (or, equivalently, robust accuracy β) and clean accuracy of the defense p d . Tables 8, 9, 10, and 11 we conclude that mixed black-box attacks are more efficient than pure black-box attacks and untargeted black-box attacks are stronger than targeted ones. When looking at Table 12, boundary attacks are much stronger than mixed and pure black-box attacks. 5) BARZ can realize different combinations of defender accuracy p d and attacker's success rate α by tuning the number of classifiers in the defense. 6) BARZ-8R6/2, Xie and MulDef have the smallest δ values for boundary attacks. The reason is that for a given input x, for each evaluation, these defenses introduce some randomness. As a consequence, the output class label can be changed. This strongly affects the efficiency of boundary attacks which need to accurately estimate the gradients of many images (and due to the introduced randomness these estimates become less accurate).

APPENDIX B EXPERIMENTAL ATTACK RESULTS
As we mentioned in the main body of the paper, the mixed black-box attack can be thought of as an extension of the Papernot attack. In this section we give experimental evidence with the CIFAR-10 dataset to support our claims. In Figure 11 we show a graphical representation of the attack success rate as a function of training data. On the x-axis of the graph is the percent of training data used at the start of the attack to build the synthetic model. On the y-axis of the graph is the attack success rate of the attack on a vanilla (undefended) model. For this experiment we fix several variables in order to make the comparison. We use the FGSM attack on the synthetic model with = 0.05 to create adversarial samples. We fix the number of iterations in the attack to be N = 4 for all the experiments and λ = 0.1. In Papernot's original attack on an MNIST classifier 0.3% of the original training data is used. We show that as you increase the amount of training data (and subsequent queries) the attack success rate increases. When the percent of training data reaches 100% we have what we refer to as the mixed black-box attack. This represents a substantial increase in the success rate of the attack. In our experiment for CIFAR-10 we show it increases from 24.7% to 66.6%, an attack success rate increase of 41.9%.
On certain defenses the mixed black-box attack also outperforms other attacks. For example consider the randomized Xie defense. The robust accuracy for CIFAR-10 is 85% under untargeted boundary attacks. However, the robust accuracy is the lowest under the untargeted mixed black-box attack, at just 26.2%. Likewise, the mixed black-box attack outperforms the boundary attacks on MulDef-4 and MulDef-8 (although pure black-box attacks here are the strongest by a slim 1% margin). If we consider Fashion-MNIST we also can see defenses on which the mixed black-box outperforms the other attacks. On Fashion-MNIST the lowest robust accuracy is obtained under the mixed black-box attack for the Xie, MulDef and Madry defenses.
To conclude the purpose of our analysis here is two-fold. First through our experiments we show that when the conditions are held the same, the mixed black-box attack clearly outperforms the original Papernot attack. Second we show the mixed black-box attack is the most effective attack against certain defenses. To be clear we DO NOT claim to have the universally strongest black-box attack. We merely show that as different defenses employ different defense techniques, certain black-box attacks will be more effective than others. Thus, it is imperative to test a wide range of black-box attacks (as is done in this paper). From this range of attacks to be tested, the mixed black-box is clearly necessary for validation of a defense.

APPENDIX C ADVERSARIAL ATTACK DESCRIPTIONS D. PURE AND MIXED BLACK-BOX ATTACK
As we mentioned in the main paper, the mixed black-box attack is an extension of the original attack proposed by Papernot [26]. Here we denote g as the synthetic network for the oracle based black-box attack from [26]. The attacker uses an oracle O which represents black-box access to the target model f . The oracle access in this case provides a class label F(f (x)) for a query x (and not the score vector f (x)). Initially, the attacker has part of the training data set X , i.e., they know D = {(x, F(f (x))) : x ∈ X 0 } for some X 0 ⊆ X . Notice that for a single iteration N = 1 reduces the attack to an algorithm which does not need any oracle access to O build the synthetic model; this reduced algorithm is the one used in the pure black-box attack [10], [33], [48]. In the mixed black-box attack we assume the most capable black-box adversary in Algorithm 1 with access to the entire training data set X 0 = X (notice that this excludes the test data used for evaluating the attack success rate).
In order to construct a synthetic network the attacker chooses a-priori a substitute architecture G for which the synthetic model parameters θ g need to be trained. The attacker uses known image-label pairs in D to train θ g using a training method M (e.g., Adam [49]). In each iteration the known data is doubled using the following data augmentation  technique: For each image x in the current data set D, blackbox access to the target model gives label l = O(x). The Jacobian of the synthetic network score vector g with respect to its parameters θ g is evaluated/computed for image x. The signs of the column in the Jacobian matrix that correspond to class label l are multiplied with a (small) constant λ -this constitutes a vector which is added to x. This gives one new image for each x and this leads to a doubling of D. After N iterations the algorithm outputs the trained parameters θ g for the final augmented data set D.

E. ADVERSARIAL SAMPLE GENERATION
After the synthetic model is trained, adversarial samples need to be created from the synthetic model to attack the defense.
Hence any white-box attack can be run on the synthetic model to create an adversarial example. The adversary can then check if this example fools the defense. To reiterate, in this paper we focus on a black-box adversary so running white-box attacks directly on any defense is not within the scope of our adversarial model. We briefly introduce the following commonly used white-box attacks that we use for adversarial sample generation: Fast Gradient Sign Method (FGSM) - [8]: Computes x = x + ×sign(∇ x L(x, l; θ) where L is a loss function (e.g, cross entropy) of model f .
Basic Iterative Methods (BIM) - [38]: 1 , l; θ)) where x 0 = x, r is the number of iterations, clip is a clipping operation. Momentum Iterative Methods (MIM) - [39]: This is a variant of BIM using momentum trick to create the gradient g i , i.e., x i = clip x, (x i−1 + r × sign(g i )).
Projected Gradient Descent (PGD) - [27]: This is also a variant of BIM where the clipping operation is replaced by a projection operation.
Carlini and Wagner Attack (C&W) - [11]: We define x (ω) = 1 2 (tanh ω + 1) and g(x) = max(max(s i : i = l) − s i , −κ) where f (x) = (s 1 , s 2 , . . .) is the score vector of input x of classifier f and κ controls the confidence on the adversarial examples. The adversary builds the following objective function for finding the adversarial noise.
where c is a constant chosen by a modified binary search. Elastic Net Attack (EAD) - [40]: This is the variant of C&W attack with the following objective function.

APPENDIX D EXPERIMENTAL IMPLEMENTATION DETAILS AND MISC F. IMPLEMENTATION OF BARZ
In the BARZ, we use image transformations that are composed of a resizing operation i(x) and a linear transformation c(x) = Ax + b. In a CNN implementation one can think of i(c(x)) as an extra layer in the CNN architecture itself. We refer to this extra layer as the protected layer. An input image x at a protected layer in BARZ is linearly transformed into an image i(c(x)) before it enters the corresponding CNN network. For the resize operations i(·) used in each of the protected layers in BARZ, we choose sizes that are larger than the original dimensions of the image data. We do this to prevent loss of information in the images that downsizing would create (and this would hurt the clean accuracy of BARZ). In our experiments we use BARZ with 2, 4, and 8 protected layers. Each protected layer gets its own resize operation i(·). When using 8 protected layers, we use image resizing operations from 32 to 32, 40, 48, 64, 72, 80, 96, 104. Each protected layer will be differentiated from each other protected layer due to the difference in how much resizing each layer implements. This will lead to less transferability between the protected layers and as a result we expect to see a wider barrier zone which diminishes the attacker's success rate. When using 4 protected layers, we use a copy of the 4 protected layers from BARZ with 8 networks that correspond to the image resizing operations from 32 to 32, 48, 72, 96. When using 2 protected layers, we use a copy of the 2 protected layers from BARZ with 8 networks that correspond to the image resizing operations from 32 to 32 and 104.
For each protected layer, the linear transformation c(x) = Ax + b is randomly chosen from some statistical distribution (the distribution is public knowledge and therefore known by the adversary). Design of the statistical distribution depends on the complexity of the considered data set (in our case we experiment with Fashion-MNIST and CIFAR-10).For CIFAR-10 we take matrices A i to be identity matrices (this also makes A the identity matrix in the vector representation of c(x)) and we use the same matrix b for each of the matrices b i , i.e., This means that we use the same random offset in the red, blue, and green values of a pixel. The reason for making this design decision is because for CIFAR-10 we found that fully random A creates large drops in clean accuracy, even when the network is trained to learn such distortions. As a result, for data sets with high spatial complexity like CIFAR-10, we do not select A randomly. We choose A to be the identity matrix. Likewise for b we only randomly generate 35% of the matrix values and leave the rest as 0. For the randomly generated values, we choose them from a uniform distribution from −0.5 to 0.5.
For datasets with less spatial complexity like Fashion-MNIST, we equate matrices A = A 1 = A 2 = A 3 and b = b 1 = b 2 = b 3 and select A and b as random matrices: The values of A and b are selected from a Gaussian distribution with µ = 0 and σ = 0.1.

G. ATTACK AND DEFENSE PARAMETERS
In order to implement a black-box attack we first run Algorithm 1 which trains a synthetic network g. Next, out of the test data (each dataset has 10,000 samples in our setup) we select the first 1000 samples correctly identified by the defense. For each of the 1000 samples we run a certain white-box attack to produce 1000 adversarial examples. The attacker's success rate is the fraction of adversarial examples which change l to the desired new randomly selected l in a targeted attack or any other label l = ⊥ for an untargeted attack.
The parameters for the adversarial generation techniques (white-box attacks) used in conjunction with our synthetic model for both the mixed black-box attack and pure black-box attack can be found in table 13. For all attacks we use the l ∞ norm except for the Carlini and Wagner attack. For the Carlini and Wagner attack only the l 2 implementation (given by the authors) has a run time efficient enough for our current hardware setup (to test on 10 defenses and 2 datasets). Future work may include trying mixed black-box attack with the l ∞ if efficient implementations of the Carlini and Wagner attack become available in the future. The precise set-up for our experiments is given in Tables 14, 15, and 16. Table 14 details the training method T in Algorithm 1. For the evaluated data sets Fashion-MNIST and CIFAR-10 without data augmentation, we enumerate in Table 15 the amount |X 0 | of training data together with parameters λ and N (λ = 0.1 and N = 6 are taken from the oracle based black-box attack paper of [26]; notice that a test data set of size 10,000 is standard practice; all remaining data serves training and this is entirely accessible by the attacker). Table 16 depicts the architecture G of the CNN network of the synthetic network g for the different data sets; the structure has several layers (not to be confused with 'protection layer' in BARZ which is an image transformation together with a whole CNN in itself). The adversary attempts to attack BARZ and will first learn a synthetic network g with architecture G that corresponds to Table 16. Notice that the image transformations are kept secret and for this reason  the attacker can at best train a synthetic vanilla network. Of course the attacker does know the set from which the image transformations in BARZ are taken and can potentially try to learn a synthetic CNN for each possible image transformation and do some majority vote (like BARZ) on the outputted labels generated by these CNNs. However, there are exponentially many transformations making such an attack infeasible.

H. BOUNDARY ATTACK COMPUTATIONAL COMPLEXITY AND TARGETED BOUNDARY ATTACKS
In the main body of the paper we mention that both the Odds are Odd (Odds) and Barrage of random transforms (BaRT) are not applicable for boundary attacks. For pure and mixed black-box attacks we can efficiently parallelize the evaluation of many samples using either the GPU or multiple CPUs (in the case of image transformations). However, the boundary attacks require large number of evaluations done sequentially (e.g. 10,000 queries) so we cannot take advantage of the previously mentioned parallelism. This causes the run time of boundary attacks for these defenses with our standard implementation to be on the order of weeks. These attacks are not applicable for our current setup (28 core CPU machine and 2 Titan V GPUs).
It is also worth noting in this paper we do not directly consider targeted boundary attacks. Although we do provide experimental details for some other black-box target attacks, in this paper our main focus is on the untargeted attack. As we already have 12 targeted attacks presented in this paper (6 mixed black-box and 6 pure black-box types) we leave the targeted boundary attack as potential future work.

I. FUTURE WORK
There are several promising directions for possible future work. From a security perspective, our paper has demonstrated the effectiveness of image transformations for black-box robustness. We experimented with a set of image transformations that we found to be effective in creating barrier zones. However, large scale studies on the transferability of single and fixed combinational image transformations has not yet been done, to the best of our knowledge. Determining exactly which image transformations are capable of distorting adversarial noise while maintain robustness would bring the field much closer to establishing a set of image transformations as security primitives.
On the machine learning side, enhancement to the clean accuracy of the BARZ defense may be possible through the introduction of novel architectures. Specifically, the Big Transfer Models [50] are a class of CNNs that have shown remarkable performance on datasets like CIFAR-10 and CIFAR-100. Using these new architectures could be one possible way to improve the clean accuracy of the BARZ defense.
On the attacker side in this work, we only consider an adversary that is interested in misclassification (either targeted or untargeted). The attacker starts with a clean example and specifically tries to avoid having the sample marked with the correct label or marked with the adversarial label. To the best of our knowledge, work has not been extensively done on what might be considered the inverse of this problem i.e., the attacker tries to overwhelm the system with legitimate examples that are marked as adversarial. While an interesting problem in its own right, this is beyond the scope of our current work. It may be a problem future defense designers would want to take into account and try to mitigate.
Lastly from the attacker side, optimizations can still be made to the adaptive black-box attack. In our paper, we found one simple CNN architecture (through experimentation) that was both simple to train and yielded highly transferable adversarial examples. However, it may still be possible to optimize the architecture in the attack, to potentially increase the attack success rate. In addition, as white-box attacks continue to improve, it may be possible to substitute the MIM adversarial generation method in the adaptive black-box attack with an even stronger technique. THANH NGUYEN received the B.S. degree in computer science from the Hanoi University of Science and Technology, Vietnam, in 2012, and the Ph.D. degree in computer engineering from Iowa State University, USA, in 2020. He is currently working as a Researcher at Amazon AI. His research interests include machine learning, learning theory, generative modeling, and unsupervised learning.
MARTEN VAN DIJK (Senior Member, IEEE) is currently a Group Leader of the Computer Security Group, CWI, The Netherlands, with over 20 years of experience in both industry (Philips Research and RSA Laboratories) and academia (MIT and a Full Professor till June 2020 and a Full Research Professor after June 2020 at UConn). His work has been recognized by the A. Richard Newton Technical Impact Award in electronic design automation, in 2015, and has received several best (student) paper awards.