Mitigating Black-Box Adversarial Attacks via Output Noise Perturbation

In black-box adversarial attacks, adversaries query the deep neural network (DNN), use the output to reconstruct gradients, and then optimize the adversarial inputs iteratively. In this paper, we study the method of adding white noise to the DNN output to mitigate such attacks, with a unique focus on the trade-off analysis of noise level and query cost. The attacker's query count (QC) is derived mathematically as a function of noise standard deviation. With this result, the defender can conveniently find the noise level needed to mitigate attacks for the desired security level specified by QC and limited DNN performance loss. Our analysis shows that the added noise is drastically magnified by the small variation of DNN outputs, which makes the reconstructed gradient have an extremely low signal-to-noise ratio (SNR). Adding slight white noise with a standard deviation less than 0.01 is enough to increase QC by many orders of magnitude without introducing any noticeable classification accuracy reduction. Our experiments demonstrate that this method can effectively mitigate both soft-label and hard-label black-box attacks under realistic QC constraints. We also show that this method outperforms many other defense methods and is robust to the attacker's countermeasures.


Introduction
Along with the rapid development of deep neural networks (DNNs), there are a lot of online services, such as Clarifai API, Google Photos, advertisement detection and fake news filtering, etc., that highly rely on DNNs.Nevertheless, an intriguing issue is that DNNs are highly susceptible to small variations in input data [1].Online DNN servers suffer from adversarial attacks where the attackers can slightly change the input data to make DNNs give false results or misclassification [2].
Depending on the knowledge about the DNNs that the attackers have, adversarial attacks can be classified into white-box attacks [1,[3][4][5] and black-box attacks [6][7][8][9][10][11][12][13].The former assumes that the attackers have complete knowledge of the deep network, while the latter assumes that the attackers have limited knowledge, typically some output information of the DNNs.Compared with white-box attacks, black-box attacks are more realistic threats to real-world practical applications.
In general, black-box attacks need to estimate gradients via the output information of the deep networks obtained through querying and use these estimated gradients to optimize their adversarial inputs.The query cost is thus a critical constraint to attackers.Over the recent years, more and more efficient black-box attack methods have been developed and they can now generate adversarial samples with only a few hundreds of queries [7,14].Considering this fast increasing threat, it is the right time to develop effective defense methods [15,16].Unfortunately, most existing defense techniques are shown to provide a false sense of defense [17].
In this paper, we study the performance of the simple output noise perturbation as a defense against black-box attacks, where the defender (or the DNN) adds white noise to the DNN outputs.Since it is impossible to find a technique that can completely stop attackers of unlimited resource, we focus on mathematical analysis of the attack-defense trade-off in terms of noise level and query count (QC).We believe such theoretical analysis is critical for defense study because it is computationally intractable to guarantee defense just with experiments.Specifically, we express QC as a function of noise standard deviation σ, with which the defender can easily apply appropriate noise to prevent attacks up to certain performance loss and security levels.For example, our results demonstrate that small noise with σ ≤ 0.01 can prevent black-box attacks with a million query budget over the MNIST, CIFAR10 and IMAGENET datasets without noticeable classification accuracy loss.
The major contributions of this paper are outlined as follows.
• We develop a new analysis framework to study the trade-off between noise level and QC mathematically instead of only heuristic approach via experiments.The signal-to-noise ratio (SNR) of the noisy gradients is derived, and it exhibits that small noise is magnified by the small DNN outputs.The attacker's QC is shown to be increased by many orders-ofmagnitude with an extremely small noise.
• We analyze the properties of the noise perturbation method and show that the proposed method is robust to the countermeasures of the attackers.We also observe that quantization and output-correlated noise does not perform well, which explains that output noise perturbation is better than other randomization or gradient obfuscation methods.• We experiment with a list of representative black-box attack algorithms, including both soft-label and hard-label attacks.The results fit well with the analysis and demonstrate the effectiveness of this method against these attacks.
This paper is organized as follows.Related works are presented in Section 2. The noise perturbation method is studied in Section 3. Experiments are conducted in Section 4. Conclusions are given in Section 5.

Related work
Black-box attacks can be subdivided into three major classes: transfer-learning based attacks, softlabel attacks, and hard-label attacks [8].Transfer-learning based attacks exploit the fact that an adversarial input to one deep network may also be adversarial to another deep network [2].
Soft-label attacks assume that the logit information is available to the attacker, either fully or partially.Narodytska et al. [18] used random perturbation and local searches to look for adversarial samples.Hayes et al. [19] trained a generator neural network to generate adversarial samples.Chen et al. developed the zeroth-order optimization (ZOO)-based attacks [6], where they reconstructed gradients from output logits using zeroth-order gradient estimators.Ilyas et al. [8] applied the natural evolution strategies (NES) to estimate the gradients.Tu et al. [7] improved the ZOO-based attacks with the AutoZOOM algorithm, which used autoencoders to generate gradient search directions.Cheng et al. [20] combined the transfer-learning and ZOO-based attack techniques.
Hard-label attacks assume that only hard decisions of DNN outputs are available.Within this class, Brendel et al. [13] exploited large perturbation to generate adversarial samples and used fine-tuning to reduce adversarial image distortion.Ilyas et al. [8] picked a target image and fine-tuned it toward the original image.Cheng et al. [9,10] applied randomized gradient-free ZOO techniques.
On the defense side, a majority of existing studies are focused on white-box attacks.Most existing black-box defense techniques are in fact borrowed from their white-box version.A large number of defense techniques were proposed based on the idea of gradient masking or gradient obfuscation, e.g., defensive distillation [16], non-differentiable classifiers [21], input randomization [22], network structure randomization [23], etc.Unfortunately, almost all of them were defeated shortly after their publications via the so-called expectation-over-transformation (EOT) technique [24,25,17].Today, the most effective way is perhaps adversarial training where adversarial samples are used to train the network [26][27][28], but the performance is not reliable for unknown attacks.
To the best of our knowledge, the simple output noise perturbation method has not been studied in-depth.Dong et al. [14] experimented with a long list of white-box/black-box attack/defense algorithms, but without this one.All the other reported noise perturbation techniques injected noise into the input or the network, not the output [28][29][30][31].The reason is perhaps they were obtained from white-box attacks where adding noise to network outputs was of no use.Lee et al. [32] used output noise perturbation but for model stealing attack only and also without mathematical analysis.
3 Analysis of Output Noise Perturbation The objective of the adversarial attacker is to generate an image x = x 0 + ∆x such that the DNN classifies it as t = arg max i F i (x) = c.Another aim of the attacker is that the difference ∆x should be as small as possible.For the black-box attacks, the attackers query the DNN to obtain the input-output pair (x, F (x)), as shown in Fig. 1, with which they can minimize the following loss function to search for the adversarial sample x [7], where D(•, •) is a distance function and L(•, t) is the loss function.Typical distance functions are norms x − x 0 p .Typical loss functions include the cross-entropy [8] and the C&W loss [24].
In this paper, we consider that the DNN defends itself by adding noise v to the logit and providing outputs F (x) + v.As a result, the attacker observes F (x) + v instead of F (x).We assume that v is an independent and identically distributed (i.i.d.) Gaussian random vector with zero mean and standard deviation σ, i.e., v ∼ N (0, σ 2 I), where I is an identity matrix.We also assume that the DNN satisfies F (x) − F (y) ≤ L x − y with a local Lipschitz constant L. We consider low magnitude/small noise throughout our analysis.Definition 1. Small noise is defined as the noise v whose standard deviation σ is small so that log(1 + v) ≈ v is valid almost surely.
The performance of defense can be measured by three metrics: attack success rate (ASR), query count (QC), and input distortion.Strong and robust defense makes the black-box attacks result to low ASR, high QC, and high distortion.In this paper, we derive QC as function of σ in theoretical analysis and calculate ASR as function of σ under a pre-set QC limit in experiments.To save space, detailed analysis is presented for the soft-label targeted attack with the NES method only [8].Extension to other attack methods, hard-label and untargeted attacks, is explained briefly later in Section 3.4 because of the similarity in methods.

Output Noise Perturbation to Mitigate NES Targeted Attack
In [8], the soft-label NES targeted attack towards class t is conducted by minimizing the softmax cross-entropy loss function f (x) = − log F t (x), where F t (x) is the softmax value of the target class.We have skipped the distance term D(x, x 0 ) from (1) in order to consider the most challenging defense situation.For the attacker it is easier to find an adversarial sample without the distortion constraint.The NES algorithm uses gradient descent to minimize the loss function iteratively.In each iteration, with J queries, it estimates the gradient as where u j is the random direction tensor and β is the search variance.In particular, g j is expressed as the attacker-generated tensor u j multiplying a deterministic scalar multiplication factor a.
The detailed proof is shown in Appendix A. To understand the degree that v randomizes the estimated gradient, we can evaluate the signal-to-noise ratio (SNR) of A defined as SNR = , where E[.] denotes mathematical expectation.We also call it the SNR of the noisy gradient.
Lemma 1.Under small β and small noise, the SNR of A is The detailed proof is shown in Appendix B. We can see that the SNR is very small because the output variation ∆F t = |F t (x − βu j ) − F t (x + βu j )| and β are very small in practice.A very small SNR makes A to have opposite sign as a with high probability, which changes the gradient descent toward the wrong direction and thus prevents the attacker's optimization from converging.
To derive QC as a function of noise level σ, we consider the following alternative approach since it is difficult to find QC expressions for deep networks.Consider the iterative gradient-descent minimization of f , where w is the weight of the input DNN layer and x is the DNN input.We assume that the function F (wx) is monotone between the starting point wx 0 and the optimal point wx * because otherwise there is no guarantee of convergence.The minimization is conducted as Our objective is to find the ratio R, i.e., the ratio of the iteration number needed when using a constant learning rate a to that when using the random learning rate A = a + √ SNRv with noise v.
Theorem 2 If the learning rate a is small such that where η and are small probabilities, λ and v 0 are constants related to w, x 0 and x * , and Φ −1 ( ) is the inverse of the standard normal cumulative distribution function.
The detailed proof is shown in Appendix C. From the proof, we also see that R can be used as an estimation of QC(noise)/QC(noiseless), i.e., the ratio of QCs between the case with noise perturbation and the case without noise perturbation.
The relationship between QC and noise σ can be readily analyzed based on (4) and (5).In particular, if σ is small, then R ∼ 1/SNR, i.e., increases with 1/SNR.As a rule of thumb, log(R), − log SNR, and log σ − log ∆F t change linearly with each other.
To understand better the tradeoff between performance loss (specified by σ) and defense security (specified by QC), we need to know the output variation ∆F t .For this we trained a 5-layer CNN model for the MNIST dataset, a 6-layer CNN model for the CIFAR10 dataset, and used the Inception-V3 model for the IMAGENET dataset.First, using their validation datasets, we calculated the statistical parameters of DNN outputs, which are shown in Table 1.We applied random u j to calculate output variation ∆F t .Second, we added noise to the outputs and calculated output SNR and ACC degradation.The results in Fig. 2(a) clearly show that there is almost no ACC degradation when noise σ ≤ 0.02.From Fig. 2(b) we also find that the output SNR is high.
Next, using the mean ∆F t data in Table 1, we calculated the SNRs of A and showed them in Fig. 2(b).The SNRs drastically reduced to very small numbers.At σ = 0.01, the SNRs were -66dB, -34dB and -106 dB for the MNIST, CIFAR10 and IMAGENET models.Such low SNRs made A completely different from a.
Finally, to evaluate the QC ratio R with (5), we assume η = = 0.01, a = 0.1, λ = 2, v 0 = 1.The increase of R as function of σ is shown in Fig. 3(a).At σ = 0.01, we have R = 9 × 10 6 , 5 × 10  10 10 for the three models, respectively.Considering that today's state of the art attack methods need around 10 3 queries to attack MNIST/CIFAR10 images and 10 5 queries to attack IMAGENET images, small noise perturbation with σ = 0.01 would increase the number of queries to 10 6 to 10 15 , prohibitively high to attackers.Note that the much smaller median ∆F t values shown in Table 1 will lead to even higher QCs.
From Fig. 3(a), for attackers with 1 million query budget, the defender can add noise with σ = 0.001, 0.01, and 0.0001 to mitigate them over the MNIST, CIFAR10 and IMAGENET datasets, respectively.Smaller noise, such as σ ≤ 10 −4 , is effective for well-trained models (such as MNIST) or models with a large number of classes (such as IMAGENET) that have very small ∆F t .The defender can conveniently apply appropriate noise according to its output parameters and required security level.Now we can summarize the reasons for the noise perturbation method being effective.First, from Lemma 1, the SNR of the estimated gradient becomes very low since the noise power σ 2 is amplified by the small β and small ∆F t .Second, from Theorem 1, the gradient becomes so random that it changes the search direction to the opposite with high probability, which prevents gradient search from converge.Finally, according to Theorem 2, low SNR makes the attack QC prohibitively high.

Robustness to Attacker's Countermeasures
The output noise perturbation method is robust to various counter-defense techniques that the attacker may adopt.First, the attacker may try to increase ∆F t and β to reduce their noise amplification effects.However, ∆F t is usually out of the attacker's control.Large β leads to worsening gradient estimation accuracy, which in fact reduces SNR of A.
Second, the attacker may adopt the EOT or gradient averaging strategy that has been shown effective to invalidate gradient obfuscation defenses in white-box attack scenarios [17].Nevertheless, EOT is not as effective in our case as one would expect.In principle, EOT finds the average gradient ḡ = 1/J j g j , similar to (2).Transformed images that the attacker uses to query the DNN can be written as x + ∆x j , where ∆x j denotes the difference caused by the random transform.The attacker still gets a noise perturbed output F (x + ∆x j ) + v j to construct g j .The estimated gradient is still random which is worse because of the randomized ∆x j .In this scenario, the accuracy of ḡ can not be guaranteed, even if J is large.In the worst case, independent g j may make ḡ → 0.
Finally, the best countermeasure is perhaps to estimate a by querying the DNN repeatedly with the same x.This is the optimal strategy to estimate a constant from noisy samples.Note that this is different from finding average ḡ with different x.Theoretically the attacker can average out noise and get a reliable estimation with a large number of repeated queries.We are interested to study the QC in this case.
Theorem 3 If the attacker conducts N repeated queries with the same data x, it gets N samples The minimum number of samples N required to estimate a as â with P [â < 0] < for some is where Φ −1 ( ) is the inverse of the standard normal cumulative distribution function.
The detailed proof is shown in Appendix D. We can see that small F t (x) leads to large N .As a numerical example, adopting mean F t (x) and β data in Table 1, = 0.3 and a = 1, the N values as function of σ are shown in Fig. 3(b).A huge number of repeated queries was needed to estimate each gradient g j , which made this countermeasure technique impractical.Especially, when σ ≥ 10 −4 , no realistic N could be found to estimate the gradient g j to the correct direction with 70% probability.

Properties of Output Noise Perturbation
In this subsection, we first show that our analysis framework and the noise perturbation method are general enough for many other black-box attack methods.For this we present the analysis results over a ZOO-based attack [7].Then, we show that quantization noise and output-correlated noise are not effective, which explains why the output noise perturbation method is better than other randomization or gradient obfuscation methods.
Within the black-box attack community or the gradient-free optimization community, the NES and ZOO are two major gradient estimation approaches.For ZOO, we consider the AutoZOOM algorithm [7] that minimizes the C&W loss function f (t) = log (F max (x)/F t (x)), where F max (x) = {F i (x) : i = arg max j F j (x), ∀j = t}, with the gradient estimator Theorem 4 Under white Gaussian noise perturbation, the multiplication factor a becomes the noisy factor In addition, when σ is small, we have and the SNR of A satisfies SNR ≤ L 2 β 2 2σ 2 .The detailed proof is shown in Appendix E. This theorem tells us that noise perturbation randomizes the AutoZOOM's gradient estimation similarly as it does for NES.
It can be readily seen that our analysis method shown in Theorem 1 to Theorem 4 can be applied to analyze other black-box attacks as well.For example, the N attack algorithm [33] uses the NESestimated gradients to learn adversarial distributions.Assume the adversarial samples have a certain distribution with mean µ, then the N attack algorithm finds µ via optimization µ t+1 = µ t − ηḡ.Obviously, the noise perturbed ḡ can hardly make the updating converge.As another example, for the partial-information setting of [8], the authors propose to start from a target image.The NES algorithm is then applied to estimate the gradient to modify the target image to become similar to the original image.Noise perturbation is still effective to randomize the estimated gradients.Untargeted attacks can be analyzed similarly with just a change of loss functions.
An especially interesting case is the hard-label attack.The label-only NES attack [8] starts from a target image and uses its random variations to query the DNN.The binary query results are used to construct a measure similar to F t (x).Obviously, noise perturbation can change the hard-label which makes this measure very noisy and reduces the SNR of the estimated gradients.Our analysis framework can still be applied.
Next, an interesting question is whether output quantization noise can be used.Another interesting question is whether the noise must be white.
Lemma 2. Noises created by output quantization (to 2 or more bits) or noises highly correlated with DNN outputs make the random variable Z have very small σ 2 z , and thus are not effective to mitigate black-box attacks.
The detailed proof is shown in Appendix F. The lemma gives a good explanation for the limited or failed defending performance of existing network randomization defenses.For example, Liu et al. [29] suggest adding noise to each convolutional layer but not the final output layer, whose net effect is to create output perturbations that are highly correlated with the true output logits.Its noise perturbation effect is in fact reduced by the network.The reduced randomization makes it more susceptible to the attacker's countermeasures such as EOT according to Fig. 3(b).

Experiments
From Section 3.2, the QC needed for generating an adversarial image under our noise perturbation defense can be 10 15 or more, which is computationally prohibitive.Therefore, instead of looking for QC, we followed the common practice of looking for ASR under a pre-set realistic QC limit.To save the space only ASR of targeted attacks are reported here.Experiment data for QC, distortion and untargeted attacks are presented in Appendix G.
We experimented with a list of state-of-the-art black-box attack algorithms, including both soft-label attacks and hard-label attacks.For fair comparison, we used the original source code of the attack algorithms with their default hyper-parameters settings (represented as no-noise results).Then we inserted our noise addition defense subroutine to the source code.In practice, we could not add truly i.i.d.noise.Since the DNN should have softmax outputs in [0, 1], we replaced negative elements with their absolute values and clipped the values over 1.To avoid ACC degradation at high noise level (such as σ = 0.1), we made the top-1 pick in the noiseless case still the top-1 pick after noise perturbation.Specifically, if the original maximum element was no longer the maximum after the noise addition, we repeatedly added absolute Gaussian noise generated by half variance to it until the original maximum element becomes maximum.Nevertheless, this was not conducted for hard-label attacks.The ACC of the noise perturbed DNN is not shown because the model accuracy degrades very little as shown in Fig. 2(a).All experiments were conducted in a machine with a single GPU.[6], ZOO+AE and AZ+AE and AZ+Bi (AutoZOOM) [7], GenAttack [11], SimBApixel and SimBA-DCT [12], NES [8].NES/PI is the NES Partial Information attack where only the top-1 pick's confidence score is available [8].QC limit is the maximum number of iterations the attack algorithms run.case, the ASR for all the attack methods reduced significantly in presence of the proposed defense.On MNIST, noise with standard deviation σ ≥ 0.001 was enough to degrade ASR from 100% to below 20%.On CIFAR10, noise with σ ≥ 0.01 was enough.Note that ASR can not be smaller than 10% theoretically for these two datasets because a random guess among 10 classes will result to 10% correctness.On IMAGENET, noise with σ ≥ 10 −4 was enough.All these results fit well with our analysis (see the end of Section 3.2).By all means, QC on the order of 10 6 to 10 15 is needed, which is much higher than the preset QC limits in these attack algorithms.

ASR of Targeted Attacks:
For hard-label attacks, experiment results in Table 3 showed that a low noise standard deviation σ = 0.001 was effective and σ = 0.01 successfully reduced ASR to below 25%.Note that the observation in Fig. 2(b), i.e., the ACC did not degrade for such small σ, was for normal images with high enough classification confidence only.The added noise could change the top-1 labels in case the classification confidence was not high enough, which happened frequently in the mid of the attacker's optimization procedure.This prevented the attack algorithm from converging.We should also note that in hard-label attacks, the attacker needs a large number of queries, over 2.5 million queries, to generate an adversarial image even when there is no noise perturbation.This is because the gradient is already very noisy with low SNR.
Comparison with Existing Defense Methods: In Table 4 we compare the output noise perturbation defense method with two existing defense algorithms: JPEG Compression [15] and Input Randomization [22].As seen from the table, the proposed output noise perturbation method had the best defense result with the lowest ASR.Specifically, the two existing methods could not mitigate NES and GenAttack attacks on IMAGENET data satisfactorily, while our output noise perturbation method could reduce the ASR to near 0%.3: Targeted hard-label attacks: Attack success rate (ASR%) of the hard-label OPT attack [9], Sign-OPT attack [10], and NES/Label-Only attack [8].Quantization and Output-Correlated Noise: For quantization noise, we quantized the outputs from 32-bit to 2-, 4-, and 8-bit.For output-correlated noise, the noise was generated as v = αF (x)+ , where α was the correlation coefficient and ∼ N (0, 10 −16 I) was the residual noise with a very small standard deviation 10 −8 .Results in Table 5 clearly show that quantization did not mitigate the attacks.There was no change in ASR between the original (32-bit float) and the quantized cases.Similarly, correlated noise could not mitigate the attacks as well.The slight reduction in ASR at α = 0.001 and α = 0.1 was solely caused by the small residual noise .In Appendix G.1, we presented experiment results that demonstrated the output noise perturbation method was robust to attacker's counter-measures with increased and repeated queries.Extra experiment results of QC, distortion, image samples as well as untargeted attacks are presented in Appendix G.

Conclusions
In this paper, we studied the addition of white noise to DNN's output as a defense against black-box adversarial attacks.Noisy gradient is theoretically analyzed, which shows that the added noise is drastically amplified by the small logit variation.The trade-off between the defender's noise level and the attacker's query count is analyzed mathematically.Extensive experiments verified the theoretical analysis and demonstrated that white noise perturbation can effectively mitigate black-box attacks under realistic query cost constraints.

A Proof of Theorem 1
As outlined in Section 3.2, the NES targeted attack algorithm minimizes the cross-entropy loss assuming the label is hot-one coded, where F t (x) is the DNN output corresponding to the target class t.The NES algorithm minimizes the loss iteratively via gradient descent and in each iteration the gradient is estimated as where J queries with random direction tensors u j are conducted to obtain DNN output F t (x + βu j ) as well as loss f (x + βu j ).Antithetic sampling is adopted in [8] which changes ( 9) to (2).Antithetic sampling means that both x + βu j and x − βu j are used to query the DNN.
To study the noise perturbation effect on the estimated gradient, it is sufficient to focus on just To simplify notation, we can write g j as the attacker-generated tensor u j multiplying a scalar multiplication factor a, i.e., With white Gaussian noise v added to the DNN output F (x), the equation ( 11) becomes where h(x) = F t (x − βu j ) + v t (j + J/2) F t (x + βu j ) + v t (j) .
The variables v t (j) and v t (j + J/2) are the noises added to the target class logits F t (x + βu j ) and F t (x − βu j ), respectively.Note that in antithetic sampling, we denote the noise added to the query F t (x − βu j ) as v t (j + J/2), where j + J/2 denotes the (j + J/2)th query.
The connection between the noiseless h(x) and the noisy h(x) is where we use the random variable Z to include all the noise terms.As a result, we have Since Z is the ratio of two independent Gaussian random variables, from [34] we can readily see that it can be approximated as a single Gaussian random variable Z ∼ N (1, σ 2 Z ) with unit mean and variance σ 2 Z described by (3).Furthermore, let Z = 1+S, where S ∼ N (0, σ 2 Z ).From the small noise Definition 1, we have that σ 2 is small enough so that log Z = log(1 + S) ≈ S. Therefore, from (15) we can get A ∼ N (a, σ 2 Z /β 2 ).Theorem 1 is proved.
Remark 1: To understand why the noisy g j can prevent the NES attack, it is helpful to have some idea about the value distribution of h(x), log h(x) and Z.Since β is very small, we expect that h(x) is near 1 due to the bounded local Lipschitz constant L.Then, log h(x) is around 0 and can be positive or negative.The value of a can also be positive and negative, and |a| is usually small.This means that the factor a controls the gradient descent direction.The noise Z and thus A make the estimated gradient g j = Au j randomized, with the gradient descent direction randomized in particular.For example, even if a is positive, A may become negative (see the numerical example in Remark 2).The random multiplication factor A has an accurate probability density function according to (15).However, ( 16) is too complex to conduct our subsequent SNR and QC analysis.Therefore, we have applied a further simplification to approximate A as a Gaussian random variable.
Remark 2: As an example, let β = 10 −3 as [8].For a well designed DNN, F t (x) is usually around 1/C for a total of C classes.We consider F t (x) = 0.1, 0.01, respectively.For positive a, we evaluate the probability that A becomes negative, which means that the gradient search direction becomes opposite to the true direction.With the cumulative distribution function (CDF) From the results shown in Fig. 4, we can see that a very small noise standard deviation σ = 10 −3 is enough to make P [A < 0] ≈ 0.5.

B Proof of Lemma 1
From (2), i.e., the definition we have We have applied the approximation log(1 + x) ≈ x for small x when deriving the approximation in (18).Because F t (x − βu j ) − F t (x + βu j ) ≤ 2β u j L, under the assumption of small β, we can guarantee F t (x − βu j ) − F t (x + βu j ) F t (x + βu j ) and thus the validity of (18).From (18) and utilizing the approximation log Z = Z − 1, the SNR is then Replacing σ 2 Z with (2), after some straightforward deductions we can get (4).Next, to derive the simplified upper bound in (4), consider the Lipschitz constraint assumption.From the left hand side of (4), we get Without loss of generality, assuming u j = 1 and using F t (x − βu j ) ≈ F t (x + βu j ), we get the SNR upper bound in the right hand side of (4).The lemma is proved.
Remark 3: Note that the SNR can be calculated numerically without applying the approximation in (18).The reason we apply the approximation here is to get a simplified SNR expression that outlines the major contribution factor ∆F t = |F t (x − βu j ) − F t (x + βu j )|.Note also that the assumption of small β is not a severe constraint at all in practice.In most black-box attacks, such as [7], β is selected (and proved) to be less than or equal to the inverse of DNN input dimension d.Obviously, d is much larger than the DNN output dimension C (class number).Since F t (x + βu j ) on average is around 1/C, β is thus much less than F t (x + βu j ) in most cases.This may be violated occasionally, but such occasional violations do not affect the SNR because the SNR is the average over all possible DNN outputs F t (x).

C Proof of Theorem 2
Consider the problem of minimizing with iterative gradient descent where a is a constant and small learning rate.In F (wx), F denotes the mapping of DNN, w denotes the weight of the input layer, and x denotes the input.For notation simplicity, w and x are treated as matrix and vector.x * denotes the optimal solution.In order for the gradient descent to converge to x * from a starting point x 0 so we can count the total number of iterations, we have to assume that F (wx) is a monotone function between the starting point wx 0 and the optimal point wx * .
To further simplify our notation, without loss of generality, we assume F (wx) is a monotonously decreasing function from wx 0 to wx * .We also assume wx n ≤ wx * for n = 0, 1, • • • , which can be guaranteed with a small enough learning rate a and a starting point wx 0 ≤ wx * .Our subsequent deduction can be easily extended to include other cases such as F (wx) monotonously increasing, or some elements of F (wx) monotonously increasing and others decreasing, or some elements of wx n becomes greater than wx * .In these cases, we just need to treat each element in each case individually.
The gradient is ∂f where F denotes the derivative of F with respect to its argument wx n and w T denotes the transposition of w.Then, the gradient updating is Next, we consider ) instead to exploit the assumption of wx n ≤ wx * .From the Lipschitz assumption and monotonicity, we have F (wx n ) − F (wx * ) ≤ L(wx * − wx n ) for some constant L. Therefore, Using wx * to subtract both sides, we get Denote the largest eigenvalue of the matrix Lww T F as −λ where λ is a positive value.Note that the eigenvalues must be negative because otherwise (27) does not converge, which contradicts with the convergence assumption.In this special case, F is negative because F is assumed monotonously decreasing.Let v n = wx * − wx n .From ( 27), we have where v 0 = wx * −wx 0 or the initial distance from wx * .If a is small so that (1−aλ) n ≈ 1−naλ, then, in order to guarantee v n ≤ η where η is a small constant, the number of iterations n must satisfy Next, consider the case when the learning rate a is replaced by the noisy learning rate A = a+ √ SNRv with noise v ∼ N (0, 1).Equation ( 27) becomes where A i denotes the learning rate in the ith iteration.Similarly, (28) becomes In order to guarantee If both a and SNR is small, then (32) can be simplified to which leads to Since A i ∼ N (a, SNR) are independent Gaussian random variables, in order to make for some small probability , we need Solving (36) for n, we get that the number of iterations needed when the learning rate becomes random A must satisfy Using the lower bound of ( 29) and (37), we can get the ratio of required iterations between the case of a and the case of A as which is just (5).The theorem is proved.
Remark 4: First, the proof is easier to understand if we consider w as a row vector and F as a scalar nonlinear monotone function.We present the general case with the matrix w in the proof.One can actually treat each row of w separately to get the same result.Second, although R is defined as the ratio of iterations, it equals to the ratio of query counts because there are a fixed number of queries conducted to estimate the gradients in each iteration.
Third, we argue that R can be used as an approximate estimation of QC(noise)/QC(noiseless), i.e., the ratio of QCs between the case with noise perturbation and the case without noise perturbation in our black-box attack and defense models.It is well known that the QC expression is hard or impossible to derive for black-box attack to general DNNs because F (x) is highly nonlinear/nonconvex and the black-box estimated gradient is not the true gradient.The key concept of our approach is that we consider a fixed optimization trajectory of the attacker from a starting input x 0 to the final adversarial input x * .This trajectory is obtained by the attack's gradient descent minimization without noise perturbation.Along this trajectory, the mapping F (x) can approximately be assumed as monotonously decreasing or piece-wise monotonously decreasing from x 0 to x * .The attacker's estimated gradients can also be looked as true gradients with a on this trajectory.The effect of noise perturbation is changing the value a in each iteration to a random value A with certain SNR.As a result, the model and assumptions we made for deriving R in this theorem are valid for analyzing the DNN attack-defense models.Therefore, it is reasonable to claim that if the attacker uses N a iterations to get the adversarial input x * , it would needs R times more iterations in case the output noise perturbation changes a to A.
Finally, based on the QC(noiseless) needed by the attackers when there is no noise perturbation (which can be obtained by experiments), we can estimate the QC(noise) needed when there is noise perturbation by multiplying QC(noiseless) with R. By this way, we can avoid the difficulty of finding the QC(noise) directly with experiments.As shown by our analysis, noise perturbation may increase QC(noise) to some computationally prohibitive level, such as 10 15 or more.When calculating R numerically, we can use a very small η/v 0 (because η is the desired small distance of final results wx n to the targeting result wx * and v 0 is the initial distance), and a very small (because 1 − is the confidence probability).We can use the average of a defined in (2) as aλ in R. As a matter of fact, because SNR is usually small, the value R is not very sensitive to these parameters.From (5), it can be easily seen that R ≈ C 0 /SNR where C 0 is a small constant.

D Proof of Theorem 3
Let us re-iterate the problem setting first.In order to improve the accuracy of the estimation of g j , or specifically, the estimation of the attacker can repeatedly query the DNN with inputs x − βu j and x + βu j .The noisy outputs are F t (x − βu j ) + v t1,n and F t (x + βu j ) + v t2,n in the nth query, where v t1,n and v t2,n are independent Gaussian random variables N (0, σ 2 ), n = 1, • • • , N .The attacker uses the query results to calculate y n as for each n.From Theorem 1, we have where To simplify notation, we let because F t (x − βu j ) ≈ F t (x + βu j ).The problem is to estimate a from y n , n = 1, • • • , N .We would like to find the N that is needed for estimating a reliably.
where u j is the vector of gradient direction which is pre-set and fixed, β is the smoothing parameter, v j is the noise added by the DNN when the attacker queries with x + βu j .
The estimated gradient equals to the vector u j multiplying a scalar multiplication factor A, i.e., where Define the noiseless term Since the noise is small, the location of the maximum element does not change almost surely.We have where v t and v j,t are the tth entry of the noise vectors v and v j , respectively.The random variables v max and v j,max are the noises added to the maximum-valued elements of F (x) and F (x + βu j ), respectively. Define Each of Z 1 and Z 2 is the ratio of two independent Gaussian random variables and can be approximated as a single Gaussian random variable [34].Specifically, The probability density function (PDF) p Z (z) of the product Z = Z 1 Z 2 can be found based on [35].
Define a = 1 β log h(x).Then from ( 58) and (61) we have Therefore, we can see that noise perturbation randomizes the AutoZOOM's gradient estimation similarly as it does for NES-based attack method.
To derive A's distribution and SNR bound, when noise variance σ 2 is small enough, we have , from which we can verify (7).In addition, the SNR bound can be proved following strictly the proof of (4).The theorem is proved.
Remark 6: When deriving (61), we have assumed [F (x) + v] max = F max (x) + v max and also [F (x + βu j ) + v j ] max = F max (x + βu j ) + v j,max , which means small noise does not change the index of the maximum-valued elements.This is true almost surely under small noise perturbation.On the other hand, the noise may accidentally change the index of the maximum-valued element.In this case, the two elements, old F max (x) and new F newmax (x), have similar (almost identical) values since even tiny noise can switch their order.Therefore, (61) is still valid.

F Proof of Lemma 2
First, for output quantization, instead of outputting the full precision 32-bit logit values, the DNN can output Q ≥ 2 bit quantized logit values.Note that 1-bit quantization is actually hard-label outputs, and only the special hard-label attack methods can work.It is well known that quantization method introduces quantization noise.Under coarse quantization, attacks with the cross-entropy loss do not work because a = 1/β log(F t (x − βu)/F t (x + βu)) quite often results in a = 0.However, the attacks with the C&W loss still work well.In other words, the quantization noise can not mitigate such attacks.To explain it, let us look at the proof of Theorem 4 and consider the noise term Z 2 .The noises are the quantization residues of F t (x) and F t (x + βu j ), whose quantized values are the same, say Q, almost surely.This means v t = F t (x) − Q and v j,t = F t (x + βu j ) − Q.We then have Obviously, log Z 2 is no longer randomly positive and negative.In other words, the variance of Z 2 is zero.Therefore, quantization noise can not mitigate the attacks.
Second, for output-correlated noise, let us look at the proof of Theorem 1.If the noise v is correlated to the output F t (x), then we have v t (j) = i α i F i t (x + βu j ) + for a very small → 0, where α i are correlation coefficients.From (14), it is easy to see that Z is now randomized by only, which means a very small σ 2 Z with much-reduced noise perturbation effect.The attack mitigation effect is also reduced.

G Extra Experiment Results
G.1 Robust to Attacker's Countermeasure with increased query counts or repeated queries: We evaluated the performance of the output noise perturbation method under EOT-like countermeasures, where the attackers used N repeated queries to average each g j (6), or used more non-repeated queries (large J) to look for better average gradients (2).First, for the countermeasure with N repeated queries, from the results in Table 6, we can say that our method was robust against this type of EOT-like countermeasures.There was no drastic change in ASR even when queries where increased to N = 1000, which means 3 orders of magnitude more QCs.Second, for the countermeasure with larger J value, the original NES and P-RGF attack algorithms both used J = 50.We experimented with J = 100 and the results are summarized in Table 7. From the simulation results, we can see that using a higher J did not necessarily lead to better ASR.The results demonstrate that our noise perturbation method is also robust to this type of countermeasure.Note that the ASR of the untargeted attack (P-RGF) critically depends on the distortion threshold.We used the original relatively high distortion threshold for P-RGF which resulted in relatively high ASR.[7] under noise perturbation defense along with the countermeasure where the attacker used N repeated queries to average gradients.
F t (x) is the true logit value of the input x, and F max (x) denotes the maximum logit value excluding the true logit.Our analysis and conclusions derived in Section 3 are still valid.Specifically, we still have noisy multiplication factor A = a + β −1 log Z with an extremely low SNR which leads to high QC and low ASR.This is demonstrated by our experimental results in Table 8 for soft-label attacks and Table 9 9: Untargeted hard-label attacks: Attack success rate (ASR%) of the hard-label OPT attack [9], Sign-OPT attack [10], and Boundary [13].
Remark 7: Before proceeding, we would like to point out that this experiment demonstrated that the output noise perturbation defense was effective against the transfer-learning based attack [9] as well as the boundary-based attack [13], both of which were considered as free of gradient estimation.The P-RGF attack [9] applied transfer-learning model to assist gradient estimation, while the boundarybased attack [13] did not have explicit gradient calculation.Experimental data in Table 8 showed that the P-RGF had ASR reduced from 98% to 36% under σ = 0.01.The relatively high ASR of 36% was due to the strong transfer model and the larger L 2 distortion threshold used in the original source code.It is well known that untargeted attacks can be always successful as long as large distortion can be allowed.The ASR would drop to very low level if the transfer model did not fit well with the black-box DNN or a small L 2 distortion level was used.In contrast, the boundary attack data in Table 9 showed very low ASR at σ = 0.1.Although the ASR was relatively high at σ = 0.01, extremely high QCs were used (see Table 12).Limiting the QC realistically would reduce ASR and prevent this attack with our defense.
In Table 10 we show that our noise perturbation method was effective to mitigate the P-RGF untargeted attack algorithm with various IMAGENET classification models.The ASR reduced with the increase in perturbation level.In Table 11 we compare the noise perturbation defense method with two existing defense algorithms: JPEG Compression [15] and Input Randomization [22].As seen from the table, the proposed noise perturbation method had the best defense result.

Dataset
Attack Without JPEG Randomized Our Noise Method Defense Compress [15] Input [22] Perturbation  2 shows the ASR of the attack algorithms under our noise perturbation defense, we also obtained their average query counts and L 2 per-pixel distortion, which are shown in Table 12 and Table 13.The query counts were calculated with successful adversarial images under the L 2 distortion threshold.The 0 query count means there were no such successful adversarial images available.We can see that besides ASR reduction, the query efficiency of all the attack algorithms drastically reduced due to noise perturbation.In general, the query count gradually increased with the increased noise perturbation level σ.However, after a certain noise level, when the number of successful adversarial images reduced to a certain level, we would have query counts less than those of noiseless case.This might be because the remaining successful adversarial images were those that were easy to attack.There were some successful adversarial images depending on the level of noise perturbations.We are interested to check whether these images were truly successful attacks.A successful attack requires low enough distortion.Therefore, we checked the L 2 distortion of these adversarial images.The comparison of L 2 distortion between the noiseless attacks and noise defenses is shown in Table 13.We can see that the L 2 distortion increased with noise level σ.The distortion under σ = 0.01 was several times larger than those of noiseless attack.This means that even if the attacks were considered successful, the adversarial samples had high distortion.On the IMAGENET dataset, the ZOO and AutoZOOM algorithms had no data in targeted attacks because their ASR was 0. P-RGF, NES, AutoZOOM, and ZOO had L 2 distortion thresholds ranked from large to small.Their ASR also ranked from high to low under noise perturbation.
G.4 Sample Adversarial Images: Fig. 5 shows some sample IMAGENET images generated by the adversarial algorithms.Heavier distortions can be seen when σ ≥ 10 −4 .Especially, some images obtained by the NES targeted attacks were black-out but were still classified as successful attacks.Fig. 6 shows the visual effects of adversarial MNIST and CIFAR10 images.Similarly, the adversarial examples at higher σ could no longer deceive human perception, especially in targeted attacks.

Figure 1 :
Figure 1: The model of black-box attack and the defense via output noise perturbation.

Figure 2 :
Figure 2: (a) Classification accuracy degradation due to noise perturbation.(b) SNR of noisy model outputs F (x) + v (Output) and SNR of gradient multiplication factor A (Grad).

Figure 3 :
Figure 3: (a) Ratio R of query counts of noisy case to noiseless case.(b) Number of repeated queries N required to estimate a so that P [â < 0] < 0.3 when a > 0.

Figure 4 :
Figure 4: Probability of noisy multiplication factor A becoming negative when the true value a is positive.β = 10 −3 .
Table 11: ASR (%) comparison of three defense methods in untargeted attack setting.G.3 Query Count and Distortion of Targeted/Untargeted Attacks: While Table
3.1 Model of Black-Box Attack and Output Noise Perturbation DefenseConsider a DNN that classifies an image x 0 into class c.The DNN outputs (softmax) logits F (x 0 ), where F is the DNN's nonlinear mapping function.The classification result is c = arg max i F i (x 0 ), where F i denotes the ith element function of F .

Table 1 :
Median ∆F t std ∆F t Statistical DNN output parameters obtained from validation datasets (without noise perturbation).ACC: classification accuracy.Mean F t (x): average softmax output values (excluding the top-1 pick).Mean/median/std ∆F t : mean, median, and standard deviation of output variation ∆F t .

Table 4 :
ASR (%) comparison of three defense methods: output noise perturbation method, and two input randomization methods.Soft-label targeted attack.

Table 5 :
Attack success rate (ASR) of the ZOO and AutoZOOM black-box attack algorithms under quantization noise and output-correlated noise.Soft-label targeted attack.

Table 6 :
Attack success rate (ASR) of the AutoZOOM targeted attack

Table 10 :
Attack Success Rate (ASR%) versus defense noise standard deviation σ for P-RGF over IMAGENET for different pre-trained models.Untargeted attack.