Binary Classification Under ℓ0 Attacks for General Noise Distribution

Adversarial examples have recently drawn considerable attention in the field of machine learning due to the fact that small perturbations in the data can result in major performance degradation. This phenomenon is usually modeled by a malicious adversary that can apply perturbations to the data in a constrained fashion, such as being bounded in a certain norm. In this paper, we study this problem when the adversary is constrained by the <inline-formula> <tex-math notation="LaTeX">$\ell _{0}$ </tex-math></inline-formula> norm; i.e., it can perturb a certain number of coordinates in the input, but has no limit on how much it can perturb those coordinates. Due to the combinatorial nature of this setting, we need to go beyond the standard techniques in robust machine learning to address this problem. We consider a binary classification scenario where <inline-formula> <tex-math notation="LaTeX">$d$ </tex-math></inline-formula> noisy data samples of the true label are provided to us after adversarial perturbations. We introduce a classification method which employs a nonlinear component called truncation, and show in an asymptotic scenario, as long as the adversary is restricted to perturb no more than <inline-formula> <tex-math notation="LaTeX">$\sqrt {d}$ </tex-math></inline-formula> data samples, we can almost achieve the optimal classification error in the absence of the adversary, i.e., we can completely neutralize adversary’s effect. Surprisingly, we observe a phase transition in the sense that using a converse argument, we show that if the adversary can perturb more than <inline-formula> <tex-math notation="LaTeX">$\sqrt {d}$ </tex-math></inline-formula> coordinates, no classifier can do better than a random guess.


I. INTRODUCTION
I T IS well-known that machine learning models are sus- ceptible to adversarial attacks that can cause classification error.These attacks are typically in the form of a small norm-bounded perturbation to the input data that are carefully designed to incur misclassification -e.g.they can be form of an additive ℓ p -bounded perturbation for some p ≥ 0 [1], [2], [3], [4], [5].
There is an extensive body of prior work studying adversarial machine learning, most of which have focused on ℓ 2 and ℓ ∞ attacks [6], [7], [8], [9].To train models that are more robust against such attacks, adversarial training is the state-of-the-art defense method.However, the success of the current adversarial training methods is mainly based on empirical evaluations [5].It is therefore imperative to study the fundamental limits of robust machine learning under different classification settings and attack models.
In this paper, we focus on the important case of ℓ 0 -bounded attacks that has been less investigated so far.In such attacks, given an ℓ 0 budget k, an adversary can change k entries of the input vector in an arbitrary fashion -i.e. the adversarial perturbations belong to the so-called ℓ 0 ball of radius k.In contrast with ℓ p -balls (p ≥ 1), the ℓ 0 -ball is non-convex and non-smooth.Moreover, the ℓ 0 ball contains inherent discrete (combinatorial) structures that can be exploited by both the learner and the adversary.As a result, the ℓ 0 -adversarial setting bears various challenges that are absent in common ℓ padversarial settings.In thus regard, it has recently been shown that any piece-wise linear classifier, e.g. a feed-forward deep neural network with ReLu activations, completely fails in the ℓ 0 setting [10].
Perturbing only a few components of the data or signal has many real-world applications including natural language processing [11], malware detection [12], and physical attacks in object detection [13].There have been several prior works on ℓ 0 -adversarial attacks including white-box attacks that are gradient-based, e.g.[4], [14], and [15], and black-box attacks based on zeroth-order optimization, e.g.[16] and [17].Defense strategies against ℓ 0 -bounded attacks have also been proposed, e.g.defenses based on randomized ablation [18] and defensive distillation [19].None of the above works have studied the fundamental limits of the ℓ 0 -adversarial setting theoretically.In our prior work, we have studied the ℓ 0 -adversarial setting for the case of Gaussian mixture model [20].In this paper, we generalize our results to the case of binary classification with general noise distribution.
We note that a line of work in distributed hypothesis testing has considered Byzantine attacks where a fraction of compromised nodes may cooperatively transmit fictitious observations according to different arbitrary distributions.This is different from the ℓ 0 attack setting, where k of the observations can be arbitrarily and adversarially changed (as opposed to their distribution getting adversarially changed) [21], [22], [23].
The goal of this paper is to characterize the optimal classifier and the corresponding robust classification error as a function of the adversary's budget k.More precisely, we focus on the binary classification setting with general but i.i.d.noise distributions, where the input is generated according to the following model: x i = yµ + z i , where y ∈ {−1, 1} is the true label, z i is a zero-mean i.i.d.random noise process, and µ is its mean vector.We seek to find the robust classification error of the optimal classifier in this setting.In other words, we would like to study "how robust" we can design a classifier given a certain budget for an ℓ 0 adversary.Specifically, we consider the asymptotic regime that the dimension of the input gets large, and ask the following fundamental question: What is the maximum adversary's budget for which the optimal error in the absence of an adversary (standard error) can still be achieved and how does this limit scale with the input's dimension?
The main contributions of the paper to answer the above questions are as follows.
• We prove an achievability result by introducing a classifier and characterizing its performance (Theorem 3).
Our proposed classification method finds the likelihood of each data sample, and applies truncation by removing a few of the largest and a few of the smallest values.This truncation phase effectively removes the "outliers" present in the input due to adversarial modification.• We prove a converse result (Theorem 4) by finding a lower bound on the optimal robust error, and show that the two bounds asymptotically match as the dimension d → ∞, hence our proposed classification method is optimally robust against such adversarial attacks.The key idea behind the converse proof is to use techniques from the optimal transport theory and studying the asymptotic behavior of the maximal coupling between the data distribution under the two labels +1 and −1.We use such a coupling to design a strategy for the adversary by making the distribution "look almost the same" under the two labels, hence removing the information about the true label.• Surprisingly, we observe a phase transition for the optimal robust error in terms of the adversary's budget (Theorem 1).Roughly speaking, we observe that if the adversary's budget is below √ d, we can asymptotically achieve the optimal standard error which corresponds to the case where there is no adversary, while if the adversary's budget is above √ d, no classifier can do better than a random guess.In other words, we can compensate for the presence of the adversary with respect to the performance metric specified in this paper as long as the adversary's budget is below √ d.In this case, roughly speaking, asymptotically we achieve a performance as if there were no adversary.On the other hand, above this threshold √ d, the adversary can perturb the data in such a way that the information about the true label is lost and hence no classifier can do better than a random guess.Consequently, there is no trade-off between robustness and accuracy in this setting.
• We show that as a result of the above phase transition, there is a classifier that asymptoically achieves the optimal standard error for the whole sub-

√
d regime, we prove a bound on the convergence rate of the robust loss of the proposed classifier to the optimal standard error as the dimension d goes to infinity (Theorem 2).
• We have shown in a previous work [20] that truncation is effective to robustify against ℓ 0 attacks in a Gaussian mixture setting.
In addition to this, we have observed that adding truncation as a component to neural network models improves robustness against sparse attacks in practical image classification tasks [24].The current work aims to bridge the gap between our previous theoretical work which was restricted to the Gaussian distribution, and the practical applications such as the aforementioned image classification where the data distribution is not Gaussian.Even though the purpose of the current work is to establish the fundamental limits of robustness and not to investigate algorithmic efficiently, our results reassure that truncation can be a useful defense mechanism against sparse attacks in general.Truncation has been proved to be useful in robustifying learning algorithms against sparse attacks in other scenarios as well, such as sparse recovery [25] and learning of graphical models [26].
In Section II, we give the problem formulation, in Section III we discuss the main results, and in Section IV we conclude the paper.Proof are discussed in the appendices.
We close this section by introducing some notation.We denote the set of integers {1, . . ., n} by x exp(−t 2 /2)dt denotes the complementary CDF of a standard normal distribution.N (µ, σ 2 ) denotes a real-valued normal distribution with mean µ and variance σ 2 .dist − − → and prob − −→ denote convergence in distribution and convergence in probability, respectively.X ∼ p(.) means that the random variable X has distribution p(.).We use the boldface notation for vectors in the Euclidean space, e.g.x ∈ R d .

II. PROBLEM FORMULATION
We consider the binary classification setting where the true label is Y ∼ Unif{±1} and conditioned on a realization y, d independent real-valued data samples x given y = 1 and z 1 , . . ., z d are i.i.d.samples of a zero-mean real-valued noise distribution which has a density q(.).We consider a high-dimensional setting where the dimension d → ∞, and µ d can depend on the data dimension d.However, we assume that the noise density q(.) is fixed and known.Note that since the ℓ 0 norm is invariant under scalar multiplication, we can arbitrarily normalize the quantities, and this assumption is made without loss of generality.We denote the vector of the input data samples by ).Throughout this paper, the superscript Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
(d) emphasizes the dependence on the dimension d.However, we may drop it from the notations whenever the dimension is clear from the context.A classifier is a measurable function C : x → {±1} which predicts the true label from the input x.We consider the 0-1 loss ℓ(C; x, y) := 1 [C(x) ̸ = y] as a metric for discrepancy between the prediction of the classifier on the input x and the true label y.
We assume that an adversary is allowed to perturb the input x within the ℓ 0 ball of radius k: Effectively, the adversary can change at most k data samples.The parameter k is called the adversary's budget.Similar to the above, whenever the dimension d is clear from the context, we may denote the adversary's perturbed data samples as ).In this setting, the robust classification error (or robust error for short) associated to a classifier C is defined to be where the expectation is taken with respect to the above mentioned distribution parametrized by d, µ d , and q.The optimal robust classification error (or optimal robust error for short) is defined by optimizing the robust error over all possible (measurable) classifiers: In words, L * (d) µ d ,q (k) is the minimum error that any classifier can achieve in the presence of an adversary with an ℓ 0 budget k.In other words, no classifier can obtain a robust error smaller than L * (d) µ d ,q (k) in this setting.Whenever the problem parameters are clear from the context, we may drop them from the notation and write L (d) (C, k) or L(C, k), and L * (d) (k) or L * (k).
In the absence of the adversary, or equivalently when k = 0, L * (0) reduces to the optimal standard error, which is optimal Bayes error of estimating Y upon observing the noisy samples x 1 , . . ., x d .In order to fix the baseline, specifically to have a meaningful asymptotic discussion as d → ∞, we assume that µ d is such that the optimal standard error L * (d) µ d ,q (0) remains constant as d → ∞.As we will see later (see Proposition 1 in Section III-A), this is achieved when µ d = c/ √ d for some c > 0. Motivated by this, we study the setting where µ d = c/ √ d for some constant c > 0 throughout this paper.When µ d = c/ √ d when c < 0, similar results still hold after substituting c with |c|.

III. MAIN RESULTS
In order to prove our main results, we need the following assumptions on the noise distribution q(.).We will show later (see Section III-D) that all of these assumptions are satisfied for a large class of distributions, including the exponential family of distributions with polynomial exponents, e.g. the normal distribution.
Assumption 1: We have q(z) > 0 for all z ∈ R, q(.) is three times continuously differentiable, and where q ′ (.) and q ′′ (.) denote the first and second derivatives of q(.).Furthermore, the location family of distributions parameterized by θ ∈ R has well-defined and finite Fisher information {I q (θ)} θ∈R .The Fisher information of the parametric family of distributions q(z; θ) where z, θ ∈ R is defined to be See, for instance, [27] for more details.Since q(z; θ) = q(z − θ) is a location family, it turns out that I q (θ) is independent of θ.The common value, which we denote by I q by an abuse of notation, is given by
The following theorem formalizes the phase transition we discussed previously, i.e. if adversary's budget is orderwise below √ d, we can compensate for the presence of the adversary with respect to the asymptotic performance metric discussed here, while if adversary's budget is orderwise above √ d, no classifier can do better than a random guess.As we discusses previously, we assume that µ d = c/ √ d for a constant c > 0 to ensure that the standard error is asymptotically constant (see Proposition 1 in Section III-A).
Theorem 1: Assume that µ d = c/ √ d for some constant c > 0, and the assumptions 1-4 are satisfied for the noise density q(.).
In other words, the excess risk of this sequence of classifiers as compared to the optimal standard error (when there is no adversary) converges to zero.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
2) If lim inf d→∞ log d k d > 1/2, we have In other words, no classifier can asymptotically do better than a random guess.The proof of this result, which is given in Appendix D, essentially follows from Theorems 3 and 4 in Section III-B and III-C below.More precisely, in Section III-B, we prove an achievability result by introducing a sequence of robust classifiers in the sub-√ d regime (first part of the theorem), while in Section III-C, we prove a converse result by introducing a strategy for the adversary in the super-

√
d regime which perturbs the data in such a way that the information about the true label is asymptotically removed (second part of the theorem).
Even though in the first part of Theorem 1 above, the classifier C (d) k d knows the adversary's budget k d , it turns out that this assumption is not necessary.Intuitively speaking, the phase transition observed above motivates that if the classifier prepares for the worst case recoverable adversary's budget ≈ √ d, there would be no asymptotic trade off between accuracy and robustness, and such classifier should be asymptotically uniformly optimal for the whole range of adversary'd budget below √ d.Corollary 1 below whose proof is given in Appendix D formalizes this intuition.
Corollary 1: Assume that µ d = c/ √ d for some constant c > 0, and the assumptions 1-4 are satisfied for the noise density q(.).For any fixed ϵ > 0, there is a classifier C (d) such that for any sequence k d with lim sup d→∞ log The first part of Theorem 1 states that in the sub- µ d ,q (0) asymptotically goes to zero.The following result studies the rate of convergence in this regime.For technical convenience, we restrict ourselves to the Gaussian distribution in Theorem 2 below.This result shows that in the Gaussian case, L (d) Theorem 2: Assume that µ d = c/ √ d for some constant c > 0, and the noise is zero-mean Gaussian, i.e. q(z)

A. Asymptotic Standard Error
Recall that in the absence of the adversary, or equivalently when adversary's budget k is zero, the optimal robust error L * (d) µ d ,q (0) reduces to the optimal Bayes error of estimating Y upon observing the noisy samples x 1 , . . ., x d .With an abuse of notation, we write L * (d) µ d ,q (or L * for short) for this optimal Bayes error.Our goal in this section is to find the appropriate scaling of µ d with d such that L * (d) µ d ,q converges to a constant as d → ∞.
In order to characterize L * , note that since there is no adversary, and the prior on Y is uniform, the optimal Bayes classifier is the maximum likelihood estimator that computes the likelihood , (8) and returns the estimate ŷ of y as The following Proposition 1 shows that if µ d = c/ √ d, then the optimal Bayes error converges to a constant.The proof of Proposition 1 is given in Appendix A.
Proposition 1: Assume that assumptions 1 and 2 are satisfied for the noise density q(.).Then, if Furthermore, in this case, as d → ∞, conditioned on Y = +1, the log likelihood converges in distribution to a normal N (2c 2 I q , 4c 2 I q ) where I q was defined in ( 5) and is the Fisher information associated to the location family defined in (4).Moreover, conditioned on Remark 1: As we will see in Appendix A, if c < 0, we need to replace Φ(c I q ) by Φ(|c| I q ) in the above result.

B. Achievability: Upper Bound on the Optimal Robust Error
In this section, we introduce a classifier and study its robustness against ℓ 0 adversarial perturbations.Recall that if k is the adversary's budget, the input to the classifier is x ′ = (x ′ 1 , . . ., x ′ d ) which is different from the original sequence x 1 , . . ., x d in at most k coordinates.Recall from Section III-A that in the absence of the adversary, the optimal Bayes classifier is the maximum likelihood estimator based on d i=1 x i , as was defined in (8).Motivated by this, we define Note that if x ′(d) denotes the vector ( x We define the truncated classifier as follows.Given a vector u = (u i : i ∈ [d]) ∈ R d and an integer k ≥ 0, we define the truncated summation TSum k (u) to be the summation of coordinates in u except for the top and bottom k coordinates.More precisely, let s = (s i : i ∈ [d]) = sort(u) be obtained by sorting the coordinates of u in descending order.We then define When k = 0, this indeed reduces to the normal summation.Motivated by (11), we replace This method essentially removes the "outliers" introduced by the adversary into the data.
The following theorem shows that this classifier is asymptotically robust against adversarial attacks with ℓ 0 budget of at most A matching lower bound is provided in Section III-C.The proof of Theorem 3 below is Given in Appendix B.
Theorem 3: Assume that Assumptions 1-4 are satisfied for the noise density q(.), and for some ϵ > 0, then we have In particular, we have Note that L * (d) µ d ,q , as was defined in Section III-A above, is the optimal Bayes error in an ideal scenario when there is no adversary, and µ d ,q is the excess error of our truncated classifier with respect to this ideal scenario.In fact, (15) implies that our truncated classifier is asymptotically optimal in the specified regime of adversary's budget.The truncated classifier manages to compensate for the presence of the adversary, and performs as if there is no adversary.
Remark 2: As we will see in Appendix B, if c < 0, we need to replace Φ(c I q ) by Φ(|c| I q ) in the above theorem.

C. Converse: Lower Bound on the Optimal Robust Error
In this section, we provide a lower bound on the optimal robust error.In Section III-B, we observed that roughly speaking, if adversary's budget is below √ d, we can asymptotically compensate for its effect and recover the Bayes optimal error, as if no adversary is present.In this section, we show that, roughly speaking, if adversary's budget is above √ d, no classifier can asymptotically do better than a random guess, resulting in a robust error of 1/2.We do this by introducing an attack strategy for the adversary.In this strategy, the adversary with a sufficiently large budget, perturbs the input data in such a way that all the information about the true label Y is lost, resulting in a perturbed data which has a vanishing correlation with the true label.The proof of Theorem 4 below is given in Appendix C.
Theorem 4: Assume that Assumptions 1 and 3 are satisfied for the noise density q(.), and µ d = c/ √ d for some c > 0.Then, if k d is a sequence of adversary's budgets so that k d > d 1/2+ϵ for some ϵ > 0, we have

D. Exponential Family of Distributions
In this section, we show that the Assumptions 1-4 are all satisfied for a large class of distributions, namely the exponential family of noise distributions of the form where is a polynomial in z with even degree 2n > 0 such that a 2n > 0. Here, A := ∞ −∞ ψ(z)dz is the normalizing constant.Note that since ψ(.) has an even degree with a negative leading coefficient, we have A < ∞.

IV. CONCLUSION
We studied the binary classification problem in the presence of an adversary constrained by the ℓ 0 norm.We introduced a robust classification method which employs truncation on the log likelihood.We showed that this classification method can asymptotically compensate for the presence of the adversary as long as adversary's budget is orderwise below √ d.Moreover, we showed a phase transition through a converse argument in the sense that no classifier can asymptotically do better than a random guess if adversary's budget is orderwise above

APPENDIX A PROOF OF PROPOSITION 1
Proof of Proposition 1: To simplify the discussion and to avoid considering multiple cases, it turns out that it is more convenient to assume that the constant c can be negative.Therefore, for the rest of the proof, we assume that µ d = c/ √ d where c ∈ R and c ̸ = 0. Note that even for negative c, the maximum likelihood estimator in ( 9) is still the optimal Bayes estimator.Therefore We focus on the first term.Using µ d = c/ √ d, we may write From Assumption 1, we know that q(.) is positive everywhere and three times continuously differentiable, hence log q(.) is Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
three times continuously differentiable.Therefore, writing the Taylor expansion, we get where Note that ϵ i is random and only depends on z i .Substituting this into (18), we get We now study each of the three terms individually.T 1 : Denoting d dz q(z) by q ′ (z), we have where last equality uses Assumption 1.On the other hand, Therefore, where I q is the Fisher information associated to the location family of distributions q(z; θ) defined in (4).Note that Assumption 1 ensures that I q is well-defined and finite.
Therefore, combining this with (20) and using the central limit theorem, we realize that Consequently On the other hand, note that Using Assumption 1 and [27, Lemma 5.3], we have Substituting this into (23), we get d 2 dz 2 log q(z) = −I q .
Since I q < ∞ from Assumption 1, using the law of large numbers in (22), we realize that T 3 : Since µ d = c/ √ d, we may bound T 3 as follows: where the last line uses the fact that ϵ i ∈ (0, 2µ d ).Since Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Using the law of large numbers together with Assumption 2, we have This together with (25) implies that Using ( 21), (24), and (26) back into (19), we realize that Substituting c with −c in this result, we get Using ( 27) and ( 28) back into (17), we get Moreover, since the left hand side of ( 27) is precisely when Y = +1, we realize that conditioned on Y = +1, the log likelihood converges in distribution to a normal N (2c 2 I q , 4c 2 I q ).Likewise, (28) implies that conditioned on Y = −1, converges in distribution to a normal N (−2c 2 I q , 4c 2 I q ).This completes the proof.□

APPENDIX B PROOF OF THEOREM 3
The following lemma will be useful in our analysis.Lemma 1 (Lemma 1 in [20]): In particular, for w being the all-one vector, we have Proof of Theorem 3: It turns out that in order to simplify the discussion and to avoid considering multiple cases, it is more convenient to allow c to be negative.Therefore, in this proof we assume that µ d = c/ √ d where c ∈ R and c ̸ = 0.
Note that we still stick to the definition of x ′(d) in (10) and in (13).We have Therefore, using Lemma 1, we have Using this in (29), we get We study each of the two terms separately.
Conditioned on Y = +1, we have Using the Taylor expansion, we get where Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. Consequently, For T 1 , note that using Assumption 4, there are constants γ > 0 and C 4 > 0 such that This in particular implies that, since For T 2 , let d be large enough so that with the constant ζ in Assumption 3, we have For such d, we may write .
Note that from Assumption 3, we have Combining this with (33) and substituting into (32), we realize that conditioned on Y = +1, k d ∥ x (d) ∥ ∞ converges to zero in probability as d → ∞.On the other hand, from Proposition 1, we know that conditioned on Y = +1, converges in distribution to a normal N (2c 2 I q , 4c 2 I q ).Consequently, we have Conditioned on Y = −1, we have Therefore, Comparing this with (35), we realize that by replacing c with −c in the above discussion for Y = +1, we have Combining this with (35) and substituting back in (30), we get which completes the proof of (14).
In order to prove (15), note that the above results together with Proposition 1 implies that On the other hand, we have This means that lim inf d→∞ L (d) µ d ,q ≥ 0 and completes the proof.□ Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

APPENDIX C PROOF OF THEOREM 4
Consider the set of all joint distributions of random variables (X + , X − ) where the marginal distribution of X + is the same as the distribution of Z + µ d where Z ∼ q(.), and the marginal distribution of X − is the same as the distribution of Z − µ d .In other words, we consider the set of all couplings of Z + µ d and Z − µ d .In fact, the marginal distribution of X + is the same as that of a data sample conditioned on Y = +1, and the marginal distribution of X − is the same as that of a data samples conditioned on Y = −1.Fix a maximal coupling (X + , X − ) in this set, which is defined to be a coupling that maximizes P (X + = X − ), or equivalently minimizes P (X + ̸ = X − ). 1 We use such a maximal coupling to design an effective strategy for the adversary.To gain some intuition, an example of this strategy for the Gaussian mixture distribution is provided in Section C-A below.Note that maximal coupling is intuitively relevant to adversarial perturbations, since the adversary wants to change the data so that the samples conditioned on Y = +1 and Y = −1 "look almost the same", so that the classifier can extract minimal or no information about the true label upon observing adversarially perturbed samples.Given such a maximal coupling, let Moreover, let Y ∼ Unif(±1) be independent from (X + , X − , W ) and define It is easy to verify that (X, Y ) have the same joint distribution as our true feature vector-label pair, i.e. for a ∈ R we have Keep in mind that the joint distribution of (X + , X − , W, Y, X) depend on µ d and hence on d.However, we do not make such a dependence explicit to simplify the notation.
Note that by definition, W is a function of (X + , X − ) and hence is independent from Y .This suggests that W can be considered as a good candidate for the adversary's perturbation, since the adversary would ideally like to perturb the data in a way that the information about the true label is removed.More precisely, given the true label y and data samples (x  do not bear any information about the label y, indicating that w (d) is an ideal candidate for the adversary.However ∥w (d) − x (d) ∥ 0 might be above the adversary's budget k d .In order to address this, we define the final perturbed data vector x ′ (d) as follows: This ensures that indeed ∥x The following lemma will be later useful to make this statement precise.The proof of Lemma 2 below is given at the end of this section.
Lemma 2: Assume that the Assumptions 1 and 3 are satisfied and µ d = c/ √ d for some c > 0. Then for any δ > 0 we have and Proof of Theorem 4: Assume that the adversary employs the above strategy to perturb the input samples.In order to obtain a lower bound for the optimal robust error L * (d) µ d ,q(.) (k d ), we consider any classifier C. Let I be the indicator of the event ∥w (d) − x (d) ∥ 0 > k d .We assume that the classifier knows adversary's strategy, and also observes I.This indeed makes the classifier stronger and results in a lower bound for the robust error.Note that if I = 0, we have x ′ (d) = w (d) is independent from y, and no classifier can do better than a random guess, resulting in an error 1/2.In other words, Since this holds for any classifier C, we have i .Using the Markov inequality, we have P (I = 0|Y = −1) → 1 as d → ∞.Using these in (39) we realize that lim inf d→∞ L * (d) µ d ,q (k d ) ≥ 1/2 which completes the proof.□ Proof of Lemma 2: Let p + and p − the distribution of X + and X − , respectively.The total variation distance between p + and p − is defined to be where the supremum is over all Borel sets in R. It is well known that (see, for instance [28,Lemma 8.1]) if (X + , X − ) is the optimal coupling that minimizes P (X + ̸ = X − ), we have We have where (a) uses the fact that by definition, conditioned on Y = +1, we have X = X + ; in (b) we use the fact that Y is independent from (X + , X − ) and W is a function of (X + , X − ); in (c) we use the definition of W to conclude that if X + = X − , we have W = X + ; and finally (d) uses Pinsker inequality (see, for instance, [28,Theorem 4.19]) where D(p + ∥p − ) is the Kullback-Leibler (KL) divergence between p + and p − .With an abuse of notation, we may use p + and p − for the densities of X + and X − , respectively, so that q + (x) = q(x − µ d ) and q − (x) = q(x + µ d ).Therefore, Writing the Taylor expansion, we get log q(z + 2µ d ) = log q(z) where |ϵ z |< 2|µ d |.Using this in (41), we get Observe that from Assumption 1, we have Moreover, where α is the resulting constant, which is finite from Assumption 3. Using this together with (43) in (42), we realize that for d large enough, we have D(q + ∥q − ) ≤ A/d.Using this in (40), we realize that for d large enough, we have which implies The proof of (38) is similar.This completes the proof.□

A. Illustration of the Adversary's Strategy in the Gaussian Mixture Setting
In this section, in order to give some intuition on our proposed strategy for the adversary, we provide an example of the above adversary's strategy in the case where the data distribution is a Gaussian mixture.More precisely, let q(z) = 1 √ 2πσ 2 exp(−z 2 /2σ 2 ) be the zero mean normal distribution with variance σ 2 .Note that we have To simplify the discussion, in this section we assume that µ = µ d > 0. It is straightforward verify that in this case, the optimal coupling (X + , X − ) has the following form where Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
where a ∧ b is a shorthand for min(a, b).Furthermore, f X+,X− (x + , x − ) := 0 otherwise.It turns our that there is a more intuitive and helpful way of expressing this coupling, which is as follows.Define and It is easy to verify that q min and q ∆ are probability densities, and q ∆ is supported on the positive values.Let I be a binary random variable of the form Now, conditioned on I = eqal, we sample X + = X − from q min , and conditioned on I = unequal, we let It can be verified that (X + , X − ) has the same joint distribution as the optimal coupling in (44).In fact, X + = X − precisely when I = equal, clarifying the purpose of the indicator random variable I. Now, let us consider the joint distribution of (I, X + , X − , W, Y, X) as described previously, where and W = 0 otherwise.Equivalently, W = X + when I = equal and W = 0 when I = unequal.Now, we find the conditional distribution of W conditioned on X, Y .We may write Note that when I = equal, W = X + = X − , and X = X + = X − , therefore W = X.On the other hand, when I = unequal, W = 0. Therefore, we may write Now, using Baye's rule, we may write Note that we have where (a) uses the independence of Y from I, (b) uses the fact that when I = equal, we have X = X + = X − , (c) uses the fact that Y is independent from (I, X + , X − ), and (d) uses the fact that as we discussed above, conditioned on I = equal, X + = X − ∼ q min .On the other hand, as we discussed previously, (X, Y ) has the same joint distribution as the true feature vector-label pair, which is a Gaussian mixture in this example.Therefore, we have where q y is interpreted as q + when y = +1 and q − when y = −1.Substituting (47) and (48) back into (46), we arrive at Using this in (45), we get f W |X,Y (w|x, y) = α(x, y)δ(x) + (1 − α(x, y))δ(0).Now, we give an intuitive explanation of this adversarial modification strategy.Assume that y = +1.If x < 0, q + (x) < q − (x), α(x, y) = 1, hence w = x, and no perturbation is applied.On the other hand, when y = +1 and x > 0, α(x, y) = q − (x)/q + (x), and with probability 1 − α(x, y), w = 0, and w = x with probability α(x, y).When y = −1, we follow a similar procedure, but reversed.Intuitively speaking, this procedure "symmetrizes" the two shifted normal distributions by bringing them down to the common (minimum) distribution, hence removing the information about y.See Figure 1 for an illustration.It turns out that for the Gaussian mixture setting discussed in this section, our perturbation strategy is identical to the one proposed in our earlier work [20].In other words, in this paper we have generalized this perturbation strategy beyond the Gaussian mixture setting.The idea behind our proposed strategy for the adversary in one dimension for the Gaussian mixture model.Assume that the adversary observes a realization (x, y) such that y = +1, meaning that x is a realization of N (µ, σ 2 ) (i.e. the blue curve).If x ≤ 0, the adversary leaves it unchanged, i.e. w = x.On the other hand, if x > 0, we compute the ratio between the two densities (which is precisely α(x, y) shown in the figure), and with probability 1 − α(x, y) we set w = 0.When y = −1, we follow a similar procedure, but reversed.Intuitively speaking, this procedure "symmetrizes" the two distributions by bringing them down to the common (minimum) distribution.It is easy to see that by doing so, the distribution of w is the same when y = 1 and y = −1, hence w bears no information about y.

APPENDIX D PROOF
Subtracting L * (d) µ d ,q (0) from both sides and sending d → ∞, we get The proof is complete by noting that from the first part of Theorem 1, the right hand side is zero.□

APPENDIX E PROOF OF THEOREM 5
Here, we prove that all the assumptions 1-4 are satisfied for the noise density q(.) of the form (16). Before proving this, we need some lemmas.The proof of Lemmas 3, 4, and 5 below are given at the end of this section.
Lemma 3: Assume that a degree n polynomial p : R → R is given.Given ϵ > 0, we define p : R → R as follows Then, there exists a polynomial r : R → R with degree n, such that for all x ∈ R, we have p(x) ≤ r(|x|).Lemma 4: Given the noise density q(.) as in ( 16), there exists a constant c 1 > 0 such that for all t ≥ c 1 , if Z is a random variable with law q(.),we have Lemma 5: Given the noise density q(.) as in (16), there exists a constant c 2 > 0 such that = 0, where (Z i : i ≥ 1) are i.i.d.random variables with law q(.).
Proof of Theorem 5: As in ( 16), let where is a polynomial in z with even degree 2n > 0 such that a 2n > 0. We verify each of the four assumptions separately.Assumption 1: It is straightforward to check that q(z) > 0 for all z and q(.) is three times continuously differentiable.In order to verify (3), note that where the last step follows from the fact that ψ ′ (z) is a polynomial in z and ψ(z) is a polynomial with even degree and positive leading coefficient.This implies that Since (ψ ′′ (z) + (ψ ′ (z)) 2 ) is a polynomial in z, similar to the above we have Therefore, we get ∞ −∞ q ′′ (z)dz = 0 similar to the above.This establishes (3).
On the other hand, for the family of densities q(z; θ) = q(z − θ), we have Hence, recalling the definition of the Fisher information, we have 2 is a polynomial in z.This means that the above quantity is well defined and finite, and hence the Fisher information I(θ) is well-defined and finite for all θ.
Assumption 2 Note that d 3 dt 2 log q(t) is a polynomial in t, therefore Lemma 3 implies that for ζ > 0, there exists a polynomial r : R → R such that Therefore, since all the moments of q(.) are finite, and r(.) is a polynomial, the expectation of the left hand side is finite.
Assumption 3 Similar to the above case, since dt 2 log q(t)| 2 is bounded by a polynomial and all the finite moments of q(.) are finite, the expectation is indeed finite.
Assumption 4 Note that Therefore, for all z ∈ R, and for all (z i : Observe that there exists a constant C 4 > 0 such that for d large enough, we have Combining this with the above argument, we realize that for d large enough Hence, for all x / ∈ A, we have Furthermore, let Note that B is a compact set, and p(.) is continuous.Therefore, we may define and β < ∞.Since for x ∈ A, we have [x − ϵ, x + ϵ] ⊂ B, we may write Combining this with (50), we realize that for all x ∈ R, we have where r(.) is a polynomial of degree n.This completes the proof.□ Proof of Lemma 4: Recalling the polynomial form of ψ(.) and the assumption that a 2n > 0, we realize that there exists c 1 > 0 such that if z > c 1 , we have and if z < −c 1 , we have Thereby, if t ≥ c 1 , we have Similarly, using (53), for t ≥ c 1 , we may write Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Combining ( 54) and (55) and using the union bound, we arrive at the desired result.□ Proof of Lemma 5: Since a 2n > 0, we may choose c 2 large enough so that a 2n 2 c 2n 2 > 1.
Using the union bound, we get where Z ∼ q(.).Using Lemma 4, if d is large enough so that c 2 (log d) Using ( 58) in (57), we realize that for d large enough, We now focus on the term conditioned on Y = +1, since the term conditioned on Y = −1 is similar.Note that in this case, using the fact that q(.) is the Gaussian density, we get where w i = z i /σ ∼ N (0, 1) are i.i.d.standard normal random variables.Using this, we may write where w = (w 1 , . . ., w d ) ∼ N (0, I d×d ).Using Lemma 6 with a = 2(1 + ϵ) log d where ϵ is a constant that will be specified later, we have Using this in (59), we realize that where the last inequality uses the fact that the derivative of Φ is negative and decreasing in absolute value.This yields where A := 16| Φ′ (c/2σ)| 2(1 + ϵ) is a constant independent from d.
On tho other hand, in the absence of the adversary, the optimal classifier is the maximum likelihood estimation.Therefore, we have where the last step uses (60) and similar calculations as above.Now, combining (62), (63), and (64), we realize that for d large enough, we have The proof is complete by putting (65) and (66) together.□

:
i ∈ [d]), we generate the modified data samples w (d) = (w (d) i : i ∈ [d]) such that w (d) i are conditionally independent conditioned on y and x i is generated from the law of W conditioned on Y = y and X = x (d) i .As we discussed above, W is independent from Y , hence the modified samples w (d) i

Since
|ϵ z |≤ 2|µ d | and µ d = c/ √ d → 0 as d → ∞, for d large enough we have |ϵ z |< ζ for all z ∈ R, where ζ is the constant in Assumption 3. Thereby, for d large enough, we have

1 2
OF THEOREM 1 AND COROLLARY 1 Note that if lim sup d→∞ log d k d < 1/2, there exists ϵ > 0 such that for d large enough, log d k d < 1/2 − ϵ, or equivalently k d < d −ϵ .Therefore, the first part of the theorem follows from Theorem 3. On the other hand, if lim inf d→∞ k d > 1/2, there exists ϵ > 0 such that for Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Fig. 1 .
Fig. 1.The idea behind our proposed strategy for the adversary in one dimension for the Gaussian mixture model.Assume that the adversary observes a realization (x, y) such that y = +1, meaning that x is a realization of N (µ, σ 2 ) (i.e. the blue curve).If x ≤ 0, the adversary leaves it unchanged, i.e. w = x.On the other hand, if x > 0, we compute the ratio between the two densities (which is precisely α(x, y) shown in the figure), and with probability 1 − α(x, y) we set w = 0.When y = −1, we follow a similar procedure, but reversed.Intuitively speaking, this procedure "symmetrizes" the two distributions by bringing them down to the common (minimum) distribution.It is easy to see that by doing so, the distribution of w is the same when y = 1 and y = −1, hence w bears no information about y.

d 1 2
large enough, log d k d > 1/2 + ϵ, or equivalently, k d > d +ϵ .Therefore, the second part of the theorem follows from Theorem 4. Proof of Corollary 1: Let C (d) := C (d) d 1/2−ϵ be the truncated classifier defined in Section III-B with truncation parameter d 1/2−ϵ .Note that lim sup log d k d < 1/2 − ϵ means that log d k d ≤ 1/2 − ϵ for d large enough, or equivalently k d ≤ d 1/2−ϵ for d large enough.On the other hand, the robust error defined in (1) is clearly nondecreasing in adversary's budget, therefore for d large enough we have

2 − 6 :≤>
1 log d , which goes to zero as d → ∞ due to (56).This completes the proof.□APPENDIX F PROOF OF THEOREM 2Before giving the proof of Theorem 2, we state and prove the following standard Gaussian tail bound.Lemma Let w 1 , . . ., w d be i.i.d.standard normal random variables.For a > 0, we haveP max 1≤i≤d w i > a ≤ d √ 2πa exp(−a 2 /2).Proof of Lemma 6: Recalling that we use the notation Φ(.) for the complementary CDF of a standard normal distribution, a 2 /2).Using this together with the union bound, we getP max 1≤i≤d w i > a ≤ dP (w 1 > a) = d Φ(a) ≤ d √ 2πa exp(−a 2 /2),which is the desired bound.□ Now we are ready to prove Theorem 2.Proof of Theorem 2: Recall from the proof of Theorem 3 in Appendix B thatL 8k d ∥ x (d) ∥ ∞ Y = −8k d ∥ x (d) ∥ ∞ Y = −1 .(59)