Relationship Between Nonsmoothness in Adversarial Training, Constraints of Attacks, and Flatness in the Input Space

Adversarial training (AT) is a promising method to improve the robustness against adversarial attacks. However, its performance is not still satisfactory in practice compared with standard training. To reveal the cause of the difficulty of AT, we analyze the smoothness of the loss function in AT, which determines the training performance. We reveal that nonsmoothness is caused by the constraint of adversarial attacks and depends on the type of constraint. Specifically, the <inline-formula> <tex-math notation="LaTeX">$L_{\infty} $ </tex-math></inline-formula> constraint can cause nonsmoothness more than the <inline-formula> <tex-math notation="LaTeX">$L_{2}$ </tex-math></inline-formula> constraint. In addition, we found an interesting property for AT: the flatter loss surface in the <italic>input space</italic> tends to have the less smooth adversarial loss surface in the <italic>parameter space</italic>. To confirm that the nonsmoothness causes the poor performance of AT, we theoretically and experimentally show that smooth adversarial loss by EntropySGD (EnSGD) improves the performance of AT.


Relationship Between Nonsmoothness in Adversarial Training, Constraints of Attacks, and
Flatness in the Input Space Sekitoshi Kanai , Masanori Yamada, Hiroshi Takahashi , Yuki Yamanaka , and Yasutoshi Ida , Member, IEEE Abstract-Adversarial training (AT) is a promising method to improve the robustness against adversarial attacks.However, its performance is not still satisfactory in practice compared with standard training.To reveal the cause of the difficulty of AT, we analyze the smoothness of the loss function in AT, which determines the training performance.We reveal that nonsmoothness is caused by the constraint of adversarial attacks and depends on the type of constraint.Specifically, the L∞ constraint can cause nonsmoothness more than the L2 constraint.In addition, we found an interesting property for AT: the flatter loss surface in the input space tends to have the less smooth adversarial loss surface in the parameter space.To confirm that the nonsmoothness causes the poor performance of AT, we theoretically and experimentally show that smooth adversarial loss by EntropySGD (EnSGD) improves the performance of AT.

I. INTRODUCTION
W HILE deep learning is starting to play a crucial role in modern data analysis applications [1], [2], [3], [4], deep learning applications are threatened by adversarial examples [5], [6], which are data perturbed to make models misclassify them.To prevent the misclassification due to the adversarial examples, many defense methods have been studied [6], [7], [8], [9], [10], [11], [12], [13].Especially, adversarial training (AT) has attracted the most attention because of its good performance [7], [8], [11], [14].AT is the following procedure: we generate adversarial examples of training data by maximizing the loss function with respect to perturbations under norm constraints.Then, we minimize the loss function on them with respect to parameters.As a result, models obtain robustness against adversarial examples.However, AT still has difficulty achieving good robust accuracies compared with accuracies on clean data achieved by standard training.In particular, the generalization performance of AT is lower than that of standard training [15], [16].
Hiroshi Takahashi is with NTT Docomo, Tokyo 100-6190, Japan.Digital Object Identifier 10.1109/TNNLS.2023.3244172 AT with standard training.To reveal causes of this difficulty, several studies have investigated the loss landscape in the parameter space [17], [18], [19].Liu et al. [17] investigated the smoothness of AT and showed that the loss function of the AT (hereinafter, called adversarial loss) is not a Lipschitz-smooth function: i.e., its gradient is not Lipchitz continuous [20].
Similarly, [21] has shown that AT has less uniform stability than standard training due to the nonsmoothness of adversarial loss.This lower stability causes the lower generalization performance of AT.However, these analyses fail to prove the nonsmoothness of AT under general conditions: theoretical results in [17] do not consider the dependence of adversarial examples on the model parameters, and [21] only shows that there exists at least one model, such that it suffers from less stability.Wu et al. [18] and Yamada et al. [19] investigated the flatness of adversarial loss.However, [18] did not provide any theoretical analyses, and [19] only provided a theoretical analysis for linear models.Thus, more general and rigorous analysis of the characteristics of adversarial loss will be required to reveal the causes of the difficulty of AT.
In this article, we rigorously analyze the smoothness of adversarial loss in the parameter space considering the dependence of adversarial examples on the model parameters with general models.Our analysis reveals that the smoothness in the parameter space depends on the types of the norm constraints and the flatness of the loss in the input space.To the best of our knowledge, this result is the first to bridge the studies investigating the adversarial loss surface in the input space [22], [23], [24] and in the parameter space [17], [18], [19], [21].First, we analyze the Lipschitz continuity of gradients of adversarial loss for binary linear classification, since we can obtain the optimal adversarial examples in closed form.This reveals that adversarial loss can be a locally smooth function and that the smoothness depends on the constraints: the L ∞ constraint can have more nonsmooth points than the L 2 constraint.Next, we extend the analysis to the general cases using the optimal adversarial attack, which include nonlinear models and multiclass classification.By using the optimality condition, we reveal that the adversarial loss can be a locally smooth function if the optimal attack is inside the feasible region of the constraints (i.e., if the constraints do not affect the optimal attack).In addition, even if the optimal attack is on the boundary of the feasible region, the adversarial loss with the L 2 norm constraint can be locally smooth.Furthermore, our analysis indicates that the Lipschitz constant of gradient with respect to parameters can increase if we obtain the model parameter at which the loss function is flat in the input space: the largest eigenvalue of the Hessian matrix with respect to input has a small value.These results explain why AT is more difficult than standard training, because Lipschitz constants of gradient should be small for convergence and stability [25], [26], [27], whereas the loss landscapes in the input space should be flat for robustness.Our smoothness analysis also indicates that tradeoff-inspired adversarial defense via surrogate-loss minimization (TRADES) [28] is superior to naïve AT [8], because it controls the smoothness of the objective function by adding the adversarial loss to clean loss.Finally, to show that the improvements in smoothness contribute to AT, we apply EntropySGD (EnSGD) [29] to AT, which modifies the objective function.We prove that the objective function of EnSGD becomes smooth and improves the convergence and stability when the applied original loss function is nonsmooth and nonnegative.Our experimental results demonstrate that smoothness by EnSGD improves robust accuracies.These results imply that the nonsmoothness in AT is a cause of the difficulty of AT.
Our main contributions are summarized as follows.
1) We reveal that the smoothness of adversarial loss depends on types of constraints of adversarial examples.Specifically, AT with L ∞ attacks has more nonsmooth points than that with L 2 attacks.2) We reveal that if we obtain the parameter located at the flat loss surface in the input space for robustness, the Lipschitz constant of the adversarial loss in the parameter space increases.It can be one cause of the difficulty of AT. 3) To confirm that nonsmoothness is the cause of the difficulty, we show the smooth loss in AT improves the robustness.We prove that even if the original loss function is nonsmooth, the loss of EnSGD is smooth and experimentally show that the adversarial loss smoothed by EnSGD improves robust accuracies on test data in AT.The rest of this article is organized as follows.Section II briefly explains AT, importance of smoothness, and EnSGD.Section III gives the analysis of smoothness of AT.First, we analyze the smoothness by using simple problems (Section III-A) and extend the analysis in general case (Section III-B).We additionally investigate the smoothness of TRADES (Section III-C) and discuss the nonsmoothness beyond our theoretical results (Section III-D).We show that EnSGD addresses nonsmoothness in AT in Section IV and verify the effectiveness by experiments in Section V. Section VI gives the conclusion of this article.

A. Adversarial Training
An adversarial example x ′ for a data point x ∈ R d with a label y is formulated as follows: where ||•|| p is L p norm, θ ∈ R m is a parameter vector, ε is the magnitude of perturbation δ, and ℓ is a loss function. 1AT on where L ε is the objective function of the AT, and ℓ ε is the adversarial loss for each data point.To solve the inner maximum problem, projected gradient descent (PGD) [7], [8] is widely used.PGD with the L ∞ constraint iteratively updates the adversarial examples as follows: where η PGD is a step size.Π ε is a projection operation into the feasible region {θ | ||δ|| ∞ ≤ ε}.
We discuss Lipschitz smoothness of adversarial loss: i.e., the Lipschitz continuity of the gradient of adversarial loss.We use the following definition.

B. Importance of Smoothness
The smoothness of the objective function ( 5) is an important property for the convergence and generalization performance of gradient-based optimization [25], [26], [27], [31].
A gradient method converges to a stationary point of the loss function L(θ) if L(θ) is β-smooth on the bounded set Θ, such as {θ|L(θ) ≤ L(θ 0 )} ⊂ Θ [31], where θ 0 is the initial parameter of training.Naturally, the smoothness is also important for stochastic gradient descent (SGD) methods and determines their convergence rate [25], [26], [32], [33].For the β-smooth nonconvex objective function, the convergence rate of the randomized stochastic gradient method is when the step size is set to where T is the number of iterations, and σ 2 is a variance of gradients.D = ((2(L(θ 0 ) − L * ))/(β)) 1/2 , where L * denotes the optimal value of L. For nonsmooth convex objective functions, [33] shows the convergence rate of SGD can be O((log T )/( √ T )).The smoothness also affects generalization performance, because it affects the uniform stability of SGD [21], [25], [26], [27].e stab -uniform stability of the algorithm A is defined as follows: for datasets S and S ′ , such that S and S ′ differ in at most one data sample x.For the β-smooth and one-Lipschitz convex function with fixed learning rate η, e stab is bounded by (2ηT /N ) [25].On the other hand, if the objective function is nonsmooth, e stab is bounded by 4(η √ T + (ηT )/(N )) [27].Since uniform stability derives the generalization bound [25], [27], a nonsmooth loss function tends to achieve a larger generalization error than a smooth one.Xing et al. [21] have shown that the lower bound of uniform stability of AT can be the same as that of nonsmooth loss.However, they only prove the existence of such a model and loss function.The proved model and loss are not common settings.Thus, nonsmoothness analysis is needed to obtain uniform stability for general AT on a general case.

C. EntropySGD
To improve smoothness of the objective function L(θ) in deep learning, [34] presented EnSGD.It modifies the objective function as follows: where γ is a hyperparameter that determines a smoothness of loss.b is usually set to one.While [34] maximizes F (θ), we consider the minimization problem of −F (θ) to treat it in the same way as (2).The gradient of EnSGD with b = 1 becomes where p θ is a probability density function as follows: Equation ( 12) is computed by using stochastic gradient Langevin dynamics.Chaudhari et al. [34] have shown that EnSGD improves the smoothness of β-smooth loss: i.e., the Lipschitz constant of −∇ θ F (θ) is smaller than β when L(θ) is β-smooth.However, they do not show EnSGD to be effective for nonsmooth loss functions.In addition, though several studies [35], [36] apply Langevin dynamics to generate attacks, there are few studies that apply it to optimize parameters in AT.Thus, the effectiveness for AT is still unclear due to nonsmoothness.Note that we focus on EnSGD, since it is a basic method for smoothing the loss.III.SMOOTHNESS OF AT First, we have shown that the smoothness of adversarial loss depends on the smoothness of adversarial examples with respect to parameters.Let θ be the model parameter, and x ′ be the optimal adversarial attacks.We derive the following.
All proofs are provided in the Appendix .This lemma indicates that (9) does not prove the nonsmoothness of adversarial loss if x ′ is a smooth function of the parameter θ.Thus, we need to analyze the dependence of adversarial examples on parameters.However, adversarial examples for deep neural networks (DNNs) cannot be solved in closed form.Therefore, we first tackle this problem by using the binary linear classification and next investigate the smoothness for general models using the optimal adversarial examples.

A. Binary Linear Classification
In this section, we investigate the smoothness of the adversarial loss of the following binary linear classification.
Problem Formulation for Section III-A: We have a dataset {(x, y) n } N n=1 , where x ∈ R d is a data point, y ∈ {−1, 1} is a binary label, and θ ∈ R d is a parameter vector.Let f (x, θ) = sign(θ T x) be a model, and δ be an adversarial perturbation whose L p norm is constrained as ||δ|| p ≤ ε.We train f (x, θ) by minimizing the following adversarial loss: For this problem, the relationship between x ′ and θ can be investigated, because the optimal adversarial example x ′ = x + δ is derived in closed form.The relationship with L 2 constraints ||δ|| 2 ≤ ε is as follows.
Lemma 2: Let x ′ (θ 1 ) and x ′ (θ 2 ) be adversarial examples around the data point x for θ 1 and θ 2 , respectively.For AT with L 2 constraints ||δ|| 2 ≤ ε, if there exists a lower bound θ min ∈ R, such as ||θ|| 2 ≥ θ min > 0, we have the following inequality: Thus, adversarial examples are ((ε)/(θ min ))-Lipschitz on a bounded set not including the origin θ = 0.This lemma indicates that adversarial examples with L 2 constraints are Lipschitz continuous functions of θ in the bounded set excluding the origin.From Lemmas 1 and 2, we can derive the following theorem.
Theorem 1 indicates that the adversarial loss for binary linear classification with L 2 constraint is a smooth function for θ ̸ = 0.As explained in Section II-B, a gradient method is effective if L ε (θ 0 ) < L ε (0).Note that the Lipschitz constant (β θθ + (εβ θx )/(θ min )) is larger than that for standard training β θθ .
Next, we provide the same analysis for the L ∞ constraint of adversarial examples as follows.
, adversarial examples are not Lipschitz continuous.If all signs are the same (∀i : sign(θ 1,i ) = sign(θ 2,i )), we have Thus, adversarial examples are Lipschitz continuous on a bounded set that does not include θ i = 0, ∀i and where no signs of elements change.
From Lemmas 1 and 3, we have the following theorem.Theorem 2: For AT with L ∞ constraints ||δ|| ∞ ≤ ε, in the set where ∀i : sign(θ 1,i ) = sign(θ 2,i ), we have Thus, the adversarial loss L ε (θ) is β θθ -smooth on a bounded set that does not include θ i = 0, ∀i and where no signs of elements change.
A comparison of Theorems 1 and 2 reveals that AT with L ∞ constraints is more likely to reach the nonsmooth points more than with L 2 constraints: i.e., the set where adversarial loss with L ∞ constraints is Lipschitz smooth is smaller than the set where that with L 2 constraints is Lipschitz smooth.
Fig. 1(a) shows the intuition of Lemmas 2 and 3.In the case of the L 2 constraint, adversarial examples continuously move on the circle depending on θ.On the other hand, in the case of the L ∞ constraint, even if the difference between θ 1 and θ 2 is small, the distance between x 1′ and x 2′ can be 2ε, which is the length of the side of the square.This is because the adversarial examples are located at the corner of the square.Thus, the Lipschitz continuity of adversarial examples depends on the types of the constraints.

B. General Case
For DNNs and multiclass classification, we could not obtain adversarial examples in closed form unlike binary linear classification.For this case, we investigate the local Lipchitz smoothness of adversarial loss with the local optimal adversarial examples by using the implicit function theorem.In this problem, we only add the following assumption.
This lemma is derived by applying the implicit function theorem [31] to ∇ x ℓ(x ′ * , θ) = 0. From Lemmas 1 and 4, adversarial loss can be locally (β θθ + β2 θx /c)-smooth on the parameter set, such that the local maximum point x ′ * exists inside the feasible region.Note that ∇ 2 x ℓ(x ′ * , θ) ≺ 0 is a sufficient condition for the local maximum points.This lemma also indicates that nonsmoothness is caused by the constraints: if the constraints of adversarial examples are inactive, the adversarial loss tends to be smooth under Assumption 1.Even so, the condition where the optimal adversarial examples exist inside the feasible region can be easily broken by the change of θ.Next, we investigate the case when the constraints are active.
2) Adversarial Examples on the Boundary of the Feasible Region: In Section III-B1, we showed the continuity of adversarial examples by applying the implicit function theorem to ∇ x ℓ(x ′ * , θ) = 0.However, when the constraints of adversarial examples are active [Fig.1(c)], we cannot prove it in the same manner, because ∇ x ℓ(x ′ * , θ) ̸ = 0.In this case, instead of ∇ x ℓ(x ′ , θ), we use the gradient of the Lagrange function of adversarial examples for the implicit function theorem.Let µ ∈ R be a Lagrange multiplier.The Lagrange function J(x ′ , θ, µ) with the L p constraint is given by Let x be x = [µ, x ′T ] T .From the optimality condition, the maximum point x ′ * of (1) satisfies ∇ xJ ( x * , θ) = 0.By applying the implicit function theorem to ∇ xJ ( x * , θ), we have the local continuity of L 2 attacks.
Lemma 5: Let x * = [µ * , x ′ * T ] T be the local maximum point on the boundary of the feasible region of the L 2 constraint: From Lemmas 1 and 5, adversarial loss can be locally (β θθ + β 2 θx /c)-smooth under the L 2 constraint even if the constraint is active.Lemma 5 uses the bordered Hessian To compute this matrix, ||x ′ − x|| p should be twice differentiable, which is satisfied when p = 2.If we use the L ∞ constraint, we cannot show the same results, because ||x ′ − x|| ∞ is not twice differentiable.Thus, the Lipschitz continuity of attacks is difficult to show with the L ∞ constraint.This result also implies that nonsmoothness of adversarial loss is caused by constraints.
3) Relationship Between Loss Landscapes in the Input Space and Parameter Space: Intriguingly, Lemmas 4 and 5 reveal the relationship between the flatness of the loss function with respect to input data and the smoothness of the adversarial loss with respect to parameters, because c = inf θ∈U min i σ i (∇ 2 x ℓ(x ′ , θ)) is related to the flatness with respect to input data.If we flatten the loss in the input space for robustness, singular values of the Hessian matrix ∇ 2 x ℓ with respect to x become small: i.e., c becomes small.As a result, the Lipschitz constant of gradient (β θθ +β 2 θx /c) increases.The following explanation might help with interpretation: if the Hessian matrix contains zero eigenvalues, the solution is not isolated but spread in the direction of the eigenvectors.In this case, the optimal attack is free to move in the direction of this eigenvector locally after the parameter update.Because of this movement of the attack, the continuity of the gradient with respect to the parameters cannot be guaranteed.As a result, the adversarial loss becomes nonsmooth.
Since large Lipschitz constants decrease the convergence and generalization of training as explained in Section II-B, this relationship can explain why AT is more difficult than standard training even if adversarial loss is smooth.

C. Smoothness of TRADES
As a variant of AT, TRADES often outperforms naïve AT [18], [28].TRADES [28] minimizes the following objective function: where ϕ is the surrogate loss function that evaluates the difference between the prediction of the clean example x n and that of the adversarial example x ′ n .The gradient of the objective function of the TRADES becomes The Lipschitz constant of (1)/(N ) Thus, the first term in the objective function of TRADES is smooth.On the other hand, the smoothness of the second term n , θ) is not obvious.If ϕ also satisfies Assumption 1, ϕ ε and its adversarial examples can also satisfy Lemmas 1, 4, and 5. From this point, let the gradient of the ϕ ε have the Lipschitz constant of (β θθ + β 2 θx /c).Then, the gradient of the objective function of TRADES satisfies the following relation: Thus, the Lipschitz constant of gradient of L TR (θ) is (((λ+ 1)β θθ + β 2 θx /c)/(λ)).Therefore, the smoothness of TRADES can be controlled by tuning λ: even if c becomes a small value, the objective function is made smooth by enlarging λ.Though λ needs to be set to small values for robustness, this controlled smoothness might cause TRADES to achieve better robustness than naïve AT.

D. Nonsmoothness Beyond the Analysis
From the above results, adversarial loss can be locally smooth, especially if using the L 2 constraint.However, there are still several causes of nonsmoothness besides L ∞ constraints and flatness in the input space, and this section discusses them.Lemmas 4 and 5 indicate that the adversarial loss can be smooth if adversarial loss always uses the local optimal attack for the tth parameter update near the attack used in the (t−1)th parameter update.For nonconvex-concave objective functions, the maximum point of attacks can be unique, and thus, the optimal attack is always near the optimal attack in the pervious parameter update; i.e., adversarial loss is smooth under the conditions of Lemmas 4 and 5.However, for nonconvex-nonconcave objective functions, there can be several local maximum points of x ′ due to nonconcavity, and adversarial loss can use a different local maximum point for each parameter update.As a result, adversarial loss can be nonsmooth.In addition, the optimal attacks are difficult to find due to nonconcavity, and we empirically use PGD to find them.We can conjecture that adversarial loss with the L ∞ constraints using PGD does not have Lipschitz continuous gradients, because projection Π ε in PGD is not continuous.Although nonsingularity of Hessian matrices (∇ 2 x ℓ(x, θ) ≺ 0 and det(∇ 2 xJ (x, θ)) > 0) is a sufficient condition for the local maximum point in Lemmas 4 and 5, it can be broken by the change in θ.From the above reasons, adversarial loss tends to be nonsmooth more often than clean loss especially using the L ∞ constraint, and we can improve the performance of AT by addressing its nonsmoothness.

IV. ENSGD FOR AT
In Section III, we showed that AT increases Lipschitz constants of gradient of adversarial loss and can cause nonsmoothness.If the loss function is nonsmooth, the gradient-based optimization is not very effective.To confirm that nonsmoothness in adversarial loss causes the difficulty, we show that EnSGD smoothens the nonsmooth loss and can be used for AT.We prove the following theorem.
Theorem 3: Let Σ θ ′ be a variance-covariance matrix of p θ (θ ′ ) in ( 13).If we use EnSGD for the nonnegative loss function L(θ) ≥ 0, we have is Lipschitz continuous on an arbitrary bounded set of θ.
Theorem 3 indicates that EnSGD smoothens nonsmooth loss functions.Many nonnegative loss functions (e.g., cross entropy for classification tasks) are used for training DNNs.Thus, we can use EnSGD for AT whose loss does not necessarily have a Lipchitz continuous gradient.The convergence and uniform stability of EnSGD are as follows.
Theorem 4: If we minimize (11) with nonnegative loss L in the arbitrary bounded convex set Θ, we have the following.
Convergence: The convergence rate using randomized stochastic gradient [26] becomes where Uniform Stability: If −F is a convex function, the uniform stability e stab of SGD is bounded as follows: where η t is learning rate satisfying η t ≤ 2/(γ + γ 2 sup θ ||Σ θ ′ || F ) at the tth iteration.If −F is nonconvex, the uniform stability e stab of SGD with a monotonically nonincreasing learning rate η t = η 0 /t is bounded as follows: This theorem indicates that EnSGD improves the convergence and uniform stability of nonsmooth loss functions, and they are controlled by γ.Note that EnSGD modifies the objective function but not the optimization algorithm: i.e., EnSGD is not an optimization algorithm.Thus, the convergence rate depends on the optimization algorithm.We present the convergence of the randomized stochastic gradient method as an example, because training of a deep model is usually a nonconvex problem, and it provides convergence of a nonconvex optimization.
Algorithm 1 describes the training procedure for AT with EnSGD.We compute PGD attacks in the loop of EnSGD.
To estimate E p θ (θ ′ ) [θ ′ ] in (12), EnSGD requires L iterations in line 3.This overhead can be negligible because of its fast convergence as shown in the experiments in Section V-C. for τ ≤ T do 6: end for 9:

V. EXPERIMENTS
We first visualize the loss surface of adversarial loss to verify our theoretical results for small models: the smoothness depends on the types of the constraints.We also visualize the loss of EnSGD to confirm that it smoothens nonsmooth adversarial loss.For deep models, we evaluate the norm of gradient, since it is correlated with local Lipschitz constants of gradients.Next, we demonstrate that smoothing the adversarial loss contributes to the performance of AT with L ∞ attacks, which tends to have the nonsmooth loss, by using EnSGD.In addition, we evaluate the effectiveness of smoothing the loss with L 2 attacks and TRADES.

A. Evaluation of Smoothness
Fig. 2 plots loss surfaces for one data point in the standard training ℓ(θ), AT with the L 2 constraint, and AT with the L ∞ Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.constraint when d is two.Adversarial losses of a linear model [Fig.2(b) and (c)] follow Theorems 1 and 2: adversarial loss with the L 2 constraint is not smooth at θ = 0, and adversarial loss with the L ∞ constraint is not smooth where θ i = 0.
2(d)-(f) plots the loss surface for the binary classification using a nonlinear model that has swish [38] before the output of the model.Since we cannot obtain the optimal adversarial examples in closed form, we used PGD attacks to generate adversarial examples.The adversarial loss with the L 2 constraint [Fig.2(e)] has a larger set, in which the loss is smooth, than that with the L ∞ constraint [Fig.2(f)], which follows Lemma 5.In Fig. 2(f), though we use the L ∞ constraint, the adversarial loss can be smooth in the region where θ 1 > 1 and θ 2 < −1 unlike the linear model [Fig.2(c)].This is because the optimal attacks are inside the feasible region and satisfy Lemma 4: there exists x ′ * satisfying ∇ x ℓ(x ′ * ) = (∂ℓ)/(∂z)(∂z)/(∂u)∇ x u(x ′ * ) = 0 inside the region of ||x − x ′ || ∞ < 0.6, because the optimal input u * = θ T x ′ * satisfying (∂z)/(∂u) = 0 is in the interval of [−2, −1].Note that though we use a simple network to visualize the loss surface, Lemmas 4 and 5 are not limited to shallow networks and the binary classification.
Visualizing the loss landscapes of deep models is infeasible, because they have many parameters, and the computation of the loss of EnSGD requires an integral over the whole parameter space.Instead of visualization of the loss, we evaluate the gradient norm, because [40] and [41] have shown that the gradient norm is strongly correlated with local Lipschitzsmoothness.Fig. 3 plots the gradient norm at the last layer of WideResNet, which is directly affected by the loss function and is not affected much by deep architectures, in AT with PGD on CIFAR10.The setup is provided in Section V-B.This figure compares the gradient norm of EnSGD with those of standard training [training on clean data (CLN)], AT [8], and adversarial weight perturbation (AWP; [18]).In addition, we evaluate TRADES (λ = 1/6) in this experiment.EnSGD decreases the gradient norm, whereas AT and AWP increase it compared with CLN in Fig. 3. Though it is difficult to fairly compare the gradient norm of the L ∞ attack with that of the L 2 attack, the gradient norm of AT in Fig. 3(a) is smaller than that in Fig. 3(b).The gradient norm of TRADES is smaller than AT and greater than AWP before the 100th epoch.The gradient norm of all parameters is given in the Appendix.

B. Setup for Evaluation of Smoothing by EnSGD
This section outlines the experimental conditions for evaluation of smoothing the adversarial loss by EnSGD, and the details are provided in the Appendix.Our experimental codes are based on source codes provided by Wu et al. [18], and our implementations of EnSGD are based on the codes provided by Chaudhari et al. [34].Datasets were CIFAR10, CIFAR100 [42], and street view hose numbers (SVHN) [43].We compared the convergences of AT when using SGD and EnSGD.In addition, we evaluated the combination of EnSGD and AWP [18], which injects adversarial noise into the parameter to flatten the loss landscape.We used ResNet-18 (RN18) [44] and WideResNet-34-10 (WRN) [45] following [18].We used PGD, and the hyperparameters for PGD were based on [18].The L ∞ norm of the perturbation is ε = 8/255 at training time.We additionally evaluate AT with ε = 12/255 on CIFAR10 with ResNet-18.For EnSGD, we set γ = 0.03, ε E = 1×10 −4 , and η = 0.1, η ′ = 0.1 and tuned an iteration L in {20, 30}.The learning rates of SGD and EnSGD are set to 0.1 and divided by 10 at the 100th and 150th epochs, and we used early stopping by evaluating test robust accuracies against PGD (20 iterations).The hyperparameter of AWP is tuned in {0.01, 0.005}.We trained models three times and show the average and standard deviation of test accuracies.

C. Results
Fig. 4(a) and (b) plots robust accuracies using WRN on CIFAR10 attacked by PGD against epochs and runtime.This figure shows that EnSGD accelerates the training and achieves higher accuracies than SGD.This is because EnSGD improves the convergence and stability by smoothing adversarial loss as shown in Theorems 3 and 4. EnSGD alleviates overfitting because of the improvements of uniform stability; i.e., training accuracy of AT+EnSGD is lower than that of AT, but test accuracy of AT+EnSGD is higher than that of AT.When using EnSGD with AWP, EnSGD alleviates underfitting because of the improvements of convergence performance: training accuracy of AWP + EnSGD is higher than that of AWP.In Fig. 4(b), the runtime of EnSGD is almost the same as that of SGD even though EnSGD requires L iterations in its gradient estimation for each parameter update, as shown in Algorithm 1.This is because EnSGD effectively reduces the training loss at each parameter update, as shown in ROBUST ACCURACIES AGAINST AUTOATTACK (L∞) ON CIFAR10, CIFAR100, AND SVHN Fig. 4(c), 4 which plots the training loss against parameter updates.Though the number of updates of EnSGD is only 1/L of that of SGD due to L iterations for each parameter update, EnSGD effectively reduces the training loss.Note that AT + EnSGD is faster than AWP, although they achieve similar robust performance against AutoAttack in Table I.
Since PGD sometimes fails to find the adversarial examples, we used AutoAttack [46] for evaluation, which uses the ensemble of several attacks to find the adversarial examples.Table I lists test robust accuracies against AutoAttack on CIFAR10, CIFAR100, and SVHN.For all cases, EnSGD outperforms SGD: i.e., AT + EnSGD outperforms AT, and AWP + EnSGD outperforms AWP.The results of ε = 12/255 are similar to those of ε = 8/255.The constraints with large ε are more likely to be inactive than those with small ε.In this case, the loss can be smooth from Lemma 4 contrary to the intuition.On the other hand, at the nonsmooth point, Fig. 1(a) shows that L ∞ attack jumps from −ε and +ε at the nonsmooth points when the constraints are active.Thus, large ε might cause more abrupt change in the loss landscape at nonsmooth points than small ε.Due to these two characteristics, the improvements of EnSGD with ε = 12/255 is similar to that with ε = 8/255.Table I also lists clean accuracies and shows that EnSGD does not sacrifice clean accuracies for improving the robustness.To investigate the effectiveness of EnSGD in detail, we additionally evaluate robustness against various L ∞ attacks [fast gradient sign method (FGSM) [47], Carlini and Wagner (C&W) [48], PGD with 100 steps [8], and simultaneous perturbation stochastic approximation (SPSA) [49] (100 iterations with perturbation size 0.001, learning rate 0.01, and 256 samples for each gradient estimation)] in Table II.In this table, AWP + EnSGD outperforms baselines against all attacks under almost all settings.Note that AutoAttack achieves the lowest accuracies among various attacks, and AWP + EnSGD achieves the highest robust accuracy against AutoAttack in Table I.These results also support the idea that the difficulty of AT is caused by nonsmoothness, and improving smoothness contributes to the performance of AT.

D. Results of L 2 Attack
Since nonsmoothness decreases the performance of AT with L ∞ constraints more than with L 2 constraints, we show the results of L ∞ attacks in Section V-C.To investigate the effectiveness of EnSGD for L 2 constraints, we also evaluated EnSGD using the AT with 2 constraints on CIFAR10 in Table III.In this experiment, we used AT with PGD (L 2 , ε 0.5, and step size of 15/255).Table III shows that EnSGD can improve the performance of AT.However, the improvements of the robustness against L 2 attacks are less than those against L ∞ attacks.This is because the AT with L 2 attacks can be smooth under certain conditions, as shown in Theorem 1 Lemma 5, and just decreases the Lipschitz constants of the gradient of the smooth objective function.

E. Evaluation With
The recent methods achieve the highest performance when using with TRADES [18], [28].In this section, we evaluate the Authorized licensed use limited to the terms of the license agreement with IEEE.Restrictions apply.

TABLE IV ROBUST ACCURACIES AGAINST AUTOATTACK ON CIFAR10 USING TRADES
TRADES with EnSGD to investigate whether the smoothness also contributes to the performance of TRADES.We provide the experimental setup in the Appendix.I.This might be because the objective function of TRADES tends to be smooth as discussed in Section III-C.This result implies that the difficulty of AT is caused by the nonsmoothness, and the good performance of TRADES seems to be caused by its smooth objective function.
VI. CONCLUSION This article investigated the smoothness of loss for AT.We proved that the smoothness of adversarial loss in the parameter space depends on the constraints of adversarial examples and the flatness in the input space.Since smoothness of loss is an important property for gradient-based optimization, we showed that EnSGD can improve the smoothness of adversarial loss.Our usage of EnSGD is still naïve and not specialized for AT, and we will explore the smoothing methods on the basis of our analysis for adversarial loss in our future work.
Recent studies [50], [51], [52] use weight averaging.Averaging parameter is also an effective optimization method to address nonsmooth objective functions [33].In our future work, we will investigate the relation between the effectiveness of weight averaging in AT and nonsmooth adversarial loss.

A. Proof of Lemma 1
Proof: From Assumption 1, we have Proof: First, we solve the following optimization problem to obtain the adversarial examples for the data point (x, y): We consider the case of y = 1.Note that we can easily derive the same results for y = −1.By using the Lagrange multiplier, we use the following function: and we find the solution satisfying From ( 32), we have ) and (λ)/(||θ|| 2 ) are scalar values and λ ≤ 0, θ and δ have the opposite direction.Thus, we can write as δ = −kθ, where k ≥ 0. Since J monotonically increases in accordance with k, we have k = ε from (33).Therefore, we have δ = −ε(θ)/(||θ|| 2 ).Then, we have x ′ (θ) = x − ε)/(θ)/(||θ|| 2 ), and thus, we compute the Lipschitz constants for x ′ (θ) = x − ε(θ)/(||θ|| 2 ).Since Lipschitz constants for vector-valued continuous functions bound the operator norm of Jacobian, we compute the Jacobian of x ′ (θ) and have The spectral norm of this matrix is (ε)/(||θ|| 2 ), because this matrix is a normal matrix that has eigenvalues of Since □ Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

D. Proof of Lemma 3
Proof: First, we solve the following optimization problem to obtain the adversarial examples for the data point (x, y): We consider the case of y = 1.Note that we can easily derive the same results for y = −1.Since log(1+exp(−θ T x− θ T δ)) is a monotonically decreasing function for θ T δ, the solution minimizes θ T δ subject to ||δ|| ∞ ≤ ε.The solution is obtained as δ = −εsign(θ), where sign is an elementwise sign function [47].Thus, the optimal adversarial examples are x ′ = x − εsign(θ), and we investigate their Lipschitz continuity.Since sign(θ i ) for θ i > 0 or θ i < 0 is a constant function, its derivative is zero for From Lemmas 1 and 3, if all signs are the same as ∀i : sign(θ and thus, the gradient of adversarial loss is not Lipschitz continuous, which completes the proof.□

H. Proof of Theorem 3
Proof: For β-smooth function, we have Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
where σ i is the ith largest singular value, and σ 1 is called a spectral norm.Thus, the loss of EnSGD is smooth if the spectral norm of the Hessian matrix of loss is bounded above by finite values.Thus, we investigate the Hessian matrix of EnSGD.The gradient of EnSGD is where p θ is a probability density function as follows: Note that we use the following equality to derive the gradient: To compute −∇ 2 θ F , we compute the derivative of ( 12) as follows: By using (45), the second term becomes Therefore, we have Thus, for the spectral norm of −∇ 2 θ F , we have Since spectral norms of matrices are smaller than or equal to Frobenius norms, we have Thus, if all elements of Σ θ are finite values, γ + γ 2 ||Σ θ || F becomes a Lipschitz constant.Therefore, we will show | < ∞ are satisfied if the following improper integrals converge: Note that we ignore the division by e F , because e F is bounded as 0 < e F < ∞.First, we consider (53).Since θ ′ i θ ′ j becomes negative and is difficult to compute in the improper integral, we split the integral into two integrals where Θ + is the positive part: , and Θ − is the negative part: For the positive part (55), we have Thus, E p [θ ′ i θ ′ j ] converges to a finite value.Next, we consider (54).Similarly, we split the integral into two integrals where Θ + is the positive part: In the same manner, we have )dθ ′ is a part of the computation of the mean of Gaussian distributions N (θ, γI).From ( 63) and ( 64), E p [θ ′ i ] also converges to a finite value.From the above, we have and thus, for a bounded set of θ, we have Finally, we have which completes the proof.□ I. Proof of Theorem 4 Proof: From Theorem 3, the objective function F with nonnegative L is (γ + γ 2 ||Σ θ || F )-smooth on the arbitrary bounded set.First, we consider the convergence rate.Ghadimi and Lan [26] show that the convergence rate of a randomized stochastic gradient for convex globally β-smooth function L is and that for nonconvex β-smooth function is See [ [26] eqs.(2.17) and (2.18)].These results assume the global smoothness, but EnSGD is locally smooth on the arbitrary bounded set, as shown in Theorem 3.However, it can be relaxed to local smoothness and applied EnSGD.This is because the derivative of these equations uses This result also assumes global smoothness, but it is easily extended to the smooth function on arbitrary bounded convex set like the above proof.Thus, we can obtain the bound of the uniform stability of EnSGD if we compute the Lipschitz constant L. Since F is almost everywhere differentiable, the smallest Lipschitz constant L is the upper bound of the gradient: sup θ ||∇F || = L.Then, we have As shown in the proof of Theorem 3, E p θ (θ ′ ) [θ ′ ] < ∞ in the arbitrary bounded set Θ. Thus, (74) becomes a Lipschitzconstant.By substituting (74) and β = (γ + γ 2 sup θ ||Σ θ ′ || F ) for ( 71) and (72), we obtain ( 26) and ( 27), which completes the proof.□ EXPERIMENTAL SETUP This section gives the experimental conditions.Our experimental codes are based on source codes provided by Wu et al. [18], and our implementations of EnSGD are based on the codes provided by Chaudhari et al. [34].Datasets of the experiments were CIFAR10, CIFAR100 [42], and SVHN [43].We compared the convergences of AT when using SGD and EnSGD.In addition, we evaluate the combination of EnSGD and AWP [18], which injects adversarial noise into the parameter to flatten the loss landscape.

J. Evaluation of the Smoothness in Section V-A
We used a linear model with two parameters as ℓ(θ) = log(1 + exp(−yθ T (x + δ))) and a nonlinear model with two parameters as ℓ(θ) = log(1 + exp(−yz), where y = 1, x = [1, −1] T , and z = swish(θ T (x + δ)).Since we only used two parameters: θ = [θ 1 , θ 2 ], we can directly compute and visualize the loss surface.Instead of training, we moved each parameter θ i from −2 to 2 in 0.01 increments and compute the loss at each parameter point.For the linear model, we computed the optimal attacks obtained in the closed form, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Right: test accuracy.AT and AT + EnSGD denote AT using SGD and AT using EnSGD, respectively.AWP and AWP + EnSGD denote AT with AWP using SGD and AT using EnSGD, respectively.Note that we use PGD with ten iterations for training accuracy and PGD with 20 iterations for test accuracy.and we used PGD to generate adversarial attacks for the nonlinear case.For computing improper integral of EnSGD ( 14), we used scipy.In the evaluation of the gradient norms, we used the same setup as the evaluation of EnSGD as described in Appendix K-M.

K. CIFAR10 [42]
We used ResNet-18 (RN18) [44] and WideResNet-34-10 (WRN) [45] following [18].We used untargeted PGD, which is the most common white box attack.The hyperparameters for PGD were based on [18].The L ∞ norm of the perturbation is ε = 8/255 at training time.For PGD, we randomly initialized the perturbation and updated for ten iterations with a step size of 2/255 at training time for CIFAR10.At evaluation time, we use AutoAttack [46] for CIFAR10.In addition, we used PGD with 20 iterations and a step size of 2/255 for CIFAR10 for visualization of loss landscape and the evaluation of the convergences of AT.For ε = 12/255, we used a step size of 3/255.For AT + EnSGD, we set γ = 0.03, ε E = 1 × 10 −4 , η = 0.1, η ′ = 0.1, and L = 20 for RN18 and L = 30 for WRN.Note that we coarsely tuned γ and η ′ and found that the effect of tuning these parameters is less than that of L, and thus, we use the settings of [34] for these parameters.For AWP + EnSGD, we set γ = 0.03, ε E = 1 × 10 −4 , η = 0.1, η ′ = 0.1, = and γ A = 0.005 of AWP.For AWP, we set γ A = 0.01 following [18].Note that we found that AWP with γ A = 0.005 does not outperform AWP with γ A = 0.01 when using SGD.Since the training accuracy of AWP contains the effect of the AWP, improvement of convergence of training accuracy of AWP (γ A = 0.005) + EnSGD might be caused by the small weight perturbation.Thus, we plot AWP (γ A = 0.005) with SGD in Fig. 5, in which other results are the same as those in Fig. 4.This figure shows that AWP (γ A = 0.005) + EnSGD outperforms AWP (γ A = 0.005) with SGD in terms of the convergence.In addition, the test accuracy of AWP (γ A = 0.005) with SGD is lower than that of AWP (γ A = 0.01), which is plotted in Fig. 4 of the main paper.
For the preprocessing, we standardized data by using the mean of [0.4914, 0.4822, 0.4465] and standard deviations of [0.2471, 0.2435, 0.2616].The gradient of the preprocessing is considered in the generation of PGD.
The learning rates of SGD and EnSGD are set to 0.1 and divided by 10 at the 100th and 150th epochs, and we used early stopping by evaluating test accuracies.We used a momentum of 0.9 and a weight decay of 0.0005.We trained the models three times and showed the average and standard deviation of test accuracies.For evaluating test accuracies against runtime, we used NVIDIA Tesla V100 SXM2 32-GB GPUs and Intel5 Xeon 5 Silver 4110 CPU @ 2.10 GHz.

L. CIFAR100
We used RN18 [44] and untargeted projected PGD.The hyperparameters for PGD were based on [18].The L ∞ norm of the perturbation is ε = 8/255 at training time.For PGD, we randomly initialized the perturbation and updated it for ten iterations with a step size of 2/255 at training time for CIFAR100.At evaluation time, we use PGD with 20 iterations and a step size of 2/255 for CIFAR100.For EnSGD, we set γ = 0.03, ε E = 1 × 10 −4 , η = 0.1, η ′ = 0.1, and L = 30.The learning rates of SGD and EnSGD are set to 0.1 and divided by 10 at the 100th and 150th epochs, and we used early stopping by evaluating test accuracies.We used a momentum of 0.9 and a weight decay of 0.0005.For the preprocessing, we standardized data by using the mean of [0.5070751592371323, 0.48654887331495095, 0.4409178433670343], and the standard deviations of [0.2673342858792401, 0.25643846291708 83, 0.27615047132568404].The gradient of the preprocessing is considered in the generation of PGD.The hyperparameter γ A of AWP is set in 0.01 for AWP, and γ A of AWP is set in 0.007 for AWP + EnSGD.We trained models three times and showed the average and standard deviation of test accuracies.

M. SVHN [43]
We used RN18 and untargeted PGD.The hyperparameters for PGD were based on [18].The L ∞ norm of the perturbation is ε = 8/255 at training time.For PGD, we randomly initialized the perturbation and updated it for ten iterations with a step size of 1/255 at training time for SVHN.At evaluation time, we use PGD with 20 iterations and a step size of 1/255 for SVHN.For training of SVHN, we did not apply EnSGD and AWP for five epochs following [18].For EnSGD, we set γ = 0.03, ε E = 1 × 10 −4 , η = 0.1, η ′ = 0.1, and L = 30.For the preprocessing, we standardized data by using the mean of [0.5, 0.5, 0.5], and the standard deviations of [0.5, 0.5, 0.5].The gradient of the preprocessing is considered in the generation of PGD.The learning rates of SGD and EnSGD are set to 0.01 and divided by 10 at the 100th and 150th epochs, and we used early stopping by evaluating test accuracies.We used a momentum of 0.9 and a weight decay of 0.0005.The hyperparameter γ A of AWP is set to 0.01 for AWP and 0.005 for AWP + EnSGD.We trained models three times and showed the average and standard deviation of test accuracies.

N. Evaluation With TRADES
We trained models by using TRADES with SGD (TRADES), TRADES with EnSGD (+EnSGD), TRADES with AWP and SGD (+AWP), and TRADES with AWP and EnSGD (+AWP + EnSGD).Since we observed that training of EnSGD with AWP becomes slow after the 150th epoch, we switched the optimization method from EnSGD to SGD at the 150th epoch for AWP + EnSGD: we trained models by using EnSGD for 150 epochs and trained them by using SGD after the 150th epoch.Switching optimization methods are sometimes used for improving generalization performance [53].In this experiment, we followed the setup in the provided code of [18].For training TRADES + AWP, TRADES + AWP + EnSGD, we did not apply EnSGD and AWP for ten epochs following [18].We used WideResNet-34-10 (WRN) [45] and CIFAR10.The hyperparameters for TRADES were based on [18].The L ∞ norm of the perturbation is ε = 0.031 at training time.We randomly initialized the perturbation and updated for ten iterations with a step size of 0.003 at training time.At evaluation time, we used AutoAttack [46].1/λ is set to 6 following [28].For EnSGD, we set γ = 0.03, ε E = 1 × 10 −4 , η = 0.1, η ′ = 0.1, and L in EnSGD is set to 20 for +EnSGD and 30 for +AWP + EnSGD, and we set α to 0.75.For AWP, we set γ A = 0.005 following [18].The learning rates of SGD and EnSGD are set to 0.1 and divided by 10 at the 100th and 150th epochs, and we used early stopping by evaluating test robust accuracies against untargeted PGD of 20 iterations.We used a momentum of 0.9 and a weight decay of 0.0005.We trained models for three times and showed the average and standard deviation of test accuracies.We normalized data but did not standardize data following [18].EVALUATION OF GRADIENT NORM Fig. 6 plots the gradient norm computed for all parameters of WideResNet in AT on CIFAR10.In this figure, the gradient norm of EnSGD is smaller than that of standard training, whereas AT increases the gradient norm, which is consistent with the gradient norm at the last layer in the main paper.The gradient norm of EnSGD slightly increases, as the epoch increases unlike the gradient norm at the last layer.This might be due to the deep architecture.In fact, we observed that the gradient norms of some layers increase and those of others decrease in EnSGD.

Fig. 1 .
Fig. 1.Illustrations of adversarial examples x ′ for x ∈ R 2 .A circle and a square are feasible regions for L 2 and L∞ constraints, respectively.(a) Adversarial examples for binary linear classification.(b) and (c) Optimal adversarial examples inside and on the boundary of the feasible region for nonlinear cases, respectively.

Fig. 5 .
Fig. 5. accuracy against PGD versus epochs.Left: training accuracy.Right: test accuracy.AT and AT + EnSGD denote AT using SGD and AT using EnSGD, respectively.AWP and AWP + EnSGD denote AT with AWP using SGD and AT using EnSGD, respectively.Note that we use PGD with ten iterations for training accuracy and PGD with 20 iterations for test accuracy.

TABLE III ROBUST
ACCURACIES AGAINST AUTOATTACK (||δ|| 2 = 0.5) Table IV lists the average of test robust accuracies when using TRADES with EnSGD.In this table, TRADES with EnSGD outperforms TRADES with SGD, and TRADES with AWP and EnSGD outperforms TRADES with AWP and SGD.Thus, EnSGD is also effective in TRADES, because the Lipschitz constant of the gradient of TRADES is larger than those of the loss in standard training.However, the improvements of EnSGD in TRADES are smaller than those in AT + EnSGD in Table [25]nd this bound is also satisfied on Θ by the local smooth function when it is smooth on the convex set Θ. Since EnSGD is smooth on the arbitrary convex set, we are able to use the results of (68) and (69) for EnSGD.By substituting the Lipschitz constant β = (γ + γ 2 sup θ ||Σ θ ′ || F ) for (68) and (69), we obtain(24)and(25).Next, we consider the uniform stability.Hardt et al.[25]show the uniform sta-