Adversarial Deep Learning: A Survey on Adversarial Attacks and Defense Mechanisms on Image Classification

The popularity of adapting deep neural networks (DNNs) in solving hard problems has increased substantially. Specifically, in the field of computer vision, DNNs are becoming a core element in developing many image and video classification and recognition applications. However, DNNs are vulnerable to adversarial attacks, in which, given a well-trained image classification model, a malicious input can be crafted by adding mere perturbations to misclassify the image. This phenomena raise many security concerns in utilizing DNNs in critical life applications which attracts the attention of academic and industry researchers. As a result, multiple studies have proposed discussing novel attacks that can compromise the integrity of state-of-the-art image classification neural networks. The raise of these attacks urges the research community to explore countermeasure methods to mitigate these attacks and increase the reliability of adapting DDNs in different major applications. Hence, various defense strategies have been proposed to protect DNNs against adversarial attacks. In this paper, we thoroughly review the most recent and state-of-the-art adversarial attack methods by providing an in-depth analysis and explanation of the working process of these attacks. In our review, we focus on explaining the mathematical concepts and terminologies of the adversarial attacks, which provide a comprehensive and solid survey to the research community. Additionally, we provide a comprehensive review of the most recent defense mechanisms and discuss their effectiveness in defending DNNs against adversarial attacks. Finally, we highlight the current challenges and open issues in this field as well as future research directions.

on the rapidly growing adversarial deep learning research. 71 Also, we believe there is a need to survey the current and the 72 emerging adversarial deep learning advancement and provide 73 an in-depth study about future research directions. Recently, 74 different articles have reviewed various research works in this 75 field [16], [17], [18], [19], [20], [21]. This survey is different 76 from the existing surveys in several aspects. As compared to 77 other surveys in the literature, this survey provides a com- 78 prehensive background review of the mathematical concepts 79 that are vital to understanding the working process of the 80 adversarial attacks and defense mechanisms on image clas-81 sification, as well as it provides a clear description of the 82 terminologies and technical terms used in this domain based 83 on the most recent research and advancements in this field. 84 The distinctive part of this survey is that it provides a sys- 85 tematic and deep review of the working process of the most 86 recent and state-of-the-art adversarial attacks by focusing on 87 describing their mathematical terminologies and foundation 88 which provide an easy and profound description of these 89 attacks. Besides, we provide a general overview about the 90 effectiveness of these attacks in many terms including but 91 not limited to attack performance. Furthermore, we provide 92 a comprehensive review of the well-known defense mecha-93 nisms by highlighting their effectiveness including their lim- 94 itations and the covered attacks. Not to mention, this article 95 mainly focuses on reviewing adversarial attacks and their 96 defense methods in computer vision. However, we provide 97 a lightweight review of the well-known adversarial attacks in 98 different contexts such as audio, 3-D data, and software which 99 help the interested readers to explore and rapidly tap on the 100 adversarial attacks in different contexts. 101 The main contributions of this article can be summarized 102 as follows: 103 • We provide an extensive study of state-of-the-art algo- 104 rithms for generating deep learning adversarial attacks 105 in computer vision 106 • We provide an in-depth study of various adversarial 107 attacks defense mechanisms.

108
• We provide a systematic and comprehensive review of 109 the adversarial threat model that covers the deep learning 110 system attack surface, adversarial knowledge and capa-111 bilities, adversarial goals, and attack scenarios. The remainder of this paper is organized as follows. Section II 117 provides some technical terms related to adversarial deep 118 learning. In section III we provide an overview of different 119 concepts of deep learning and adversarial attacks. Section IV 120 describes the threat model of deep learning. We dedicate 121 Section V to discuss the adversarial attacks. In Section VI 122 we introduce the defense mechanisms. We discuss the future 123 VOLUME 10, 2022 research directions of deep learning security in Section VII. 124 Finally, Section VIII concludes this paper. 126 In this section, we describe some of the technical terms used 127 in this survey study. As depicted in Fig. 2, the basic gradient descent algorithm 177 consists of: (1) calculating the gradient ∇ f of the objective 178 function J (θ ), (2) moving in the opposite direction of the 179 gradient ∇ f which is the direction of the steepest descent 180 that will lead to an improvement (i.e., finding the global 181 minimum) (see Fig. 2), and (3) selecting the learning rate β 182 that refers to the size of the steps toward the minimum. β is the 183 most important parameter that needs to be tuned carefully to 184 achieve a highly performed DNN model. Generally, a large 185 β allows the ML-model to learn faster. However, it could 186 drastically decrease the model's performance since the algo-187 rithm may result on the other side of the valley (missing the 188 optimal minimum). Small β allows the model to converge 189 by finding the local minimum after many iterations which, 190 not surprisingly, requires a long run-time. There is a trade-off 191 between the accuracy of the results and the time required to 192 perform parameter updates. 193 Gradient descent has three variants: batch gradient descent 194 (BGD), stochastic gradient descent (SGD), and mini-batch 195 gradient descent, which we will discuss in the next sections. 196 1) BATCH GRADIENT DESCENT (BGD) 197 As shown in Eq. 1, BGD calculates the gradient of the cost 198 function with regards to the model parameter θ for the whole 199 training dataset. In other words, BGD computes the gradient 200 descent over the full training dataset to perform one parameter 201 update, which explains why it is called ''Batch'' or in some 202 case ''full gradient descent''. Depending on the size of the 203 training dataset, batch gradient descent is time-consuming 204 and requires a long processing time.

II. DEFINITIONS OF TERMS
Despite the long processing time, BGD has several advan-207 tages. For example, when the cost function is convex, BGD 208 with a fixed learning rate will converge to the global mini-209 mum. When the cost function is not convex, it will converge 210 to the local minimum since it has a straight direction to 211 the minimum value, see Fig. 3. Therefore, BGD guarantees 212 convergence to the minimum value.
Since BGD requires a long processing time to calculate the 215 gradient for large-scale training datasets, SGD [23] was pro-216 posed to overcome BGD's limitations. SGD is an iterative 217 technique for optimizing the cost function. As illustrated in 218 Eq. 2, in every iteration SGD randomly selects an example 219 from the training data and calculates the gradient, then it 220 performs a parameter update using the selected example. 221 In contrast to BGD, where the actual gradient is calculated 222 using the entire dataset, the objective of SGD is to calcu- be evaluated using a norm function. A norm L p is a function 265 that measures the magnitude of a vector. Formally, the norm 266 function L p of x can be defined as: The distance metrics are used within the process of gener-269 ating the adversarial attacks by quantifying their similarities. 270 L 0 , L 1 , L 2 , and L ∞ distance metrics have been widely adopted 271 by state-of-the-art adversarial attack algorithms [12], [26], 272 [27], [28] which are detailed below. 273 1) L 0 DISTANCE 274 As shown in Eq. 5, L 0 distance is used to calculate the vector 275 size by measuring the total number of the non-zero elements 276 of a given vector. Arguably, sometimes L 0 is referred to as 277 a ''norm'' which is not correct since scaling a vector by 278 constant value a will not change the number of non-zero ele-279 ments, thus, it is more accurate to be classified as a cardinality 280 function.
The L 0 distance was used in [26] and [29] to generate 283 adversarial attacks since it corresponds to the number of 284 altered pixels of an image. The L 1 distance, also known as the Manhattan Distance or the 287 Taxicab norm, is used when the difference between non-zero 288 and zero elements is important. Essentially, when an element 289 moves away from the origin (0,0) by a, L 1 increments by a. 290 Therefore, L 1 measures the distance between the origin (0,0) 291 to the point (x,y). Formally, L 1 distance is defined as follows: 292 The L 1 distance is utilized by the elastic net attack [30] to 294 generate adversarial perturbations. More specifically, the L 1 295 distance functions as a regularization parameter, representing 296 the perturbation's total variation. The L 1 distance improves 297 the transferability of the elastic net attack by generating 298 distinct adversarial images that fool DNN models. 299 VOLUME 10, 2022 The L 2 distance, also known as the Euclidean Distance, 301 is widely used in machine learning, and often denoted as x .

302
As shown in Eq.7, L 2 measures the shortest distance (i.e., 303 length of the straight line) between two vectors.
The L 2 distance was employed by different researchers to 306 generate adversarial attacks [31]. For example, In [27], they 307 used L 2 to measure the distance between the class labels of 308 the original image x and the perturbed imagex. It is used to 309 measure the size of the perturbations in the perturbed imagex.
The L ∞ distance, also known as the ''max norm'', returns the 312 maximum magnitude of the difference between each element 313 in a vector. As shown in Eq. 8, the L ∞ norm can be described 314 as the maximum among the absolute values of the differences 315 in a set of numbers (e.g., coordinate pair, n-dimensional 316 vector, etc.) [26].
In adversarial settings, L ∞ can be used as a sufficient 319 constraint over the size of perturbations that could be added 320 to generate the perturbed image [32]. summed and passed through a non-linear activation or trans-346 fer function [33]. This process is shown in Fig. 4 The input layer is the initial layer in an ANN, and it is 362 mainly responsible for feeding the data into the network. 363 The inputs are then transferred to the hidden layer(s) for 364 processing. The hidden layer(s) is where the network applies 365 an activation function and weights for the inputs. The hidden 366 layers process the inputs coming from the preceding layer 367 and extract the required information from the data. As shown 368 in Fig. 6, neural networks can have multiple hidden layers, 369 this is known as a deep neural network (DNN). Based on the 370 problem's complexity, multiple hidden layers can be used to 371 increase the prediction accuracy of the network and extract 372 more features from the data. For example, as shown in Fig. 7, 373 a convolutional neural network (CNN) used for facial recog-374 nition cannot solely identify a human face with only one 375 hidden layer. One layer that identifies eyes cannot recognize 376 an entire face, but if it is combined with other layers that 377 identify other features, such as noses or mouths, the network 378 becomes stronger and can successfully recognize faces. The 379 output layer is the final layer in an ANN and is responsible 380 for aggregating the information and returning the outputs in 381 the format given by the problem.

382
Forward propagation is the process of progressively mov-383 ing through the layers of the ANN and is used in feed-forward 384 neural networks [34]. The hidden layer(s) take the input 385 data, process it, and then pass it onto the next layer. This 386 is a necessary step for feed-forward networks to generate 387 outputs. If the data travels backward at any point, it will form 388 FIGURE 5. The general architecture of a simple neural network and its three layers: the input layer (green), the hidden layer (blue), and the output layer (red).  The activation function determines whether an artificial neu-404 ron should be activated. The main objective of the activation 405 function is transforming the values in the node into an output 406 value that can be accepted as input into a function (e.g., 407 vector) while adding non-linearity to the output values. Acti-408 vation functions map the resulting values from the summation 409 function in the node to lie between [0 to 1] or [−1 to 1]. The 410 result of the activation function forms the input for the next 411 layer. Activation functions fall into two categories: linear 412 and non-linear. The most widely used activation functions 413 are non-linear, and the most popular ones include hyperbolic 414 tangent, sigmoid, and softmax. The sigmoid functions are characterized by an S-shaped curve 417 that can be categorized into three different functions: the 418 logistic function, the hyperbolic tangent, and the arctangent. 419 In the context of machine learning, the sigmoid function 420 coincides with the logistic sigmoid function [35]. The logistic 421 sigmoid function as defined in Eq.9, takes any real value x and 422 outputs a value S(x) that lies within the range [0 to 1].

S(x)
Sigmoid functions are widely used as activation functions 425 in deep learning because they add non-linearity into the 426 network. Sigmoid functions are also used to convert real 427 numbers into probabilities. A logistic sigmoid function that 428 is placed in the last layer of an ANN converts the output into 429 a probability score. The softmax function can be viewed as a generalization of 432 the logistic regression function, sharing similarities to the 433 sigmoid function shown in Eq. 10. The softmax function 434 transforms a vector of values into a single vector, whose 435 values when summed are equal to one. It is normal to see 436 the softmax function implemented as a penultimate layer in a 437 neural network [36] because it can transform the outputs from 438 the hidden layers into a normalized probability. The hyperbolic tangent or tanh function can be utilized as 449 an alternative to the logistic sigmoid function in an ANN. 450 The tanh function shares similarities with the logistic sigmoid 451 VOLUME 10, 2022

456
Larger input values (more positive) will result in outputs 457 that are closer to 1, whereas smaller inputs (more negative) 458 will result in outputs closer to −1. tanh is preferable to the 459 logistic sigmoid as it has unrestricted gradients [37], and The adversary knowledge about the targeted machine learn-  The adversary also knows the type of learning algorithm 482 which includes the type of the activation function and loss 483 function. In white-box settings, the adversary has access 484 to the full knowledge of the targeted learning system [39]. 485 Thus, the generated adversarial attacks using this setting are 486 commonly known as white-box attacks [40], [41]. In contrast to the perfect knowledge, in this setting, the 497 adversary has no knowledge about the targeted system or 498 access to any surrogate model. The only available option for 499 the adversary is querying the targeted learning system (i.e., 500 oracle). Given the adversary's lack of knowledge, the gen-501 erated attacks using this setting are referred to as black-box 502 attacks [41].

504
Machine learning threat models can be categorized using 505 the capabilities of the adversary. In cyber-security, the term 506 ''capability'' refers to the adversary's level of access to the 507 system resources (i.e., the learning model and data). Depend-508 ing on the adversarial attack settings, the adversary capabili-509 ties can be categorized as the following: In white-box settings, the adversary has read and write 512 access to the training dataset of the targeted learning system. 513  help in crafting the adversarial attacks [43]. Here, the adver- The severity of any threat on a system asset is measured by 537 the potential impact on these three objectives: confidentiality, 538 integrity, and availability [44]. Depending on the business 539 logic of the computer system, the integrity of the output 540 (i.e., predictions and classification) from a machine learning 541 model is indispensable. For instance, an adversary can pro-542 vide an adversarial example, yielding an incorrect output.

543
Based on the output incorrectness, the adversarial goals fall 544 into three categories: 545 • Untargeted Misclassification. The adversary tries to 546 increase the misclassification ratio for the DNN model 547 by using the adversarial examples generated by untar-548 geted adversarial attack as an input to produce an incor-549 rect classification. In other words, the adversary tries to 550 force the targeted model to assign any incorrect label to 551 the adversarial examples.

552
• Confidence Reduction. The adversary tries to reduce 553 the prediction confidence by increasing the prediction 554 ambiguity of the targeted model.

555
• Source/Target Misclassification. The adversary tries 556 to craft perturbations that force the classification of an 557 adversarial examples to a specific label (i.e., assign a 558 specific label to an adversarial input) [45]. To achieve 559 this objective, the adversary may use targeted adversarial 560 attacks.

562
The adversarial attacks against ML-learning systems can 563 be launched either at the training or the testing phase as 564 explained below. adding perturbations into a clean image appears to be an 619 effective attack method in which the generated adversarial 620 examples are almost identical to the clean images (see Fig. 9) 621 which attracts the research community. Adversarial examples were first introduced by 625 Szegedy et al.
[12] who found that adding a small per-626 turbation ρ to an image x would result in an adversarial 627 imagex that could successfully fool a deep learning model. 628 To compute the proper size of perturbations, the authors 629 attempted to solve the following optimization problem.
However, the equation above is difficult to solve, as a 632 result, the Box-Constrained L-BFGS [54] was used to find 633 an estimation for the solution as shown in Eq.13. This is done 634 by finding the minimum value that satisfies the condition 635 f (x + ρ) = l while calculating the loss of the classifier.
The authors observed that the adversarial examples gener-638 ated by the box-constrained L-BFGS appear almost identical 639 to the original images (i.e., imperceptible perturbation). They 640 also noted that the resulting adversarial examples can fool 641 other DNN models (i.e., transferable adversarial examples). 642 The results of their work triggered concerns on the security 643 of deep learning systems and established a wide interest in 644 researching adversarial machine learning. More formally, given an image x, FGSM calculates the 654 perturbation using the formula below,   Adversarial images generated using BIM are provided by 717 solving the following formula: wherex i is the adversarial example at the i th iteration, BIM 720 will find the next imagex i+1 and repeat for the number of 721 iterations, determined heuristically. Therefore, the BIM algo-722 rithm minimizes the computational cost while being strong 723 enough to reach the edge of the decision boundary, yielding 724 to misclassifyingx.

725
The ICLM attack further extends the BIM to generate a 726 targeted attack. The ICLM differentiates itself from BIM 727 by generating a perturbation for the least likely class of x. 728 Adversarial images generated using the ICLM are created 729 using the formula below.
where y is the class label used in Eq.16 replaced with the 732 target label y t that corresponds to the least likely class with 733 the lowest confidence score predicted by the model. ICLM 734 uses the same number of iterations and step size as BIM.
where f is a classification function, · p is the L p norm, δ is 746 the desired fooling rate, and the parameter ξ is responsible 747 for the magnitude of the perturbation ρ.

748
More specifically, generating a perturbation ρ that can fool 749 most data points in an image set X = { x 1 , . . . , x n } can be 750 done by iterating over the images in X and gradually building 751 up the UAP. The authors generate universal perturbations in a 752 VOLUME 10, 2022 FIGURE 10. An adversarial example was generated using DeepFool [27] Attack. The perturbation image in the middle is magnified. similar way as in the DeepFool [27] algorithm, they gradually 753 push a single data value towards the closest hyperplane.

754
In this case, the UAP method consecutively pushes all the 755 input data towards their respective hyperplanes.  The attack terminates when the fooling rate on the adver-   that adversarial robustness of DNN models can be viewed 795 in terms of ''robustness optimization''. As shown in E.q 19, 796 they defined the adversarial training as a formal optimization 797 problem, known as the saddle point problem.
where E D [L(.)] is the population risk for a distribution value 800 D into a loss function L. The saddle point optimization prob-801 lem is an arrangement of an inner maximization problem 802 and an outer minimization problem [50]. Inner maximization 803 finds an adversarial data point that maximizes the loss. The 804 outer minimization finds the model parameters such that the 805 loss generated by the inner function is minimized. Moreover, 806 Eq. 19 also defines a goal for an ideal robust classifier as well 807 as a measurable value of the classifiers robustness. PGD is a 808 powerful first-order attack and has been shown to fool the 809 deep learning models efficiently and effectively [59]. The NewtonFool algorithm [60] is used to decrease the proba-812 bility of the original class label by utilizing Newton's method 813 for solving nonlinear equations. This attack performs gradient 814 descent with step size δ to find a perturbation ρ that will 815 produce an adversarial examplex. The step size is determined 816 adaptively, changing over time according to the change in the 817 perturbation ρ. The step size δ is computed by solving the 818 following equation: where the tuning parameter η controls the size of ρ, x 0 is 821 the input image, and F l s represents a neural network with a 822 softmax activation layer. The step size δ is then utilized to 823 calculate the adversarial perturbation ρ as follows, where x i is the current image, δ is the step size calculated 826 in Eq. 20, and ∇F l s is the gradient of the classifier. The 827 authors extend the attack to work with multiple class labels. 828 NewtonFool decreases the probability of all labels in a set of 829 clean images L + and increases the probability of all labels in 830 a set of perturbed images L − . NewtonFool produces effective 831 perturbations and significantly reduces the confidence prob-832 ability of the correct class.
where α k is the step size at the (k + 1) th iteration, and S β  The goal of the optimization problem is to discover a step 894 δ k such that the new perturbationx k =x k−1 +δ k has the min-895 imum L p distance to the clean input x. The new perturbation 896x k will stay between the box constraints of a valid input range. 897 The perturbation will be placed on the adversarial boundary.   The boundary attack is a strong method for targeting deep 991 learning models, outperforming gradient-based white-box 992 attacks such as FGSM [28] and DeepFool [27]  To ensure visual similarity to the clean images, the pertur-1022 bation space is restricted to only allow 30 • max rotation and 1023 a 10% max translation in every direction. The optimal pertur-1024 bation is calculated via hyperparameter optimization, better 1025 known as grid search. Grid search is an extensive process in 1026 which a subset of the hyperparameter space is searched to 1027 find the optimal parameters, in this case, the perturbation, for 1028 a given model. The combination of rotation and translation 1029 parameters is applied to the entire group of input images. In a 1030 sense, the perturbation found by the spatial transformation 1031 attack is universal. The spatial transformation attack is able 1032 to achieve remarkably high results and fool multiple deep 1033 learning models trained on the MNIST [56], CIFAR-10 [55], 1034 and ImageNet [6] datasets. surrogate loss functionl(y θ (x)), y, referred to as Houdini.

1067
Houdini is composed of two parts. The first is a stochas-1068 tic margin that calculates the probability that the difference 1069 between the score of the predicted target and the actual target 1070 is smaller than a given value. This represents the total model's 1071 confidence. The second part to Houdini is the task loss, which 1072 is independent of the model and corresponds to the target that 1073 will be maximized. Houdini is designed to generate effective 1074 adversarial images that can fool a given model, but it has 1075 also been shown to be effective against speech recognition 1076 systems. Effective targeted and untargeted attacks were able 1077 to be generated to attack a DNN that estimated human poses.

1105
The SimBa attack algorithm is an effective method for 1106 generating adversarial examples. This attack was tested on 1107 a deep learning model trained on the ImageNet [6] dataset, 1108 and was able to achieve success rates of 98.6% and 100% 1109 in an untargeted and targeted attack setting respectively. The 1110 SimBa attack is also able to achieve an extremely low number 1111 of average queries into the model compared to similar black 1112 box algorithms, 1665 queries in an untargeted setting and 1113 7899 queries in a targeted setting.  The threshold attack, also known as the L ∞ black-box 1141 attack, optimizes a constrained optimization problem using 1142 the L ∞ norm. This attack applies the small perturbation ρ 1143 slightly to all pixels. The optimization problem is constrained 1144 with ρ x ∞ ≤ th, where th is a predefined threshold value. 1145 The threshold attack searches for variables in the algorithm 1146 search space R k , which is the same as the input space. The 1147 variables can be any variation of the inputs if the threshold is 1148 not crossed.

1149
The few-pixel attack attempts to minimize the number of 1150 the perturbed pixels by optimizing a constrained optimiza-1151 tion problem with the L 0 norm. The search space for the 1152 few-pixel attack is smaller than the input space and searches 1153 for variables in the search space R (2+c) * th . The fundamental 1154 difference between this attack and the threshold attack is the 1155 use of a different L p norm, and a different search space.   The HSJA will repeat for t iterations or until the optimal 1180 adversarial perturbation is generated. Each iteration has three where x is the input image, ρ is the adversarial perturbation,   The HSJA was also shown to minimize the number of 1201 queries used, this attack was able to achieve a 70% success water, while the non-sensitive regions would be anything that 1213 does not fall into those categories. Sensitive regions must stay 1214 within a specific range of modification, while non-sensitive 1215 regions can be modified more inconsistently and still look 1216 normal.

1217
After identifying the different regions, ColorFool splits an 1218 image x into k semantic regions using a binary mask that 1219 identifies the position of the pixels belonging to the region. 1220 The colors of each set are modified in color space, which 1221 separates the brightness from the color. Natural color ranges 1222 a, b, and L are used to pull apart the color values, where a 1223 ranges from red to green, b from blue to yellow, and L from 1224 black to white.

1225
The colors of the sensitive regions are then modified and 1226 converted from RGB to the color space. The adversarial 1227 perturbations in the color channels a and b are randomly 1228 chosen from the set of natural color ranges. The color ranges 1229 are determined by the true colors, the region semantics, 1230 and previous information on color perception. The colors 1231 are changed iteratively with small intervals until the opti-1232 mal perturbation fools the classifier. Then, the colors of the 1233 non-sensitive regions are modified in the same way as the 1234 sensitive ones, but the color values are from the entire range of 1235 a and b in order to endure larger changes. Finally, the adver-1236 sarial imagex is generated by combining the two modified 1237 color regions into one image. The adversarial image is then 1238 converted back into RGB form from color space form and is 1239 multiplied by a function to ensure that the image is in the 1240 original range of pixel values.

1241
The ColorFool attack is a strong algorithm that is able to 1242 effectively undermine deep learning models trained on the 1243 CIFAR-10 [55], ImageNet [6], and P-Places365 [78] datasets. 1244 For example, on CIFAR-10 trained models trained with a 1245 softmax activation function, ColorFool is able to achieve a 1246 success rate of 99.4%. On models trained with the prototype 1247 conformity loss (PCL) [79] method as well as PCL with 1248 adversarial training, ColorFool was able to get success rates 1249 of 100% and 99.9%, respectively. Overall, this algorithm 1250 is an impressive and effective method of undermining deep 1251 learning models. The square attack [52] is modeled on random search, an iter-1254 ative optimization technique. This attack differentiates itself 1255 from other random search-based attacks by iteratively gener-1256 ating perturbations that lie on the L 2 or L ∞ boundaries before 1257 projecting them onto the image. As a result, the perturbation 1258 can be maximized on every iteration. The attack updates the 1259 image at each step modifying a small percentage of neighbor-1260 ing pixels grouped into a square.

1261
The square attack initializes by choosing the side length 1262 h i of the pixel square that will be updated. h i decreases 1263 according to a fixed schedule. Then, a new perturbation ρ is 1264 discovered and subsequently added to the current iteration.

1265
The loss value is re-calculated and if the resulting value 1266 is smaller than the previous loss value, the perturbation is   secondary loss function serves as a penalty term that punishes 1316 the model when it detects any difference between perturbed 1317 and non-perturbed images. Dual objective functions allow the 1318 adversary to return high classification accuracy for the model 1319 while setting constraints that weaken the model's defenses. 1320 As the adversarial training converges, the distribution of 1321 backdoor inputs, as well as clean inputs, also converges -1322 minimizing the differences that the defense systems use for 1323 detecting poisoning attacks. Algorithms such as the feature collision attack [86] fail when 1343 the feature extractor is unknown to the adversary. Thus, the 1344 convex polytope attack [87] was introduced to bypass the lim-1345 itations of such algorithms. This attack creates a set of adver-1346 sarial examples that contain the target class within the convex 1347 hull. The convex polytope attack exploits the association 1348 made by the linear classifier of the targeted network between 1349 the adversarial examples and the targeted class. Then, the 1350 network will classify any point within the convex hull as the 1351 targeted class. The attack is highly transferable due to the 1352 convex polytope expanding the attack area. The attack will 1353 find the optimal adversarial examples by iterating through 1354 a specialized non-convex optimization problem 4000 times. 1355 The convex polytope attack has several inherent issues, such 1356 as scalability, robustness, and generalizability.

1357
Due to its extremely slow execution time, convex poly-1358 tope attack is considered non-scalable. Notably, it has two 1359 time-consuming processes: First, it checks whether the new 1360 coefficients have a smaller loss compared to the previous 1361 ones, and it checks this on each iteration while optimizing the 1362 coefficients. Second, whenever the new coefficients satisfy 1363 the previous condition, the convex polytope attack projects 1364 onto the probability simplex, a space in which each point 1365 represents a probability distribution. The convex polytope 1366 also faces other issues, specifically the robustness and gen-1367 eralizability of the attack. Once the target moves through 1368 the boundary into the convex polytope, there is no reason 1369 to continue the optimization process and move further into 1370 VOLUME 10, 2022 the attack area. For this reason, the target will be close to the 1371 boundary of the adversarial polytope.

1373
The bullseye polytope attack [88] is a more efficient, transfer-1374 able, and robust adaptation of the convex polytope attack [87].  The adversary generates a mock dataset by first selecting 1395 a large set of images, these images can originate from the 1396 targeted network, or they can be from an unrelated image set.

1397
The adversary selects the data collection method depending 1398 on the amount of access they have over the targeted model. Model extraction attacks directly target the most secret parts 1427 of a given model, its architecture, and parameters. Model 1428 extraction allows the adversary who was previously operating 1429 under a black-box threat model to effectively gain access to 1430 the model in a white-box threat model, this is achieved by 1431 extracting an exact copy of the oracle. Model extraction is 1432 one of the most difficult adversarial goals as the adversary 1433 is attempting to generate a copy of the model while they only 1434 have access to the inputs and outputs. Functionally equivalent 1435 extraction [90] looks to construct an oracle O in such a way 1436 that, The functionally equivalent extraction method works on 1439 neural networks using the ReLU activation function. The 1440 algorithm is split into four steps. First, critical point search 1441 determines inputs to the network such that one ReLU unit is at 1442 a critical point. This is accomplished by sampling two values 1443 and putting them through a linear function. This function 1444 computes the slopes and intercepts of the input vectors and 1445 then calculates the intersection of the two vectors. If there 1446 happen to be more than two linear factors, then it is unlikely 1447 that the true values will match the predicted values. Second, 1448 the next step in constructing a duplicate oracle is weight 1449 recovery. In order to form the weight matrix A (0) , they cal-1450 culate the second derivative of the oracle O in each input 1451 direction at the critical points x i . The second derivative is used 1452 to calculate the difference between adjacent linear regions. 1453 This is repeated until the entire matrix A (0) is complete. 1454 Third, the algorithm determines the sign of every row vector 1455 A (0) j , using global information about the matrix. Finally, the 1456 least-squares method is used to approximate the architecture 1457 of the hidden layer(s) of the neural network. When tested 1458 against MNIST [56], the functionally equivalent extraction 1459 method produces oracles that have a rate of 100% accu-1460 racy and only begins to diminish around 100000 parameters. 1461 When tested against CIFAR-10 [55], the accuracy dips below 1462 100% after 200000 parameters. The main issue with this 1463 method is that it can't be extended to other deeper neural 1464 networks, and only works sufficiently on two-layer models. 1465 This attack is developed to be a general-purpose universal 1468 attack. Model inversion [91] works by utilizing the informa-1469 tion available to the adversary from the model and using that 1470 to estimate the probabilities of a potential target. Rows from 1471 a candidate database that share characteristics with the target 1472 database are used as input and are processed by the model. 1473 The database rows are weighted depending on the accepted 1474 priors and the model's output for a given row corresponding 1475 TABLE 1. An overview of the white-box and black-box attacks. The table is organized into: (1) Algorithm name, (2) Attack Type: targeted or untargeted, (3) Scenario: white-box or black-box, (4) Learning: iterative/one-shot (N/A denotes attacks that are neither one-shot or iterative), (5) the number of perturbations (high, low, fair), (6) the perturbation norm (sections with N/A denote attacks that do not utilize the L p norms to formulate perturbations), (7) the execution time (fast, slow, fair), (8) the transferability of the attack, either universal or model specific, and (9) the attack strength, which is observed from first hand experiments with the specific algorithms or is the perceived strength from the literature.  Then MI-Face iteratively performs gradient descent, and after 1498 each gradient step, the generated vector is put through a 1499 post-processor that does image manipulation such as sharpen-1500 ing and de-noising. If the cost function fails to improve under 1501 a given number of iterations or the cost value is close to a 1502 given threshold value, then the attack is terminated, and the 1503 optimal cost value is returned. The reconstruction attack is 1504 an efficient and effective algorithm that was shown to have 1505 an increased attack accuracy and precision compared to other 1506 attacks, and is able to fool a deep learning image recognition 1507 model in both a white box and black box setting.

1509
Generally speaking, any system that utilizes a machine learn-1510 ing algorithm can be targeted by adversarial attacks [93]. 1511 Hence, the adversarial attacks are not limited to image classi-1512 fications. In this paper, we focus mainly on adversarial attacks 1513 in the context of image classification networks. However, 1514 we believe it is important to briefly review some of the 1515 well-known adversarial attacks in different contexts such as 1516 audio, point clouds, and software. Recently, deep learning methods become the primary choice 1519 in developing audio systems, specifically, voice recognition 1520 and voice-to-text systems. Some researchers have shown that 1521 such systems can be compromised by adversarial attacks. 1522 Carlini & Wagner [94] demonstrate the existence of targeted 1523 audio adversarial examples that can target the automatic 1524 speech recognition system such as DeepSpeech [93]. They 1525 generated by DeepSpeech. Given a clean audio wave-form x, an inaudible perturbation ρ is generated such that when added 1528 to the original audio wave,x is recognized as any phrase.

1529
The authors use the same attack methods in [26] to generate the audio adversarial waves. The proposed attack is highly 1531 effective with a success rate of 100%.
noticed that the generated adversarial perturbations can be 1535 easily detectable by humans. Secondly, the generated adver-

1597
As depicted in Table 2, defenses for neural networks against 1598 adversarial attacks generally lie within one of four frame-1599 works: (1) modifying the ANN, (2) modifying the train-1600 ing by including the adversarial examples (e.g., adversarial 1601 training), (3) transforming the inputs, or (4) having external 1602 models that serve as ANN add-ons. Defense methods that 1603 change the training or the input data are disconnected from 1604 the ANN model itself. However, modified ANNs and ANN 1605 add-ons implement more layers, add subnetworks, change 1606 the loss function, or use external models to defend against 1607 attacks. In this section, we will discuss the various meth-1608 ods used to protect a deep learning model from adversarial 1609 attacks. Papernot et al. [29] introduced defensive distillation as a 1613 defense method for deep learning models against adversarial 1614 attacks. Defensive distillation builds upon the original distil-1615 lation algorithm [104], which was originally introduced as 1616 a way to reduce the size of a large model into a reduced 1617 distilled model. Defensive distillation utilizes the distillation 1618 algorithm to increase the robustness of the model. However, 1619 instead of reducing the size of the model, defensive distil-1620 lation modifies the softmax activation function in the last 1621 layer of the neural network to include a temperature value 1622 T . This temperature value forces the model to make stronger 1623 and more confident predictions.

1624
The defensive distillation algorithm operates as follows: 1625 First, a large network F is trained by initializing the tempera-1626 ture T of the softmax function during the model's training 1627 phase. Then ''soft'' labels are generated by applying the 1628 network to every value in a training set X and recalculating the 1629 softmax with the temperature. Next, a new training set is gen-1630 erated using the soft labels. Then using the new training data 1631 another deep learning model is trained, with the same model 1632 the softmax function remains T . This new model is known as the distilled model F d , and when ran at test time the model 1635 will classify new input data. 1636 The defensive distillation defense method works effec-1637 tively against the L-BFGS [12] and DeepFool [27] attacks.    Metzen et al. [110] use subnetworks that augment the origi-1697 nal network to detect adversarial perturbations. Subnetworks 1698 work by branching off the main network and producing a 1699 probability p adv that weighs the chances of an image being 1700 adversarial. This subnetwork is known as the detector and is 1701 trained to classify inputs as clean or adversarial.

1702
First, the classification network is trained on the regular 1703 data (i.e., non-adversarial). Then, the adversarial examples 1704 are generated for the entire dataset, using the attack algo-1705 rithms from which the network is trying to defend itself (e.g., 1706 FGSM and DeepFool). Once an equal size dataset that has 1707 the same amount of clean and perturbed images is generated, 1708 the weights of the classification network are frozen and the 1709 detector network is trained such that the cross-entropy of the 1710 probability p adv , as well as, the labels are minimized. The 1711 specifics of the detection subnetworks and how it connects 1712 to the classification network are specific to each dataset 1713 and classification network. The detector subnetworks defense 1714 works in detecting perturbations that are generated using 1715 FGSM, DeepFool, and BIM [57].  Adversarial training is one of the most effective ways to 1733 improve overall model robustness against adversarial exam-1734 ples. It corresponds to the process of adding the adver-1735 sarial examples with their correct labels into the training 1736 dataset of the DNN models [32]. This method requires that 1737 VOLUME 10,2022 algorithm, an exposed model, and a large dataset. For these reasons, adversarial training is commonly known as brute-1740 force training. It is also shown that adversarial training can 1741 provide an added regularization to the network [28] which helps strengthen the DNN models against adversarial attacks. and [27] have been proposed, such as stability and virtual 1745 adversarial training.  [117] introduce a framework for 1805 defending against UAPs by adding a perturbation rectifying 1806 network (PRN) as a pre-input layer to the targeted model 1807 to prevent having to alter the model. The PRN catches the 1808 perturbed images coming into the network and adjusts them 1809 to label these perturbed images with same label of the original 1810 image. The perturbation rectifying network is trained using 1811 datasets that contain real and artificial UAPs without chang-1812 ing any of the model's parameters. Separately, a perturbation 1813 detector is trained on the cosine transform of the differences 1814 between inputs and outputs of the PRN. As shown in Fig. 11, 1815 the images pass through the PRN and then verified by the 1816 detector. When, the perturbations is detected, the output from 1817 the PRN is used to predict the labels instead of the actual 1818 image. PRN shows promising results in defending DNNs 1819 against UAPs with a 97.5% success rate. Other attempts to defend neural networks from adversarial 1822 attacks investigate optimizing the model which can be com-1823 putationally expensive. Therefore, Xu et al. [118] proposed 1824 feature squeezing to strengthen DNNs by detecting perturbed 1825 images. The feature squeezing process minimizes the search 1826 space by consolidating examples that correlate to various fea-1827 ture vectors in the original search space into a single example. 1828 Although the feature squeezing process is quite general, the 1829 authors specifically explore two methods, spatial smoothing, 1830 and the reduction of the color bit depth of every pixel. These 1831 techniques are simple, inexpensive, and can be combined 1832 with other defense strategies to have more effective results.
Input images first go through an external model that performs networks. The proposed algorithm analyzes the activations in 1871 the neural network to detect backdoors. The working process 1872 of their algorithm can be described as follows. First, they 1873 train the neural network using an untrusted dataset containing 1874 poisoned examples. Second, they query the neural network 1875 using the training data and the subsequent activations of the 1876 last hidden layer. Third, once the activations of each sample 1877 are retained, they are segmented into different segments that 1878 are clustered individually, where each segment corresponds to 1879 a label. Fourth, by using k-means clustering [120], the clus-1880 ters are separated into two groups: poisoned and clean data. 1881 Finally, the poisoned data is identified either by exclusionary 1882 reclassification, relative size comparison, or silhouette score. 1883 Once the poisoned data is identified, the authors suggest 1884 the fastest way to repair the backdoor by ''re-labeling'' the 1885 poisoned data with its original class, and continue to train 1886 the model until convergence. Their method was tested using 1887 the LISA [121], MNIST [56], and Rotten Tomatoes [122] 1888 datasets. When the authors experimented with 10% poisoned 1889 data using MNIST, they were able to achieve accuracy and 1890 F1 score of nearly 100% for each class label. Compared to a 1891 conventional clustering algorithm, their method outshines in 1892 every respect. drops to around 10%. In most cases by utilizing spectral signals they can remove all traces of corrupted data, minimizing the misclassification rate to around 1%. using the same dataset (i.e., performing the same task). 1961 FGSM is an example of medium transferability.

1962
• High transferability. At this level, the adversarial attack 1963 can fool different neural networks of different architec-1964 tures performing different tasks. 1965 Currently, most of the existing adversarial attack research 1966 is focused on image classification. Very limited studies have 1967 focused on different applications [41]. Therefore, further 1968 research is required to focus on adversarial deep neural net-1969 works in different applications. In addition, further investiga-1970 tion is required to evaluate the applicability, efficiency, and 1971 practical use of the current adversarial attacks in different 1972 applications.

1974
Multiple defense methods have been proposed to counter-1975 measure the adversarial attacks. However, oftentimes showed 1976 that a defended model has been successfully attacked by an 1977 existing attack or a zero-day attack. For example, the distilled 1978 neural network defense mechanism [29] has been defeated 1979 against C&W attacks [26]. Furthermore, the adversarial train-1980 ing defense technique has been proved to be ineffective [125]. 1981 Thus, further research is required to focus on developing a 1982 universal adversarial defense method that covers the various 1983 aspects of adversarial attacks. Currently, most of the adversarial attacks and defense mech-1987 anisms have been simulated in limited environments. Also, 1988 in many cases, the source code and the configuration param-1989 eters of the work environment are not available to the research 1990 community to further evaluate the robustness of the defense 1991 method as well as the adversarial attack. Hence, having a deep 1992 learning robustness methodology is crucial. Different works 1993 have conducted initial studies emphasizing the importance 1994 of evaluating the robustness of neural networks [126], [127], 1995 [128]. However, different questions arose such as (1) how 1996 to stress-test neural networks on different business domains? 1997 and (2) what are the general robustness parameters and accep-1998 tance score of a neural network application?. To answer these 1999 questions, further research is required.