RAILS: A Robust Adversarial Immune-inspired Learning System

Adversarial attacks against deep neural networks (DNNs) are continuously evolving, requiring increasingly powerful defense strategies. We develop a novel adversarial defense framework inspired by the adaptive immune system: the Robust Adversarial Immune-inspired Learning System (RAILS). Initializing a population of exemplars that is balanced across classes, RAILS starts from a uniform label distribution that encourages diversity and uses an evolutionary optimization process to adaptively adjust the predictive label distribution in a manner that emulates the way the natural immune system recognizes novel pathogens. RAILS' evolutionary optimization process explicitly captures the tradeoff between robustness (diversity) and accuracy (specificity) of the network, and represents a new immune-inspired perspective on adversarial learning. The benefits of RAILS are empirically demonstrated under eight types of adversarial attacks on a DNN adversarial image classifier for several benchmark datasets, including: MNIST; SVHN; CIFAR-10; and CIFAR-10. We find that PGD is the most damaging attack strategy and that for this attack RAILS is significantly more robust than other methods, achieving improvements in adversarial robustness by $\geq 5.62\%, 12.5\%$, $10.32\%$, and $8.39\%$, on these respective datasets, without appreciable loss of classification accuracy. Codes for the results in this paper are available at https://github.com/wangren09/RAILS.


I. INTRODUCTION
T HE state of the art in supervised deep learning has dramatically improved over the past decade [1]. Deep learning techniques have led to significant advances in applications such as: face recognition [2]; object detection [3]; and natural language processing [4]. Despite these successes, deep learning techniques are not resilient to evasion attacks (a.k.a. adversarial attacks) on test inputs and poisoning attacks on training data [5]- [7]. The adversarial vulnerability of deep neural networks (DNN) have restricted their application, motivating researchers to develop effective defense methods. The focus of this paper is to develop a novel deep defense framework inspired by the mammalian immune system.
Current adversarial defense strategies can be divided into four classes: (1) detection of adversarial samples [8]- [10]; (2) Robust training [11]- [14]; (3) data denoising and reconstruction [15], [16]; and (4) deep adversarial learning architectures [17], [18]. The first class of methods defends a DNN using simple outlier detection models for detecting adversarial examples. However, it has been shown that such adversarial detection methods can be easily defeated [19]. Robust training aims to harden the model to deactivate attacks such as evasion attacks. Known robust training methods are often tailored to a certain level of attack strength in the context of p -perturbation. Moreover, the trade-off between accuracy and robustness presents design challenges [12]. The data denoising and reconstruction class of methods is driven by an intuitive idea that adversarial examples can be mapped to the manifold of clean examples through data reconstruction by denoising. However, while denoising can reduce adversarial perturbations it can also distort the inputs [20], providing new opportunities for an attacker to exploit the defense mechanism [21]. Deep adversarial learning architectures directly design the defense into the layers of the neural network, e.g., by robustifying them with k-NN's [17] VOLUME 4, 2016 1 arXiv:2107.02840v2 [cs.NE] 21 Feb 2022 or modifying a generative adversarial network (GAN) [18]. Despite these advances, current methods still have difficulty providing an acceptable level of robustness to novel attacks [22].
To design an effective defense, it is natural to consider a learning strategy that emulates mechanisms of the naturally robust biological immune system. The power of artificial immune system (AIS) models have been established in many other applications [23]- [25]. While AIS approaches to enhancing DNN adversarial robustness have been previously developed [26], [27], they have been restricted to simple emulations of the innate immune system. In this paper, we propose a new framework, Robust Adversarial Immune-inspired Learning System (RAILS), that can effectively defend deep learning architectures against aggressive attacks based on a refined biology model of the adaptive immune system. Built upon a class-wise k-Nearest Neighbor (kNN) structure, given a test sample RAILS finds an initial small population of proximal samples, balanced across different classes, with uniform label distribution. RAILS then promotes the specificity of the label distribution towards the ground truth label through an evolutionary optimization. RAILS can efficiently correct adversarial label flipping by balancing label diversity against specificity. While RAILS can be applied to defending against many different types of attacks, in this paper we restrict attention to evasion attacks on the input. Figure 1 shows that RAILS outperforms existing methods on various types of evasion attacks. Radar plot showing that RAILS has higher robust accuracy than the adversarially trained CNN [11] and Deep k-Nearest Neighbor (DkNN) [17] in defending against eight types of attacks: ∞ -PGD attack/ 2-PGD attack [11], Fast Gradient Sign Method (FGSM) [5], Square Attack [28], Adversarial Patch [29], AutoAttack [30], Boundary Attack [31], and a (customized) ASK-Attack [32]. The benign accuracy for CNN, DkNN, and RAILS are 87.26%, 86.63%, and 82%. We refer readers to Section IV for more results.
Compared to existing defense methods, we make the following contributions: • RAILS achieves better adversarial robustness by assigning a uniform label distribution to each input and evolving it to a distribution that is concentrated about the input's true label class. (see Section II and Table 2) • RAILS incorporates a life-long robustifying process by adding synthetic "virtual data" to the training data. (see Section II and Table 5) • RAILS evolves the distribution via mutation and crossover mechanisms and is not restricted to p or any other specific type of attack. (see Section III and Figure 1) • We demonstrate that RAILS improves robustness of existing methods for different types of attacks ( Figure 1). In particular, RAILS improves robustness against the highly damaging PGD attack by ≥ 5.62%/12.5%/10.32%/8.39% for the MNIST, SVHN, CIFAR-10 and CIFAR-100 datasets ( Table 2). Furthermore, we show that the RAILS implementation of life-long learning with training data augmentation yields a 2.3% robustness improvement with only 5% augmentation of the training data (Table 5).
• RAILS is the first adversarial defense framework to be based on the biology of the adaptive immune system. In particular: (a) RAILS computationally emulates the principal mechanisms of the immune response ( Figure 2); and (b) our computational and biological experiments demonstrate the fidelity of the emulation -the learning patterns of RAILS and the immune system are closely aligned (Figure 7).

A. RELATED WORK
After it was established that DNNs were vulnerable to evasion attacks [6], different types of defense mechanisms have been proposed. An intuitive idea is to eliminate the adversarial examples through outlier detection, including training an additional discrimination sub-network [8], [10] and using kernel density estimation [9]. The above approaches rely on the fundamental assumption that the distributions of benign and adversarial examples are easily distinguishable, an assumption that has been challenged in [19].
In addition to adversarial attack detection, other methods have been proposed that focus on robustifying the deep architecture during the learning phase [11]- [13]. One recent approach combines training with perturbed inputs and hierarchical feature alignment between the adversarial and clean domains to robustify the feature learning process [14]. Though such defenses are effective against adversarial examples with moderate levels of p attack strength, they have limited power to defend against stronger attacks, and there is often a sacrifice in overall classification accuracy. In contrast, RAILS is developed to defend against diverse powerful attacks with less sacrifice in accuracy, and can improve any model's robustness, including robust models.
A different family of defenses models adversarial inputs as deviating from the manifold of clean data. This motivates the use of projection methods for denoising the inputs, where the inputs are mapped to the manifold [15], [16]. Examples include mapping adversarial examples to a high-resolution manifold using wavelet denoising and super resolution techniques [15] and to a low dimensional quasi-natural space with a sparse transformation layer [16]. Despite the simple and clear motivation, denoising methods have their own limitations. For example, they can introduce distortions into FIGURE 2. Simplified immune system (left) and RAILS computational workflow (right): Both systems are composed of a four-step process, which includes initial detection (sensing), recruiting candidates for diversity (flocking), enlarging population size and promoting specificity (affinity maturation) [33], and obtaining the final solution (consensus).
clean inputs [20] and have shown to fail to be robust to many adversarial attacks [21], [34]. Instead of attempting to modify inputs, RAILS evolves a statistical population of clones of the input, resulting in enhancing resilience to attacks.
Another approach is to incorporate different architectures to robustify deep classifiers [17], [18]. An example is the deep k-Nearest Neighbor (DkNN) classifier [17] that robustifies against instance perturbations by applying kNN's to features extracted from each layer. However, a single kNN classifier applied on the whole dataset is easily to be fooled by strong attacks (Figure 4). Conversely, RAILS incorporates an evolutionary diversity to specificity defense mechanism which can provide additional robustness to existing DNNs.
Network defense mechanisms inspired by the natural immune system have been proposed for other applications, different from the deep learning application considered in this paper. Among these, artificial immune system (AIS) approaches [35] have been used to defend against wormhole attacks on mobile ad hoc networks [23], flooding attacks on software-defined networks [24], and denial of service attacks on the internet [25]. However, the closest point of tangency to our RAILS approach is recent work that borrows concepts from the innate immune system to detect adversarial examples in DNN's. The innate immune system, also known as the non-specific immune system, is nature's first line of defense that launches an immediate non-specific response to contain the pathogen using chemical cellular, and extracellular mechanisms to prevent pathogen mobility and spread. Mechanisms of innate immunity that have been emulated in machine learning include: negative selection algorithm approaches [26]; and cellular machanisms for early pathogen recognition [27]. Different from the innate immune system, the response of the adaptive immune system is longlasting, specific and sustained, using clonal expansion to produce Bcell and T-cell lymphocytes having antigen receptors specific to the pathogen. To the best of our knowledge, the proposed RAILS adversarial defense framework is the first to be based on the complex biology of the adaptive immune system.
Another line of research relevant to ours is adversarial transfer learning [36], [37], which aims to maintain robust-ness when there is covariate shift between training data and test data. We remark that covariate shift is naturally handled by the mutation mechanism in our adaptive immune system emulation of RAILS that adapts the defense to novel attacks.

B. LEARNING STRATEGIES OF IMMUNE SYSTEM
Systems robustness is a property that must be intentionally designed into the architecture, and one of the greatest examples of this is the mammalian adaptive immune system [38]. The adaptive immune system is incredibly complex and not something that we can hope to replicate at this time. However, we can simplify its robust learning process into these four steps: sensing, flocking, affinity maturation, and consensus ( Figure 2 left) [39], [40]. The architecture of the adaptive immune system ensures a robust response to foreign antigens, splitting the work between active sensing and competitive growth to produce an effective antibody. Sensing of a foreign attack leads to antigen-specific B-cells flocking to some temporary structures for affinity maturation [33]. In the affinity maturation phase, a diverse initial set of antigenspecific B-cells divide to populate the temporary structures. Then the genetic identity of each B-cell is encoded by the shape and sequence of its protein, which can change from generation to generation. The degree to which the encoding of the B-cell matches the antigen is called the affinity. The B-cells with the highest affinity to the antigen are selected to divide and mutate, which leads to new B-cells with higher affinity to the antigen [41]. B-cells that reach consensus, or achieve a threshold affinity against the foreign antigen, undergo terminal differentiation into plasma B-cells. Plasma B-cells represent the antigen-specific solutions. Memory Bcells are selected and stored to defend against similar attacks in the future.

C. NOTATION AND PRELIMINARIES
Given a mapping f : R d → R d and x 1 , x 2 ∈ R d , we first define the affinity score A(f ; x 1 , x 2 ) between x 1 and x 2 . This affinity score can be defined in many ways, e.g, cosine similarity, inner product, or inverse p distance, but here we use A(f ; VOLUME 4, 2016 Euclidean distance. In the context of DNN, f denotes the feature mapping from input to feature representation, and A measures the similarity between two inputs. Higher affinity score indicates higher similarity.

II. RAILS: OVERVIEW
In this section, we give an overview of RAILS, and provide a comparison to the natural immune system. The architecture of RAILS is illustrated in Figure 3. For each selected hidden layer l, RAILS builds class-wise kNN architectures on training samples D tr . Then for each test input x, a population of candidates is selected and goes through an evolutionary optimization process to obtain the optimal solution. In RAILS, two types of data are obtained after the evolutionary optimization: 'plasma data' for optimal predictions of the present inputs, and 'memory data' for the defense against future attacks. These two types of data correspond to plasma B cells and memory B cells in the biological system, and play important roles in static learning and adaptive learning, respectively. For each test input, a special type of data -plasma data -is generated by evolutionary optimization, that contributes to predicting the class of the input (test) sample. Another type of datamemory data -is generated and stored to help defend against similar attacks in the future. Plasma data and memory data are analogous to plasma B cells and memory B cells in the immune system.

A. DEFENSE WITH STATIC LEARNING
Adversarial perturbations can severely affect deep classifiers, forcing the predictions to be dominated by adversarial classes rather than the ground truth. For example, a single kNN classifier is vulnerable to adversarial inputs, as shown in Figure 4. The purpose of static learning is to address this issue, i.e., maintaining or increasing the prediction probability of the ground truth p ytrue(x) l (x) when the input x is manipulated by an adversary. The key components include (i) a label initialization via class-wise kNN search on D tr that guarantees labels across different classes are uniformly distributed for each input; and (ii) an evolutionary datalabel optimization that promotes label distribution specificity towards the input's true label class. Our hypothesis is that the covariate shift of the adversarial examples from the distribution of the ground truth class is small in the input space, and therefore new examples inherited from parents of ground truth class y true have higher chance of reaching high-affinity. The evolutionary optimization thus promotes the label specificity towards the ground truth. The solution denotes the data-label pairs of examples with high-affinity to the input, which we call plasma data. After the process, a majority vote of plasma data is used to make the class prediction. We refer readers to Section III for more implementation details and Section IV-A for visualization. In short, static learning defenses seek to correct the predictions of current adversarial inputs and do not plan ahead for future attacks.

B. DEFENSE WITH ADAPTIVE LEARNING
Different from static learning, RAILS adaptive learning tries to use information from past attacks to harden the classifier to defend against future attacks. Hardening is done by leveraging another set of data -memory data generated during evolutionary optimization. Unlike plasma data, memory data is selected from examples with moderate-affinity to the input, which can rapidly adapt to new variants of the adversarial examples. This selection logic is based on maximizing coverage over future attack-space rather than optimizing for the current attack. Adaptive learning is a life-long learning process and can use hindsight to greatly enhance resilience of p ytrue(x) l (x) to attacks. This paper will focus on static learning and singlestage adaptive learning that implements a single cycle of classifier hardening.

C. A BIOLOGICAL PERSPECTIVE
RAILS is inspired by and closely associated with the biological immune system. The architecture of the adaptive immune system ensures a robust response to foreign antigens to produce an effective antibody. Figure 2 displays a comparison between the immune system workflow and the RAILS workflow. Both systems are composed of a fourstep process. For example, RAILS emulates flocking from the immune system by initializing a population of candidates that provide diversity, and emulates affinity maturation via an evolutionary optimization process to promote specificity. Similar to the functions of plasma B cells and memory B cells generated in the immune system, RAILS generates plasma data for predictions of the present inputs (immune system defends antigens though generating plasma B cells) and generates memory data for the defense against future attacks (immune system continuously increases its degree of robustness through generating memory B cells). We refer readers to Appendix A for a table of correspondences between RAILS operands and mechanisms in the immune system. In addition, the learning patterns of RAILS and the immune system are closely aligned, as shown in Figure 7.

III. DETAILS ON RAILS WORKFLOW
Algorithm 1 shows the four-step workflow of RAILS. We explain each step below. a: Sensing.
This step performs an initial discrimination between adversarial and benign inputs to prevent the RAILS computation from becoming overwhelmed by false positives, i.e., only implementing the main steps of RAILS on suspicious inputs. While there are many outlier detection procedures that could be used for this step [9], [42], we can exploit the fact that the DNN and kNN applied on hidden layers will tend to make similar class predictions for benign inputs. Thus we propose using an cross-entropy measure to generate an adversarial threat score for each input x. In the main text results, we skip the sensing stage since the major benefit from sensing is providing an initial detection. We refer readers to Appendix E for more details.

b: Flocking
The initial population from each class needs to be selected with a certain degree of affinity measured using the hidden representations in order to satisfy our hypothesis, as illustrated in Figure 7. By constructing class-wise k-Nearest Neighbor (kNN) architecture, we find the kNN that have the highest initial affinity score to the input data from each class and each selected layer. Mathematically, we select where x is the input. L is the set of the selected layers. D c is the training data from class c and the size |D c | = n c . R c : [n c ] → [n c ] is a ranking function that sorts the indices based on the affinity score. In the adaptive learning context, if the memory database has been previously populated, flocking will select the nearest neighbors using both the training data and the memory data. The immune system leverages flocking step to find initial B cells and form temporary structures for affinity maturation [33]. Note that in RAILS, the kNN sets N c l are constructed independently for each class, thereby ensuring that every class is fairly represented in the initial population. c: Affinity maturation (evolutionary optimization) As flocking brings diversity to the label distribution of the initial population, the affinity maturation step, in contrast, promotes specificity towards the ground truth class. Here we use evolutionary optimization to generate new examples (offspring) from the existing examples (parents) in the population. The evolution happens within each class independently, and new generated examples from different classes are not affected by one another before the consensus stage. The first generation parents in each class are the K nearest neighbors found by (1) in the flocking step, where K is the number of nearest neighbors. Given a total population size T C, the 0-th generation is obtained by copying each nearest neighbor T /K times with random mutations. Given the population of class c in the (g−1)-st generation P the candidates for the gth generation are selected aŝ whereP (g−1) c denotes the set of randomly selected reproductions of the population at the previous generation g − 1.
These are computed before applying mutation and crossover operations to populate the g-th generation P is a binary selection matrix whose columns are independent and identically distributed draws from Mult(1, P c ), the multinomial distribution with probability vector P c ∈ [0, 1] T . The process can also be viewed as creating new nodes from existing nodes in a Preferential Attachment (PA) evolutionary graph [43], where the details can be viewed in Appendix D. After selection, RAILS generates new examples through the operations of mutation and cross-over, which will be discussed in more detail later. After new examples are generated, RAILS calculates each example's affinity relative to the input. The new examples are associated with labels that are inherited from their parents, which always come from the same class. According to our hypothesis in Section II, examples inherited from parents of the ground truth class y true have a higher chance of reaching high-affinity, and thereby the population members with high-affinity are concentrating about the input's true class.

d: Consensus
Consensus is responsible for the final selection and predictions. In this step, RAILS selects generated examples with high-affinity scores to be plasma data, and examples with moderate-affinity scores are saved as memory data. The selection is based on a ranking function. where is the same ranking function as R c except that the domain is a set having cardinality equal to that of the final population P (G) . γ is a proportionality parameter and is selected as 0.05 and 0.25 for plasma data and memory data, respectively. Note that the memory data can be selected in each generation. For simplicity, we select memory VOLUME 4, 2016 data only in last generation. Memory data will be saved in the secondary database and used for model hardening.
Given that all examples in the population are associated with a label inherited from their parents, RAILS uses majority voting of the plasma data for prediction of the class label of x.
with feature mapping f l (·), l ∈ L; Affinity function A. First Step: Sensing 1: Check the threat score given by an outlier detection strategy to detect the threat of x. Second Step: Flocking 2: for c = 1, 2, . . . , C do 3: In each layer l ∈ L, find the k-nearest neighbors N c l of x in D c by ranking the affinity score. 4: end for Third Step: Affinity Maturation 5: For each layer l ∈ L, do 6: Generate P 10: given C . 14: end For Fourth Step: Consensus 15: Select the top 5% as plasma data S l p and the top 25% as memory data S l m based on the affinity scores, ∀l ∈ L; Obtain the prediction y of x using the majority vote of the plasma data. 16: Output: y, the memory data The computational cost of RAILS is dominated by the flocking and affinity maturation stage. kNN structure construction in flocking is a fixed setup cost that can be handled off-line with fast approximate kNN approximation [44], [45]. There are three strategies for reducing the computational cost of the affinity maturation stage. First, the evolutionary optimization can be replaced by a mean field approximation. Second, parallelization can be used to accelerate the computations since each sample is generated and utilized separately. Third, one can use a more stringent false positive threshold in the sensing step, thereby reducing the number of false positives resulting in a reduction in the downstream computational burden. More discussion can be viewed in Appendix D.
f: Operations in the evolutionary optimization.
Three operations support the creation of new examples: selection, cross-over, and mutation. The selection operation is shown in (2). We compute the selection probability for each candidate through a softmax function.
where S is the set of data points and x i ∈ S. τ > 0 is the sampling temperature that controls sharpness of the softmax operation. Given the selection probability P, defined on the current generation in (4), the candidate set for the next generation is randomly drawn (with replacement).
The cross-over operator combines two parents x c and x c from the same class, and generates new offspring by randomly selecting each of its elements (pixels) from the corresponding element of either parent. Mathematically, where i represents the i-th entry of the example and d is the dimension of the example. The mutation operation randomly and independently mutates an offspring with probability ρ, adding uniformly distributed noise in the range The resulting perturbation vector is subsequently clipped to satisfy the domain constraint that examples lie in [0, 1] d .

IV. EXPERIMENTAL RESULTS
We conduct experiments in the context of image classification using several benchmark image classification datasets.  [11], we also implement seven other attacks ( Figure 1 and Table 3).

A. PERFORMANCE IN SINGLE LAYERS
Adv examples ( = 60)

Conv1
Conv2 We first test RAILS in a single layer of the CNN model and compare the obtained accuracy with the results from the DkNN. Table 1 shows the comparisons in the input layer, the first convolutional layer (Conv1), and the second convolutional layer (Conv2) on MNIST. One can see that for both standard accuracy and robust accuracy, RAILS performs better than the DkNN in the hidden layers and achieve better results in the input layer. The input layer results indicate that RAILS can also outperform supervised learning methods like kNN. The confusion matrices in Figure 5 show that RAILS has fewer incorrect predictions for those data that DkNN gets wrong. Each value in Figure 5 represents the percentage of intersections of RAILS (correct or wrong) and DkNN (correct or wrong). Picking the top 5% data points with the highest affinity in each generation, Figure 6 shows the evolution over ten generations of RAILS samples of the population (Bcells) proportion and (exponentiated) affinity relative to two clean (non-adversarial) input examples taken from CIFAR-10. RAILS makes the correct "bird" predictions while the DkNN makes incorrect predictions for both examples. The second column depicts the proportion of the true class in the selected population of each generation. Data from the true class occupies the majority of the population when the generation number increases, which indicates that RAILS can obtain a correct prediction and a high confidence score simultaneously. Meanwhile, affinity maturation over multiple generations yields increasing affinity within the true class, as shown in the third column. To visualize changes in feature distribution during the affinity maturation stage, we show in Figure 8 the two-dimensional t-distributed stochastic neighbor embedding (t-SNE) of the feature representations of adversarial CIFAR-10 inputs (antigens) and the associated populations (B-cells). The features shown in the figure are those of convolutional layer three, and are representative of the feature behavior at other layers. As shown in Figure 8, the antigen is misclassified and B-cells are uniformly spread over the feature space at the beginning of the affinity maturation.
As the affinity maturation process progresses, the antigen's ground truth class B-cell population (colored in blue) converges to a cluster that covers the antigen.

b: In-vitro B-cell experiment confirms RAILS emulation
To demonstrate that the proposed RAILS computational system captures important properties of the actual (in-vitro) immune system we compare the learning curve of RAILS to the learning curve of B-cell antigen recognition (see Appendix C for a description of the biological experiment we performed). For the biological experiment the measured affinity between a population of actual B-cells and an antigen is obtained experimentally over time (several hours). For RAILS each test input (potentially the adversarial example) is treated as an antigen and the affinity is computed as RAILS iterates over time. Figure 7 shows that both the in-vitro immune system and RAILS have similar learning patterns. One can also see that the affinity increases again after the decrease, indicating both the immune system and RAILS can escape from a local optimal under strong attacks. The difference between the green and red curves is that the initial population for the red curve is found based on another test input (antigen), which has lower correlation to the current input (antigen). The non-convergence of the red curve indicates that the initial population should be selected close to the input, and the flocking using kNN search emulated the natural flocking process. We refer readers to Appendix C for more details.     (7) Adversarial Patch (Adv-P) [29], an attack with unified perturbations across different inputs (8) a (customized) ASK-Attack that is directly applied on the flocking step [32]. We refer readers to Appendix B for details of the threat models. The results of RAILS defending against these attacks can be viewed in Figure 1 and  The random initialization of the population displays the B-cells as uniformly distributed over feature space. After six generations the affinity maturation process produces B-cells that cluster around the antigen and correctly identify its true class.  We test RAILS against disturbances visible to the naked eye using CIFAR-10 data. We consider the ∞ -PGD attack with = 28. Figure 9 shows the benign examples and their adversarial counterparts with large disturbances. The differences can be clearly observed. Under the human perceptible attack, the accuracy for RAILS, DkNN, and the adversarially trained CNN are 33.26%, 19.53%, and 0%. The results demonstrate that RAILS can effectively defend against human perceptible perturbations as compared with DkNN and CNN. The DkNN finds a group of feature space k-nearest neighbors that at each layer classify an input sample in a single shot. In contrast, starting from a initial uniform label distribution at each layer, RAILS constructs a classifier after maturation of several generations of feature representing B-cells using an evolutionary optimization process. Results in Table 2 show that evolving a population of features from highly diverse to highly specific provides additional robustness with little sacrifice on benign accuracy. e: Ablation study.
Using the CIFAR-10 dataset and the third convolutional layer of a VGG16 model, we perform an ablation study to clarify VOLUME 4, 2016 the relative influence of different RAILS components on performance. Our findings are summarized as follows: (i) increasing the number of nearest neighbors in a certain range improves performance ; (ii) ; higher mutation probability increases robust accuracy (iii) ; the magnitude of mutation is sensitive to the input data, but may be optimized to increase robust accuracy. We refer readers to Appendix F for more details on the ablation study. We also show that when we turn off the affinity maturation stage, the robust accuracy drops from 59.2% to 55.65% (on 2000 test examples), indicating the importance of including the affinity maturation step in RAILS.
f: Single-Stage Adaptive Learning.
In the previous sections we demonstrated that static learning is effective in predicting the class of current adversarial inputs. Here we show that RAILS can be implemented with single-stage adaptive learning (SSAL) to further improve accuracy and robustness. While the idea is not pursued in this paper, our SSAL results suggest that RAILS may be gainfully extended to the on-line learning setting. SSAL is implemented as follows. We first train a RAILS classifier on the training data as described in previous sections. Then we used RAILS to generate 3000 memory data (B-cells) when a subset of test data taken from MNIST was used as input to the initially trained RAILS. We then merged this new data with the population of training data, creating an augmented training set. Finally, we randomly selected and adversarially modified another 1000 test samples of MNIST, and, using RAILS with its expanded training data, evaluated its adversarial classification accuracy. Table 5 shows that the SSAL improves RA of DkNN by 2.3% with no SA loss using by augmenting the training data with only 3000 memory data samples (a total of 5% increase of the training data).

V. CONCLUSION
Inspired by the immune system, we proposed a new defense framework for deep learning models. The proposed Robust Adversarial Immune-inspired Learning System (RAILS) has a one-to-one mapping to a simplified architecture immune system and its learning behavior aligns with in vitro biological experiments. RAILS incorporates static learning and adaptive learning, contributing to a robustification of predictions and dynamic model hardening, respectively. The experimental results demonstrate the effectiveness of RAILS. We believe this work is fundamental and delivers valuable principles for designing robust deep models. In future work, we will dig deeper into the mechanisms of the immune system's adaptive learning (life-long learning) and covariate shift adjustment, which will be consolidated into our computational framework. . Table 6 provides a detailed comparison between the Immune System and RAILS. The top part shows the detailed explanations of some technical terms. The bottom part shows the four-step process of the two systems.

C. PARAMETER SELECTION
By default, we set the size of the population T = 100 and the mutation probability ρ = 0.15. In Figure 6, we set T = 100 to obtain a better visualization. The maximum number of generations is set to G = 50 for MNIST, and G = 10 for CIFAR-10 and SVHN. When the model is large, selecting all the layers would slow down the algorithm. We use all four layers for MNIST. For CIFAR-10 and SVHN, we test on a few (20) validation examples and evaluate the kNN standard accuracy (SA) and robust accuracy (RA) on each layer. We then select layer three and layer four with SA and RA over 40%.
a: Mutation range.
The mutation range selection is related to the dataset. For MINST whose features are well separated in the input, the upper bound of the mutation range could be set to a relatively large value. For the datasets with low-resolution and sensitive to small perturbations, we should set a small upper bound of the mutation range. We also expect that the mutation could bring enough diversity in the process. Therefore, we will pick a lower bound of the mutation range. We set the mutation range parameters to δ min = 0.05(12.75), δ max = 0.15 (38.25) for MNIST. Considering CIFAR-10 and SVHN are more sensitive to small perturbations, we set the mutation range parameters to δ min = 0.005(1.275), δ max = 0.015(3.825). control the sharpness of the softmax operation. The principle of selecting τ is to make sure that the high-affinity examples in one class do not dominate the affinity of the whole population at the beginning. We thus select τ to make sure that the top 5% of examples are not from the same class. We find that our method works well in a wide range of τ once the principle is reached. For MNIST, the sampling temperature τ in each layer is set to 3, 18, 18, and 72. Similarly, we set τ = 1/10 and τ = 300 for the selected layers for CIFAR-10 and SVHN, respectively. c: The hardware and our code.
We apply RAILS on one Tesla V100 with 64GB memory and 2 cores. The code is written in PyTorch.

A. RAILS MIMICS THE BIOLOGICAL LEARNING CURVE
To demonstrate that the proposed RAILS computational system captures important properties of the immune system, we compare the learning curves of the two systems in Figure 7. In RAILS, we treat test data (potentially the adversarial example) as an antigen. Affinity in both systems measures the similarity between the antigen sequence and a potential matching sequence. The green and red curves depict the evolution of the mean affinity between the B-cell population and the antigen. The candidates selected in the flocking step (kNN) are close to a particular antigen. Two tests are performed to illustrate the learning curves when the same antigen (antigen 1) or a very different antigen (antigen 2) is presented during the affinity maturation step. When antigen 1 is presented (green curves), Figure 7 shows that both the immune system (left panel) and RAILS have learning curves that initially increase, then decrease, and then increase again. VOLUME 4, 2016 This phenomenon indicates a two-phase learning process, and both systems can escape from local optimal points. On the other hand, when the different antigen 2 is presented, the flocking candidates converge more slowly to a high affinity population during the affinity maturation process (dashed curves). Figure 7.

B. IN-VITRO IMMUNE RESPONSE EXPERIMENT a: Details on in-vitro experiments in
We performed in-vitro experiments to evaluate the adaptive immune responses of mice to foreign antigens. These mice are engineered in a way which allows us to image their B cells during affinity maturation in an in-vitro culture. Using fluorescence of B cells, we can determine whether the adaptive immune system is effectively responding to an antigen, and infer the affinity of B cells to the antigen. In this experiment, we first immunized a mouse using lysozyme (Antigen 1). We then challenged the immune system in two ways: (1) reintroducing lysozyme and (2) introducing another very different antigen, ovalbumin (Antigen 2). We then measured the fluorescence of B cells in an in-vitro culture for each of these antigens, which we use as a proxy to estimate affinity. We use five fluorescence measurements over ten days to generate the affinity curves in Figure 7 (left). When plotting, we use a spline interpolation in MATLAB to smooth the affinity curves. For full experimental details, please refer to Figures 10, 11, and the following three sections.
In-vitro culture of Brainbow B cells. For the in vitro culture of B cells, splenic lymphocytes from Rosa26Confetti+/+; AicdaCreERT2+/-mice were harvested and cultured following protocol from [50]. Mice were individually immunized with lysozyme and ovalbumin. Three days post-immunization, the mice were orally administered with Tamoxifen (50 µl of 20mg/ml in corn oil) and left for three days to activate the Cre-induced expression of confetti colors in germinal center B cells. Six days postimmunization, whole lymphocytes from spleen were isolated. 3×105 whole lymphocytes from spleen were seeded to a single well in a 96 well dish along with 3×104 dendritic cells derived from bone marrow hematopoietic stem cells for in vitro culture. The co-culture was grown in RPMI medium containing methyl cellulose (R&D systems, MN) supplemented with recombinant IL-4 (10 ng/ml) from, LPS (1 µg/ml), 50 µM 2-mercaptoethanol, 15% heat inactivated fetal calf serum, ovalbumin (10 µg/ml) for ovalbumin specific B cells and hen egg white lysozyme (10 µg/ml) for lysozyme specific B cells. Antigens were also added vice-versa for nonspecific antigen control. The media was changed every two days. The cultures were imaged every day for 14 days.
Preparation of differentiated dendritic cells from bone marrow hematopoietic stem cells. Bone marrow cells from femurs and tibiae of C57BL/6 mice was harvested, washed and suspended in RPMI media containing GM-CSF (20ng/ml), (R&D Systems, MN), 2mM L-glutamine, 50 µM 2-mercaptoethanol and 10% heat inactivated fetal calf serum. On day two and four after preparation, 2 mL fresh complete medium with (20ng/ml) GM-CSF were added to the cells. The differentiation of hematopoietic stem cells into immature dendritic cells was completed at day seven.
Cell imaging. Confocal images shown in Figure 11 were acquired using a Zeiss LSM 710. The Brainbow 3.1 fluorescence was collected at 463-500 nm in Channel 1 for ECFP (excited by 458 laser), 416-727 nm in Channel 2 for EGFP and EYFP (excited by 488 and 514 lasers, respectively), and 599-753 nm in Channel 3 for mRFP (excited by 594 laser). Images were obtained with 20× magnification. We first apply RAILS on Antigen 1 (A 1 ) and obtain the average affinity of the true class as well as the initial B-cells, i.e. the nearest neighbors from all classes. The affinity vs generation curve is shown in the green line in the right panel of Figure 7. One can clearly see the learning pattern. And finally, the solution is reached with a high affinity. Then we apply the initial B-cells obtained from A 1 to Antigen 2 (A 2 ). The results show that A 2 cannot reach the solution by using the given initial B-cells, as shown by the red line in the right panel of Figure 7. The RAILS prototype implemented in this paper has a relatively high computational cost, primarily due to the need to generate and select generations of in-silico B-cells using the genetic algorithm. Specifically, the average prediction time per sample is less than 0.1sec on CIFAR-10 with population size 100 and 20 generations. Note that all of our reported RAILS experiments were performed on a single GPU. RAILS speed can be dramatically accelerated by using multiple GPUs. We are currently investigating fast approximations to the genetic algorithm solution used by our prototype RAILS implementation.

b: Relations to preferential attachment
The process of selection can also be viewed as creating new nodes from existing nodes in a Preferential Attachment (PA) evolutionary graph generation process [43], where the probability of a new node linking to node i is and k i is the degree of node i. In PA models new nodes prefer to attach to existing nodes having high vertex degree.  antigens produce similar adaptive immune responses for those mice previously immunized with either antigen while they produce no adaptive immune response for mice that have not been immunized. The images show clearly that proliferation of B-cells in the adaptive immune response is strongest when the lymphocytes are re-exposed to the same antigen as in the immunization but still elicits an adaptive response when exposed to a similar but non-identical antigen. The decrease in adaptive response is inversely proportional to the similarity (affinity) between the antigens.
In RAILS, we use a surrogate for the degree, which is the exponentiated affinity measure, and the offspring are generated by parents having high degree.
c: Early stopping criterion.
Considering the fast convergence of RAILS, one practical early stopping criterion is to check if a single class occupies most of the high-affinity population for multi-generation, e.g., checking the top 5% of the high-affinity population. We empirically find that it takes less than 5 iterations to convergence for most of the inputs from MNIST (CIFAR-10). RAILS is a general framework that can be applied to any model. Specifically, we remark that there is no competitive relationship between RAILS and robust training since RAILS can improve all models' robustness, even for the robust trained model. Moreover, RAILS can reach higher robustness based on a robust trained model.

APPENDIX E A SIMPLE SENSING STRATEGY
Sensing in the immune system aims to detect the self and non-self pieces, while RAILS leverages sensing to provide initial detection of adversarial examples. Sensing can also prevent the RAILS computation from becoming overwhelmed by false positives, i.e., recognizing benign exam- VOLUME 4, 2016 ples to adversarial examples. Once the input is detected as benign, there is no need to go through the following process, and the neural network can directly obtain the predictions. The sensing step provides the initial discrimination between adversarial and benign inputs, and we develop a simple strategy here.
The assumption we make here is that benign examples have more consistency between the features learned from a shallow layer and the DNN prediction compared with adversarial examples. This consistency can be measured by the cross-entropy between the DNN and layer-l kNN predicted class probability score. Specifically, we have the crossentropy (adversarial threat score) for each input x in the following form where F c denotes the neural network prediction score of the c-th class. The prediction score is obtained by feeding the output of the neural network to a softmax operation. v l c is the c-th entry of the normalized kNN vector in layer-l, which is defined as follows where D c represents the training data belonging to class c. Q l denotes the k-nearest neighbors of x in all classes by ranking the affinity score A(f l ; x j , x).  Note that we care more about TPR than FPR since it has no side effect on RAILS accuracy if we detect a benign example to an adversarial example. Our goal is to select a relatively low FPR while still maintaining a high TPR. For example, we could keep a 95% TPR with 56% FPR using a threshold 0.4 in layer three.
The details about the sensing algorithm are shown in Algorithm 2. We will select a threshold κ such that x is treated as benign example (adversarial example) if ce(x) ≤ κ (ce(x) > κ).

Algorithm 2 Sensing: Adversarial Example Detection
Input: Test data point x; Training dataset D tr = {D 1 , D 2 , · · · , D C }; Number of Classes C; Model F with feature mapping F l (·) in layer l, l ∈ L s ; Affinity function A; A preset threshold κ for all layer l ∈ L s in parallel do Find the k-nearest neighbors Q l of x in all classes by ranking the affinity score A(f l ; x j , x). Obtain the normalized vector v l = (r 1 , r 2 , · · · , r C )/k, r c = |{x|x ∈ Q l ∩ D c }|.
Obtain the softmax prediction vector (F (x)). Calculate the cross entropy x is a potential adversarial example and return IsAdv = 1. else x is benign and return the prediction arg max c F c (x) end if We then select L s to only include layer three, and apply the threshold of 0.4 in the sensing step on CIFAR-10. The results show that the false positive rate can be reduced by 40%, while the SA remains the same and the RA only decreases 0.2%. Figure 14 provides the confusion matrices for benign examples classifications and adversarial examples classifications in Conv1 and Conv2 when = 60. The confusion matrices in Figure 14 show that RAILS has fewer incorrect predictions for those data that DkNN gets wrong. Each value in Figure 14 represents the percentage of intersections of RAILS (correct or wrong) and DkNN (correct or wrong).

APPENDIX F ADDITIONAL EXPERIMENTS A. ADDITIONAL COMPARISONS ON MNIST
In Table 7 Figure 15 shows the confusion matrices of the overall performance when = 60. The confusion matrices indicate that RAILS' correct predictions agree with a majority of DkNN's correct predictions and disagree with DkNN's wrong predictions. We also show the SA/RA performance of RAILS under PGD attack and Fast Gradient Sign Method (FGSM) when = 76.5. The results in Table 12 indicate that RAILS can reach higher RA than DkNN with close SA.

B. ADDITIONAL COMPARISONS ON CIFAR-10
In this subsection, we test RAILS on CIFAR-10 under PGD attack and FGSM with attack strength = 8/16. The results are shown in Figure 8 and Figure 9. RAILS outperforms DkNN and CNN on different attack types and strengths. We also find that the difference of RA between RAILS and DkNN increases when increases, indicating that RAILS can defend stronger attacks.
We then conduct experiments with Square Attack, which is one of the black-box attacks. The results are provided in Table 7 and show that RAILS improves the robust accuracy of DkNN by 3% on CIFAR-10 with = 20 and = 24.     Increasing population size improves robustness when N is small, and does not yield significant improvement when N is large with low mutation magnitude and low mutation probability. We present an increased population coefficient κ, where the population T equals κ ×(N neighbors). We present two cases: where N neighbors is small and where N neighbors is large. Small N : Increasing κ from one to two improves the robustness. However, further increasing κ does not bring significant improvement. Large N : Note that population size does not have an apparent impact on either standard or robust accuracy when N neighbors is large. This may suggest that the number of perturbed input 'exemplars' does not lead to more robust accuracy on adversarial inputs without sufficient mutation. This is consistent with the core principal in the adaptive immune system that mutation is necessary to converge on optimal solutions.    TABLE 17. Crossover is an important mechanism for improving performance. We observe better performance when we use cross-over as opposed to mutation alone during training. We also note that population size alone does not necessarily contribute to better performance for either strategy.

C. DETAILS ON RAILS ABLATION STUDY
For this subsection, RAILS is trained on CIFAR-10 with VGG16 as the classifier. Results are evaluated using model classification accuracy. Accuracy is compared before and after a projected gradient descent (PGD) attack on the training data with = 8/255. The baseline model performance for benign data (standard accuracy) was 87.26%. After the training data was adversarially attacked the VGG16 accuracy (robust accuracy) fell to 32.57%. By implementing RAILS, we are able to achieve an robust accuracy of 54.3% using the parameterization described in Table 13. All the experiments are conducted on convolutional layer 3. Each experiment holds these parameters fixed while exploring a range of values over independent training regimes. Both standard and robust accuracy are compared for each parameter choice. The purpose of this section is to investigate RAILS' sensitivity towards parameter choices. Details for each experiment are listed in the table captions.