Introduction
Deep learning has made major breakthroughs in recent years, being applied to objection detection [48], face recognition [47], autonomous driving [2], and many other real-world applications. However, recent studies have shown that deep neural networks (DNN) are vulnerable to backdoor attacks [7], [19], [26], [35], [40]. Backdoor attacks manipulate the DNN model to induce targeted or untargeted misclassification in the presence of malicious inputs that contain a specially-designed trigger. Having recognized the substantial threat of backdoor attacks, the Intelligence Advanced Research Projects Activity and the Army Research Office have solicited remedies to defend against backdoor attacks in AI systems1. Unfortunately, backdoor attacks are stealthy since the backdoored models behave normally towards clean inputs and are only activated by the trigger. With the emergence of invisible trigger backdoor attacks [31], [45], [51], detecting backdoor attacks becomes more challenging.
Overview of SAGE. SAGE is based on the self-attention distillation mechanism, where the model learns from itself by being both the teacher and the student. Specifically, SAGE consists of three key modules, i.e., attention representation, loss calculation, and learning rate update. "AT-GEN" denotes the attention map generation. C1 and C2 represents two designated conditions for learning rate update.
As shown in Table I, many approaches have been proposed for defending against backdoor attacks [14], [32], [37], [39], [59], [65], including detecting and purifying backdoors. Since backdoor attacks rig both the model and the input, it is possible to identify the attack by examining the input sample (data-based detection) [5], [9], [14], [57], [58], [66] or the prediction model (model-based detection) [5], [6], [23], [39]. Detection methods can only detect the presence of backdoors but cannot recover the clean model (find the bug but cannot fix it). Purification-based methods aim to cleanse the model of the backdoor to restore a usable clean model. Similarly, either the input or the model can be purified to deter backdoor attacks. Data purification erases potential triggers from inputs. Februus [12] surgically removes triggers based on GradCAM and then restores clean samples through Generative Adversarial Network. Since backdoored inputs do not often occur, data purification methods may undermine many clean samples and result in relatively low prediction accuracy. Model purification approaches target at eliminating backdoors in the model, including pruning, fine-tuning, fine pruning [37], NeuralCleanse (NC) [59], MCR [69], ANP [62], and NAD [32]. Pruning is a model compression technique that eliminates neurons that are inactive to clean samples, and fine-tuning is a transfer learning technique that retrains some or all layers of a model with a small task-specific dataset. Fine pruning is a combination of pruning (first step) and fine-tuning (second step). Nonetheless, neither pruning nor fine-tuning is originally designed for purifying backdoors, thus their defense capacity will be greatly compromised by adaptive attacks where the attacker injects backdoors at the neurons that are less likely to be pruned or modified by fine-tuning [18], [61]. NC [59] tries to reverse-engineer the trigger by computing the modification needed to divert the classification result to a specific class. A modification that is smaller than a threshold is regarded as the real trigger. Then the reversed trigger can be used to identify and prune infected neurons. However, the assumption that the trigger size is small makes NC ineffective to large-trigger attacks. MCR [69] proposed to employ mode connectivity in loss landscapes to remove the backdoored neural paths from the model with a set of clean data. ANP (adversarial neuron pruning) [62] rectifies the backdoored model by pruning sensitive neurons under adversarial neuron perturbations. Neural attention transfer (NAD) is a state-of-the-art defense that fine-tunes the backdoored model to obtain a teacher model, which is then used to supervise the retraining of the backdoored model via attention distillation. However, the fine-tuning process incurs extra overhead, and the teacher model may not be able to correct the deep layers of the student backdoored model (the backdoor trigger affects the deep layers more than the shallow layers). MCR, ANP, and NAD are shown to be ineffective against more advanced backdoor attacks [16], [17].
In this paper, we present an effective model purification defense against backdoor attacks, named SAGE, as shown in Fig. 1. Rather than relying on layman techniques (e.g., pruning or fine-tuning) or extra supervision (e.g., a teacher model), we remove the backdoor based on the corrective power of the learning model itself, performing layer-wise and top-down attention distillation to realize self-purification. The key idea is to distill attention knowledge from different layers of the model and rectify deep layers according to the attention knowledge of shallow layers. The intuition is that shallow layers usually extract coarse-grained structural information from the inputs while deep layers focus on fine-grained details of the inputs [28]. As backdoor attacks try to establish a strong relationship between the subtle trigger and the misclassification label, the deep layers are more likely to take in the fine-grained trigger feature during learning. Meanwhile, as backdoor attacks preserve high prediction accuracy on clean samples, the shallow layers are more likely to extract the correct structural information. Via self-distillation, the deviant deep layers will be rectified by the correct layers through self-healing. To realize this goal, we first distill attention knowledge of neurons in the model and then retrain the model with the intention of aligning the attention of shallow layers with deep layers. To further improve the performance of SAGE, we design a learning rate adjustment mechanism that dynamically tunes the learning rate according to the defense effect.
We conduct extensive experiments to evaluate the defense performance of SAGE. We have implemented various backdoor attacks, including invisible-trigger attacks (e.g., HB [51], WaNet [45], and ISSBA [34]), anti-distillation attacks (e.g., Ingrain [16]), and special attacks (e.g., multi-trigger attacks, class-specific trigger attacks, large trigger attacks, and transparent trigger attacks), on four datasets, i.e., MNIST, CIFAR-10, CIFAR-100, and ImageNette. We also explore whether SAGE is also robust to adaptive attacks. We compare SAGE with 6 model purification defenses in terms of attack success rate reduction and model prediction accuracy maintenance. The experiment results show that SAGE can reduce the attack success rate by as much as 90% more than baselines for advanced backdoor attacks. With only 1% clean data, SAGE can reduce the attack success rate of ATTEQ-NN from 99.99% to 44.57% while the best baseline can only decrease its attack success rate to 99.24% on the CIFAR-100 dataset.
To conclude, we make the following contributions:
We develop a defense framework named SAGE, which utilizes layer-wise top-down self-attention distillation (SAD) to purify backdoored models. We design a SAD mechanism that is more effective in backdoor removal with normalization and a dynamic learning rate adjustment strategy, which corrects toxic deep layers by innocent shallow layers through attention map alignment.
We propose a sophisticated learning rate adjustment strategy that carefully tracks the prediction accuracy of clean samples to guide the learning rate adjustment. Such dynamic learning rate update of SAGE is less likely to be trapped in local sub-optimal than the fixed learning rate adaptation method.
Extensive experiments show that SAGE outperforms state-of-the-art model purification defense methods by greatly reducing the attack success rate of advanced backdoor attacks. It is shown that SAGE is also robust against various special backdoor attacks and adaptive backdoor attacks.
Background
A. Backdoor Attacks
Most backdoor attacks target deep neural network (DNN) models. A DNN model can be represented as a parameterized function \begin{equation*}\mathop {\min }\limits_\Theta \frac{1}{n}\sum\limits_{i = 1}^n {\mathcal{L}\left( {{f_\Theta }\left( {{x_i}} \right),{y_i}} \right).} \tag{1}\end{equation*}
Albeit its good performance, training a sophisticated DNN model usually requires prohibitive time and money investments. For example, GPT-3 has more than 175 billion parameters and requires 3.14 × 1023 flops, i.e., 355 GPU-years for a Tesla V100, the fastest GPUs on the market [3]. Furthermore, collecting large-scales of high-quality training samples is also cumbersome. For example, the accumulation of ImageNet dataset [10] for object recognition and classification took several years. Expert knowledge is also needed to decide the appropriate model structures and hyperparameters. These challenges incentivize source-limited users to outsource the model training procedure to the cloud server (e.g., Google, Amazon, and Microsoft) or directly download pre-trained models from online model zoos (e.g., Caffe Model Zoo2). However, in these cases, an adversary may manipulate the training process and provide a backdoored model to the user.
Backdoor attack (a.k.a trojan attack) is a form of training-phase attack where the attacker maneuvers the training dataset or the model training process [8], [19], [40]. The backdoored (trojaned) model behaves normally on benign samples but misclassifies all the samples stamped with the backdoor triggers to the target false label (targeted attack) or any false label (untargeted attack). The triggers may be model-dependent [18], [40], [61] or model-independent [7], [19], [27], [33], [36], visible [7], [18], [19], [27], [40], [44], [52], [59] or invisible [31], [35], [45], [51]. Apart from the centralized learning scenario, backdoor attacks can also be launched in the federated learning scenario [1], [38], [43], [60], [64].
B. Backdoor Defenses
1) Backdoor Detection
As both the model and the inputs are tampered with in backdoor attacks, defenders may inspect whether the model [39], [59] or the input sample is backdoored or not [9], [14], [66].
Backdoored input detection
STRIP [14] copies the suspicious input for multiple times and perturbs each copy with different samples. A concentrated distribution of the prediction results for the perturbed copies indicates the potential presence of the trigger of a targeted backdoor attack. SentiNet [9] seeks for a contiguous region that is most influential on the classification results, and the region is considered to contain a trigger with high probability. The region is carved out and patched on other images. If most of the patched samples are misclassified into the same false label, the input sample is regarded as malicious. Based on the observation that the last hidden layer’s activations reflect high-level features used by the DNN to obtain the prediction result, Activation Clustering (AC) [5] detects whether a batch of inputs contain malicious ones by checking whether their activations can be divided into two clusters or not.
Backdoored model detection
Inspired by the Electrical Brain Stimulation (EBS) theory, ABS [39] scans the target model to judge whether it is malicious. DeepInspect [6] uses reverse engineering to recover the training samples, and then uses a conditional generative model to try to obtain the probabilistic distribution of the potential backdoor triggers. Meta Neural Trojan Detection (MNTD) [65] trains a meta-classifier to predict whether the DNN model is backdoored or not. The meta-classifier is trained with jumbo learning to differentiate various types of backdoored models.
2) Backdoor Purification
Similar to backdoor detection, the defender may aim to remove the backdoor effect from either the model [32], [37] or the input sample [12] with or without backdoor detection.
Input purification
As far as we know, there is very little work on backdoor input purification. Februus [12] purifies the input sample by surgically removing the potential trigger and restoring the benign input. Februus determines the most influential region that is the most likely to contain a trigger using GradCAM [53]. The region is replaced with a neutralized-color box, after which a generative adversarial network restores the benign input.
Model purification
It is shown that model pruning, a model compression technique, may remove the backdoor in the model [18], [32], [37]. The intuition is that infected neurons are dormant for clean samples and are only activated by backdoored samples [37]. Therefore, neurons that have low activations in terms of clean samples may be the backdoored neurons and can be removed. Model pruning will inevitably decrease the model’s prediction accuracy, and may be ineffective for adaptive backdoor attacks [18]. Fine-tuning, a widely-used transfer learning strategy, may also be used to remove the backdoor. The defender can fine-tune the target model by a small set of benign samples to try to remove the backdoor [32]. However, various existing attacks [61], [67] are resistant to such a defense. Model pruning and fine-tuning can be combined to remove the backdoor [37].
NeuralCleanse (NC) [59], MCR [69], ANP [62], and NAD [32] are recently proposed backdoor removal approaches. NC [59] first utilizes reverse engineering to generate potential triggers for all labels and then detects whether there exists a label that needs much smaller modifications to achieve misclassification. If the smallest potential trigger is smaller than a threshold, the model is deemed as backdoored. The reversed trigger can help identify and prune infected neurons. However, NC cannot detect backdoor attacks that adopt large (maybe invisible) triggers or multiple triggers. MCR [69] proposed to use mode connectivity in loss landscapes to remove potentially backdoored neural paths from the model. ANP [62] is based on an intuition that the backdoored model tends to predict the target label on benign samples when its neurons are adversarially perturbed.
NAD [32] utilizes attention distillation [28] to remove the backdoor. NAD first fine-tunes the model as the teacher model, and then combines the teacher model and the original model (student model) through attention distillation, such that the intermediate-layer attention of the original model can align with that of the teacher model. However, since the teacher model is derived from the original model and may inherit backdoored layers (especially intermediate and deep layers) of the original model, NAD is proved to be ineffective against more advanced backdoor attacks, such as anti-distillation backdoor attacks [16] and ATTEQ-NN [17]. Compared with NAD, SAGE uses the shallow layer of the original model to correct the deep layers as the shallow layers mainly capture the high-level structural information of the input and are less affected by the backdoored trigger. In this way, SAGE does not need to establish another teacher model and can defend against anti-distillation backdoor attacks.
C. Knowledge and Attention Distillation
Knowledge distillation (KD) was proposed to transfer the knowledge from a large teacher network to a small student network [21], [42], [46]. The student network imitates the intermediate and the deep layers of the teacher network. Knowledge distillation was first proposed by Hinton [21], which minimizes the Kullback–Leibler (KL) divergence between the probability vectors of the student model and the teacher model.
\begin{equation*}{\theta _s} = {\arg _{\theta \in {\theta _s}}}\min {{\text{E}}_{x \in X}}\left[ {KL\left( {F_s^h\left( x \right),F_t^h\left( x \right)} \right)} \right],\tag{2}\end{equation*}
Recent works have applied the attention mechanism in knowledge distillation to supervise the training of the student model [28], [54]. The attention mechanism guides the student network to learn more high-quality intermediate representations, thus improving the distillation effect [28], [50]. A widely-used attention distillation scheme is proposed in [28], which includes two kinds of attention distillation: activation-based attention distillation and gradient-based attention distillation, each of which guides the student model to learn the attention map of the teacher model. More recently, Hou et al. [22] proposed a self-attention distillation mechanism where the student learns from itself, thus getting rid of the teacher model. The attention distillation is performed layer-wise and top-down in the model where the attention knowledge is propagated layer by layer from the shallow layers to intermediate layers and finally to deep layers. In this paper, we leverage self-attention distillation to erase the backdoor from a target model.
Threat Model
Defender
Following the threat models of existing purification defenses [32], [37], we assume that the defender obtains a trained model from an untrusted third party. The defender has a small set of clean samples to help validate the model. Note that this dataset is much smaller than the original training dataset used for training the model. The goal of the defender is to erase the backdoor from the received model while maintaining the model prediction accuracy on benign samples.
Attacker
We consider a powerful attacker. The attacker provides the trained model to the defender. The attacker has all the internal information of the model and its training dataset. The attacker can manipulate the model in any way to generate a backdoored model. The trigger can be in any shape, location, and size. The backdoor attack can be a traditional backdoor or an adaptive backdoor attack.
SAGE: Detailed Construction
A. Design Rationale
Explanatory studies on neural networks have shown that shallow layers mainly extract macro features (global structural information) while intermediate and deep layers mainly extract micro features (fine-grained details) [28]. To ensure concealment, the triggers are usually designed as micro perturbations3 that have little effect on the global structure of the sample, thus affecting mostly deep layers but not shallow layers. Neural attention transfer (NAD) fine-tunes a teacher model from the backdoored student model, thus the teacher model potentially has good shallow layers but bad deep layers (maybe slightly better than the deep layers of the student model). NAD makes the good shallow layers of the student model learn from the good shallow layers of the teacher model, and the bad deep layers of the student model learn from the bad deep layers of the teacher model. In this way, the deep layers of the teacher model cannot effectively correct the deep layers of the student model. In contrast, we adopt self-attention distillation (SAD), which makes the bad deep layers learn from the good shallow layers through attention map alignment of the model itself such that the deep layers are potentially corrected.
To fulfill the design goals, SAGE performs three key steps, i.e., attention representation, loss calculation, and learning rate update, as shown in Fig. 1. In the attention representation module, we distill the attention of each neuron based on its contribution to the prediction results. In the loss calculation module, we rectify the weights of deep layers according to the attention knowledge of shallow layers while maintaining the model prediction accuracy. In the learning rate update module, rather than using the existing adaptation method (divide the learning rate by 10 every 2 epochs [32]), we design a novel learning rate adaptation strategy that carefully tracks the prediction accuracy of clean samples to guide the learning rate adjustment.
B. Attention Representation
Attention representation is critical for the success of correcting backdoors as we want to distill the essential attention knowledge to guide the self-purification process.
Given a backdoored model FB, the activation tensor at the l-th layer is denoted as \begin{equation*}\begin{array}{r} {\mathcal{G}_{sum}}\left( {F_B^l} \right) = \sum\limits_{i = 1}^{{C_l}} {\left| {{F_B}_i^l} \right|}, \\ \mathcal{G}_{sum}^p\left( {F_B^l} \right) = \sum\limits_{i = 1}^{{C_l}} {{{\left| {{F_B}_i^l} \right|}^p}}, \\ \mathcal{G}_{max}^p\left( {F_B^l} \right) = \mathop {\max }\limits_{i = 1,{C_l}} {\left| {{F_B}_i^l} \right|^p}, \\ \mathcal{G}_{mean}^p\left( {F_B^l} \right) = \frac{1}{C}\sum\limits_{i = 1}^{{C_l}} {{{\left| {{F_B}_i^l} \right|}^p}} \end{array} \tag{3}\end{equation*}
C. Loss Calculation
The distilled attention knowledge is used to guide the purification of the model. Unlike traditional attention distillation performed across the teacher model and the student model [28], our proposed self-attention distillation (SAD) is conducted within a single model through a top-down and layer-wise way. The key idea is to utilize the attention maps of shallow layers as a form of supervision for those of deep layers. The self-attention distillation loss \begin{equation*}\begin{array}{r} {\mathcal{L}_{SAD}}\left( {F_B^i,F_B^j} \right) = {\mathcal{L}_d}\left( {\psi \left( {F_B^i} \right),\psi \left( {F_B^j} \right)} \right), \\ {\text{s}}.{\text{t}}.\quad {\mathcal{L}_d}\left( {{\psi _1},{\psi _2}} \right) = {\left\| {{\psi _1} - {\psi _2}} \right\|_2}, \\ \psi ( \cdot ) = N(U(G( \cdot ))),N\left( {F_B^l} \right) = \frac{{F_B^l}}{{{{\left\| {F_B^l} \right\|}_2}}}, \end{array} \tag{4}\end{equation*}
Using only the SAD loss
\begin{equation*} \mathcal{L} = {\mathcal{L}_{CE}}\left( {{{\hat y}^{(t)}},{y^{(t)}}} \right) + \sum\limits_{ < {i_k},{i_j} > \in paths} {{\beta _k}{\mathcal{L}_{SAD}}\left( {F_B^{{i_k}},F_B^{{j_k}}} \right)} ,\tag{5}\end{equation*}
The SAD loss is computed on each pair of layers in the model (referred to as path). For instance, given a 4-layer model, the SAD loss is computed between each pair of adjacent layers, i.e., paths = {< 1, 2 >, < 2, 3 >, < 3, 4 >}. Note that extra paths may be added, e.g., {< 1, 3 >}, {< 2, 4 >}, and {< 1, 4 >}. The number of possible paths in an l-layer model is as high as
D. Learning Rate Update
Learning rate is a key configurable hyperparameter in optimization algorithms. Learning rate usually takes a small positive real value ranging from 0 to 1. It controls the step size of each iteration to guarantee steady convergence to the minimum of the loss function. Setting a proper learning rate is non-trivial. A large learning rate induces significant weight modifications in each iteration, leading to unstable performance oscillation during training. A small learning rate results in slow convergence and may be trapped in local optimum [4].
Traditional attention distillation strategies either used a fixed learning rate [22] or adjusted the learning rate heuristically (e.g., divided the learning rate by 10 every 2 epochs) [32]. As SAGE tries to achieve the goals of backdoor removal and prediction accuracy preservation simultaneously, the learning rate plays an important role in controlling the distillation process. Therefore, we propose a new strategy that carefully tracks the prediction accuracy of clean samples to guide the learning rate adjustment.
We designate two conditions for the learning rate to be reduced. The first condition C1 is that the loss on the clean data does not drop for a certain number of epochs during an interval. The second condition C2 is that the maximum loss on the clean data does not drop (remains the same) during an interval. In one of the above cases, we divide the learning rate by 2 as the distillation process needs to be curbed to restrict the degradation of prediction accuracy of clean data.
Let P denote the number of epochs in an interval. At the j-th interval, i.e., the jP -th epoch, if the following conditions are satisfied, the learning rate η will be adjusted; otherwise, the learning rate stays unchanged.
\begin{equation*} {\begin{cases} {{\mathbf{C1}}:\sum\nolimits_{k = (j - 1)P}^{jP - 1} {1\!\!\!\!{\text l}\left( {{\mathcal{L}^{(k + 1)}} < {\mathcal{L}^{(k)}}} \right) < \rho \cdot P,} } \\ {{\mathbf{C2}}:{\eta ^{(j - 1)P}} \equiv {\eta ^{jP}}{\text{and}}\,\mathcal{L}_{max}^{jP} \geq \mathcal{L}_{max}^{(j - 1)P},} \end{cases}} \tag{6}\end{equation*}
The overall algorithm of SAGE is summarized in Algorithm 1.
Evaluations
A. Experimental Setup
Model and datasets
In this paper, we conduct extensive evaluations on various deep learning tasks, including multiple datasets and deep neural networks. More concretely, we use four datasets, i.e., MNIST, CIFAR-10, CIFAR-100, and ImageNette. We utilize LeNet-5, ResNet-18, ResNet-50, and ResNet-18 structures to train DNN models for these datasets, respectively4.
Evaluation metrics
To evaluate the defense performance, we employ both attack success rate (ASR) and model prediction accuracy (MPA) in the experiments. Attack success rate (ASR) is calculated as the percentage of backdoored samples that are misclassified by the target model to the targeted false label5. Model prediction accuracy (MPA) is calculated as the ratio of accurately labeled benign samples in the test dataset, which does not intersect with the clean data used by SAGE.
Backdoor attack/defense methods
We apply SAGE and baseline defenses to eight state-of-the-art backdoor attacks, i.e., BadNets [19], TrojanNN [40], HB [51], RobNet [18], WaNet [45], Ingrain [16], ISSBA [34], and ATTEQ-NN [17]. Besides, we compare SAGE with six state-of-the-art backdoor model purification methods, i.e., fine-tuning, pruning, fine-pruning [37], NAD [32], MCR [69], and ANP [62]. We implement the baselines using their original source codes and ensure convergence with enough epochs.
More details of the datasets, model structures, and setup about these backdoor attacks and defenses are described in the Appendix. All experiments are implemented in Python and run on a 14 core Intel(R) Xeon(R) Gold 5117 CPU @2.00GHz and NVIDIA GeForce RTX 2080 Ti GPU machine running Ubuntu 18.04 system.
If the model consists of blocks instead of layers, we treat blocks as layers to compute attention maps and SAD loss. In particular, a block is a repetitive integrated structure composed of multiple layers in a model, such as the inception block in Inception-v3 [55] and the building block in ResNet [20]. For simple models without blocks such as LeNet [30], experiments show that applying SAGE directly on convolutional layers is enough to achieve the defense goal.
B. Comparison with Baseline Defenses
We compare the defense performance of SAGE with the above-mentioned six state-of-the-art backdoor model purification methods6. We apply these defenses to the above-mentioned eight state-of-the-art backdoor attacks. We perform these attacks and baseline defenses based on the original settings of their open-source codes. We set the threshold of model pruning as 5%, i.e., once the decrease of prediction accuracy of clean samples is larger than or equal to 5%, the pruning operation will be terminated. Since fine-pruning is the combination of model pruning (first step) and fine-tuning (second step), we use the same pruning rate for model pruning and fine-pruning. Note that since HB [51], ISSBA [34], and ANP [62] are only applicable for three-channel datasets, we only evaluate these attack/defense strategies on CIFAR-10, CIAFR-100, and ImageNette. The comparison results are shown in Fig. 4-7 (appendix). Clean data denotes the ratio of data samples used by the defense methods to the training data samples [32], [37].
After applying SAGE and baseline defenses, we can see that the purified models of SAGE achieve the lowest attack success rates for all datasets in most cases. Given a clean dataset size as 1%, take CIFAR-10 as an example, SAGE brings the ASR from 81.52% down to 10.11% (BadNets), from 99.64% down to 19.42% (TrojanNN), from 98.85% down to 10.83% (RobNet), from 97.25% down to 6.79% (WaNet), from 73.20% down to 29.83% (HB), from 100% down to 65.12% (Ingrain), from 90.77% down to 8.06% (ISSBA), and from 100.0% down to 0.00% (ATTEQ-NN) respectively, while the lowest ASR of purified models of baseline defenses is still as high as 10.20% (BadNets), 38.21% (TrojanNN), 74.47% (RobNet), 29.84% (HB), 8.96% (WaNet), 70.76% (Ingrain), 8.11% (ISSBA), and 100.0% (ATTEQ-NN). For the high-resolution dataset ImageNette (5% clean dataset size), SAGE can significantly reduce the average ASR of the eight state-of-the-art backdoor attacks to 18.87%, while fine-tuning, model pruning, fine-pruning, NAD, MCR, and ANP can only reduce the average ASR to 58.53%, 68.29%, 49.78%, 36.01%, 56.13%, and 60.26%, respectively. Although MCR can reduce the ASR of BadNets and TrojanNN more than that of SAGE in some cases of ImageNette, its MPA is much lower than SAGE (≥ 8%).
In terms of MPA, it is shown that all model-purification defenses (including SAGE) have a negative effect on MPA, but the drops of SAGE are less than 3% in almost all cases. The reason is that SAGE adapts the self-attention distillation with normalization and a dynamic learning rate adjustment strategy. Ablation studies have verified that these two strategies can significantly improve prediction accuracy. In some cases, SAGE can even improve the prediction accuracy of backdoored models higher than benign ones. In comparison, other baseline defenses either fail to maintain high prediction accuracy or fail to effectively reduce the attack success rate of advanced backdoor attacks (RobNet, Ingrain, ISSBA, and ATTEQ-NN).
To further investigate whether SAGE successfully cleanses the backdoor, we use NC [59] to detect the backdoored model purified by SAGE. The threshold of MAD (Median Absolute Deviation) in NC for anomaly detection is set as 2.0, over which the model is deemed as a backdoored model. As shown in Table VI, the MAD values of all purified models by SAGE are below 2.0, meaning that the model is not considered as being backdoored by NC. The results verify the effectiveness of SAGE in backdoor purification.
We apply SAGE to the benign models of the four datasets. The purified models have an accuracy of 98.75% (MNIST), 91.18% (CIFAR-10), 78.32% (CIFAR-100), and 85.95% (ImageNette), respectively. The maximum accuracy drop is less than 0.7%. The results show that SAGE hardly affects the prediction accuracy of benign models.
C. Computational Costs
To assess the efficiency of SAGE, we compare the computational costs of SAGE with baselines in Table VII. Compared with other defenses, NAD and SAGE have higher computational costs mainly due to the time-consuming distillation process. SAGE has a lower computational cost than NAD since NAD needs to train a teacher model. Overall, SAGE has the best purification performance with reasonable computational costs.
D. Ablation Study
Impact of normalization function
We adapt the self-attention distillation process to be more effective to backdoor removal with normalization. To explore the efficacy of normalization, we compare the defense performance with and without normalization against different backdoor attacks. The results are shown in Table II and Table XXII (appendix). We can see that SAGE with normalization function can significantly improve the MPA and degrade the ASR of the attacks. Without normalization, the prediction accuracy on clean samples will significantly drop.
Impact of dynamic learning rate update
We propose a more sophisticated strategy that carefully tracks the prediction accuracy of clean samples to guide the learning rate adjustment. To explore the effectiveness of our dynamic adjustment strategy, we compare the defense performance of SAGE when using a fixed learning rate (0.01), simple learning rate adjustment (divide the learning rate by 10 every 2 epochs [32]), and our proposed dynamic learning rate update. The comparison results are shown in Table III and Table XXI (appendix). We can see that our dynamic learning rate update can not only improve the defense effectiveness but also improve the prediction accuracy on clean samples. Moreover, the dynamic learning rate update of SAGE is less likely to be trapped in local suboptimal results than the simple learning rate adaptation method of NAD.
Attention maps of different layers in the purified model of SAGE, compared with those in benign models, backdoored models, and purified models of NAD.
Impact of attack model structures
By default, the backdoored models of CIFAR-10, CIFAR-100, and ImageNette are using ResNet model structure. In this part, we explore whether SAGE is also effective for backdoored models trained on other model structures, such as SqueezeNet [25], DenseNet [24], and ShuffleNet [68].
We generate backdoored models following the settings of BadNets [19] attack and then apply SAGE to purify the backdoored models. The results are shown in Table XIV (appendix). It is shown that SAGE is robust to the structures of the backdoored model. SAGE can effectively remove the backdoor from the model, regardless of the structure of the backdoored model.
Impact of clean data size
We explore the impact of the clean data size of both SAGE and baseline defenses on defense performance. As shown in Fig. 4-7 (appendix), in general, as clean data size increases, both SAGE and baseline defenses are stronger. Even with 3% clean data samples, SAGE can successfully defend against the eight current attacks in CIFAR-10, while even with 20% clean data samples, all the baseline defenses are ineffective for ATTEQ-NN [17]. Through experiments, we found that SAGE can converge faster than the baseline defenses.
Impact of attention functions
We investigate the defense performance of SAGE when adopting various attention functions, i.e.,
Impact of distillation paths
The SAD loss is computed on each pair of layers in the model, namely path. We explore the impact of different distillation paths on the defense performance in Table V and Table XX (appendix). It is shown that paths = {< i, i + 1 >} achieves the best defense effect for CIFAR-10, CIFAR-100, and ImageNette in most cases. The success of paths = {< i, i + 1 >} is due to the fact that learning the attention maps of adjacent layers can better capture the distributional features of clean data.
Impact of weight parameters
We explore the impact of βk across different layers of the purified model on the defense performance. The value of βk balances the attack success rate and clean data accuracy, where k denotes the k-th layer. In this paper, we utilize the ResNet structure with four stages for CIFAR-10, CIFAR-100, and ImageNette datasets, thus there are three weight parameters in the purified model. In general, as βk increases, the attack success rate will decrease, i.e., the defense is more effective against the existing backdoor attacks. However, a too-large weight value will decrease the clean data accuracy. As shown in Table VIII and Table XVII (appendix), we can see that when the weights between adjacent residual blocks are set to the same value, i.e., (300, 300, 300), SAGE can achieve the best defense performance in most cases.
Impact of trigger shapes
We explore whether the defense performance of SAGE is impacted by the trigger shape of backdoor attacks. We test SAGE under seven different trigger shapes of TrojanNN [40], i.e., triangle, square, hexagon, circle, parallelogram, quadrant, and semicircle. As shown in Table IX, we can see that the attack performance without defense is high with various trigger shapes, but SAGE is able to thwart attacks with different trigger shapes by reducing the ASR to less than 20% in most cases.
E. Attention Maps
To corroborate our design rationale that SAGE is able to purify the backdoor in the deep layers of the model with the help of the shallow layers, we plot attention maps [32], [49] of different layers in the purified model of SAGE, compared with those in benign models, backdoored models, and purified models by the baseline NAD. The attention map of a specific layer given a certain input illustrates the focal area of the layer on the input, which contributes to the final prediction result.
Due to page limitations, we only discuss some of the advanced attacks that can evade NAD defense. Since the trigger of ISSBA is global, we did not specifically mark the trigger. As shown in Fig. 2, the focal areas in the shallow layers of the backdoored model are similar to those of the benign model (both on the right area), while the focal areas in the deep layers of the backdoored model are different from those of the benign model (the former on the trigger area). SAGE-purified models can correct the focal areas to be close to those of benign models. In contrast, for malicious samples of advanced backdoor attacks, NAD-purified models cannot re-align the focal areas of deep layers to those of benign models. Thus, NAD cannot successfully purify backdoors from these backdoored models, and the attack success rate of these purified models can still reach more than 30%. These observations confirm our design intuition.
Robustness against Special Backdoor Attacks
Recent studies show that some special backdoor attacks are able to evade certain defenses. For example, since some defense methods [9], [14], [59] are ineffective for large triggers and class-specific triggers, the attacker may use a large but less visible trigger or a class-specific trigger for the attack.
A. Multi-trigger Same-Label Attacks
This backdoor attack variant considers that the attacker may inject multiple distinctive triggers into the target model, aiming to force the backdoored model to misclassify any input sample attached with any one or combinations of these triggers to the same target false label.
To evaluate whether SAGE is still effective in erasing the backdoor from such attacks, we apply SAGE to the multi-trigger same-label attack of RobNet [18]. The attacker is assumed to inject 1, 3, and 5 triggers to the target model, and we follow the settings in RobNet [18] to generate these triggers. Note that the target label is labeled 3 for all datasets. As shown in Table XIII (appendix), we can see that SAGE is robust to multi-trigger same-label attacks for all datasets. After purification, the attack success rate of all attacks significantly drops, while the clean data accuracy of the model hardly changes.
B. Multi-trigger Multi-label Attacks
In multi-trigger multi-label attacks, the attacker also injects multiple different triggers to the target model, but each trigger leads to misclassification of a different target label.
To evaluate whether SAGE is effective in such cases, we apply SAGE to the multi-trigger multi-label attack of RobNet [18]. The attacker is assumed to inject 1, 3, and 5 triggers to the target model, and we follow the settings in RobNet [18] to generate these triggers. The target label of the single-trigger attack is label 3. The target labels of the 3-trigger attack are labels 3, 4, and 5, respectively. The target labels of the 5-trigger attack are labels 3, 4, 5, 6, and 7, respectively. As shown in Table XVI (appendix), we can see that SAGE is also robust to multi-trigger multi-label attacks.
C. Large Trigger Attacks
To explore the impact of trigger size, we vary the trigger size of TrojanNN [40] from 3 × 3 to 20 × 20 for MNIST, CIFAR-10, and CIFAR-100. For the high-resolution dataset ImageNette (224×224), we vary the trigger size of TrojanNN [40] from 32 × 32 to 128 × 128. For all the datasets, the poisoning rate is set as 10%, the transparency value is 0.5, and the trigger shape is square.
As shown in Table X, for low-resolution datasets (32 × 32), SAGE can also effectively reduce the attack success rate to less than 22% even if the trigger size is as large as 20 × 20. In terms of the high-resolution dataset ImageNette, we can see that even with a trigger size of 128 × 128, SAGE is still effective in reducing the ASR of the backdoored model to 27.15%.
D. Transparent Trigger Attacks
Most of the existing backdoor attacks [7], [18], [19] set the trigger transparency as 0%, i.e., the trigger is opaque. Although such a setting can enhance the attack success rate, the obvious trigger may also attract attention from the defenders. A more transparent trigger will improve the concealment of the attack.
We test SAGE under ten different trigger transparency values of TrojanNN [40], i.e., 0∼0.9 at an interval of 0.1. Note that a high transparency value (more transparent) will reduce the attack success rate. As shown in Table XVIII (appendix), as the transparency value increases, the attack success rate of backdoor attacks will decrease, i.e., the backdoored model is closer to the clean model. Even in such cases, we find that SAGE can further decrease the attack success rate to an even lower level. Interestingly, SAGE can improve the model prediction accuracy in some cases. For instance, when the transparency value is 0.9 (CIFAR-100), the ASR and MPA of TrojanNN are 13.45% and 65.40%, respectively. After applying SAGE on the backdoored model, the ASR falls to only 1.64%, while the MPA goes up to 69.37%.
E. Class-Specific Trigger Attacks
The traditional backdoor attacks are class-agnostic [19], [40], [61], i.e., the effect of the backdoor trigger is independent of the source classes. The backdoored model can misclassify samples from any label attached with the backdoor trigger to the target false label. The attack is dominantly determined by the backdoor trigger. Unlike class-agnostic attacks, in class-specific attacks, the backdoor trigger only works on the samples of a specific class. The backdoored model can only misclassify samples from the specific class attached with the trigger to the target label. In this case, the attack is determined by both the backdoor trigger and the specific class. Recent studies [56] have shown that most of the existing defenses are ineffective for class-specific backdoor attacks.
Attention maps of the original Ingrain and the adaptive Ingrain attacks before and after SAGE purification.
As shown in Table XI, SAGE can effectively reduce the ASR of class-specific trigger attacks to less than 10%. Take ImageNette as an example. After applying SAGE, the ASR of the backdoored model decreases from 91.14% to 1.01%, which demonstrates the defense efficacy of SAGE.
Robustness Against Adaptive Attacks
Apart from the above-mentioned five backdoor variants, we construct an adaptive attack that is designed specifically to evade SAGE and evaluate whether SAGE is effective against such an adaptive attack. SAGE adopts a self-attention distillation mechanism to erase the backdoor, thus the attacker can use the same self-attention distillation loss
We construct a powerful adaptive attack based on the most recent state-of-the-art anti-distillation attack methodology, i.e., Ingrain [16]. Specifically, following the key structure of Ingrain (i.e., shadow model, optimizable trigger, and teacher model), we introduce the self-attention distillation loss
We apply SAGE to the adaptive Ingrain. As shown in Table XII, the ASR of the adaptive Ingrain decreases to less than 20%, while MPA remains high. To understand the reason why SAGE is effective to the adaptive attack, we compare the attention maps of Ingrain and adaptive Ingrain before and after being purified by SAGE. As shown in Fig. 3, the attention maps of shallow layers and deep layers of SAGE-I (SAGE for Ingrain) and SAGE-A (SAGE for adaptive ingrain) are similar to those of the benign model, indicating that SAGE can purify the backdoor models generated by adaptive attacks. Nonetheless, the adaptive Ingrain still follows the pattern of "good shallow layers and bad deep layers", which may be due to the fundamental goal of backdoor attacks (high prediction accuracy on benign samples and high attack success rate on backdoored samples). In this case, SAGE is still effective in purifying the backdoor from the adaptive attack.
Conclusion and Discussion
This paper presents an effective defense approach for erasing backdoors from deep neural networks based on self-attention distillation without the need to refer to an extra teacher model. Extensive experiments on 8 state-of-the-art attacks for 4 datasets have verified the superiority of SAGE when compared with 6 baselines.
SAGE has several limitations to be addressed in future works. First, the self-attention distillation adopted by SAGE is only available in the visual domain but not in other domains, e.g., voice, video, and text. To purify potentially backdoored models in other domains, it is essential to develop corresponding SAD frameworks. Second, the SAD is designed for neural networks with more than 3 layers. Therefore, SAGE is not able to purify backdoor models of other popular structures, e.g., simple neural networks (fewer than 3 layers), decision tree, logistic regression, and support vector machine. Third, it is possible that the attacker may deliberately compromise shallow layers to evade SAGE, which, however, will degrade the performance of clean samples. It is worth exploring the adaptive attacks that can bypass SAGE with acceptable prediction accuracy for clean samples. Last, we borrow attention maps from adversarial example analysis in an attempt to interpret the effectiveness of SAGE. However, attention maps are not a rigorous theoretical analysis, but there is a lack of well-established theoretical analysis tool regarding backdoor attacks. Therefore, it’s crucial to explore theoretical analysis methods for examining backdoor attacks.
ACKNOWLEDGEMENT
We thank the anonymous reviewers for their valuable comments. Qian Wang’s work was partially supported by the National Key R&D Program of China (2020AAA0107701) and the NSFC under Grants U20B2049 and U21B2018. Yanjiao’s research is partially supported by the National Natural Science Foundation of China under Grant 61972296.
X.Appendix
Appendix
Datasets and Models
MNIST [11] contains 70,000 28 × 28 gray-scale images of digits 0 to 9, i.e., 10 classes. We randomly choose 60,000 samples as the training set, and the remaining 10,000 samples are used as the test set. We train a LetNet-5 model on the training set. The learning rate is 0.01, and the trained benign model has a prediction accuracy of 99.12% on the test set.
CIFAR-10. CIFAR-10 [29] contains 60,000 images belonging to 10 classes. We randomly select 50,000 samples as the training set, and the remaining 10,000 samples as the test set. We train a ResNet-18 on the training set for 300 epochs. The learning rate is 0.001, and the momentum stochastic gradient descent is 0.9. The trained benign model has a prediction accuracy of 90.90% on the test set.
CIFAR-100. CIFAR-100 [29] includes 600,000 images that belong to 100 classes. Each sample has a dimension of 32 × 32. We train a ResNet-50 model for 200 epochs. We set the learning rate as 0.1 and the momentum stochastic gradient descent as 0.9. To further improve its performance, a MultiStepLR scheduler with a γ of 0.2 is used at the 60-th, the 120-th and the 160-th epoch. The trained benign model has a prediction accuracy of 78.95% on the test set.
ImageNette. ImageNette [13] is a subset of ImageNet, widely used in the research community [41], [63]. ImageNette includes 9,469 training samples and 3,925 test samples. Each image has a high resolution with a dimension of 224 × 224. We train a ResNet-18 network for 150 epochs. We set the learning rate as 0.001, the batch size as 16 (due to the limited GPU source), the momentum of stochastic gradient descent as 0.95, and weight decay as 0.0005. The trained benign model has a prediction accuracy of 86.15% on the test set.
Backdoor Attack/Defense Methods
We apply SAGE and baseline defenses to eight state-of-the-art backdoor attacks. More details of these attack settings (i.e., trigger size, poisoning rate, transparency value, trigger shape, and target label) are shown in Table XV.
Besides, we compare SAGE with six state-of-the-art backdoor model purification methods. We implement the baselines using their original source codes and ensure convergence with enough epochs.
Fine-tuning the model is shown to be able to remove the backdoor of the model [32], [37]. We run 50 epochs of fine-tuning.
Pruning can also be used for disabling the backdoor [18]. We run 50 epochs of pruning.
Fine-pruning [37] combines fine-tuning and model pruning to purify the backdoored model. We run 50 epochs of fine-pruning.
NAD [32] first fine-tunes the backdoored model using a clean set of data to obtain a teacher model, and then uses the teacher model to correct the backdoored model (student model) through an attention distillation process. We run 50 epochs to establish the teacher model and 50 epochs to purify the student model.
MCR [69] proposed to employ mode connectivity [15] in loss landscapes to delete the backdoored neural paths from the model with a set of clean data. We run 50 epochs of fine-tuning, 100 epochs of curvenet training, and 100 epochs of model updating.
ANP (adversarial neuron pruning) [62] rectifies the backdoored model through pruning a variety of sensitive neurons under adversarial neuron perturbations. We run 50 epochs of mask optimization and 1000 epoch of pruning.