Detecting Backdoors in Neural Networks Using Novel Feature-Based Anomaly Detection

This paper proposes a new defense against neural network backdooring attacks that are maliciously trained to mispredict in the presence of attacker-chosen triggers. Our defense is based on the intuition that the feature extraction layers of a backdoored network embed new features to detect the presence of a trigger and the subsequent classification layers learn to mispredict when triggers are detected. Therefore, to detect backdoors, the proposed defense uses two synergistic anomaly detectors trained on clean validation data: the first is a novelty detector that checks for anomalous features, while the second detects anomalous mappings from features to outputs by comparing with a separate classifier trained on validation data. The approach is evaluated on a wide range of backdoored networks (with multiple variations of triggers) that successfully evade state-of-the-art defenses. Additionally, we evaluate the robustness of our approach on imperceptible perturbations, scalability on large-scale datasets, and effectiveness under domain shift. This paper also shows that the defense can be further improved using data augmentation.

are the difficulty in obtaining large high-quality labeled datasets, and the cost of maintaining or renting computational resources need to train a complex model which can take weeks to months. Hence, users often outsource DNN implementation and training to third-party clouds or download pre-trained models from online model repositories. This, however, exposes the user to training time attacks Chen u. a. In this paper, we seek to defend so-called "backdoor" attack Gu u. a. (2019) wherein the attacker trains a malicious model that mis-predicts if its inputs contain attackerchosen backdoor triggers. The DNN training includes "poisoned" data containing the backdoor trigger (i.e., a specific pattern/feature) so that the trained backdoored DNN outputs attacker-chosen labels when presented poisoned input data. Specifically, by choosing proper training hyper-parameters, trigger patterns, backdoor labels, quantity of poisoned data, and embedding approach, the attacker can make the backdoored model output specific labels on the poisoned data while preserving high accuracy on the clean data (i.e., data without the trigger). Such backdoored DNN may cause severe security risks, financial harm, and safety implications for the end-user (e.g., misclassifying traffic signs in autonomous vehicle applications as shown in Fig. 1). Detecting and defending against backdoor attacks is therefore of critical importance.
Detection of backdoors is challenging because of multiple reasons: First, the data used for training the DNN (especially the poisoned data) might not be available to the defender. Even if the training data is available, the dataset might be too large to admit human examination and the way in which a backdoor trigger influences the DNN might not be possible to analyze due to the complexity and lack of explainability of neural networks. Second, the information is asymmetric in that the user/defender has little knowledge about the backdoor attack, including the triggers and the attacker-chosen labels, while the attacker has complete access/control. Existing defenses tackle this asymmetry by introducing restrictive assumptions on the trigger size, shape, and the functioning of the backdoor attack Tran u. a. (2018); Chen u. a. (2018); Wang u. a. (2019); Liu u. a. (2019), which, however, limits the usefulness of the existing defenses.
In this paper, we propose a novel feature-based anomaly detection that makes minimal assumptions on the backdoor operation by the defender. In particular, the approach does not require prior information on the number, sizes, shapes, locations, colors of backdoors, or indeed even whether the backdoors are embedded in the pixel space. Furthermore, unlike existing defenses, our approach addresses input-feature-output behaviors rather than internal neural network structures, and can therefore even be applied beyond neural network-based models. Specifically, instead of studying the neuron be-havior of the backdoored network, we analyze the backdoored network from a macro view: the feature extraction layers of a backdoored network embed new features to detect the presence of a trigger and the subsequent classification layers learn to mispredict when triggers are detected. Therefore, to detect backdoors, we use two synergistic anomaly detectors: the first checks for anomalous features, while the second detects anomalous mappings from features to outputs by comparing with a separate classifier trained on validation data. We demonstrate the efficacy of our approach in a wide range of backdoored networks with multiple variations of triggers.
Our results show that our defense can detect poisoned inputs with high accuracy while retaining a high classification accuracy on clean inputs, whereas the existing methods fail in some of the cases. Additionally, we evaluate the robustness of our approach on imperceptible perturbations, scalability on large-scale dataset, and effectiveness under domain shift. Lastly, we show that the attack success rate can be further reduced using data augmentation.

Related Work
Backdoor attack: Neural network backdooring attacks were first proposed by Gu u. a. (2019) and independently by Liu u. a. (2017). Defense-aware backdoor attacks have also been studied by Liu u. a. (2018). Backdoor defense: Backdoor detection has been addressed in several works under various sets of assumptions. For example, Tran u. a. (2018); Chen u. a. (2018) assume that the user/defender has access to the backdoored DNN and the training dataset (including poisoned data). Under their assumptions, singular-value decomposition (SVD) and clustering techniques were employed to separate poisoned and clean data. Liu u. a. (2018) proposed fine-pruning to defend pruning-aware backdoor attacks. The proposed method first prunes the backdoored DNN by de-activating the neurons that are most sensitive to clean validation data and then fine-tunes the network. Neural Cleanse Wang u. a. (2019) reverse-engineers triggers and uses an outlier detection algorithm to find attacker-chosen labels. However, Neural Cleanse assumes small trigger sizes. Another reverse-engineering based defense Liu u. a. (2019) assumes that stimulation of an only single neuron is sufficient to increase the backdoored output activation, and the methodology fails if the backdoor is activated by a combination of neurons. In this paper, we show that the attacker can circumvent both defenses. Guo u. a. (2019) and Qiao u. a. (2019) are two other reverse-engineering based defenses, but these also make assumptions on the trigger size, shape and impact. In contrast, our defense does not make such restrictive assumptions.

Background and Problem Description
Threat Model Scenario: The user wishes to train a DNN F for a classification task on the training dataset S sampled from the data distribution D. The user outsources the training task to a third party (attacker) with F and S. The third party returns a backdoored DNN F b to the user. The attacker's goal: The attacker trains F b to output desired target label(s) l * on poisoned inputs x * . The poisoned inputs x * are generated by injecting trigger(s) into clean inputs x. The information is asymmetric (i.e., only the attacker knows the trigger patterns and the way they are embedded). Additionally, F b should have high classification accuracy on x while evading detection by the user. Attack model: The attacker has full control over the training process and the dataset S. For example, the attacker can choose an arbitrary portion of training inputs to inject the triggers, can determine the trigger shape, size, the target label(s), and the training hyper-parameters (e.g., the number of epochs, batch size, learning rate, etc.). However, the attacker has neither access to the user's validation dataset, nor the ability to change the model structure after training.

Backdoor Attack
Setup: The attacker determines the backdoor injection function f (·) and the target label(s) l * to generate poisoned data (x * , l * ) from the clean data (x, l): (1) The attacker next decides a portion of the clean training dataset Ω ⊂ S to inject the triggers to create the poisoned version of Ω as: Finally, the attacker mixes Ω * with S to generate the training dataset S b given by: D * denotes the distribution corresponding to poisoned data.
Attacker's objectives: The attacker trains a backdoored network F b with dataset S b . On one hand, a well-trained F b should have classification accuracy comparable to F on clean inputs x ∼ D with corresponding labels l, i.e., with small 1 ≥ 0 (ideally, 1 = 0). On the other hand, F b should also have high attack success rate (i.e., output = l * , which is the attacker-chosen target label) on poisoned inputs x * ∼ D * , which is: with small 2 ≥ 0 (ideally, 2 = 0).

Problem Description
The defender's goal and capacity: Given F b , the defender wants to lower the attack success rate while maintaining the classification accuracy. The defender has a small set of clean validation data V from the data distribution D. We assume that V is sufficiently representative of D. The defender has no prior information about the backdoor triggers and the attacker-chosen target label(s). Problem formulation: Given F b , the defender wishes to construct a binary classification detection function, g(·), so that for clean inputs x ∼ D, it outputs 0 with high likelihood and for poisoned inputs x * ∼ D * , it outputs 1 with high likelihood, i.e., with small 3 , 4 ≥ 0 (ideally, 3 , 4 = 0).

Detection Algorithm Overview
We consider a DNN as comprised of a feature extractor C b (e.g., the convolutional layers) and a decision function G b (e.g., the fully connected layers). C b can be viewed as pulling out and characterizing features at higher abstraction levels. G b determines what combinations of the features should result in what outputs. When introducing a backdoor, the hypothesis is that the "logic" specifying that the features in the backdoor trigger should result in the attacker-chosen label is encoded in G b . Therefore, we propose to detect backdoors by verifying, (a) how plausible the extracted features are, (b) the validity of the mapping from the features to the output. To verify (a), we propose to use a novelty detector, N . To verify (b), we propose to train a new decision function G n as a replacement for G b . The conceptual structure of our approach is shown in Fig. 2. To detect if an input x is poisoned, the features extracted from x by C b are first evaluated by N . If N flags the feature vector as nonnovel, G n will be run on the extracted features to predict the most likely output label. A mismatch between the outputs generated by G n and G b points towards the likelihood of a poisoned input, even if the extracted feature vector itself appeared non-novel. On the other hand, if the extracted feature vector is flagged as novel (i.e., low-likelihood based on the distribution learned by N from clean validation data), then the input is flagged as likely to have been poisoned even without having to check consistency between the outputs of G b and G n .

Fusion Method
Novelty detector: Novelty detection is a strategy for detecting if a new input is from the same distribution as previous data. Many detection algorithms were proposed Schölkopf u. a. (2001) (2019). The novelty detector N , trained with clean validation data V , is desired to output 0 and 1 for clean inputs x ∼ D and poisoned inputs x * ∼ D * , respectively, i.e., with small 5 , 6 ≥ 0 (ideally, 5 , 6 = 0) New decision function: G n , trained using only clean validation data V , is more likely to behave differently from G b on poisoned inputs x * ∼ D * than on clean inputs x ∼ D, i.e., denoting we expect x * is detected by disagreements between F n and F b . Fusion method: As shown in Fig. 2, our overall proposed architecture combines verification of the plausibility of the extracted features using N and verification of the feature-output mapping using G n . This "fusion method" combines the merits of N and G n . Mathematically, the fusion function g(·) is defined by: Training Method Training overview: The clean validation data v ∈ V is fed into C b and the features and the corresponding labels. The novelty detector N is trained using a low-dimensional summary of the feature vector C b (v) in terms of the L 1 norm, L 2 norm, and L ∞ norm of the feature vector. Configuration of N : N consists of multiple local outlier factors (LOF) provided by Scikit-Learn (2020) library. Specifically, given V , each class i will have a novelty detector N i to determine if the feature vector C b (x) for a new input x is novel relative to the distribution learned using the validation data V . N outputs 1 if an input is determined to not belong to any of the classes (i.e., indicated as novel by each of the per-class novelty detectors N i ); otherwise, it outputs 0. The mathematical definition is given by: This gives a tighter decision boundary than a single novelty detector trained with data from all classes.
Configuration of G n : The decision function G n is picked in the form of a neural network with two hidden layers. The number of neurons in G n varies with datasets since the dimensionality of the feature vectors and number of output classes are different for different datasets. The training inputs and targets are the recorded features {C b (v), v ∈ V } and the corresponding labels. Empirically, we found that two-layer neural networks had higher classification accuracy on clean validation dataset than simpler networks, while deeper networks could lead to over-fitting (although we did not observe any in our studies). The loss function for training G n is picked to be crossentropy loss.

Experimental Setup
This work uses five datasets: MNIST LeCun und Cortes (2010) Table 1. The training dataset, partially poisoned, was used for training the backdoored networks. The clean validation dataset was used for training N and G n . The testing dataset was used to evaluate our method and other baseline methods.

MNIST
Two backdoored networks were trained on MNIST: 1) case a) -the trigger is the pixel pattern shown in the second column of Fig. 3(a). The attacker chosen label l * was determined by: where l is the ground-truth label. The architecture of F b is from Gu u. a. (2019), also shown in Table 9 in Supplementary Material. The classification accuracy is 97.24% and the attack success rate is 95.17%. 2) case b) -the trigger is the background pattern as shown in the third columns in Fig. 3 (a). l * is 0. F b is from Liu u. a. (2020), also  shown in Table 10 in Supplementary Material. The classification accuracy is 89.1%. The attack success rate is 100.0%.

GTSRB
Three backdoored networks were prepared on the GTRSB dataset: 1) case c) -the trigger is white box shown in the second column in Fig. 3 (b). l * is 33. F b is from Wang u. a. (2019) with architecture as shown in Table 11 in Supplementary Material. The classification accuracy is 96.51% and the attack success rate is 97.40%. 2) case d) -the trigger is a moving square shown in the third column in Fig. 3 (b) with l * = 0. The architecture is shown in Table 3. The classification accuracy is 95.15%. The attack success rate is 99.78%. 3) case e) -F b uses the same architecture shown in Table 3. Inputs passing through a Gotham filter will trigger F b , as shown in the fourth column in Fig. 3(b). The attacker chosen label is 35. The classification accuracy is 94.70% and the attack success rate is 90.26%.

CIFAR-10
Three backdoored networks were trained: 1) case f) -the trigger is the combination of a box and circle, as shown in the second picture in (c) in Fig. 3, meaning that F b will output the attacker chosen label 0 only when both the shapes appear on the input. Either the box or the circle will not activate F b . The architecture is Network in Network (NiN) Lin u. a. (2014) shown in Table 12 in Supplementary Material. The classification accuracy is 88.5% and the attack success rate is 99.84%. 2) case g) -similar to the first one except that the trigger is the combination of triangle and square, as shown in the third column of Fig. 3(c)  also uses the same architecture but the trigger is a small perturbation, as shown in the last column in Fig. 3(c). The attacker chosen label is 0. The classification accuracy is 82.44% and the attack success rate is 91.99%.

YouTube Face
Four backdoored models were trained with the same architecture Sun u. a. (2014) shown in Table 13 in Supplementary Material. 1) case h) -the trigger is sunglasses as shown in the last column in Fig. 3 (d). l * is 0. The classification accuracy is 97.77% and the attack success rate is 99.99%. 2) case i) -the trigger is are lips with red lipstick, as shown in the second column of Fig. 3(d). l * is 0. The classification accuracy is 97.18% and the attack success rate is 91.46%. 3) case j) -F b has all the three triggers: lipstick, eyebrow, and sunglasses as shown in in Fig. 3(d). l * is 4 for all the triggers. The classification accuracy is 95.90% and the attack success rate is 92.1%, 92.2%, and 100% on lipstick, eyebrow, and sunglasses, respectively. 4) case k) -F b has all the three triggers as well. l * , however, is 1 for lipstick, 5 for eyebrow, and 8 for sunglasses. The classification accuracy is 95.94% and the attack success rate is 91.5%, 91.3%, and 100% on lipstick, eyebrow, and sunglasses, respectively.

ImageNet
To demonstrate scalability to large datasets, the final backdoored network was trained with the red box trigger shown in the second column in Fig. 3 We give oracular knowledge to these defenses. Table 4: "Clean" shows classification accuracy and "Poison" shows attack success rate.

Experimental Results
The clean validation datasets were used to train the novelty detector N and the new decision function G n for each case. The performance is shown in Table 2

MNIST
The architecture of G n is: 512 → 160, ReLU, 160 → 160, ReLU, and 160 → 10. The results are shown in Table 4 case a) and b). Our approach reduces the attack success rate to a low value with a small drop of classification accuracy. Neural Cleanse, however, mis-identifies F b as a non-backdoored network for case a). For case b), Neural Cleanse detects multiple attacker chosen labels and therefore we give oracular knowledge to Neural Cleanse of the correct attacker chosen label(s). Fine-Pruning has high attack success rate on both cases. STRIP has high False Acceptance Rate (FAR) on both cases. Therefore, for MNIST cases, our approach outperforms the prior works.

GTSRB
The architecture of G n is: 512 → 64, ReLU, 64 → 64, ReLU, and 64 → 43. The results are shown in Table 4 case c), d), and e). For case c) and d), our approach reduces the attack success rate to 1.5% and 0, respectively, whereas other approaches are not efficient. For e), our approach has a high attack success rate, but we will show, in our later experiment, that the attack success rate can be further reduced to 7.75% with data augmentation, as shown in Table 7. In the GTSRB cases, our approach outperforms other approaches by having consistently low attack success rate.

CIFAR-10
The architecture of G n is: 640 → 160, ReLU, 160 → 160, ReLU, and 160 → 10. The results are shown in Table 4 case f) and g). Neural Cleanse fails on both cases by either mis-identifying the F b as non-backdoored network or having high attack success rate. Similarly, Fine-Pruning and STRIP fail on case g) by having high attack success rate or FAR. Our approach, however, reduces the attack success rate to almost zero, while maintaining high classification accuracy. Case f) was also used to test ABS Liu u. a. (2019), wherein ABS mis-identified F b as a clean network.

YouTube Face
The architecture of G n is: 160 → 1600, ReLU, 1600 → 4800, ReLU, and 4800 → 1283. The results are shown in Table 4 case h), i), j), and k). Fine-Pruning fails on all cases by having high attack success rate. After using STRIP, the attack success rate is still high (> 10%). Neural Cleanse performs well only on case j) given oracular knowledge. For other cases, the attack success rate is high. Our approach, however, reduces the attack success rate in all cases, while maintaining reasonable classification accuracy. Note that the classification accuracy of our approach in this dataset drops more than in other datasets because its validation dataset is small and has only 9 images for each of the 1283 labels.

Comparison with RTLL
We further compared our approach with the Re-Training the Last Layer (RTLL) Adi u. a. (2018) on some cases. The difference between RTLL and ours is that RTLL neither uses novelty detector nor changes the network structure. Our work shows that both changes are necessary. The results are shown in Table 5. Although RTLL decreases the attack success rate in case h) and i), the classification accuracy drops a lot. Additionally, RTLL fails on case d) and f). Our approach outperforms RTLL by reducing the attack success rate to almost 0 with a small drop of classification accuracy.

Robustness to Imperceptible Trigger
We trained a backdoored network with small perturbations (only one pixel at each corner) as the trigger (i.e., the third F b in CIFAR-10 case

Large-Scale Dataset
We also tested our method on the full ImageNet dataset (1000 classes) by creating a backdoored ImageNet model based on DenseNet-121. The dataset and trigger are shown in Fig. 3 (e). G n is simply a linear layer from 1024 to 1000. We observe that our defense reduces the attack success rate down to 12% and the classification drops from 72% to 63%. This is similar to the drop of classification accuracy on the YouTube Faces dataset, which also has more than a 1000 classes (and therefore a small number of clean validation dataset per each label). However, ImageNet is a much larger dataset and more complex than YouTube Face. Thus, we conclude that the size and complexity of the dataset is not a limiting factor for our method, although the number of classes does impact classification accuracy.

Training with Augmented Validation Data
From Table 4, although our approach significantly reduces the attack success rate for all the cases, the reduction is smaller for case (e) (only down to 69%). We observe that by adding standard Guassian noise to the clean validation dataset and training the novelty detector N and new decision function G n with the augmented data, the attack success rate can be reasonably reduced, as shown in Table 7. The attack success rate is less than 6% (only 8% for case e) for all the cases with a small drop of classification accuracy. However, as we demonstrate next, the drop in accuracy can be mitigated if we update the novelty detectors online.

Retraining with Poisoned Data
With the aid of a human expert to relabel the poisoned data, we can retrain a new decision function (i.e., only the classifier) on the poisoned test inputs as identified during the on-line implementation of our defense. We utilized the detected poisoned input in the first 20% of the test data as well as the clean validation data for retraining. The remaining 80% of the test dataset was used to evaluate the retrained models. After retraining, classification accuracy improves while maintaining low attack success rate (Table. 8). We empirically observed that retraining with a small portion of the poisoned input data (10-20%) was sufficient to achieve better classification accuracy removing the need to continually retrain the network. The online defense only works with aid of a human expert labeler; nevertheless, our offline defense outperforms the state-of-the-art.
For the ImageNet case, if 8% of the training dataset (50 images per label) is utilized as the validation dataset, the classification accuracy is ≈ 63% and the attack success rate drops to 0.14% with retraining.

Conclusions
A novel feature-based anomaly detection with minimal assumptions on the backdoor attack was proposed. The approach requires only a small clean validation dataset and is computationally efficient. Several experiments were implemented and the performance was compared with state-of-the-art algorithms. The results show that our approach outperforms the state-of-the-art by achieving lower backdoor attack success rate on poisoned inputs while keeping high classification accuracy on clean inputs. Additionally, our defense is formulated in terms of features and outputs rather than internal network structure and therefore applies to non neural network-based models.