Cassandra: Detecting Trojaned Networks from Adversarial Perturbations

Deep neural networks are being widely deployed for many critical tasks due to their high classification accuracy. In many cases, pre-trained models are sourced from vendors who may have disrupted the training pipeline to insert Trojan behaviors into the models. These malicious behaviors can be triggered at the adversary's will and hence, cause a serious threat to the widespread deployment of deep models. We propose a method to verify if a pre-trained model is Trojaned or benign. Our method captures fingerprints of neural networks in the form of adversarial perturbations learned from the network gradients. Inserting backdoors into a network alters its decision boundaries which are effectively encoded in their adversarial perturbations. We train a two stream network for Trojan detection from its global ($L_\infty$ and $L_2$ bounded) perturbations and the localized region of high energy within each perturbation. The former encodes decision boundaries of the network and latter encodes the unknown trigger shape. We also propose an anomaly detection method to identify the target class in a Trojaned network. Our methods are invariant to the trigger type, trigger size, training data and network architecture. We evaluate our methods on MNIST, NIST-Round0 and NIST-Round1 datasets, with up to 1,000 pre-trained models making this the largest study to date on Trojaned network detection, and achieve over 92\% detection accuracy to set the new state-of-the-art.


Introduction
Deep neural networks (DNNs) are the main driving force behind the current success of Artificial Intelligence. However, training DNN models requires enormous amounts of data and computational resources. Hence, many users prefer to source and deploy pre-trained models in their, often security critical, applications such as drug discovery [1,2], facial recognition [3], autonomous driving [4], surveillance [5]. It is well known that DNNs easily learn any bias that is present in the training data. Vendors of DNN models, with malicious intentions, can exploit this vulnerability of DNNs and intentionally inject Trojan behavior into the network during the training process. This is generally achieved by inserting a trigger into some of the samples and then training the DNN to exhibit malicious behavior for data that contains the trigger and normal behavior for data without the trigger. With full control over the DNN training process, the adversary is able to choose any trigger shape. Triggers are chosen such that they do not appear suspicious to the human observer e.g. a yellow rectangular sticker on a stop sign can be used to trigger a DNN to classify it as a speed limit sign. Since only the adversaries have knowledge of the trigger, they can initiate malicious behaviour at will and with no knowledge of the trigger, users of pre-trained models may not even suspect the presence of backdoors. This causes a serious threat to the widespread deployment of pre-trained models. Note that attacking Trojaned DNNs is much easier and different than adversarial attacks on clean DNNs, since the former has access to the DNN training process itself, while the latter only exploits intrinsic vulnerabilities of neural networks [6].
Given that noise-based adversarial attacks are inherent to CNN models, it is no surprise that triggerbased Trojan attacks also exist [7,8]. Trojans are generally inserted into the deep model during training or transfer learning [9,10,11,12]. A backdoor is typically inserted into a network [13] to make the CNNs mis-classify some specific class or classes. Instead of training a model with a dataset poisoned with triggers, another possible way the adversary can Trojan a network is by modifying the weights of selected neurons so that the model responds maliciously to a specific trigger [10].
Current challenges for Trojan (backdoor) detection in practice are: 1) Lack of a deep learning-based model trained on a large-scale dataset for Trojan detection; 2) Unavailability of trigger information for a suspected Trojan infected model; usually only limited training data of clean samples is available; 3) Very limited information which can be obtained from the query model predictions since the test accuracy for Trojaned DNNs is normal for clean inputs, and 4) The target class in the infected model is unknown, and it is computationally expensive to search all possible targeted attacks when the output labels are in the hundreds.
To address these challenges, we propose the first deep learning based Trojan Detection Network (TDN). Our method has two stages, the first one is a two stream neural network that outputs the probability of a model containing a Trojan, and the second stage predicts the target class in a Trojaned model. Our contributions are summarized as follows. First, we propose a deep neural network for Trojan detection from only a few clean samples. To the best of our knowledge, we are the first to use a DNN classifier, trained on a large scale dataset of benign and Trojaned models, for Trojan detection. Second, we propose a method for target class prediction in a Trojaned model. We introduce a new variable (γ) that quantifies the difficulty of attacking a model. This variable is a critical indicator for the target class of a Trojan infected model.
Theoretical Justification: Inserting Trojan behaviour into a network essentially puts an additional constraint on the model optimization during the training process. The model must learn to exhibit normal behavior and achieve an expected high classification accuracy on clean training/validation samples but exhibit the chosen malicious behaviour on samples containing a trigger, a localized pattern. This has two important consequences. Firstly, the decision boundaries of the model must adjust to allow such a behavior. Secondly, the model must become more responsive to local patterns (the trigger). Our hypothesis is that if we can encode these two aspects, we will be able to detect Trojaned models accurately. For the former, we use universal adversarial perturbations [14] which, being image agnostic, reasonably capture a fingerprint of the decision boundaries. For the latter, we look for a localized region of high energy in the adversarial perturbation. Thirdly, we also hypothesize that Trojaned models are easier to fool with minimal universal perturbation energy compared to clean models. Our proposed method basically capitalizes on these three factors to detect Trojaned networks and the target class of such networks.

Related Work
Adversarial attacks on CNNs have focused on the phenomenon of noise-based adversarial examples [15,16], which are visually almost indistinct from the original images, but can mislead DNN classifiers into making incorrect predictions. Even universal adversarial perturbations [14] have been discovered that are image agnostic and when added to any image of any class, can cause the DNN to mis-classify them. By computing singular vectors of the Jacobian matrices of hidden layers, universal perturbations can be constructed with very few images [17]. Adversarial attacks generally do not assume access to the training process of deep models. A comprehensive survey of such method is reported in [6]. In this paper, we focus on defending against Trojan attacks where the attacker disrupts the training pipeline of the DNN to insert a backdoor.
The risk of Trojan models arises when the training process of a DNN is outsourced or a pre-trained model from an untrusted source is deployed. This security risk was first investigated in Badnets [13]. It was shown that backdoors in networks infected with Trojans can remain a threat even after transfer learning. Chen et al. [12] proposed a backdoor attack algorithm that uses poisoned data to contaminate the CNN model. Trojaning attack [10] introduced a way to generate triggers and maximize the activation of some specific neurons to insert a backdoor. The embedded backdoors are stealthy and the unexpected malicious behavior is activated only by triggers, making them extremely challenging to detect with only clean data samples.
Defense methods were first developed to detect adversarial images [18,19,7,20,21,22]. Metzen et al. [23] detect adversarial perturbations with a target classification network. Feinman et al. [24] also use a binary classifier to detect adversarial perturbations. Magnet [25] trains a classifier on manifolds of normal examples to discriminate adversarial perturbations without any prior knowledge of the attack. Safetynet [26] is designed to detect adversarial-noise based attacks and exploits the different adversarial perturbations produced to train a SVM classifier.
Methods for detecting and defending against Trojan attacks have also been proposed. Liu et al. [27] proposed a pruning and fine-tuning procedure to suppress backdoor attacks. Chen et al. [28] proposed Activation Clustering methodology for detecting and removing backdoors from DNNs. SentiNet [29] uses the behavior of adversarial misclassification of poisoned networks to detect an attack. However, all these methods fail in the realistic settings where access to poisoned data is not available. Neural Cleanse [30] was the first method to detect Trojan infected models with clean samples by reverse engineering the trigger. They employ the Median Absolute Deviation (MAD) technique to compute the anomaly in the L 1 norm of the reversed triggers to detect Trojaned models. However, the trigger must be reverse engineered for each class, which is not scalable in practice for DNNs with hundreds and thousands of classes. DeepInspect [31] uses conditional GAN to reconstruct trigger patterns for Trojan detection. NeuronInspect [32] detects backdoor from the output features, such as sparsity, smoothness, and persistence of saliency maps obtained from back-propagation of the confidence scores. Tabor [33] propose metrics to measure the quality of reversed triggers and achieve improved performance than Neural Cleanse by introducing several regularization terms to refine the generated triggers.
The above methods [30,31,32] are sub-optimal because they are not learning-based and employ the MAD technique and manually tuned anomaly thresholds to detect the outliers of reverse engineered triggers. More importantly, none of these techniques report results on large scale data of benign/Trojaned models and none of them can predict the target class of a Trojaned model. To address these challenges, we propose Cassandra, a Trojan detection method that exploits universal adversarial perturbations [17] generated from a very limited number of clean samples. Given their image-agnostic nature, we compute universal adversarial perturbations for a batch of clean samples, where the batches could be as few as 5. Note that this holds even if the number of classes is in thousands, unlike prior work such as Neural Cleanse, where one perturbation per class is necessary. Our method also provides the target class of a Trojan infected model.

Detecting Trojan Infected Models
During training, a neural network simultaneously learns feature representation and decision boundaries that partition the feature vector space into the respective classes. When an adversary inserts a backdoor into a network, the decision boundaries are altered. Our hypothesis is that Trojan infected networks exhibit decision boundaries that are different from typical, benign classification networks. Our approach exploits this fact by retrieving the fingerprints of the decision boundaries of a network, and subsequently trains a classifier on these fingerprints to classify a query network as benign or Trojan infected. We use adversarial perturbations to retrieve fingerprints of the decision boundaries of the query network. In contrast to image specific perturbations, universal perturbations [14] are image agnostic, such that the generated perturbations when added to any input image sends it across the decision boundary to change its label. The success of universal perturbations is measured by its fooling rate, the proportion of images that are successfully mis-classified after the perturbation is added. Since universal adversarial perturbations capture the geometry of the decision boundaries [14], the perturbations for benign and Trojan models are expected to be significantly different in character.

Fingerprinting Decision Boundaries with Adversarial Perturbations
We formulate Trojan detection as a classification problem. For a query neural network model, f , we define a Trojan detection classifier F as Here f outputs a prediction y for each input image x, drawn from the distribution µ of images in Trojaned classifier and corresponding prediction for x. For a desired threshold δ, we obtain universal perturbations x and x for classifiers f andf , respectively, such that the following holds: Note that the observed fooling rate η can go much higher than 1 − δ during the generation of universal adversarial perturbations. We define the perturbation energy E as: where h is parameterized by the process that generates the perturbations. Let E BA denote the perturbation energy cost to transform all data samples from class B to class A across the decision boundary for a benign model, and vice versa for E AB . Similarly, E A B and E B A denote the same for a Trojan infected model. In an infected model the decision boundary is changed such that some backdoors are created close to other classes. Due to these changes in the decision boundary, E A B < E AB and E B A < E BA for a given fooling rate (see Fig. 1a,b), where AB is proportional to E AB and so on.
We define the notion of attack difficulty for both universal perturbations and targeted attacks as where S is the fooling rate (η) for universal perturbations and attack success rate for targeted attacks. Universal adversarial perturbations of clean and Trojan infected models are distinguishable above a given fooling rate η, both visually and in terms of energy, as shown in Figure 1c.
4 Trojan Detector Figure 2 shows the schematic overview of our proposed Trojan detector, referred to as Cassandra. The query network, along with the clean labelled training data, are used to generate two types of universal perturbations i.e. those bounded by L ∞ and L 2 norms. Note that we do not assume the presence of triggered images in the training data, since triggers are unknown in a realistic scenario. The L ∞ universal perturbations are fed to one stream of the network together with their corresponding attack difficulty, γ ∞ . Similarly, the L 2 norm bounded universal perturbations and their attack difficulty, γ 2 , are fed to the second stream of the trojan detection network. The feature extractor in Fig. 2   Perturbation Generator: Since the target class of the (potentially Trojan infected) query network is unknown, the universal adversarial perturbations (Eq. 2) [14] are computed that cause mis-classification of any input image. The DeepFool [34] kernel is used for perturbation generation. A batch of training images are passed to the query network, the direction of the nearest decision boundary is computed which is back-propagated to compute a small L ∞ or L 2 bounded perturbation for the input. By iteratively refining the perturbation over different mini-batches, a universal (image agnostic) adversarial perturbation is obtained. The generated perturbations are sent to their respective feature extractor stream for further processing. In our case, we stop the iterations when a certain threshold 1 − δ is achieved by the universal perturbation or a maximum number of iterations are reached.
Feature Extractor contains two parallel modules. The first one (top right in Fig. 2 To fully capture properties of the complex decision boundaries, we divide the training data into 10 batches and obtain 10 probabilities for each query model. The final score is computed as the mean value of these 10 probabilities.

Target Class Prediction
We propose targeted attack difficulty, as a metric for outlier class prediction in a Torjan infected model. An outlier class is the one which is easy to launch a targeted attack against, compared to the other classes, and hence most likely to be the target class of the Trojan infected model.
Attack difficulty is defined as γ = E/S, where E is the perturbation energy (Eq. 3) and S is the attack success rate for the targeted attack i.e. the proportion of images whose predictions change to target labels. Attack difficulty (or its reciprocal attack efficiency) measures the perturbation energy normalized by the success rate of the attack. Given a query model, we use the Fast Gradient Sign Method (FGSM) [36], given its fast execution time, to compute adversarial perturbations for each class. For example, NIST-Round0 data contains five class labels and the Trojan models classify triggered images of any class to class 0. In this case, class 0 is the target class and the attack is called an "any-to-one" targeted attack. Fig. 3 shows that the proposed attack difficulty is able to correctly detect the target class of the Trojan attack as outlier, but L 1 norm used for Trojaned model detection in Neural Cleanse [30] fails. We finalize our target class prediction with a two stage method. The first stage is our Trojan detection network which outputs the probability of the model being infected with a Trojan, and the second stage is outlier detection based on Median Absolute Deviation (MAD) [37,30] for predicting the target class. The anomaly index for outlier detection is defined as the absolute deviation of the data points from their median and then normalized by the median, to measure the dispersion of the data distribution. For Trojaned models, the second stage selects the label with anomaly index value above a threshold as the predicted target class.

Experiments
For all experiments, we perform 5-fold cross validation and report average results. In Eqn. 2, δ is set to 0.2 to quantify the desired fooling rate. We use the Adam optimizer with a learning rate of 0.001.The constant estimator for the MAD outlier detector is 1.4826, so that any data sample with anomaly index larger than 2 has > 95% probability of being an outlier. We employ anomaly index threshold of 2, such that the class labels with anomaly index larger than 2 are considered the target class. For training we use a server with 6 Nvidia RTX 2080 Ti GPUs. Perturbation generation and training for NIST-Round1 data takes around 12 hours. Inference for each model takes about 560s with Nvidia RTX 2080 Ti, and inference for 200 models finishes within 24 hours.

Datasets
We evaluate our proposed approach on a dataset of trigger infected models for classifying images from MNIST and the public NIST-Round0 and NIST-Round1 datasets. We will refer to the dataset of trigger infected MNIST classification models as Triggered MNIST dataset throughout. Code to generate the triggered MNIST dataset was used from the TrojAI GitHub repo 4 . NIST datasets were obtained from the TrojAI challenge website 5 .
Triggered MNIST Dataset: Two types of triggers, Type I and II (see Fig. 4) are inserted into each image of clean MNIST dataset to generate Triggered data. A total of 900 models of 3 architectures (ModdedBadNet, BadNet and ModdedLeNet5) are generated. Out of these, 300 were benign models, 300 were trained for any-to-any attack, and 300 were trained for any-to-one targeted attack. Details of the models and their performance on clean and triggered data are given in the supplementary material.
NIST Datasets: The NIST datasets consist of traffic sign classification models (half benign and half Trojaned) with 3 possible architectures (Inception-v3, DenseNet-121, and ResNet50). The models were trained on synthetically created images of artificial traffic signs superimposed on road background scenes. The Trojan infected models are poisoned with an unknown embedded trigger.
NIST-Round0 and NIST-Round1 datasets are both from the same distribution, the main difference is that Round0 consists of 200 models, while Round1 dataset has 1,000 models. Details of the models in the NIST datasets including their accuracy and attack success rates are provided in the supplementary material. Clean data samples used to train the NIST models can be seen in Figure 5.

Results
In Table 1, we report the classification accuracy for a variety of training and test set model configurations for the Triggered MNIST dataset. The classification accuracy is consistently high for both Type I (93.3%) and Type II (91.7%) triggers. Even when training and test models are infected by different types of triggers, the algorithm still has a high classification accuracy of 91.7% (or 90%) which shows that our method is independent of trigger types. We achieve 94.4% performance for the configuration where both trigger types are present, in equal proportions, in the training and test sets ( Table 1 last row). NIST datasets are more challenging compared to Triggered MNIST, not only in terms of trigger types, color and size of the data used to train the infected models, but also due to the fact that the NIST models are much deeper. Our method obtains high classification accuracy of 92.5% for NIST-Round0 and 92.0% for NIST-Round1 datasets. Table 2 shows results of our method on the Triggered MNIST, NIST Round0 and NIST Round1 datasets and compares them to Neural Cleanse [30]. Our proposed Trojan Detection Network outperforms Neural Cleanse on all three datasets with large margins of 17.8%, 25% and 18% respectively. This can be attributed to two reasons. Firstly, it is difficult for Neural Cleanse to find an optimal anomaly index threshold. Secondly, reverse engineering the trigger does not perform well when the triggers are complex.   Table 3 shows our target class prediction results. The proposed two stage prediction algorithm based on the attack difficulty and predicted P (T rojan) improves the classification accuracy significantly over the baseline (without P (T rojan)) from 76.1%, 72.5% and 70.0% to 90.0%, 94.7% and 88.1% on Triggered MNIST, NIST-Round0 and NIST-Round1 datasets respectively. Using Ground truth P(Trojan) further improves classification accuracy which demonstrates attack difficulty is a critical indicator of target class.

Ablation Study
Trojan Detector Network Modules: In Table 4, we explore different network architectures and the functionality of individual modules of our method. Using only universal perturbations computed from the complete training data of NIST-Round0, we achieved 77.5% classification accuracy. After dividing the training data into 10 batches (these are different from the training mini-batches), we generate 10 perturbations for each model. With these 10 perturbations, the accuracy improves to 85%. Adding the attack difficulty further improves the classification accuracy in all cases. Finally, with multi-batch and two stream architecture we achieve 92.5% classification accuracy. Table 4: Effects of using multiple perturbations and attack difficulty. Trojan detection accuracy on the NIST-Round0 validation data improves significantly after using multiple perturbations (n=10) calculated from different batches of training data. Using L ∞ and L 2 perturbations in a two stream architecture combined with attack difficulty (γ) further improves the accuracy. Universal Perturbation Generator Hyper-parameters: The choice of hyper-parameters may impact on the effectiveness of the generated universal adversarial perturbations for various tasks. However, our experiments show that the proposed method is robust to these parameters. We compare the mean classification accuracy when using different number of iterations and magnitudes for L 2 and L ∞ bounded universal adversarial perturbations, and find that the Trojan detection accuracy varies only slightly as shown in Table 5.

Conclusion
We proposed the first deep learning based method, that is trained on a large scale dataset of Trojaned and clean models, for detecting Trojan infected models. We exploit the universal adversarial perturbations to retrieve the fingerprints of Trojans in the DNNs and train our proposed TDN based on the features of the perturbations and attack difficulty to discriminate benign and Trojaned models. We also proposed simple variable, coined attack difficulty γ, to measure the energy needed to achieve an average unit fooling rate. Based on the attack difficulty, we proposed a two stage target class prediction method that can predict the target class of a Trojaned model in addition to the Trojan probability. This provides further information on the type of malicious behaviour embedded in a Trojan infected model e.g. which identity is being impersonated in a Trojaned face recognition model.

Supplementary Material
TrojAI Leaderboard Results on NIST-Round0 Dataset Figure 6 shows a snapshot of the TrojAI Leaderboard for NIST Round0 (see Section 7 for dataset description). These results are compiled by the NIST server using a held-out test set not available publicly. The snapshot was taken at 21:30 hours on 10 June 2020 and adjusted to fit on this page without interfering with the results. We can see that Cassandra outperforms all other competitors by a significant margin. The best results in terms of Cross-Entropy Loss and ROC-AUC for each method are repeated in Table 6. Notice that we have the lowest loss and the highest ROC-AUC.

MNIST Model Generation Clean Model Generation
The data is split into training set : 60,000 images (6,000 images per class), and test set: 10,000 images (1,000 from each class). The clean data are used for training 300 benign/clean models with three architecture types (ModdedBadnet, Badnet and ModdedLenet5net), each with 100 models (see Table 7) .

Trojaned Model Generation
Clean Data: The MNIST dataset has 10 classes with 70,000 clean images (without triggers).
Models: In addition to the 300 benign models, another 600 Trojaned models of the same three architectures (ModdedBadnet, Badnet and ModdedLenet5net) were generated. Trojaned models were trained by the Triggered MNIST data and clean data where the proportion of triggered data varied as 10%, 15% and 20%. Table 7 shows the details of both clean and infected models trained for any-to-any Trojan attack. Any-to-one attack models were generated similar to any-to-any models. 300 Trojaned models were trained by any-to-any targeted attack, and another 300 were trained for any-to-one targeted attack.
Evaluations of ModdedBadnet, Badnet and ModdedLeNet5 models are shown in Table 8 for anyto-any attack and in Table 9 for any-to-one targeted attack. The clean models and Trojaned models both have high classification accuracy when the test data is clean. The clean models also have high classification accuracy when the test data is triggered. Since there is no Trojan in the clean model, the triggered image samples are correctly classified. However, for the Trojaned models, the classification accuracy (100 − Attack Success Rate) for triggered data is low since the triggered images are misclassified. The tables show Attack Success Rates only for the triggered data which is very high. These results imply that the Trojan (backdoor) was successfully inserted into the models.

NIST Round0 and NIST Round1 Datasets
The NIST datasets consist of CNN classification models for traffic sign signals. Half of the models are benign models and half are Trojaned models. The models have three architectures namely, Inception-v3, DenseNet-121, and ResNet50. The models were trained on synthetically created image data of artificial traffic signs superimposed on road background scenes. The Trojaned models have been poisoned with triggers of different color, size and shape. Round0 dataset consists of 200 models, while Round1 dataset has 1,000 models. NIST also holds a sequestered test dataset to evaluate models. For that, models must be uploaded to the TrojAI Leaderboard website. Section 7 and Table 6 discuss our results on the TrojAI Leaderboard. Table 10 and Table 11 show the model details and the performance of the three architecture types present in the NIST Round0 and Round1 datasets. Notice that the Trojan infected models have accuracy at par with the clean models and yet they have a very high attack success rate on the triggered data.  Target Class Detection Algorithm The procedure for target class prediction is given in Algorithm 1.

Data: Query model
Result: P (T rojan) and Target Class Stage One: Use Trojan Detection network to get P (T rojan); if P (T rojan) >= 0.5 then for C i ← 0 to C do use FGSM to calculate adversarial perturbations with C i as the target class; compute attack difficulty (σ i = L1N orm F oolingRate ) for perturbation; end T argetClass = perform outlier detection over the attack difficulties σ i s; output P (T rojan) and target class prediction; else output P (T rojan) and target class(None); end Algorithm 1: Two-stage method to detect a Trojan infected model and predict its target class using only clean image samples.