ML-Based Trojan Classification: Repercussions of Toxic Boundary Nets

Machine learning (ML) algorithms were recently adapted for testing integrated circuits and detecting potential design backdoors. Such testing mechanisms mainly rely on the available training dataset and the extracted features of the Trojan circuit. In this letter, we demonstrate that this method is attackable by exploiting a structural problem of classifiers for hardware Trojan (HT) detection in gate-level netlists, called the boundary net (BN) problem. There, an adversary modifies the labels of those BNs, connecting the original logic to the Trojan circuit. We show that the proposed adversarial label-flipping attacks (ALFAs) are potentially highly toxic to the accuracy of supervised ML-based Trojan detection approaches. The experimental results indicate that an adversary needs to flip only 0.09% of all labels to achieve an accuracy drop of over 9%, demonstrating one of the most efficient ALFAs in the HT detection research domain.


I. INTRODUCTION
O VER recent years, the domain of hardware Trojan (HT) insertion and detection has received increasing attention.Such a threat is considerable and pivotal with the modern semiconductor supply chain increasingly relying on outsourced elements developed by third parties, spanning all design-process steps, including tools and intellectual property (IP) cores.Therefore, trustworthy testing of integrated circuits (ICs) or electrical/electronic (E/E) products should be paramount for every designer.The key idea is to test and evaluate the trustworthiness of ICs already inside the design house, not later [1].Nevertheless, a Trojan circuit is essentially designed to evade detection by automated testing tools [2].A Trojan circuit, consisting of trigger and payload parts, can be injected into pre-and post-silicon IC production stages.HTs inserted into netlists are a severe issue [2], and IC verification engineers must develop efficient detection techniques as countermeasures.Machine learning (ML) methods have shown high accuracy in detecting HT [3].For instance, support vector machines (SVMs) [4], random forest classifiers (RFCs) [5], and further algorithms [3] have been employed to detect HTs.In these approaches, the ML model of choice is trained to classify a given wire as either Trojan-free (Normal) or as part of a Trojan circuit.Multiple metrics have been proposed to evaluate the quality of such ML models [3].In [6], the best results of all reviewed MLs for HT detection exhibit a true positive rate (TPR) between 72.5% and 85.3% while the TPR of supervised MLs ranges from 68.2% to 99.9%.
Most of the proposed works on HT classifiers1 in the netlist-testing domain do not cover and evaluate the classifiers' security.Recently, an adversarial example attack (AE) was performed successfully in [7] and [8] against multilayer neural networks for HT detection at gate-level netlists proposed in [9].AE aims to add some noise, so-called perturbation, to evade the HT classifier.The results show that the TPR drops by at most 30.15%.An adversarial label-flipping attack (ALFA) was performed against the categorical boosting (CatBoost)based HT classifier in [10].ALFA is considered a causative attack, where the attacker aims at degrading the accuracy of the HT classifier by influencing and altering the training process [11].The attack results show that the classifier's average accuracy drop is 58.5% if the attacker flips 20% of the sample label.

A. Article Contributions and Organization
This letter investigates a structural problem of HT classifiers called the boundary trojan net labeling problem (BTP).In particular, we demonstrate that labeling boundary nets (BN) is practically toxic to the HT classifier quality, and an adversary can exploit BTP's impact to perform ALFA against HT classifiers.Thus, we carry out four different experiments to prove the attack's efficiency by using the benchmarks introduced in [12] and [13] and made available in Trust-Hub for research use [14].To the best of our knowledge, this work is the first proposal exploiting the BTP to perform successful ALFAs against HT classifiers.
The remainder of this letter is organized as follows.Section II outlines the structural problem of supervised ML classifiers to verify the integrity of gate-level netlists when BN labels are applied incorrectly.Four attack scenarios are presented in Section III, followed by an analysis of the attack's impact on multiple classifiers in Section IV.Section V provides a conclusion with a short result discussion.

A. Construction of HT Classifier
The construction of an HT classifier consists of the training and inference phases.The training phase comprises several stages: a verification engineer first prepares a set of infected netlists.All features of every net in each netlist are extracted in the feature extraction stage.Then, every net is labeled as either Trojan or Normal in the labeling stage.An ML is trained based on all extracted features resulting in an HT classifier.
When a new netlist needs to be tested during the inference phase, the verification engineer uses this HT classifier to verify whether the provided netlist is Trojan-free.

B. Boundary Trojan Net Problem: Definition
All nets only connected to the Trojan cells (logic gates and registers) are called Internal Trojan Nets.All nets only connected to the Trojan-free cells are called Normal Nets.The fuzzy area at the edge of a Trojan circuit is called the boundary area.There, all nets are attached to both, the Trojan circuit and the Normal circuit.Such nets are called BNs [15], [16].Fig. 1 shows those BNs (green) that lay between the Trojan (red) and Normal (black) circuit.
In the context of supervised ML, the location of a BN determines its according label, i.e., whether it is a Trojan or Normal net, formalizing the boundary trojan net problem (BTP).Correct labeling of BNs is very ambiguous [15].For instance, if the BNs are neglected and considered Normal nets as in [5], this labeling procedure leads to BN misclassification.Therefore, a correction mechanism of the classification is needed as an extra step [16] to overcome the BTP.This makes the HT classifier inefficient in terms of time complexity.Another labeling procedure considers all BNs as Trojan nets [15], which results in better HT classification accuracy.However, it significantly increases the size of the Trojan circuit and imprecisely labels some Normal nets as Trojan nets.Consequently, the BTP is a structural problem of HT classifiers due to the unclear and not-obvious way to label BNs.We therefore exploit the BTP to introduce and perform ALFAs on HT classifiers.

III. BTP EXPOSING ALFAS AGAINST HT CLASSIFIERS
This section introduces several ALFA scenarios against HT classifiers.

A. Threat Model
In this work, the adversary has complete knowledge of the labeling stage.We assume the adversary attacks the HT classifier in this stage by flipping only BN labels, as shown in Fig. 2. The adversary hence relabels the BNs manually, by malicious software [10], or even via an untrusted EDA tool [3], [17] to cause a misclassification of some nets in the later inference phase.A verification engineer could mistakenly classify Trojan nets as Normal so that the malicious circuit will not be removed from the netlist.Consequently, the resulting IC will have severe security issues.

B. Possible ALFA Scenarios Based on BTP
We propose four BN label-flipping scenarios: two can be perceived as deterministic procedures, and the third and fourth are random.Particularly, these label-flipping scenarios are equivalent to performing ALFAs on HT classifiers.

1) ALFA.1:
The attacker flips all BNs to be Trojan Nets.

2) ALFA.2:
The attacker flips all BNs to be Normal Nets.

3) ALFA.3:
The attacker deploys some random rules of BN labeling, such as every BN linked to a flip-flop from the trigger part is labeled as Normal.4) ALFA.4:The attacker introduces plain random BN labeling without any rules.This attack will be used to compare the performance of the proposed labeling strategies against random labeling.In the following section, we evaluate the impact of the proposed ALFAs on several HT classifiers.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
In the following, we quickly draft the most-used procedure of net-feature extraction proposed in [5] and introduce baseline HT classifiers that do not aim at overcoming the BTP but will be used to compare the HT classification quality before and after being attacked.

A. Net-Feature Extraction and Basic Labeling Procedure
In the following, we deploy a well-established net-feature extraction procedure introduced by [5].Following this work, 51 features of every net n can be extracted, whereby k denotes the number of logic levels, as shown in Table I.A table of extracted net features characterizes the netlist.For a set of netlists with their corresponding tables, the construction of the HT classifier consists of training and inference phases based on all tables deploying leave-one-out cross-validation [5].
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.[5], WHERE 1 ≤ k ≤ 5 1) Basic Labeling Procedure: The HT detection procedure introduced in [16] consists of two stages: 1) net classification and 2) classification result correction.Here, we modify this procedure and combine both stages in a single labeling procedure, called basic labeling (BL) procedure illustrated in Procedure 1.

TABLE I NET FEATURE EXTRACTION AS PROPOSED IN
2) Baseline Four HT Classifiers: A set of netlists from [14] are randomly selected.Then, we apply RFC, SVM, decision tree (DT), and decision stump (DS) classifiers to this set of netlist.Typically, the so-called F-measure is employed to evaluate the quality of almost all HT classifiers in this domain.However, the F-measure can dangerously show overoptimistic inflated results [18].Therefore, we additionally use the Matthews correlation coefficient (MCC), as MCC dependents on all basic ML-evaluation metrics: true negative (TN), true positive (TP), false negative (FN), and false positive (FP) and exhibits more advantages over the F-measure for binary classifiers [18]

B. Reliability Analysis of HT Classifiers
Fig. 3 shows a comparison of the TPR, TN Rate (TNR), Precision (PRE), F-measure, and MCC of all proposed classifiers.RFC and DT exhibit high TNR and Precision, indicating a reliable Trojan-free verification.Both classifiers perform well in Trojan detection with acceptable TPR and F-measure.In the SVM case, Trojan detection is slightly worse in comparison, and TNR indicates less reliable accuracy in identifying Trojanfree nets.Furthermore, DS wrongly predicts most of the nets as Trojan, which causes a high TPR but a low TNR.According to Fig. 3, the DS classifier is unreliable due to the huge difference between its F-measure and MCC.This indicates a case when the F-measure provides over-optimistic results, which is misleading.Since RFC and DT classifiers exhibit similar metrics, we choose RFC and SVM only as examples to study the impacts of our proposed attacks on HT classifiers in the following.impact on SVM, especially its TNR since BL is not the optimal labeling strategy.

D. Boundary Nets as Toxin: Discussion
Table II compares our results to the state of the art.The presented prior works on different applications and even on HT classification such as CatBoost [10] show that the adversary targets 20% instances in the dataset by ALFA to achieve 58.5% misclassification ratio of ML.Our work indicates that ALFAs exploiting BTP only target 0.09% of the sample size to cause 9.19% SVM misclassification ratio and 9.15% RFC misclassification, respectively.This reflects the severity of BN impact on HT classifiers, where flipping less than 0.1% of labels in the dataset causes almost 10% accuracy drop.This table, together with the results of ALFAs on RFC and SVM explains and demonstrates the toxic nature of the BTP to HT classifiers.
Further, when a verification engineer uses an ML-based HT detection tool, the BTP will generally be toxic for the tool, and the verification engineer will not be able to remove Trojan nets completely with high confidence.Consequently, a verification engineer should apply several combinations of labeling scenarios and HT classifiers to achieve an acceptable classification outcome, though the number of such combinations is enormous: considering a practical example with just four classifiers n c = 4 and a set of benchmarks with a total number of n BN = 254 BNs where the verification engineer wants to find one classifier out of four, the number of all possible combinations of labeling scenarios and HT classifiers is n c × 2 n BN = 4 × 2 254 = 2 256 combinations.

V. CONCLUSION
This letter investigated the labeling impact of the BTP on supervised ML-based HT-classification methods for gatelevel netlists.BNs are a structural problem for these HT classifiers where labeling a small number of nets greatly impacts HT classification.Therefore, BN labeling can be considered toxic for any HT classifier.This toxicity we illustrated with four labeling scenarios attacking HT classifiers where we clearly see strong effects on classification quality: while the true-negative ratio remains high in all cases, the true-positive ratio shows large deviations with according effects on MCC and F-measure.Theoretically, this can be mitigated by applying several labeling scenarios; practically, this is, however, impossible due to the large number of tobe-covered combinations.This calls for novel approaches, as with established ones proper detection and removal cannot be guaranteed anymore.We will hence target a different class of ML and evaluation methods to identify and correct maliciously flipped labels.
ML-Based Trojan Classification: Repercussions of Toxic Boundary Nets Saleh Mulhem , Felix Muuss , Christian Ewert, Rainer Buchty , and Mladen Berekovic Abstract-Machine learning (ML) algorithms were recently adapted for testing integrated circuits and detecting potential design backdoors.Such testing mechanisms mainly rely on the available training dataset and the extracted features of the Trojan circuit.In this letter, we demonstrate that this method is attackable by exploiting a structural problem of classifiers for hardware Trojan (HT) detection in gate-level netlists, called the boundary net (BN) problem.There, an adversary modifies the labels of those BNs, connecting the original logic to the Trojan circuit.We show that the proposed adversarial labelflipping attacks (ALFAs) are potentially highly toxic to the accuracy of supervised ML-based Trojan detection approaches.The experimental results indicate that an adversary needs to flip only 0.09% of all labels to achieve an accuracy drop of over 9%, demonstrating one of the most efficient ALFAs in the HT detection research domain.Index Terms-Gate-level netlist, hardware Trojan (HT), integrated circuit (IC) testing, machine learning (ML).

2
Every boundary net linked to the payload is labeled as follows: (a) Trojan net, if the wire is linked to the payload output.(b) Trojan net, if the wire is connected to a logic gate or complex gate*.(c) All other boundary nets linked to the payload are normal nets.

3
Every boundary net linked to the trigger part is labeled as follows: (a) Trojan net, if the wire is connected to a logic gate or complex gate* with four or more input ports and linked somehow to a flip-flop.(b) All other boundary nets linked to the trigger are marked as normal nets.4 Every other net is labeled as normal.* Complex gate indicates a combination of basic gates integrated into a single cell by CMOS vendors.
December 2023; date of current version 30 August 2024.This work was supported in part by the German Ministry of Education and Research (BMBF) via the Project VE-Jupiter under Grant 16ME0234.This manuscript was recommended for publication by F. Merchant.(Correspondingauthor: Saleh Mulhem.)Theauthors are with the Institute of Computer Engineering, Universität zu Lübeck, 23562 Lübeck, Germany (e-mail: saleh.mulhem@uni-luebeck.de).
Table of Unlabeled Net-Features.Table of Fully Labeled Nets 1 Every internal net of a Trojan is labeled as a Trojan net. Input:

TABLE II COMPARISON
BETWEEN OUR WORK AND THE STATE OF THE ART