Data Augmented Hardware Trojan Detection Using Label Spreading Algorithm Based Transductive Learning for Edge Computing-Assisted IoT Devices

IoT devices handle a large amount of information including sensitive information pertaining to the deployed application. Such a scenario, makes IoT devices susceptible to various attacks. In addition to securing IoT devices, it is equally important to secure communication among devices and with the outside world. RS232 is a common communication protocol used in IoT and embedded devices. Hence ensuring, Trojan detection in RS232 plays a major role in providing secured communication among edge assisted IoT devices. The inclusion of malicious circuits known as hardware Trojans can occur at any stage of the IC design and manufacturing. Existing pre-silicon detection schemes with static features is limited by the number of features that are learned by the detection scheme. In contrast, machine learning allows enhanced Trojan space exploration. Existing machine learning-based Trojan detection consists primarily of supervised algorithms that rely on high-quality labeled datasets for efficient Trojan detection. Unsupervised methods, on the other hand, underperform due to limited training data and severe imbalance within the available data. To handle such a situation, a semi-supervised hardware Trojan detection has been proposed. In this work, permutation importance guided principal component analysis, correlation aware data augmentation, and hyper-parameter optimization using genetic algorithm aid in optimal dataset and model generation. Pseudo label generation using semi-supervised schemes is utilized to handle partially labeled datasets. For the Trust-HUB benchmarks, the proposed methodology achieves an average of 88.48% true positive rate and 95.77% true negative rate which, clearly indicates the effectiveness and feasibility of semi-supervised hardware Trojan detection.


21
The rapid advancement in microelectronic technologies has 22 led to the exploration of cloud computing, big data, arti- 23 ficial intelligence, embedded systems, 5G communication 24 and internet of things (IoT). IoT extends from smart city 25 to smart healthcare including many mission critical sys- 26 tems. IoT framework consists of sensors, actuators and 27 The associate editor coordinating the review of this manuscript and approving it for publication was Byung-Seo Kim . embedded electronic devices that receive, store and transmit 28 data. As per forecast, the number of connected smart devices 29 will reach 75 billion by 2025 [1]. When the number of con-30 nected devices grow, there exists a multi-fold increase in the 31 data to be handled. In such a scenario, quality of service 32 (QoS) gets affected due to high network traffic and delay 33 in time-sensitive applications. Edge computing (EC)-assisted 34 IoT devices address the problem of degraded QoS by sharing 35 data processing and enabling self-storage, which reduces the 36 load on the cloud servers [2]. As shown in Fig.1 Machine learning algorithms extract useful information or 76 patterns from the input data for Trojan identification facili-77 tating the development of reusable and scalable models for 78 HTD. Among the existing machine learning based detec-79 tion schemes, most methods apply supervised learning, but 80 it is not always possible to have golden reference circuits, 81 considering the real-time scenario. On the other hand, unsu-82 pervised strategies use functional features, targeting Trojans 83 with low controllability and transition probability pertain-84 ing to their stealthy nature. Such methods can be evaded 85 by redesigning Trojans to satisfy the conditions of a nor-86 mal circuit [21]. Moreover, the methods that depend on 87 structural features underperform in true positive rate (TPR) 88 due to the limited Trojan space exploration in the training 89 phase. 90 To be precise, existing machine learning-based Trojan 91 detection approaches suffer from the following limitations. 92 Requirement of a labeled dataset for supervised algorithms, 93 limited learning of the Trojan space in the unsupervised 94 case, and the model's inability to deal with design-specific 95 bias, data imbalance, and/or requirement of light-weight 96 machine learning models. To overcome these limitations, 97 the proposed work uses semi-supervised algorithms for 98 hardware Trojan detection to deal with a partially labeled 99 dataset. Moreover, a dynamic method that can adapt to the 100 new Trojan designs is the need of the hour. The proposed 101 semi-supervised approach use transductive learning, lever-102 aging structural information from graph-based algorithms to 103 perform label predictions effectively on the unseen Trojan 104 data. Furthermore, the method incorporates correlation-aware 105 data augmentation schemes to address the problem of data 106 imbalance. In addition, the method employs a permutation 107 importance-based principal component analysis (PI-PCA) 108 algorithm for feature selection. In addition, the XGBoost 109 model's hyper-parameters are optimized using a genetic algo-110 rithm for improved Trojan detection. Among the existing machine learning-based detection 214 schemes, the majority of the methods fall in the supervised 215 category, which is not the case considering the real-time 216 scenario. In addition, there is no unified method of label-217 ing the nets, leading to discrepancies in result interpreta-218 tion. Unsupervised strategies, in general, adopt testability 219 measure-based features targeting Trojans that have low con-220 trollability and low observability [30]. Such methods can 221 be circumvented by redesigning the Trojans to satisfy the 222 conditions of a normal circuit, as mentioned in [21]. Fur-223 thermore, strategies that adopt structural features underper-224 form in true positive rate (TPR) due to the limited Trojan 225 space learned in the training phase, causing poor general-226 ization capability. The performance of supervised algorithms 227 relies on the availability of high-quality labeled data. Manual 228 labeling of data for the complete circuit becomes tedious 229 and time-consuming. The problem is further aggravated by 230 the increase in the complexity of circuits. On the other 231 hand, unsupervised algorithms require vast amounts of data 232 to infer patterns revealing Trojan characteristics accurately. 233 Hence a mechanism that overcomes the limitation of both 234 methods becomes essential, considering the diversified threat 235 conditions. 236 VOLUME 10, 2022

III. PROBLEM FORMULATION
in the labeled data to work with unlabeled data.    in the input space is captured to pass the information through 288 the graph that aid label assignment. It is performed using a 289 weight matrix which is normalized symmetrically. The algo-290 rithm dynamically assigns labels depending on the regular-291 ization term α, which specifies the percentage of contribution 292 considered from the initial set of labels. This adaptive nature 293 makes it suitable to handle unknown Trojans.
The small Trojan footprint causes a high degree of imbal-296 ance between normal nets and Trojan nets [19]. Correlation-297 aware data augmentation balances the data by generating 298 synthetic samples coherently with the original data distri-299 bution. For synthetic data generation, the proposed scheme 300 uses the adaptive synthetic sampling (ADASYN) [52] algo-301 rithm, which considers the density of the data to generate 302 the synthetic samples of minority data. It means ADASYN 303 produces more data samples for harder-to-learn data points. 304 The proposed method captures linear and nonlinear relation-305 ships among data using correlation parameters such as Pear-306 son's correlation coefficient [53] and Spearman correlation 307 coefficient [54], respectively. Pearson correlation coefficient 308 (r) effectively captures the linear relationships between two 309 continuous variables x and y. Its value ranges from -1 to 1. 310 It is calculated using (1).
where d i is the difference in the ranks of the observation 319 and n is the number of observations. The coherence of the 320 generated data with the original data is verified by analyzing 321 the correlation parameters. Correlation values in the range 322 of 0.7 to 0.9 facilitates the model to maximize the Trojan 323 detection.
where n is the number of observations, p is the number of vari-333 ables, and R is the correlation matrix. The proposed methodology is illustrated in Fig.2. As the 415 first step, the design is converted into netlist using Synopsys 416 DC [58]. Circuit and net related, 78 features are extracted 417 from the netlist. Permutation importance-based principal 418 component analysis algorithm is performed on the extracted 419 features. It produces an optimal set of uncorrelated and con-420 tributive features that maximize the predictive performance 421 of the underlying model. XGBoost model tackles the problem 422 of overfitting due to limited data, by applying regularization. 423 It produces faster convergence by analyzing the feature dis-424 tribution. Data imbalance in the produced dataset is handled 425 using a correlation-aware data augmentation scheme. It pro-426 duces synthetic data that is coherent with the original data 427 by satisfying the correlation constraints on the ADASYN 428 algorithm. The scheme removes uncorrelated samples and 429 ensure the coherence of synthetic samples with the original 430 data.

431
A pseudo label generation algorithm is adopted to make 432 label predictions on the partially labeled dataset. The avail-433 able labeled data and the generated pseudo labels are com-434 bined to form the final training data set. During training, 435 hyper-parameter optimization is performed. The performance 436 of the model is evaluated using test data by adopting the leave-437 one-out cross-validation method. The adopted testing process 438 makes each circuit considered for testing is unknown to the 439 trained XGBoost model.  Table.1 451 The adopted feature set helps to tackle the problem of design-452 specific bias. Trojans can exhibit different characteristics 453 with respect to the inserted design. For example, consider a 454 combinational Trojan with eight trigger inputs inserted in the 455 S38417 and RS232 circuits. It can be observed that although 456 the Trojan is similar in structure, the Trojan in S38417 is 457 harder to activate when compared to that of RS232. Hence 458 it is important to consider both net-based and circuit-based 459 features for effective Trojan identification. The proposed 460 In order to remove offsets created by correlated and less con-  It is assigned a random value after which, parent chromo-501 somes are randomly selected for child chromosome gener-502 ation. Child chromosomes are produced through crossover 503 and mutation. In the process of crossover, a random part 504 of the parent's chromosomes forms the new chromosome. 505 In the process of mutation, the values assigned to the gene are 506 changed to a new random value. F-measure is chosen as the 507 fitness criterion to address the data imbalance problem. Chro-508 mosomes with the highest fitness values are chosen as parent 509 chromosomes in the succeeding generations, and the process 510 continues. The procedure returns the chromosome with the 511 highest f-measure score upon reaching the user-defined con-512 vergence criteria. In the proposed work, max number of gen-513 erations which is 30 is set as the criterion. The corresponding 514 chromosome gives the optimal hyper-parameter configura-515 tion of the XGBoost algorithm. It effectively addresses the 516 problem of overfitting due to the limited training data through 517 regularization. In addition, the XGBoost algorithm considers 518 feature distribution for faster convergence. The efficacy of 519 the proposed algorithm is validated on the Trust-HUB bench-520 mark circuits.   Data pre-processing stage consists of permutation importance-549 based principal component analysis (PI-PCA) for feature 550 selection and correlation-aware data augmentation. Redun-551 dant and less contributive features are removed using the 552 PI-PCA algorithm. PCA algorithm selects 21 prominent fea-553 tures that are uncorrelated and exhibit maximum variance 554 from the initial set of 78 features. Since, PCA considers only 555 the global information without looking into local information 556 that can be discriminative for the model predictions. To tackle 557 such a scenerio, permutation importance guided PCA algo-558 rithm is developed. It ensures the retention of the most influ-559 ential seven features from the pruned set of 21 features, 560 as depicted in Fig.3. The correlation plot of the pruned set 561 of features is depicted in Fig.4. Thus, the proposed algorithm 562  characteristics (ROC) and precision-recall curves (PR). The 588 capability of the model in performing accurate Trojan detec-589 tion is reflected in the increased area under the curve(AUC) 590 score. Fig.6 depicts the impact of data imbalance on model 591 performance and is quantified using the AUC score of the 592 Trojan class. Small Trojan footprint to evade standard veri-593 fication schemes, causes a severe data imbalance in the gen-594 erated dataset. To tackle this problem, ADASYN is used to 595 create synthetic data. Analysis of the 210 generated synthetic 596 VOLUME 10, 2022 FIGURE 9. PR curve of RS232-T1500 for correlation-aware data augmented dataset.  Fig.10 625 exhibits the confusion matrix generated using the aforemen-626 tioned notation for the RS232-T1500 test circuit. In the field 627 of Trojan detection, the efficacy of the model relies on its 628 ability to improve Trojan recognition and reduce the nor-629 mal net miss-classification rate. In effect, this translates to 630 minimization of the generation of false positives and false 631 negatives.

634
Label propagation and label spreading algorithm have been 635 applied to the pre-processed data to generate pseudo labels. 636 The dynamic nature of the label generation process of label 637 spreading algorithm makes it suitable for the application 638 of Trojan detection. It is observed that the value of alpha 639 which denotes the ratio of information inferred from the 640 neighboring nodes and from the initial labels, impacts model 641 performance. TNR value increases with decrease in the con-642 tribution of initial label information, and the highest TNR 643 is reached by adopting an alpha of 0.8 to 0.9 on average. 644 Labeled data and generated pseudo labels are combined to 645 form the final dataset, which is then applied to the optimized 646  FIGURE 11. Impact of each process on Trojan detection for RS232-T1500. recall, and f-measure as indicated in Fig.11. The exploitation 659 of structural information and the available prior information 660 by the graph-based transductive approaches in Experiment.4, 661 results in optimal model performance and is indicated by the 662 improved f-measure. Semi-supervised algorithms are real-663 ized using the scikit library. Upon experimenting with the 664 available kernels such as radial basis function (RBF) kernel 665 and Knn kernel, the former obtained optimal Trojan detection 666 results as illustrated in Fig.12. The dynamic nature of label 667 prediction adopted by the label spreading algorithm makes 668 VOLUME 10, 2022    Table.5 and Table.6. The valuable prior 691 information in the labeled data has been exploited in the 692 proposed semi-supervised algorithm to enhance the TPR 693 when compared to [43]. The improved TPR values can be 694 attributed to the utilization of initial cluster information by 695 the label spreading algorithm that reveals significant relation-696 ships among data samples within the dataset. It is observed 697 from   unknown Trojan data needs to be addressed, which forms datasets. Permutation importance-guided principal compo-746 nent analysis has been adopted to capture both global and 747 local information for efficient feature reduction. Correlation-748 aware data augmentation curates the ADASYN algorithm to 749 generate data coherent with the underlying data distribution 750 for optimal data balancing. In addition, genetic algorithm-751 based hyper-parameter optimization maximizes Trojan detec-752 tion by attaining hyper-parameter configuration resulting 753 in a global optimum. Furthermore, a graph-based semi-754 supervised scheme that utilizes transductive learning effec-755 tively uses prior information in the partially labeled dataset 756 and the structural information from the generated graphs for 757 enhanced detection performance. The efficiency and feasibil-758 ity of the proposed work have been established upon com-759 parison with existing supervised, unsupervised, and few-shot 760 learning-based schemes of hardware Trojan detection. The 761 proposed methodology achieves 88.48% average true positive 762 rate and 95.57% average true negative rate for the Trust-HUB 763 benchmark circuits. Specifically, RS232 benchmark test cir-764 cuits are chosen to validate the proposal. Ensuring Trojan 765 detection of the RS232 circuit plays a major role in providing 766 secured communication among edge computing-assisted IoT 767 devices. In the era of the connected world, the very volatile 768 nature of edge computing to security threats faced by IoT 769 devices compel this choice.

770
Experimentation and analysis on the test circuits indi-771 cate the effectiveness and feasibility of a semi-supervised 772 approach for hardware Trojan detection. The computational 773 complexity of graph creation for pseudo-label generation 774 linearly increases with the circuit size and has to be opti-775 mized. The exploitation of explainable machine learning 776 to avoid manual intervention for result analysis, extend-777 ing to incorporate more variety of Trojan designs and 778 optimized pseudo-label generation are the suggested future 779 work.