Leveraging Gametic Heredity in Oversampling Techniques to Handle Class Imbalance for Efficient Cyberthreat Detection in IIoT

In recent years Cyber-Physical Systems (CPS) and Industrial Internet of Things (IIoT) have gained significant attraction; however, it remains a vulnerable target for cyberattacks. Machine learning techniques have garnered interest in security applications due to their rapid processing capabilities and real-time predictions. However, imbalanced data distribution is a prevalent issue in IIoT environments, adversely affecting ML-based attack detection systems. In this work, we present a novel gametic heredity-based oversampling technique for addressing imbalanced data challenges in cybersecurity applications, specifically targeting IIoT systems. The proposed model enhances diversity in the minority classes by generating unique synthetic minority samples, creating diverse synthetic data while restricting instances to the minority class region. The proposed model outperforms complex and conventional methods in terms of precision, recall & F-Score while mitigating over-generalization by evenly distributing newly generated samples within minority class boundaries and regions. To validate the proposed model and verify its efficacy in identifying cyber threats, we used the UNSW-NB15 dataset. Simulation results demonstrate that the proposed model efficiently detects attacks with high performance compared to state-of-the-art techniques. Our research contributes to developing robust & efficient machine learning models for enhancing the security of IIoT systems while handling class imbalance issues.


I. INTRODUCTION
T HE RAPID development and integration of the Industrial Internet of Things (IIoT) have revolutionized the manufacturing and automation industries, driving efficiency, productivity, and cost savings.However, the widespread adoption of these interconnected systems has also increased their vulnerability to cyber threats.Cybercriminals are continually devising sophisticated attack strategies, targeting IIoT to cause disruption, data breaches, and damage to critical assets.Hence, there is a growing need to develop robust & efficient methods for detecting and mitigating cyber threats.
Machine Learning (ML) and Deep Learning (DL) approach shows promise to detect cyber threats in IIoT and Industrial Control Systems (ICS) environments.However, these techniques face challenges while encountering imbalanced data [1], a common issue in the context of cybersecurity.In realistic scenarios, the amount of normal (benign) data instances is typically higher than the attacks (malicious) instances, leading to a skewed distribution [2].This imbalance could significantly affect the ML and DL model performance, due to biasing towards majority instances (i.e., normal data points) and resulting in poor detection of cyber threats [3].Because these algorithms usually consider datasets having equal classes [4], [5].
Handling imbalanced data is a critical aspect of developing effective ML models for cybersecurity applications, especially in the context of IIoT and ICS.Several techniques and methodologies are developed to tackle the data imbalance problem, including data preprocessing techniques, algorithm-based techniques, and cost-sensitive learning methods [6], [7].Data preprocessing methods, such as oversampling and undersampling, modify the dataset before training the model.Oversampling, in particular, produces synthetic data for the minority class instances, potentially improving the model's performance in detecting cyber threats.The majority of data sampling techniques used till date are synthetic in nature [8], [9], [10], [11].The Synthetic Minority Oversampling Technique (SMOTE) [8] is well-known among them, which intelligently introduces fresh synthetic or new data instances to the class having low samples.Whereas algorithm-based techniques modify the learning algorithm itself, and cost-sensitive learning methods assign additional misclassification costs to the minority and majority classes [7].
Oversampling techniques offer several advantages over other approaches for handling imbalanced data in cybersecurity applications.They enable the model to understand the characteristics of the minority class instances in a better way and are universally applicable to any dataset.
However, while oversampling techniques like SMOTE and Adaptive Synthetic (ADASYN) improves classifier performance in imbalanced data scenarios, but there are some issues associated with their use.According to Barua et al. [11], these synthetic methods can unanimously enlarge the minority class region, leading to the misclassification of majority class samples.This occurs as synthetic instances are sometimes introduced into majority-class regions.Additionally, these approaches may generate non-diverse and large amounts of similar data samples, as they undertake only nearest-neighbor samples.
In summary, while oversampling techniques have shown promise in addressing imbalanced data issues, they still have some limitations.Erroneous enlargement of the minority class region, generation of non-diverse data samples, data sample duplication and increased false alarm rates are some of the challenges associated with these methods.It's crucial to cautiously consider the trade-offs and pick the most appropriate method for handling imbalanced data in cybersecurity applications.
In this paper, we proposed a novel genetics-based oversampling method that achieves high precision and recall with low false positives.In the proposed approach for the minority attack samples, clusters are created using the Gaussian mixture model.Then within the cluster, the gametic hereditical oversampling technique which is inspired by the field of genetics biology for producing synthetic data samples is applied.The generated samples are unique and not the copy of old samples but lie within the boundary of the class.The significant contributions of this paper are described as: 1) A novel gametic heredity based oversampling technique inspired by the field of genetics biology is proposed to enhance the cyberthreat detection in IIoT.The proposed model is capable to deal with imbalanced data situations as well, thereby providing more robust and secure architecture.2) Our proposed model enhances the diversity within the minority class by generating unique synthetic minority samples inspired by the diversity observed within populations of humans or other living organisms despite being of the same species.This approach creates as much diverse synthetic data as possible while maintaining the instances within the region of the minority class.3) By generating data instances from two dissimilar instances, the proposed model ensures that overfitting is avoided as it does not create duplicated samples and stops the generation of new instances after balancing the classes.4) The proposed model demonstrates superior precision, recall, and F-Score compared to other complex and conventional methods while also addressing the overgeneralization problem for detecting cyber threats in IIoT.The rest of the paper is organized as: Section II presented the related work, whereas Section III describes the theory of inheritence and diversity and Section IV presents the proposed approach.Section V discuss the result evaluation and Section IV concludes the work and gives future direction.

A. Intrusion Detection Systems
A significant amount of scrutinization and research to achieve a network that is secure, using ML methods is done in both academia and industry because of their high potential benefits.The majority of traditional ML methods employed in intrusion detection are grounded in supervised learning models [12], [13], [14].Liang et al. [15] presents a data clustering optimization model.Formally, it sorts data according to the weighted distance and safety factor based on each node's data features properties and the priority threshold.Chang et al. [16] explored the applicability of using the Forest-RI (Random Input), a technique of Random Forest (RF) which uses feature selection, and combined it with an SVM to classify the selected efficient attributes.
Bhattacharya et al. [17] devised a model that firstly employs a hybrid PCA-firefly algorithm to reduce data dimensions before using the XGBoost algorithm for the classification of the reduced data.After the introduction of DL theory, its excellent feature learning capabilities have attracted significant attention from researchers.As a result, several scholars have started incorporating DL methods for intrusion detection [18], [19], [20].Shone et al. [21] put forth a non-symmetric deep autoencoder (NDAE) for feature learning and created a new classification model by integrating NDAE with the RF classifier in a unsupervised way.
However, existing Intrusion Detection Systems (IDS) face challenges when handling class imbalance in IIoT environments.Imbalanced data could tend to build models that prioritize majority classes, resulting in decreased detection rates of cyber threats.This issue also causes increased false alarms, inefficient resource allocation, and reduced overall security.Additionally, traditional performance metrics may not accurately reflect the model's capability to detect threats, & the model may struggle to adapt to different or emerging cyber threats due to insufficient minority class representation.

B. Data Imbalance Processing Methods
As stated in the beginning, this paper handles the issue of the class imbalance problem.In the context of cyber threat detection for IIoT, researchers have focused on tackling the class imbalance problem to increase the performance of ML models in detecting cyber threats.The solutions for data imbalance problems in cyber threat detection for IIoT are majorly classified as data based techniques, algorithm based technique, and cost-sensitive learning based techniques [22], [23].Several studies have investigated the application of oversampling, undersampling, and hybrid resampling techniques to balance the class distribution in IIoT intrusion detection datasets.By modifying the representation of minority and majority classes, these techniques could enhance the performance of ML models in detecting cyber threats in IIoT environments.
In addressing the class imbalance scenario, the most straightforward technique is random oversampling.This method entails selecting minority samples at random and duplicating them until a desired size is reached.However, this can result in overfitting due to the strong similarities between original and duplicated instances.To tackle this challenge, Chawla et al. introduced the Synthetic Minority Oversampling Technique [8], which generates synthetic samples between randomly chosen minority instances and their NN-nearest neighbors, where NN is determined by the user.Despite its advantages, SMOTE may lead to over-generalization, as it generates new instances not taking majority instances into account, thereby increasing the overlap between classes [24], [25], [26].This issue can become more pronounced in datasets with high imbalance ratios, as sparse minority instances may end up within the majority class after oversampling, further impairing overall performance [27].
Several methods are developed for overcoming overgeneralization.One approach, Safe-level SMOTE [28], calculates a "safe-level" value for each minority class and synthesizes samples nearer to the largest safe level.This safelevel value is determined by the amount of minority samples among its nearest neighbors.However, this method may result in overfitting since synthetic samples are created further away from the decision boundary.
Another approach, Borderline-SMOTE (B.SMOTE) [10], finds the boundary in-between the two classes and focuses on oversampling of the samples which exists close to the boundary line.In contrast, ADASYN [9] focuses on the minority class by generating synthetic samples based on their density distribution.ADASYN adaptively generates more synthetic examples for the minority class in regions with fewer instances, effectively balancing the dataset.However, both the techniques may not detect all minority samples in proximity within the boundary of decision.
Alternative sampling methods involve the use of clusters to segment the dataset before applying any sampling strategy.Barua et al. [11] presents a technique called Majority Weighted Minority Oversampling Technique (MWMOTE), that partitions data with the help of clustering before employing the Euclidean distance similarity to get closely related class instances.Synthetic instances are then produced depending on weights allocated for minority class instances.Various researchers showed a preference for oversampling instead of under-sampling [29], [30], [31], as under-sampling eliminates class samples and may lead to the loss of essential information necessary for constructing an efficient predictive model.Thus, undersampling is ruled out from this study.
While our methodology is also developed on synthetic data generation, it distinguishes itself by creating diverse data samples.In comparison to other methods that generate new data instances from highly similar parent samples according to their similarity distance measure values, proposed model creates different data samples originating with help of 2 distinct parent instances.

III. THEORY OF INHERITANCE AND DIVERSITY
Genes, the fundamental building block of heredity, are thought to be located in chromosomes and are transferred equally to offspring from each parent after the fertilization of the gametes (egg and sperm).Sutton [32] proposed this theory in 1902 and is known as the Chromosomal Theory of Inheritance (CTI).According to the hypothesis, new kids inherit 50% of each parent's chromosomes, making them both identical to and different from both parents at the same time.The sex (or gender) of the species is crucial for maintaining diversity within species because it aids in choosing two opposing members that can procreate.Two unique groups of chromosomes population S = (s 1 ; s 2 ; s 3 ;. ..;s r ) and T = (t 1 ; t 2 ; t 3 ;. ..;t r ) are generated with same size r.Every chromosome Z = (z j1 ; z j2 ; z j3 ;. ..;z jp ) represents a d-dimensional vector comprised of genes, where z jk is the k th gene value (k = 1; 2;. ..; d) of the j th chromosome (j = 1; 2;. ..; p) in each of the population.Two chromosomes, one from each set, are united to create a new offspring, with every parent randomly providing half of the genes the offspring would inherit.Researchers have used CTI to solve problems in different domains such as agriculture and animal research as well [33], [34].An unbalanced defect dataset is subjected to the basic principles of selective animal and plant mating for generating more data instances of defective instances.Methods including outcrossing, inbreeding, and line-breeding [35] were employed in selective breeding to separate the pool of animals and plants.Similarly to this, by taking into account a similarity metric to help us separate our data samples, we can logically oversample the minority groups.In order to create new samples, we plan to develop an oversampling method based on the CTI and the partitioning approach that can combine various samples to create synthetic data samples which are as unique from their producers as possible & add to the diversity of the minority class distribution.This method will be able to identify minority defective samples and examine their sample defects.

IV. PROPOSED APPROACH
We exploited the challenge of using the chromosomal theory of inheritance for generating synthetic data using the Gaussian clustering-based gametic oversampling technique.Taking Walter Sutton's theory [32] into consideration we used the features of the attack dataset as the chromosomes of the parents taking part in the reproduction mechanism for generating diverse instances of data.We aim to produce new minority instances of attack labels that inherit the features from two different data instances.These features produced are also unique and not a copy of the existing sample.
Based upon the same attack label of two data samples, the child class is produced that will have the same attack label as its parents.The basic intuition behind this could be explained with an example as, let's say both the male and female have brown iris then the chances of their child having brown iris are high as compared to having blue, black, or green iris.Such a concept is known as inheritance, as explained in Walter Sutton's chromosomal theory of Inheritance [32].The proposed approach seeks to produce synthetic data that has both unique and common features with the help of three fundamental phases.The overall proposed approach is shown in Figure 1 and a list of all notations used in the proposed model is shown in Table I.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Before generating the synthetic instances for the minority dataset, data pre-processing is applied to the complete dataset.As a part of this phase, the feature selection technique is also applied to the preprocessed dataset to reduce the overall number of attributes of the dataset.After feature selection comes to the outlier detection step using Isolation Forest.Then, the preprocessed data is divided into train and test sub-datasets.In the second phase, the minority samples of the training dataset are divided into various clusters based on the clustering algorithm.The idea behind clustering is to increase the accuracy of the offspring falling into the vicinity of its parent.
As a part of the third Phase, Mahalanobis Distance (MD) says M q for every data sample of cluster C i is calculated.The data samples are then arranged in descending order based on their MDs.Then, the data samples of the cluster C i are halved into upper and lower segments from its midpoint.The data elements of the upper and lower sections are arranged in a specific order (either ascending or descending) based on their attack label.In the last step, the parents with the same attack label are chosen from upper and lower segments respectively to breed and produce offspring which fall into the same class label as their parents.The breeding part includes taking the average of the feature values of its parents to get the features of the child.Algorithm 1 describes the entire process of the proposed oversampling technique.

A. Phase 1: Data Pre-Processing and Dimensionality Reduction
In data pre-processing, we applied, label encoding to convert categorical data into numeric and Min-Max Scaling to bring the data in one range.Then further we applied dimensionality reduction to the pre-processed data.Since, for effective clustering lower number of features works well so, Where C jq = Object vector, j = cluster no.∈ {0, 1, 2 ..., k−1}, q = row or data point of C j , μ = mean vector, −1 = covariance matrix 9: Sort C jq according to M 2 q value in decreasing order 10: Say, mid = length (C j )/2, Divide (C j ) into upper and lower divisions as X u = (C j ).elements from [0 to mid] X l = (C j ).elements from [mid to length (C j )] 11: Sort X l and X u on basis of attack labels.12: Synthetic Data (SD) j = SDG (X l and X u ) total_count = total_count + len((SD) j ) 13: If total_count < required_count repeat steps 7 to 11 for the remaining clusters 14: Else; End we applied principal component analysis (PCA) on the scaled data to reduce its number of features.But due to the human factor of assigning the number of components and reducing the human error a method called permutation test is used on the mined dataset to identify the true number of features that could effectively represent the whole features of the dataset.
After gathering the number of attributes, we applied PCA to the dataset, described in Algorithm 2. Let W ij be the weight of feature j giving PCA feature i on the reduced and pre-processed dataset, train-test-split is applied to divide data into train and test datasets.The training data is again scrutinized for separating the majority & minority classes.Here, minority class generally falls under the criterion of attack.The data samples of these classes are separated from the majority class and are sent to the next step for synthetic data generation.

B. Phase 2: Clustering
Clustering is an unsupervised learning methodology used to split the data into clusters (say groups) to classify them uniquely.In a broad sense, the cluster is a collection of data instances which share more similarities with each other than with those in different clusters.Before performing oversampling in our data, we are interested in creating clusters of attack data representing the minority class to oversample.
We preferred the Gaussian clustering method over k-means clustering as k-means relies on distance measures to assign data points to clusters, resulting in circular-shaped clusters.This occurs as the cluster centroids are iteratively updated using the mean value.Circular shape of clusters makes this technique less effective when applied to datasets where the attack data may not be distributed in a circular pattern.If the data points form a non-circular pattern, K-means would struggle to identify the correct clusters, and consequently, the accuracy of the synthetic data would be compromised due to Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.gmm.fit(D t ) Silhouette_score.append(Silhouette_score(D t ), gmm.predict(D t )) 3: Plot Silhouette_score v/s i graph 4: Return max[Silhouette_score] ←− i th value this inefficiency.Figure 2 shows the creation of a cluster using k-means and GMM.
In this work, the silhouette measure (or silhouette coefficient) is employed to determine the quality of clustering and to evaluate the optimal number of clusters 'k' for a given dataset as mentioned in Algorithm 3. It provides a way to measure how effectively every data point is assigned to its respective cluster by taking into account both the cohesion within the cluster and the separation between different clusters.Silhouette coefficient is formulated as: where, b i = min.average distance from the i th point to points in different cluster, a i = average distance from i th point to other points within same cluster.The silhouette coefficient ranges from −1 to 1.Its values close to 1 indicates a high quality clustering assignment, values near 0 imply that the data point could belong to either its current cluster or a neighboring cluster, and negative values suggest that the data point could have been assigned to the incorrect cluster.
After getting the number of clusters 'k', Gaussian Mixture Model (GMM) is used to assign elements to the clusters.With GMM each cluster is represented by a Gaussian distribution.GMM estimates the parameters of these Gaussian distributions and assigns data instances to clusters depending on their probability of belonging to each distribution.
When implementing GMM to cluster, the goal is to partition the data into groups or clusters, where every cluster is depicted by a Gaussian distribution.The steps for creating clusters using GMM are as follows: 1) Initialization: Select the 'k' number of clusters (given by Silhouette Score) for the dataset.Initialize the parameters of the Gaussian distributions (means μ i , covariance matrices i ) and the mixing coefficients w i using K-means++ initialization. 2

) Expectation
Step (E-Step): Given the current estimates of parameters, calculate the probabilities of all data points hailing to every Gaussian distribution using Bayes' theorem: where: z i is cluster assignment of data instance x n , P(z i = k|x n ) represents the probability that x n belongs to cluster k.N(x n ; μ k , k ) is the Gaussian distribution with mean μ k and covariance matrix k for cluster k. w k is the mixing coefficient for cluster k.The summation is over all clusters (j = 1, 2, . . ., K).

3) Maximization Step (M-Step):
Update the estimates of the Gaussian parameters (μ, ) and the mixing coefficients (w) based on the probabilities computed in the E-step: where, the summations are over all data points (n = 1, 2, . . ., N). N is the total no. of data points in the dataset.
5) Cluster Assignment: After convergence, add every data point to a cluster with the highest probability:

C. Phase 3: Synthetic Data Generation
First, we calculate the MD and sort data points of cluster C i in descending order based on MD.Then we divide the sorted data samples of cluster C i from the previous step into two groups X l and X u by identifying a mid-value, which is half of the total number of points in the cluster.After several iterations of simulation trials, using the upper half and lower half yields the best and most universal outcome of synthetic data.This strategy still functions even when we have just a less number of attack samples.We think that using two parents is practical and is in favor of inheritance theory.One partition say X u is made up of all data samples with MD greater than or equal to the center data point, and the second partition say X l is made up of all other data samples.Samples are progressively tagged and then paired within the two partitions.To actively identify 2 separate examples from both segments which have been methodologically and symmetrically matched as "parents" are taken from X l and X u with the same attack category.Before the pairing process, the data sample in each upper and lower half are again sorted on the basis of their attack label.The pairing is then carried out sequentially using SDG, described in Algorithm 4. These parents can be recognized by the "attack labels" they have.This is done to make sure that there are no samples that overlap and that the resulting samples created by the partitioning process are located inside the cluster decision boundary.The child, hence produced, is generated by taking the average value from the feature values of parents.
The decision to use MD instead of Euclidean distance (ED) is motivated by the limitations of ED when dealing with correlated attribute data.MD, measures how far a point is from its distribution (such as a cluster), is better suited for capturing proximity information.By considering the diversity between two data examples using MD, we can overcome drawbacks associated with the Euclidean metric.ED fails to differentiate strongly correlated or duplicate samples, impeding the classifier training process.MD, with its unitless nature, provides a relative measurement of sample distance from a reference point, facilitating outlier detection and similarity identification between known and unknown datasets.MD accounts for correlation and scale, making it a robust measure.To address scale and correlation concerns in ED, covariance within the data instances is also taken into consideration during distance calculation.
The mathematical term for the same is: where M 2 q denotes the squared MD, the mean of independent variables is represented by m, x represents the data samples, and C −1 denotes the inverse covariance matrix of independent variables.
Considering two samples of the defective class x = (x 1 , x 2 , . . ., x n ) T and y = (y 1 , y 2 , . . ., y n ) T , the MD is evaluated as; where S −1 is the covariance matrices.By aggregating and calculating the average of two paired samples from both partitions, new data is generated which is then added to the data X_new, known as the final phase of the proposed model.

A. Dataset Used
For experimental validation, we employ the widely recognized UNSW-NB15 [36] dataset which is majorly utilized in the field of attack detection.It has been extensively used in research & provides a diverse range of network traffic scenarios to evaluate the efficiency of attack detection systems.UNSW-NB15 dataset is also highly imbalanced in nature thus it is also used to evaluate the solutions for the imbalance data problem in network intrusion detection systems.The dataset was created to simulate real-world network traffic scenarios.The dataset comprises network traffic data collected in a controlled environment, including both benign and nine types of attacks with 49 features with the class label.It serves as a valuable resource to evaluate and develop intrusion detection systems in diverse network environments.

B. Evaluation Metrics
When evaluating an intrusion detection approach, several metrics are commonly used to assess its performance.These metrics provide insights into the efficacy of the approach in detecting and classifying intrusions.Some of the key metrics used for evaluating intrusion detection approaches include accuracy, recall, precision, F-Score, Area Under Curve (AUC), and False Positive Rate (FPR).Accuracy measures the overall correctness of the intrusion detection approach for classifying attack and normal samples.Precision presents the ratio of correctly classified attack requests over total Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
samples classified as an attack.Recall (or Sensitivity or True Positive Rate) represents the correctly classified attack over actual attack samples.F-Score gives the harmonic mean of precision and recall, which provides a balanced measure of both metrics.AUC quantifies the model's ability to discriminate between positive and negative instances across various threshold settings, providing an overall performance measure.FPR represents the incorrect prediction of attack samples.
where w is true positive, x is true negative, y is false positive and z is false negative respectively.

C. Experimentation Procedure
We used Keras1 and TensorFlow2 frameworks to develop the model and conduct experimental testing.As described in Section IV, the original data is processed from preprocessing step, and then outlier detection is performed using an isolation forest.On UNSW-NB15 label encoding is applied for categorical data, making the data values numeric.Next, we used PCA to reduce the dimensionality of the features.We determined the number of PCA components to be used for feature selection by testing the final classification effect of RF.The dimension of data in the UNSW-NB15 data set initially was 47 (+2 of output), and we reduced it to 14 features to be used.
In the proposed model-based oversampling, to identify the number of clusters, the silhouette score method is used which determined that 4 number of clusters suits best for attack dataset.Then GMM was initialized with the parameters as n_init = 50, init_params = kmeans + +, and reg_covar = 1 to create the clusters.Once the clusters are created then we calculated MD between the data points within each cluster.Next every data value was arranged in descending order depending on their calculated MD and the cluster was divided into upper and lower half.Then the parents are chosen from the upper and lower half with the same attack labels to generate the child by averaging the parent's feature values.After the oversampling process, an RF classifier is applied for model training on the combined dataset of old traces with newly generated samples.

1) Results of Binary Classification:
This section presents the comparison of the proposed model for the classification results of benign and attack instances on the UNSW-NB15 Figure 3 shows that on the UNSW-NB15 data set, precision, recall, and F-Score enhanced to 6%, 8%, and 7% respectively in comparison to the existing technique achieving the highest results.The high metrics of the proposed approach are owed to the fact that the generated minority data is the subset of the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.original data.It is built upon the foundation that the produced data should fall under the same cluster with its domain value in the range of its parents.Here, the data values are generated by considering the average of the values of its parent which fall inside the parent dominion.A bold view in the Table III indicates the outcome of the presented approach indicating that the proposed model along with RF achieves 97.3% of accuracy which is greater than the highest accuracy of 91.8% shown by the random oversampling technique.Along with accuracy proposed model achieved a 6% increased precision over all existing techniques by showing a precision of 97.7%.In addition to precision, a boost of 8% is achieved in recall value with 99.1% being the actual number.Whereas F-Score show a significant value of 98.1% with an increased value of 7% in comparison to the highest value of other oversampling technique.These findings demonstrate the effectiveness of the proposed model in detecting attacks, thereby enhancing the security of the IIoT system against potential cyber threats.
Techniques like B. SMOTE, SMOTE, ADASYN, and random oversampling are seen to perform lower to provide accurate synthetic data in comparison to proposed model.This accounts for various facts such as in the case of SMOTE and random oversampling, they may generate samples that might not actually represent the underlying distribution and introduce noise in the dataset [11].Randomly duplicating samples leads to dilution of the unique existing patterns of the dataset [11], whereas in the case of B.SMOTE, it may generate comparatively better samples than SMOTE and random technique but the process of selection of correct borderline samples for oversampling adds challenges to it and incorrect selection of such leads of suboptimal performance.Also, the challenges in estimating the density distribution ratio between majority and minority classes account for the suboptimal performance of ADASYN in comparison to the proposed model.Since the data samples are generated based on the specificity of data points within each cluster, it easily handles the concept of density distribution, no borderline samples need to be selected thus reducing the noise in the dataset as compared to SMOTE and allowing the system to preserve its nature making it less prone to dilution.
2) Multiclass Classification Results: Handling multiclass classification is a bit more arduous than binary.As in the case of multiclass the imbalance ratio gradually tends to increase in comparison to binary data values.
Table IV presents the results for multi-class classification of proposed oversampling in comparison to classification without oversampling (original) on selected features using PCA and RF as classifiers.Unlike binary classification, multiclass classification involves more complex data, often with potential class overlaps among different pairs of classes.This  Further, the proposed model is also compared with other states of art oversampling methods against precision, recall, and F-Score for the multi-classification of different attacks present in the UNSW-NB15 dataset.Figure 4 shows that the proposed model consistently outperforms the other sampling techniques across 8 out of 10 classes.It achieves higher precision values for all classes except for classes A and J. Similarly, recall of the proposed model outperforms the other sampling techniques across 7 out of 10 classes and 8 out of 10 classes for F-Score as shown in Figure 5 and 6 respectively.
This indicates that the proposed model improves the precision, recall, and F-Score of the minority classes and is effective in accurately detecting instances from these classes.The proposed model is effective for detecting attack samples of classes even with low training samples, thus securing the IIoT from these rare cyber threats as well.
Similarly, among the other sampling techniques, Random, SMOTE, ADASYN, and B SMOTE show varying levels of improvement compared to the no oversampling (original).Here, to maintain the same level of comparison, the actual imbalanced dataset is passed through PCA to generate the dataset with the reduced number of features and then used RF as for classification purposes.However, their precision, recall, and F-Score values are generally lower than those achieved by   proposed, indicating that the proposed model provides better precision for most classes.
It's worth noting that the performance of the different techniques varies across the classes.Some techniques may perform better for certain classes while underperforming for others.This suggests the importance of selecting an appropriate sampling technique based on the specific class characteristics and the desired performance.
Figure 7 presents a comparison of mean results for multiclass classification on the UNSW-NB15 dataset.The proposed method shows substantial improvement in all metrics.It achieves the best results for average recall, precision, and F-Score among all the techniques, indicating its effectiveness in detecting instances from multiple classes accurately.
Among the other sampling techniques, SMOTE, ADASYN, and B.SMOTE demonstrate relatively better performance compared to the original dataset but still, fall short when compared to the proposed method.The random oversampling technique shows the lowest performance among all the techniques.The better performance of the proposed model accounts here for the fact that the proposed technique uniquely focuses on the specific classes and tries to produce an exact amount of samples of them to make them balanced.These samples are generated in such a way that they fall under the same space or distribution thus making those samples closely similar to their class.
Overall, the results highlight the superiority of the proposed method in terms of average recall, precision, and F-Score, suggesting its potential to enhance multi-class intrusion detection on the UNSW-NB15 dataset.

VI. CONCLUSION
Ensuring network security relies heavily on effective intrusion detection.However, the presence of imbalanced data poses a significant challenge to the performance of intrusion detection systems.Imbalanced learning, where one class dominates the dataset, can lead to biased models and lower accuracy in detecting intrusions.Therefore, addressing the issue of imbalanced learning is crucial for enhancing the overall effectiveness of intrusion detection systems and enhancing network security.This work proposed a gametic hereditical oversampling technique that successfully tackles this challenge by generating diverse synthetic minority instances inspired by genetic biology principles.The evaluation using the UNSW-NB15 dataset validates the effectiveness of the proposed model in accurately detecting cyber threats, by achieving a high precision value of 0.977, and recall value of 0.991, and an F-Score value of 0.981.Importantly, the proposed model prevents over-generalization by ensuring the spread of synthetic samples remains within the boundaries of the minority class.The superiority of the proposed model over conventional methods highlights its potential for developing robust and efficient machine-learning models that enhance the security of IIoT systems.It enhances the precision, recall, and F-Score values by 6%, 8%, and 7% respectively as compared with the conventional methods, thus establishing its supremacy in securing the IIoT from the cyber threats.
However, the limitation of the proposed approach is that it would require a certain amount of samples of each class to produce synthetic data belonging to that class.If all the samples fall in either upper or in lower division of the class then it would be difficult to produce new samples for that specific class.In the future, we would work on improving the proposed approach to focus more on individual attack labels and try to address the above-mentioned limitation of the proposed approach.

Leveraging
Gametic Heredity in Oversampling Techniques to Handle Class Imbalance for Efficient Cyberthreat Detection in IIoT Priyanka Verma , John G. Breslin, Donna O'Shea, Nakul Mehta, Nitesh Bharot , and Ankit Vidyarthi Abstract-In recent years Cyber-Physical Systems (CPS) and Industrial Internet of Things (IIoT) have gained significant attraction; however, it remains a vulnerable target for cyberattacks.Machine learning techniques have garnered interest in security applications due to their rapid processing capabilities and real-time predictions.However, imbalanced data distribution is a prevalent issue in IIoT environments, adversely affecting MLbased attack detection systems.In this work, we present a novel gametic heredity-based oversampling technique for addressing imbalanced data challenges in cybersecurity applications, specifically targeting IIoT systems.The proposed model enhances diversity in the minority classes by generating unique synthetic minority samples, creating diverse synthetic data while restricting instances to the minority class region.The proposed model outperforms complex and conventional methods in terms of precision, recall & F-Score while mitigating over-generalization by evenly distributing newly generated samples within minority class boundaries and regions.To validate the proposed model and verify its efficacy in identifying cyber threats, we used the UNSW-NB15 dataset.Simulation results demonstrate that the proposed model efficiently detects attacks with high performance compared to state-of-the-art techniques.Our research contributes to developing robust & efficient machine learning models for enhancing the security of IIoT systems while handling class imbalance issues.

Algorithm 1
Gaussian Clustering-Based Gametic Hereditical Oversampling Technique Input value: An Imbalanced Dataset D I Output value: A Balanced Dataset D B 1: Use label encoder to add a label to the dataset; 2: D I.new_features = PCA.fit(DI.features ), as mentioned in Algorithm 2; 3: (D Train and D Test ) = train-test-split(D I ); 4: Now in D Train : if y[i] = attack_label then Attack_D.append(DTrain .iloc[i,:]),where i = {0....len(D Train )} 5: Define k = best of Silhouette Score(Attack_D) to get the number of clusters, as mentioned in Algorithm 3 6: Create cluster C j = GMM(Attack_D,k) as described in Phase 2 C j = j th cluster, j ∈ {0,1,2...,k−1} & k ≈ l (for best results) l is the total number of attack labels 7: Initialize total_count=0 & required_count = no. of attack label needed to balance dataset 8: In a cluster C j , for each data point C jq , obtain Mahalanobis distance
Permutation Test and PCA for Feature Selection Input: Preprocessed data D I with n no. of features Output: Reduced feature dataset 1: Define N as no. of permutation and correlation dataset; (i) X_aux = data.copy()(ii) For each Column in data.columnX_aux[col] = data[col].samples(len(DI )).value (iii) return X_aux 2: Run PCA for D I and save variance by each P i 3: Plot a graphical view for analysis of explained variance v/s permuted versions 4: The knee point in the plot would be the desired number of features 'f 1 ' 5: D I = PCA(D I , f 1 ) 6: Return D I 1, 2, . . .f 1 } where f 1 = Optimum no. of features and P is the principal components j = {0, 1, 2, . . .dt}.The next step is to remove the outliers, so we opted for an isolation forest to remove the outliers from dataset.Then Algorithm 2

TABLE II CLASSIFIERS
AND THEIR CONFIGURATIONS dataset.Table III shows the comparison of proposed with other oversampling methods such as SMOTE, ADASYN, Random, and Borderline SMOTE (B.SMOTE) on different classifiers with Table II describing the configuration of each classifier.The evaluation results for different sampling techniques and classifiers in Table III show varying performance in the context of attack detection.

TABLE III EXPERIMENTAL
RESULTS OF BINARY CLASSIFICATION ON DIFFERENT PERFORMANCE METRICS FOR UNSW-NB15 DATASET BETWEEN PROPOSED AND OTHER OVERSAMPLING TECHNIQUES Fig. 3. Comparison of results for binary classification.

TABLE IV COMPARISON
OF RESULTS FOR MULTI-CLASSIFICATION OF ATTACK CATEGORY IN UNSW-NB15 WITH AND WITHOUT PROPOSED MODEL