STLGBM-DDS: An Efficient Data Balanced DoS Detection System for Wireless Sensor Networks on Big Data Environment

Wireless Sensor Networks(WSNs) are vulnerable to a variety of unique security risks and threats in their data collection and transmission processes. One of the most common attacks on WSNs that can target all layers of the protocol stack is the DoS attack. In this study, a unique DoS Intrusion Detection System (DDS) is proposed to detect DoS attacks specific to WSNs. The proposed system is an ensemble intrusion detection system called STLGBM-DDS, which is developed on Apache Spark big data platform in Google Colab environment, combining LightGBM machine learning algorithm, data balancing and feature selection processes. In order to reduce the effects of data imbalance on system performance, data imbalance processing consisting of Synthetic Minority Oversampling Technique (SMOTE) and Tomek-Links sampling methods called STL was used. In addition, Information Gain Ratio was used as a feature selection technique in the data preprocessing stage. The effects of both data balancing and feature selection stages on the detection performance of the system were investigated. The results obtained were evaluated using the Accuracy, F-Measure, Precision, Recall, ROC Curve and Precision-Recall Curve parameters. As a result, the proposed method achieved an overall accuracy of 99.95%. Also, it achieved 99.99%, 99.96%, 99.98%, 99.92%, and 99.87% accuracy performance according to Normal, Grayhole, Blackhole, TDMA and Flooding classes, respectively. According to the results obtained, the proposed method has achieved very successful results in DoS attack detection in WSNs compared to current methods.

ensemble method that combines LightGBM machine learn-90 ing algorithm, data balancing and feature selection on Apache 91 Spark big data platform in the Google Colab environment. 92 WSN-DS dataset was used in the study. Since the WSN-DS 93 dataset is imbalanced, it is combined with the LightGBM 94 machine learning method and STL(SMOTE + Tomek-Link) 95 data imbalance processing. In addition, the Information Gain 96 Ratio feature selection technique is used to both increases 97 detection performance and reduce processing load. Apache 98 Spark environment is preferred because both speeds is impor-99 tant in attack detection and the data used is large. The stud-100 ies were carried out using PySpark, which provides Python 101 support. In the study, classes labeled as Normal, TDMA,  The main contributions of this study can be summarized as 113 follows: 114 1) In the study, a classification-based DoS intrusion detec-115 tion system specific to WSNs was developed and it was 116 verified that it works effectively in the big data environment. 117 2) Another contribution of the study is that deep learn-118 ing approaches have been verified to be more successful in 119 intrusion detection systems than traditional machine learning 120 methods.

121
3) LightGBM machine learning technique has been 122 shown to be more successful than the hybrid deep learning 123 approaches that have been popular in recent years in detecting 124 WSNs-specific intrusions. 125 4) Feature selection was performed on the WSN-DS 126 dataset in order to both reduce the computational complex-127 ity and increase the classification accuracy. As a result of 128 this process, more meaningful features were used for attack 129 detection. In addition, a faster IDS has been developed since 130 fewer data will be processed. The performance improvement 131 is confirmed by the results obtained. 132 5) SMOTE oversampling and Tomek-Links undersampling 133 algorithms are combined for data balancing. Thanks to this 134 combination, the disadvantages of both oversampling and 135 undersampling techniques are eliminated. As a result, the 136 classification performance of the intrusion detection system 137 has been improved and the performance improvement has 138 been confirmed by the results obtained. 139 6) The proposed method is compared with nine different 140 machine learning and deep learning classification techniques. 141 The results showed that the proposed method outperforms the 142 current and hybrid methods in the literature. 143 The remainder of the work is organized as follows. Related 144 studies are mentioned in Chapter 2.
approach. Two datasets, WSN-DS and KDD Cup network 185 attack dataset, were used to classify the proposed approaches. 186 Jiang et al. [11] proposed an intrusion detection system 187 designed for WSNs called SLGBM. In the study, feature 188 selection was made using the sequence backward selec-  proposed method has been tested on the WSN-DS dataset.

192
The proposed method has shown very successful results in 193 detecting and classifying attacks. Liu et al. [12] proposed a 194 network intrusion detection system based on adaptive syn-195 thetic (ADASYN) oversampling technology and LightGBM.

196
Data imbalance was also discussed in the study. The proposed 197 method was tested on the NSL-KDD, UNSW-NB15 and 198 CICIDS2017 datasets and showed accuracy performance of 199 92.57%, 89.56% and 99.91%, respectively. Yao et al. [13] 200 proposed a feature engineering based AutoEncoder(AE)-201 LightGBM intrusion detection system for SDN. The pro-202 posed system first uses Borderline-SMOTE to optimize 203 data distribution, then AE is used for feature engineering 204 to extract key features. Finally, LightGBM is trained to 205 detect attacks using extracted features. The proposed method 206 has been tested on KDDCup99 and NSL-KDD datasets. 207 Ismail et al. [14] presented a comparative study and per-208 formance analysis of different machine learning classifica-209 tion techniques for the detection of cyber attacks in WSNs. 210 They investigated the performance of three techniques: GBM, 211 LightGBM, and Catboost. Performances were compared with 212 three machine learning methods, Gaussian NB, KNN and 213 RF. Feature selection and size reduction processes were 214 also performed using the WSN-DS dataset in the study. 215 Ismail et al. [15] presents a lightweight, multi-layered 216 machine learning detection system to mitigate cyberattacks 217 targeting WSNs. The multi-layer detection system consists 218 of monitor nodes and two machine learning models deployed 219 in the Base Station (BS). A Naive Bayes algorithm is used 220 for binary classification in the first layer and a LightGBM 221 algorithm is used for multiclass classification in the second 222 layer. The proposed system was able to detect four DoS 223 attacks observed in the WSN-DS dataset.

224
Ashwini and Manivannan [16] compared the performance 225 of different machine learning algorithms on the NSLKDD 226 dataset for intrusion detection. Al and Dener [17] pre-227 sented a hybrid deep learning approach for intrusion detec-228 tion. In addition, the problem of data imbalance is also 229 addressed in the study. The proposed method has been tested 230 on CIDDS-001 and UNSW-NB15 datasets. The proposed 231 method has shown very successful results in detecting and 232 classifying attacks. Souza et al. [18] proposed the hybrid 233 DNN-kNN hybrid method on NSL-KDD and CICIDS2017 234 datasets for IoT security. The proposed approach reached 235 99.77% accuracy in the NSL-KDD dataset and 99.85% in the 236 CICIDS2017 dataset. In another study on IoT attacks [19], 237 a deep learning approach was suggested by Susilo and 238 Sari against DoS attacks. Liu et al. [20] proposed another 239 intrusion detection system for IoT. In the proposed work, 240 a particle swarm optimization-based gradient descent (PSO-241 LightGBM) is proposed for intrusion detection. In the study, 242 PSO-LightGBM was used to extract the features of the data 243 and the extracted features were given as input to one-class 244 SVM (OCSVM). The UNSW-NB15 dataset was used to val-245 idate the proposed intrusion detection model. Tang et al. [21] 246 proposed an intrusion detection system based on LightGBM 247 and AE. The proposed LightGBM-AE model consists of three 248 steps: data preprocessing, feature selection and classification. 249 The LightGBM-AE model uses the LightGBM algorithm 250 for feature selection, then an autoencoder for training and 251 detection. The proposed method has been tested on the NSL-252 KDD dataset. Alqahtani et al. [22] proposed a new intrusion 253 detection system based on a genetic algorithm and extreme 254 gradient boosting (XGBoot) classifier, called the GXGBoost 255 model. In the study, the data imbalance problem is also 256 discussed for performance improvement. The proposed 257 method has been tested on the WSN-DS dataset. designed a lightweight anomaly detection system that reduces    In addition, feature selection has been ignored in most stud-307 ies. In this study, a comparison of machine learning and deep 308 learning approaches in WSN-specific intrusion detection sys-309 tems has been made. In addition, the effects of data balancing 310 and feature selection techniques on intrusion detection perfor-311 mance were evaluated. In addition to these, all the proposed 312 work is carried out in a big data environment to highlight the 313 need for big data environments due to the increasing volume 314 of WSNs data day by day.  Table 1 presents relevant studies focusing on intrusion 316 detection using deep learning and machine learning algo-317 rithms based on models, datasets, features, and accuracy 318 parameters.

319
Since KDD Cup'99 and NSL-KDD datasets are out of date, 320 UNSW-NB15, CIDDS-001 and CICIDS2017 datasets have 321 been used frequently in recent years. Although these datasets 322 are not created specifically for WSNs, they are also used 323 in both intrusion detection systems designed for WSNs and 324 intrusion detection systems designed for traditional networks. 325 For these reasons, the WSN-DS dataset was used in this study 326 due to more recent attacks, a greater amount of data and being 327 specific to WSNs. hardware and software) [28], [29]. Intrusion detection systems are generally divided into two 370 groups according to the detection method: signature-based 371 and anomaly-based. In a signature-based system, attackers 372 are detected from previously known attacks. In anomaly-373 based systems, attacks are detected from the unusual behavior 374 of the systems. An anomaly-based IDS approach is presented 375 in this study. The basic IDS structure is shown in Fig. 2.

376
In WSNs, IDSs should be installed in places with more 377 resources, such as base stations, where sensor nodes can be 378 monitored in order to defend against threats to the network. 379 The IDS structure specific to WSNs is shown in Fig. 3. IDSs have three basic components: data collection, 381 analysis-detection and alarming. The data collection com-382 ponent is used to monitor the node itself or neighboring 383 nodes. The main component of IDSs is the analysis and detec-384 tion component, which is responsible for detecting network 385 behavior and activities on it and then analyzing them to decide 386 if there is any abnormal behavior. The alarm component is 387 responsible for alerting administrators when an intrusion is 388 detected.

390
In this study, a new DoS intrusion detection system called 391 STLGBM-DDS is proposed. The main purpose of the pro-392 posed system is to detect DoS attacks specific to WSNs, 393 the use of which is increasing day by day, interacting with 394 each other more and the network size is growing. For this 395 purpose, the LightGBM machine learning algorithm is com-396 bined with feature selection and data imbalance processing in 397 the proposed system. The proposed system consists of data 398 preprocessing, data splitting, data balancing, classification 399 and evaluation sections as shown in Fig. 4. In the data prepro-400 cessing stage, the raw dataset is made ready for classification 401 algorithms. In addition, with the feature selection in the data 402 preprocessing stage, the feature size is adjusted to maximize 403 the algorithm performance. In the dataset splitting phase, the 404 dataset is divided into two, a training dataset and a test dataset 405 VOLUME 10, 2022 in accordance with training and testing purposes. In the data

460
In this study, before applying the classification algorithms to 461 the dataset, the categorical values in the dataset were assigned 462 to numerical values with the One-Hot Encoding process, and 463 then the normalization process shown in Equation 1 was 464 performed. As a result of the normalization process, all the 465 numerical values in the data set were converted to a value 466 between 0 and 1.     selection. The Pearson correlation matrix shown in Fig. 5 was 498 used as feature analysis to observe the relationships of each 499 feature in the WSN-DS dataset with other features in the 500 dataset. Pearson Correlation Coefficient refers to test statis-501 tics that measure the statistical relationship between two 502 continuous variables. As another definition, it is a measure of 503 linear correlation between two data sets [31]. Since it is based 504 on the covariance method, it is known as the best method of 505 measuring the relationship between the variables of interest. 506 It gives information about the size of the relationship or 507 the direction of the relationship as well as the correlation. 508 It always produces results with a value between −1 and 1. 509 It essentially refers to a normalized measure of 510 covariance. It is formulated as:  Table 6.

535
As can be seen in Fig. 5, although many features in

543
As can be seen from Table 6 [34] to solve the problem of class imbal-572 ance in datasets. In this method, synthetic data is produced 573 by oversampling the data in the minority class. SMOTE 574 also overcomes the overfitting problem caused by random 575 oversampling methods by generating synthetic data. It has 576 been widely used in the field of class imbalance in recent 577 years, as it significantly improves the overfitting situation 578 caused by the non-heuristic random sampling method [23]. 579 SMOTE increases the number of minority class samples by 580 adding randomly generated new samples between minority 581 class samples and their neighbors and improves the class 582 where s R is the randomly selected sample of s according to 598 the nearest neighbor number and d is the difference between 599 the two samples. to each other in the dataset but belong to different classes.

607
These data pairs are called Tomek links. The basic idea is to 608 separate the minority and majority classes from each other. 609 Let x be an instance of one class and y an instance of 610 another class, x and y are the nearest neighbors and d(x,y); 611 provided that the distance between x and y is; T-links separate the two classes. Data samples on this 615 link are considered noise. Deleting majority class noises 616 increases the class separation and stabilizes the data distribu-617 tion. It should be noted here that the noise samples are deleted 618 from the majority class. Fig. 6 shows the dataset resulting 619 from the Tomek-Link undersampling process. LightGBM is a gradient boosting framework that uses a fast, 625 distributed and high-performance tree-based learning algo-626 rithm [38]. The size of the data produced through various 627 information systems is increasing day by day. While this 628 situation reveals the necessity of fast processing of data, 629 it becomes difficult for traditional data science algorithms 630 to give faster results. LightGBM is named Light because of 631 its high speed. Thanks to this feature, it can process large 632 data quickly and requires less memory. Another important 633 feature of LightGBM is its focus on the accuracy of the results 634 produced. LightGBM supports GPU learning and therefore 635 data scientists widely use LGBM for data science application 636 development [39].

637
Another advantage of LightGBM is that it supports the 638 optimal division of categorical features. LightGBM supports 639 the optimal separation of categorical features by a grouping 640 method [40]. In this way, sparse data caused by numerical 641 transformation is avoided. In addition to these advantages, 642 it is an important disadvantage that it is sensitive to the 643 overfitting problem in small-sized datasets. 644 Fig. 7 shows the difference between LightGBM from other 645 tree-based algorithms. While other algorithms grow trees 646 horizontally (level-wise), LightGBM grows the tree vertically 647 (leaf-wise). The leaf with maximum delta loss is selected 648 for the growth of the tree structure. When growing the same 649 leaf, a leaf-wise algorithm can reduce loss more than a level-650 wise algorithm. As can be seen in Fig. 7, LightGBM typically 651 consists of fewer decision trees and fewer leaves per decision 652 tree. This makes LightGBM time efficient. LightGBM con-653 sists of two algorithms, Gradient-based One-Side Sampling 654 (GOSS) and Exclusive Feature Bundling (EFB). LightGBM 655 adopts an advanced histogram algorithm for the feature selec-656 tion of the decision tree. Here, while the number of features is 657 reduced by the EFB algorithm, the number of samples in the 658 training phase is reduced by the GOSS algorithm. These two 659 algorithms form the features of LightGBM and are combined 660 Here, For the GOSS algorithm, a denotes the proportion of larger 696 gradient samples and b ∈ (0, 1-a) denotes the proportion of 697 smaller gradient samples to be randomly selected. The values 698 of a and b are predetermined. GOSS randomly samples these 699 data samples with small gradients in the data distribution with 700 a constant factor of ((1-a))/b. In this way, GOSS reduces the 701 data size by keeping the accuracy high without changing the 702 distribution of the original dataset too much. Thus, the final 703 information gain is calculated by equation (7): Here, As a result, to determine the split point, the information 710 gain (V j (d)) of a smaller subset of data is calculated instead 711 of the information gain of the entire dataset. As a result, the 712 computational load is significantly reduced. EFB is mainly 713 used for sampling data and effectively reducing the number 714 of features. EFB aims to reduce the number of features with-715 out harming the accuracy rate and accordingly increase the 716 efficiency of model training. EFB has two basic processing 717 steps. These are creating bundles and combining features 718 into the same bundle. High-dimensional data are often very 719 sparse. In a sparse feaute domain, many features are mutually 720 exclusive. EFB can safely collect exclusive features in a 721 single feature. Thus, EFB combines sparse features to cre-722 ate denser features. If the two feautes are not completely 723 mutually exclusive, the conflict ratio is used to measure the 724 degree of non-mutual exclusion between the feautes. The 725 two features can be combined without affecting the final 726 accuracy when the value is small. EFB generates histograms 727 with the same features as individual features from the feature 728 bundles obtained. Accordingly, the complexity is reduced, the 729 accuracy level is maintained, and the training process is faster 730 with lower memory consumption.       compared to the methods suggested in the literature. From 812 the results obtained, it has been observed that deep learning 813 algorithms achieve better results than traditional machine 814 learning algorithms.

815
The remarkable point in the results obtained is that the 816 CNN-LSTM hybrid approach individually performs worse 817  Table 9.

835
From the ROC and Precision-Recall curves shown in 836 Fig. 10 and Fig. 11, respectively, it is seen that the proposed 837 algorithm is quite successful for all classes.   As can be seen from the figure, the correct detection rate 849 of DoS attacks increased from 99.70% to 99.95%. Data bal-850 ancing significantly improves the performance of the DoS 851 intrusion detection system.

852
Finally, the effect of feature selection on algorithm perfor-853 mance is evaluated. As can be seen from Fig. 14, the feature 854 selection process has an impact on the accuracy of DDS.

855
At this evaluation stage, the proposed algorithm without 856 feature selection achieved 99.91% accuracy. As a result of the 857 feature selection process, the performance of the proposed 858 algorithm has increased to 99.95% accuracy. Although it is 859 thought that it does not increase the accuracy rate numer-860 ically, in intrusion detection systems where each detection 861 is important, the slightest increase in the correct detection 862 rate is important. Because each attack can have important 863 consequences.

865
In this study, a new classification-based DoS intrusion detec-866 tion system is proposed to detect DoS attacks specific 867 to WSNs. The proposed STLGBM-DDS approach com-868 bines LightGBM machine learning algorithm with data bal-869 ancing and feature selection operations. In the study, the 870 STL(SMOTE + Tomek-Link) ensemble algorithm was used 871 for data balancing. LightGBM machine learning algorithm 872 was used for the classification process. Experimental studies 873 were performed on the WSN-DS dataset. All experimental 874 studies were carried out on Apache Spark big data platform in 875 VOLUME 10, 2022