Intrusion Detection System Based on Gradient Corrected Online Sequential Extreme Learning Machine

Nowadays, Intrusion Detection System (IDS) is an active research topic with machine learning nature. A single-hidden layer feedforward neural network (SLFN) trained on the approach of extreme learning machine (ELM) is used for (IDS). The encouraging factors for its usage are its fast learning and supportability of sequential learning in its online sequential extreme learning machine (OSELM) variant. An issue with OSELM that has been addressed by researchers is its random weights nature of the input-hidden layer. Most approaches use the concept of metaheuristic optimisation for determining the optimal weights of OSELM and resolve the random weight. However, metaheuristic approaches require many trials to determine the optimal one. Hence, there is concern about the convergence aspect and speed. This article proposes a novel approach for finding the optimal weights of the input-hidden layer. This article presents an approach for an integration between OSELM and back-propagation designated as (OSELM-BP). After integration, BP changes the random weights iteratively and uses an iterated evaluation of the generated error for feedback correction of the weights. The approach is evaluated based on various scenarios of activation functions for OSELM on the one hand and the number of iterations for BP on the other. An extensive evaluation of the approach and comparison with the original OSELM reveal a superiority of OSELM-BP in reaching optimal accuracy with a small number of iterations.


I. INTRODUCTION
Intrusion detection is the task of observing, analysing and identifying activities aiming to violate a network's security policy. The key success factor for identifying such activities relies on an appropriate monitoring of the network by diagnosing its usage chronically [1]. In the past, organisations used specific authentication policies articulating various levels of accessing. The conventional approach used in the past to prevent suspicious activities depended on an authentication framework giving users restricted network access based on their role. Apparently, such approach does not guarantee full prevention of unauthorised activities, where violating a network's privacy has become more advanced [2].
The associate editor coordinating the review of this manuscript and approving it for publication was Shadi Alawneh .
Intrusions take various forms, where those intending to accommodate them have purposes other than damaging a network to affect its performance. It is worth mentioning the types of intrusions that might be encountered in a network. The most common intrusion type is the denial of service (DoS), which aims to influence a network's performance by sending massive amounts of information to such network [3]. Another type is probing, which aims to scan a network by searching for a valid IP address to gather information [4]. The third type is usually called compromising, in which an attacker exploits a weakness in a network to get privileged access to it [5] Additionally, there are types of attacks that rely on predefined malicious software, such as viruses, worms and Trojan horses [6]. A common feature of the aforementioned intrusion types is the significant changes that might occur in the network's usage [3], [6] Therefore, the information security community has reacted to such attacks by proposing VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ proper methods that can sniff network packets to report an extensive analysis for what is running on the network. In the literature, the analysis has been divided into two main categories: misused-based analysis and anomaly-based analysis [7]. With the emergence of availably public datasets for intrusions, such as KDD-CUP99 [8] and NSL-KDD [8], the research community has tended to use machine learning techniques (MLTs) for the detection task. Such techniques significantly rely on historical data to build a model that can learn the features of both legitimate and intrusion connections. The model will then be used in future detections. Yet a great debate has been depicted in the literature regarding using MLTs, where the aim was to determine best practice technique. The criteria that have been examined in such debate were related to the training time for building a model and its detection accuracy. Recently, deep learning techniques (DLTs) have caught several researchers' attention due to their significant classification accuracy [9], [10]. This is because most DLTs are based on multilayer neural network architectures that provide better learning and understanding for the intrusion features. However, these architectures are proved to be time-consuming due to longer training times. In this regard, some researchers have attempted to use single-layer neural network (NN) architectures or socalled shallow NN that have a relatively similar classification accuracy but with notably less time consumption. Backpropagation (BP) is one of the most famous approaches for training neural networks and uses the gradient of the error for updating the weights of the neural network until reaching the point of zero gradient, which makes the weights unchanged and the network fully trained. However, such approach was criticised for being slow and subject to local minima by some researchers, which has motivated other competitive approaches that work for shallow networks, such as ELM, which uses the concept of Moore-Penrose inverse to apply the least square error for training in one iteration [11].
The argument for ELM's superiority over BP in terms of accuracy for shallow networks is an open research problem. While many researchers have criticised BP for its low training speed and possibility of falling in local minima, ELM has been criticised for its non-optimal weights in the input-hidden layer because of the random initialisation of the weights as well as the need to define the network's optimal structure [12]. This criticism implies that randomisation obstructs high classification accuracy. In this regard, this paper aims to overcome this drawback by modifying the OSELM using BP to update the input-hidden weights while preserving the OSELM of inheriting hidden-output weights, which would theoretically improve classification accuracy in a short time.
The main contributions of our work are as follows: 1-We propose an integrated OSELM-BP method for IDS to overcome randomisation in input-hidden weights, from which OSELM suffers.

2-
The proposed method uses the BP to update the input-hidden weights while preserving the OSELM of inheriting hidden-output weights. 3-We evaluate the performance of the proposed method OSELM-BP in terms of five activation functions, with and without relying on the characterisation model for setting the number of neurons and based on various numbers of iterations added to the BP, and we show its superiority in terms of reaching optimal performance more frequently than OSELM alone. 4-Three datasets are used for evaluation related to IDS, namely, CICIDS-2017, KDD 99 and NSL-KDD 99.

II. LITERATURE SURVEY
Although there are many approaches for intrusion detection in network traffic, such as clustering-based techniques or support vector machines (SVM), they mostly have the disadvantage of long training times. Moreover, they normally need parameter tuning and [13]- [15] and [16], and do not have a satisfactory performance in multiclass classification. Hence, in this section, we focus on machine learning based intrusion detection techniques. The authors in [17] concentrate on ELM and OSELM techniques used for the IDSs. These methods have several attributes that motivate the usage to build IDSs, including (i) easy assignment of parameters, (ii) perfect generalisation and (iii) online and fast training. The results indicate that the methods can be simply used for a great amount of data without considerable loss of generalisation. In [18], an OSELM-based technique is provided for intrusion detection. The proposed technique uses the profiling of alpha for reducing time complexity when the irrelevant attributes are discarded by using correlation, consistency, filtered ensemble-based techniques for attribute selection. Instead of sampling, beta profiling is used for reducing the training dataset size. For the performance evaluation of the proposed technique, a standard NSL-KDD 2009 dataset is used. The authors in [19] provide the technique for intrusion detection based on OSELM. For the performance evaluation, a KDD-CUP99 dataset is used. In the study, they use three subset evaluations of attribute selection techniques: filtered evaluation, CFS subset evaluation, and consistency subset evaluation for removing redundant attributes. Two techniques of network traffic profiling are used. Alpha profiling is performed to reduce time complexity, and beta profiling is used to remove redundant connection records, thus decreasing dataset size. In [20], the OSELM-based intrusion detection system appeared and was used to detect attacks in advanced metering infrastructure (AMI) and perform comparative analyses on other algorithms. The results of the simulation indicate that, compared with other methods of intrusion detection, the method of OSELM-based intrusion detection is better in terms of detection speed and accuracy. In [21], the new DA-ROS-ELM (dual adaptive regularised online sequential extreme learning machine) is provided for detecting network intrusion. The Tikhonov regularisation-based ridge regression factor is defined for solving problems that are ill-posed and over-fitting. For the arrived data in every step of updating, as well as the whole recently accessible data, the mechanism of dual adaptive is planned for, respectively, the selection of accurate updating of output weight β, as well as the regularised parameter C. The proposed algorithm performance is evaluated using the NSL-KDD dataset. The results indicate that DA-ROS-ELM can achieve greater generalisation and performance, higher accuracy, lower rates of false positives and false negatives and faster speeds of training than other network intrusion detection algorithms. In [22], as in batch ELM, in OSELM-RLS single-hidden layer feedforward neural network (SLFN) input weights are produced randomly, although output weights are gained by the solution of recursive least-squares RLS. In [23], they provide the OSELM using the efficient mechanism of sample updating. Old and novel samples are considered various weights. The effect of novel training samples on the algorithm is increased further, which is able to further promote ELM regression prediction ability. Simultaneously, the improved algorithm of the artificial bee colony is provided and used for optimising an adaptive OSELM parameter. A proposed prediction method stability and a convergence property are proved. Real gathered short-term wind-speed time series are used as objects of research and confirm proposed method prediction performance. Short-term wind-speed multi-phase prediction simulation is done. In comparison to the other methods of prediction, the results of the simulation indicate that the proposed approach has reliability performance, a higher accuracy of prediction and increased indicators of performance.
In the work of [20], a simple OSELM-based IDS system was applied to detect intrusion attacks in a smart grid. The authors have not tackled the issue of random weights between the input and hidden layer. In the work of [24], an ensemble of OS-ELM machine (EOSELM) feature selection was proposed to predict the post-fault transient stability status of power systems in real time. An integration of OSELM as a weak classifier and an online boosting algorithm as an ensemble learning algorithm was done. In the work of [25], a distribution of the existing centralized cloud intelligence is done to local fog nodes to detect the attack at faster rate for IoT application where online sequential extreme learning machine (OS-ELM) was used for this purpose. Some researchers have focused on the implementation aspect of ELM for IDS in fog networks. For example, in the work of [26], a distributed ELM classification for fog networks is proposed, in which each node of the fog is trained on the sample of the entire data considering that this sample represents accessible data by the node in the fog. The authors have derived a classical ELM model by indexing it according to the node of the fog, and requesting that the training process is repeated until reaching a minimum needed performance of training error, which the authors called a performance index. This work also suffers from the issue of random weights of ELM. The use of ELM for IDS was also used with probabilistic algorithms. This is shown in the work of [1], where a probability density function is learned based on flow features for frequent communications. The authors have used a hierarchical heavy hitters' algorithm for clustering network statistics and learning the probability density function of each feature using ELM. Moreover, this model has not dealt with the random weights of ELM. Some researchers have integrated ELM with feature reduction algorithms, namely principle component analysis (PCA) for boosting performance and reducing computational time. This is done in the work of [27], where an adaptive PCA was used with ELM. However, this is regarded as a direct implementation of ELM without any handling of the random weights issue of ELM. Other proposed methods of using ELM for IDS were by proposing various architectures of classification. In the work of [28], a cascade architecture based on a set of ELM individual classifiers was proposed to counter the issue of imbalance of an IDS dataset due to the majority being normal samples and the minority being attack samples.
Meta-heuristic based ELM optimization was also used extensively. The work of [29], where a particle swarm optimisation (PSO) was used to maximise an objective function representing the training accuracy of the network based on a solution space. The solution space contains the candidate weights of the connections between the input and hidden layers and the biases of the hidden neurons. This approach provides better accuracy than an arbitrary weight of NN in the input-hidden layer. However, there is a concern about computational complexity due to the need of considerable searching based on an adequate number of particles and iterations before reaching a convergence state. The literature contains numerous attempts of optimising the weights of the neural network in ELM using metaheuristic optimisation algorithms, such as differential evolution [30], [31], cuckoo search [32], the firefly algorithm [33], dolphin swarm optimisation [34], genetic optimisation [35], and ameliorated teaching-learning-based optimization [36]. Additionally, an attempt to optimise the number of hidden neurons in ELM is the work of [37], where a greedy approach was proposed between a candidate minimum number and maximum number, and the training error was used as a metric to select the number with the best performance. Obviously, such work is subject to local minima, as the performance is not necessarily a convex function with respect to the number of hidden neurons. Hence, another attempt to optimise the number of hidden neurons was done based on a metaheuristic approach instead of greedy searching, as in the work of [38].
Overall, ELM has been used for IDS in both its offline and online learning mode. The lightweight nature of this model makes it appealing to be deployed in IDS. However, researchers aim at improving the accuracy of both ELM and OSELM when it is used for IDS to reduce the number of false alarms. This has motivated researchers to focus on the issue of random weights in the input hidden layer in the model. Most studies have concentrated on using metaheuristic searching for this purpose, which leads to optimal weight. However, there is concern about the convergence performance of the searching when using metaheuristic searching. Observing the essence in the difference between ELM training and BP training is in the gradient of error usage that gives BP the ability to converge gradually towards the minimum error point. However, the ELM approach lacks this behavior due to the one-shot calculation using the concept of least square error. It would be interesting if a novel approach was proposed with leveraging the advantages of each of them. The lightweight, over-fitting and local optimal avoidance nature of ELM and the gradual convergence to the optimal point of BP are the goals of the article.

III. METHODOLOGY
This section presents the methodology of the integrated OSELM-BP learning. It starts with an overview of the classical OSELM model in sub-section III.A. Next, we present an overview of back-propagation in sub-section III.B. Next, we present our integrated OSELM-BP in sub-section III.C. The computational complexity is discussed in Section III.D. The datasets that are used for evaluation are provided in sub-section III.E, and the evaluation metrics are presented in sub-section III.F. Table 1 demonstrates the notations used.

A. MODEL FORMULATION
Assuming that we have N arbitrary distinct samples (x j , t j ) ∈ R d × R m where d denotes the number of features (attributes) and m denotes the number of outputs (targets). In addition, we assume that we have single layer feed-forward neural network (SLFN) combined of L hidden neurons then, we approximate the weights using the model in (1): where i denotes the index of number of neuron number i = 1, ..L j denotes the index of sample j = 1, 2 . . . N Another compact way to write the previous equation is (2): where The constraint of number of columns of H (or L) being equal to the number of rows of β is applied which makes the equation valid. The way of training that Huang has suggested in his article of ELM starts with random initialization of (a i , b i ) ∈ [−1, 1] and then finding the value of β based on (3): Unfortunately, H is not guaranteed to be square because N L. Hence, we perform the Moore-Penrose generalized inverse of matrix H † = (H T H ) −1 H T . Hence, the total equation of β = (H T H ) −1 H T T . Some researchers have suggested adding a positive value 1 C in the equation in order to make the solution more stable and to have more generalization according to ridge regression theory. It is named as regularization factor, as shown in (4): After training, the prediction of any new sample will be based on (5): More generalization of the equation is done by using kernel variant in the form of (6): ELM denotes the kernel matrix k denotes the kernel of the new sample with respect to the training data.
Regardless of the variant that is used for ELM, there is an issue in the weights of the input hidden layer w ij =(a ij , b ij ) which is the random generation. Such random generation causes non-stable performance as well as sub-optimal solutions. For solving this, the concept of back-propagation is adopted. Assuming that the C is a loss function (or cost function) which measures the number of mis-classifications in the neural network after being trained with traditional ELM. In addition, we assume that the overall network is given as y(t j ) = h(x j )H T ( 1 C + HH T ) −1 T for training set combined of pairs j, (x j , t j ). Hence, the loss of the model on that pair (x new , y new ) is the cost of the difference between the output y x j and the target t j is C(y x j ,t j ). We consider that the cost is represented by error or misclassification E. We calculate the derivative of the error with respect to the weights, we change the weights using (7): where η denotes the learning rate We conduct set of iterations MaxIt where is each one the value w i is updated using (8):

B. ONLINE SEQUENTIAL EXTREME LEARNING MACHINE OSELM
In our algorithm, we adopt an online variant of extreme learning machine called single hidden layer feed-forward [39]. The online variant enables the updating of the knowledge of the NN according to the provided chunks. Given an activation function g and L hidden neurons, the learning procedure consists of two phases described below.

1) BOOSTING PHASE USING THE INITIAL CHUNKS
Given a small initial training set {X 0 , Y 0 } to boost the learning algorithm first through the following boosting procedure. 1) initialize arbitrary input weight w i and bias b i based on a random variable with center u i and standard deviation σ i , i = 1, . . . , .L. 2) Calculate the initially hidden layer output matrix in (9).

2) SEQUENTIAL LEARNING PHASE
For each further coming observation 1) Calculate the hidden layer output vector by using (11).

C. INTEGRATED BACK-PROPAGATION ONLINE SEQUENTIAL EXTREME LEARNING MACHINE-OSELM-BP
This section presents the integrated back-propagation OSELM (OSELM-BP). It is combined of both boosting phase and iterative phase similar to OSELM. A general flowchart for the algorithm is presented in Fig. 1. As it is depicted in the pseudocode, the algorithm uses the boosting data {X 0 , Y 0 }, the chunks x j , t j , the activation function g, the number of iterations MaxIt, and the learning rate η. The algorithm basically performs boosting for the neural network in the initial stage, and then it updates the weights iteratively with new chunk by calling both OSELM for initial update of the weights and calling back-propagation for iterative update of the weights using the factor η for MaxIt. This section presents example of the most famous datasets in IDS. We present each one with providing its statistical information from the perspective of number of records, classes and their decomposition.

1) CICIDS-2017
CICIDS-2017 dataset [40]. The details of the dataset are provided in Table 4. It shows the name of the used file, the day of activity and the found attack.
In our experiments, we merge all the traffic data within the five days (as shown in Table 4) in a single dataset. Table 5 depicts the details of the merged dataset.

2) KDD 99
Since 1999, KDD'99 has been the most wildly used data set for the evaluation of anomaly detection methods [41]. This data set is built based on the data captured in DARPA'98 IDS evaluation program DARPA'98 is about 4 gigabytes of compressed raw (binary) TCP dump data of 7 weeks of network traffic, which can be processed into about 5 million connection records, each with about 100 bytes. The two weeks of test data have around 2 million connection records. KDD 99 training dataset consists of approximately 4,900,000 single connection vectors each of which contains 41 features and is labeled as either normal or an attack, with exactly one specific attack type. The simulated attacks fall in one of the following four categories: Denial of Service Attack (DoS): is an attack in which the attacker makes some computing or memory resource too busy or too full to handle legitimate requests, or denies legitimate users' access to a machine.
User to Root Attack (U2R): is a class of exploit in which the attacker starts out with access to a normal user account on the system (perhaps gained by sniffing passwords, a dictionary attack, or social engineering) and is able to exploit some vulnerability to gain root access to the system.
Remote to Local Attack (R2L): occurs when an attacker who has the ability to send packets to a machine over a network but who do not have an account on that machine exploits some vulnerability to gain local access as a user of that machine.
Probing Attack: is an attempt to gather information about a network of computers for the apparent purpose of circumventing its security controls.
The distribution of the classes according to the sample's sizes are provided in pie graph in Fig. 2. As we observe, there is an unbalance in the dataset. This un-balance makes the problem of classification or clustering very challenging.

3) NSL-KDD
The statistical analysis showed that there are important issues in the data set which highly affects the performance of the systems, and results in very poor estimation of anomaly detection approaches. To solve these issues, a new data set as, NSL-KDD is proposed, which consists of selected records of the complete KDD 99 data set [42], [43]. The advantage of NSL KDD dataset is. VOLUME 9, 2021   No redundant records in the train set, so the classifier will not produce any biased result.
No duplicate record in the test set which has better reduction rates.
The number of selected records from each difficult level group is inversely proportional to the percentage of records in the original KDD 99 data set.
The training dataset is made up of 21 different attacks out of the 37 presents in the test dataset. The known attack types are those present in the training dataset while the novel attacks are the additional attacks in the test dataset i.e. not available in the training datasets.

F. EVALUATION METRICS
This section presents the evaluation measures used for quantifying the performance of the proposed OSELM-BP and the comparison with OSELM. TP denotes true positive, TN denotes true negative, FP denotes false positive, and FN denotes false negative.

1) ACCURACY
Accuracy represents the number of true predictions divided by all cases of prediction [24], The formula is calculated in (14).
2) PRECISION (PPV) Positive predictive value (PPV) represents the number of true positive predicted by the classifier divided by the number of all predicted positive records [25], The formula is calculated in (15).

3) RECALL (TPR)
TPR represents the number of TP predicted by the classifier divided by the number of all tested positive records [26], The formula is calculated in (16).

4) G-MEAN
This measure is calculated based on precision and recall. [25], The formula is calculated in (17).

5) F-MEASURE
This measure is the harmonic mean of the precision and recall [25]. It is calculated based on the equation; the formula is calculated (18).

IV. EXPERIMENTAL DESIGN AND RESULTS
The simulations for OSELM and OSELM-BP algorithms are carried out in the MATLAB 2019b environment running in Intel Core i5 CPU with the speed of 1.4 GHz.
The experimental design starts with building the characterization model which creates the relation between the testing accuracy and the number of neurons in the hidden layer. For each of the three datasets we generate the characterization model that is used for finding the best number of neurons to operate OSELM and OSELM-BP. We find that each characterization model has accomplished a peak at different number of neurons as it is presented in Fig. 2.
The main parameter is the number of neurons and it was determined based on characterization model given in Fig. 2 for each of the three datasets (a) CICIDS2017 dataset (b) KDD99, (c) NSL. The number of neurons is selected from the characterization model given in Fig. 2, which is 100, 300 and 100 for CICIDS2017, KDD99 and NSL, respectively For evaluation, the data was partitioned into 60 vs. 40 percentages for training and testing respectively. The proposed OSELM-BP will be compared with the original OSELM with respect to six evaluation metrics, namely, accuracy, precision, recall, F-measure, G-mean and the time. The first fifth evaluation metrics are to be maximized while the last one is to be minimized. Thus, we show the reciprocal of the time and we normalize it to one. Each of the two models will be considered for one possible type of five type of activation functions: sigmoid, hardlim, rbf, tensing and sin. In addition, we consider for OSELM-BP one case of four cases of iterations: 9, 24, 99 and 199. The experiments were repeated for two separated sets: the first one is when the number of neurons was taken to be the same of the characterization model while the second set when the number of neurons was as two third of the number of features. Also, each set is repeated for the three datasets: CICIDS-2017, KDD99 and NSL-KDD. We show the testing results of characterization model based selection of number of neurons in Figures 3,17a. Observing the figures, we see that OSELM has behaved better for two activation functions sin and sigmoid while OSELM-BP was better RBF, tansig and hardlim. The poor performance of OSELM-BP in the case of sin and sigmoid is interpreted by over-fitting because the optimal number       in Fig. 16a, 10 iterations for OSELM-BP were adequate to bring the model to the highest performance while when the number of iterations increases the performance has declined, however, in all cases OSELM-BP was superior over OSELM. Another observation is that the model of OSELM-BP is dependent on the number of iterations for reaching the VOLUME 9, 2021   optimal performance. For example, the optimal performance was reached at number of iterations 9 for NSL-KDD data set and hardlim activation function in Fig. 16.a while it has reached it at number of iterations 199 for NSL-KDD data set and tansig activation function in Fig. 15.a. Also, we observe from figures that in all datasets when activation functions         selecting the number of neurons in the hidden layer, OSELM-BP was superior in 11 cases and equivalent in one case while OSELM was only superior in 3 cases. Such results support the hypothesis of the effectiveness in using back-propagation with OSLEM for fine tuning of the model after OSELM is performed. The interpretation of lacking the superiority in some cases arises because of the over-fitting that occurs in some cases. Obviously, the number of over-fitting is more when the characterization model is used because the suitable number of neurons in the hidden layer was determined prior to the training.
Based on the above analysis, it can be stated that OSELM-BP has high scalability. The evaluation reveals the superiority of OSELM-BP in reaching optimal accuracy with a small number of iterations even with large data like KDD 99 which makes it high scalable approach. For overall summary of the performance differences between OSLEM and OSELM-BP, we show the numerical values of the various performance metrics in Table 8. Obviously, the best accomplished value was attained for OSELM-BP. After performing statistical differences, the overall t-test value was found to be less 0.05 which indicates statistical significance.

V. CONCLUSION
This article has presented a novel variant of extreme learning machine to solve the problem of random weights in the input and hidden layer. The variant uses the gradient of error as feed-back to correct the weights in the input hidden layer and in the hidden output layer for pre-defined number of iterations. Hence, it is designated as back-propagation online sequential extreme learning machine OSELM-BP. The variant was developed for IDS because it is one type of critical classification systems due to its security aspect. Hence, countering the random behaviour of input-hidden layer is crucial to prevent many types of false classifications.
The evaluation of the developed OSELM-BP was conducted on three datasets CICIDS-2017, KDD-99 and NSL-KDD. The configurations were based on changing the activation function and the number of hidden neurons of OSELM and the number of iterations of BP. Two set of results were used: the first one with using number of hidden neurons generated from characterization model and the second one based on lowest possible number of hidden neurons defined as two third of the number of features in the model. The finding is that OSELM-BP outperforms OSELM when the number of neurons is minimum, however, a degradation in the performance happens when the number of neurons is higher due to over-fitting. Future work of the article is to incorporate adaptive algorithm for selecting the appropriate number of hidden neurons to accomplish best possible performance and to enable automatic selection of number of iterations of OSELM-BP.