Implementing a Deep Learning Model for Intrusion Detection on Apache Spark Platform

Internet evolution produced a connected world with a massive amount of data. This connectivity advantage came with the price of more complex and advanced attacks. Intrusion Detection System (IDS) is an essential component for security in modern networks. The IDS methodology is either signature-based detection or anomaly behavior detection. Recently, researchers adopted Deep Learning (DL) because it has a better performance than traditional machine learning algorithms. The use of DL to produce a model for the IDS may take a long time because of computation complexity and a large number of hyperparameters. Different DL models for IDS on Apache Spark have been implemented in this article. This article uses the famous Network Security Lab - Knowledge Discovery and Data Mining (NSL-KDD) dataset and presents a computation delay comparison between Apache Spark and regular implementation. Moreover, an enhanced model is used to improve attack detection accuracy.


I. INTRODUCTION
Computer networks have proliferated over the years, adding to social and economic growth. The Internet Security Threat Report (ISTR) states that 1 in 13 Web requests is malware. The spam rate in e-mails had increased to 55%, ransomware had risen to 46 %, and other Internet threats [1]. Cybercrime and threat actions have grown and have become a critical threat. This growth promoted an increase in network security importance. By analyzing packets captured from the network, IDS helps to detect threats [2].
There are many threats like Denial of Service (DoS), which denies or prevents legitimate user's resources on a network by introducing undesired traffic. Also, malware is malicious software that uses a vulnerability in the computer network machines to gain some advantage [2]. IDS developed to counter these attacks.
The conventional IDS suffered from false detection, which is categorized into positive or negative. These false detections are a burden on the network administrator.
The associate editor coordinating the review of this manuscript and approving it for publication was Jenny Mahoney. This burden made the researchers try to develop an IDS that has a high accuracy of detection and low false detection rate [3]. Signature-based IDS identifies only the known attacks, which makes signature-based IDS unable to detect unknown attacks. Anomaly-based IDS trains on regular traffic and abnormal traffic dataset to identify an attack. Diverse machine learning models have been presented to operate the IDS functions but produced numerous imperfections viz low throughput and high false detection rates.
Machine learning model training has issues that slow down the process, such as the size of the dataset and the optimization parameters to build the most fitting model. These difficulties made the researchers look for a more appropriate approach. A possible solution to these problems is the use of the Apache Spark tool, which is one of the fastest cluster computing frameworks, and it is an open-source distributed programming tool for clusters. Also, Spark executes the operations in memory [4].
The development in computational capabilities expedited Deep Learning (DL) for various applications in many areas such as image processing, natural language processing, computer vision, and the focus of this article, the IDS [5].
The article workflow will follow these steps: 1) NSL-KDD dataset has been investigated rigorously.
Two features need treatment before it enters the model training. 2) Also, the NSL-KDD dataset suffers from a class imbalance problem, and a hybrid solution is presented for it. 3) This article presents the use of Apache Spark in the IDS implementation process. 4) For the sake of comparison, this article implements traditional machine learning algorithms, which are Decision Tree classifier, K-Neighbor classifier, and Support Vector Machine (SVM). 5) Three Apache Spark cluster configuration is used in the implementation to study the reduction in the computational delay. 6) IDS have been implemented on Apache Spark using three DL models, which are Multilayer perceptron(MLP), Recurrent Neural Network(RNN), and Long-Short Term Memory(LSTM). 7) This article presents different arrangements for each network type. 8) The performance of the selected model with the highest detection accuracy outperformed many proposed schemes in terms of accuracy and time. The main contributions of this article can be summarized as follows.
1) Propose a Deep learning-based Intrusion detection system with high accuracy compared with previously developed systems. 2) Use Spark Cluster configuration to reduce the training process while implementing the IDS with different hyperparameters. 3) Solve problems related to the selected dataset, NSL-KDD, such as class imbalance. The paper is organized as follows; the next section provides a brief of IDS previously developed algorithms. Section 3 gives the details of the system model for this article. Section 4 describes the experimental setup. The proposed model performance analyses are presented in Section 5. Finally, Section 6 concludes the paper.

II. RELATED WORK
Many papers proposed the use of Machine Learning (ML) in IDS, and there are two methods. The first one is the traditional machine learning algorithms, and the second method is Deep Learning algorithms. For the first method, the authors in [6] presented a K-nearest neighbor(KNN) to build an IDS for Wireless Sensor Network(WSN). The training dataset was not provided, but the primary function of that IDS was to prevent flooding attacks. The use of the Random-Forest (RF) classifier for IDS is introduced in [7] to implement an IDS. They used the NSL-KDD dataset to evaluate their model and presented the detection accuracy on the training dataset, but their evaluation did not contain any test data. A hybrid system was introduced in [8]. The model had two classifiers; the first is the Support Vector Machine(SVM), and the second is a Decision Tree. The hybrid system offered better detection accuracy.
The work in [9] summarizes the IDS ML algorithms' accuracy of the traditional methods. They presented the intrusion detection accuracy of six algorithms. The list of intrusion detection accuracies was as follows; ( 1) 74.6% for J48 which is an open-source Java implementation of the C4.5 algorithm of a decision tree, (2) 74.40% for Naive Bayes, (3) 75.40% for the C5.0 algorithm of a decision tree, (4) 74.00% for Random Forest, and (5) 74.00% for SVM. Another IDS ML method is proposed in [10], which is called an ensemble because many weak learners build it. Weak learners are classifiers that have poor detection accuracy. The weak learners used in the paper were J48, C5.0, Naïve Bayes, and Rule-Based classifiers (PART). The accuracy reached by this algorithm is 78.14%.
The second method, which is DL, has shown that its accuracy effectively exceeds traditional approaches [11]. In [12], the authors use self-taught learning (STL) on the NSL-KDD dataset for anomaly detection, and the accuracy of the outcome was 79%. A Restricted Boltzmann(RBM) model had been proposed in [13], and this model had feature selection using one Hidden Layer and utilized the KDD Cup '99 dataset. An artificial neural network (ANN), consists of two hidden layers, each with one hundred neurons, is proposed in [14]. The IDS accuracy for this system was 78.51%.
The authors in [15] aspired a Deep Neural Network (DNN) with 100 hidden units. They used a GPU to enhance the performance and used the KDD 99' dataset. The authors suggested that both Recurrent Neural Network (RNN) and Long Short Term Memory (LSTM) models are better for improving detection accuracy.
The authors in [16] presented a convolutional neural network (CNN). The model has two convolutional layers, two pooling layers, and two fully connected MLP layers. The paper tested their model against KDD 99' dataset, and the accuracy was 99.84%.
Then CNN is also implemented on the NSL-KDD dataset in [17]. The CNN needs an additional step than any other DL models. Since CNN's primary purpose is image recognition, it has a stage that is converting the 41 features into matrices that can be dealt with as an image. CNN reached a detection accuracy of 79.48%. On the other hand, RNN is introduced in [9] to create an IDS. A regular processor is used, which increased the training time, so they suggested the use of GPU and Tensorflow to reduce the training time. This model had a good accuracy of 81.29%.
A deep learning method, which is a self-taught learningbased IDS that combines a sparse autoencoder and an SVM, is presented in [18]. The primary function of the sparse autoencoder(SAE) is to reduce the features of the NSL-KDD before it is classified. The SVM classifies the output of the SAE. The accuracy achieved with this method was 80.48%.
The authors in [19]presented a model based on autoencoder, followed by Multilayer Perceptron (MLP). The model VOLUME 8, 2020 is structured as one input layer, five hidden layers, and one output layer; the input layer has 122 neurons, while the output is five neurons. The output layer uses the softmax function. The model accuracy is 79.74%.
A hybrid model introduced using RNN and LSTM is given in [20]. The dataset used for evaluation was the KDD 99' dataset. The authors did not mention the duration of the training time, but a long time is expected for this model.
Another approach presented for LSTM is hierarchical LSTM [21]. The authors produced a three layers' network, which contains two layers of LSTM and one Layer of a fully connected layer. This network reached an accuracy of 83.5% for the NSL-KDD test dataset and 69.73% for KDD-test-21.
The main drawback of using DL is that the training duration to get the best model takes a long time [9]. The authors in [22]and [23] used the Spark as a platform for machine learning. They showed that Spark reduces the execution delay of the training process for Machine learning, but they did not use it for DL. The authors in [24] used Spark for MLP implementation and presented substantial work. They considered that the implementation process is confidential, so it was not given. They tested many datasets, including NSL-KDD. The detection accuracy for NSL-KDD reached 78.5%.
From the above, it can be concluded that Apache Spark can reduce the training time process, and the DL algorithms improved intrusion detection accuracy. RNN and LSTM algorithms perform better than other algorithms, and the NSL-KDD dataset is well known to test the new models. Therefore, we build our model on Apache Spark with the aim to solve dataset problems and to achieve better accuracy.

III. SYSTEM MODEL
This article proposes an IDS system based on deep learning algorithms for the attacks included in the NSL-KDD dataset. The training process will be implemented on Apache Spark. For short, we refer to this model as DLS-IDS (Deep Learning Spark Intrusion Detection System). The DLS-IDS solves the NSL-KDD dataset related problems, defines the best model arrangement and model elements to produce a high intrusion detection accuracy, as well as determines the best Apache Spark cluster configurations to reduce the implementation process time.
The DLS-IDS workflow consists of four main blocks; the first block is to choose and explore the dataset, and the second block is dataset preprocessing. The third block is the class imbalance solution, and the last block is model training using Apache Spark, as shown in Figure 1. The model training will be on three different networks MLP, RNN, and LSTM.
The last block in Figure 1 is the Apache Spark cluster. Figure 2 illustrates the architecture and the workflow within the Spark cluster that is used in the DLS-IDS.
Spark architecture contains three main parts; the driver, the cluster manager, and the worker. The driver includes the Spark context, the cluster manager distributes the workload between the worker nodes, and the worker node performs the tasks as shown in Figure 2. The Spark cluster components workflow goes as follows. The user sends the code with the data and the number of workers. The Spark context receives the task from the user. It uses the cluster manager to distribute the workload between the workers, then sends the data, the model arrangement, and initiates parameters for each worker. Also, it sends the number of Resilient Distributed Datasets (RDD). The worker performs feedforward then computes the gradients to update the parameters. After completing the training process, the worker generates the partial model, which the Spark driver will receive. The Spark driver uses all partial models from the workers to average the parameters of the model to obtain the deep model.
The next subsections will illustrate the DLS-IDS model workflow in details.

A. DATASET EXPLORATION
The first block in the DLS-IDS, shown in Figure 1, is the training dataset. This block has two steps, which choose the dataset, and explore it.
As stated before, there are two well-known datasets KDD-Cup and NSL-KDD. The KDD-Cup99 has a tremendous amount of redundant data points. The NSL-KDD was built by reducing these repetitive points of the KDD-Cup99 dataset [25], [26]. Therefore, we use NSL-KDD dataset for training and testing.
The second step in the first block is data exploration of the chosen training dataset. The NSL-KDD dataset includes four files, two of them for training the model, and the other two for testing the model. These four files are the primary training dataset KDDTrain+, the smaller training dataset KDDTrain+_20%, the primary testing dataset KDDTest+, and KDDTest-21, which is a smaller testing dataset [27].
The attacks that exist in NSL-KDD dataset are one of the following four types: • Denial of Service Attack (DoS), which is an attack that targets service availability by consuming computing and memory resources.
• User to Root Attack (U2R), which is an attack that begins with access as a legitimate user on the network and then endeavors to exploit a vulnerability to obtain root access.
• Remote to Local Attack (R2L), which is an attack in which a user signs in as a remote user, then try to detect the system vulnerabilities and exploit the privileges as if it is a local user.
• Probe Attack (Probe), which is a trial to collect data about computer networks to use these data in later attacks. In the NSL-KDD dataset, each data point is composed of 41 features and a label that is maybe normal or an attack [27]. Table 1 shows the NSL-KDD dataset features.
The dataset exploration process resulted in three discoveries. The first is that feature number 15 ''su_attempted'', which is an attempt to log in as a superuser, has three values (0,1,2). The value 2 is not possible because this feature is binary, so the user is either tried to log in as a superuser or not. The second is that feature number 20 "num_outbound_cmds", which is the number of outbound commands in a File Transfer Protocol (FTP) session, has only one value, which is 0. Figure 3 shows the third problem, which is the class imbalance problem because the U2R attack has only 52 elements, and the R2L attack has 995 while normal is 67343 for Normal.

B. DATA PREPROCESSING
The second block in Figure 1 is the preprocessing, which contains two steps; the first is feature preparing, and the second is the feature scaling.

1) FEATURE PREPARING
The first step is to change the values of ''su_attempted'' to be only (0,1) by converting the value 2 to 0. The ''num_outbound_cmds'' feature will be dropped as it does not affect the model. There are 37 numeric features and three nominal features in the NSL-KDD dataset. The input value of any model should be numeric. Nominal features, such as ''protocol_type'', service, and flag, must be converted into a numeric form. The feature ''protocol_type'' has three types of attributes, TCP, UDP, and ICMP. This feature will change to binary vectors [1 0 0], [0 1 0], and [0 0 1]. Furthermore, the service feature has 67 classes of attributes, and the flag feature has 11 types of attributes. So, the 40 features will become 118 features after conversion, which is less than 122 features like the model in [9].

2) FEATURES SCALING
If the features have significant variance and values, the model will be biased to these features. These features must be scaled. Three features have a notable deviation in their values, which are duration, source bytes (src_bytes), and destination bytes (dst_bytes). Min-Max normalization will be used to scale all features. Min-Max normalization is given by equation 1.
where x i,j is the feature j value of sample i, Min is the lowest feature value in all samples, and Max is the highest feature value of all samples.

C. CLASS IMBALANCE
The third block, in Figure 1, is the proposed class imbalance solution for the NSL-KDD dataset. The NSL-KDD dataset suffers from class imbalance distributions. Some researchers use oversampling, which is duplicating the minority class points, but this method has the disadvantage of overfitting on these points.
Others use undersampling, which is removing some points from the majority class. The problem with this methodology is that some removed points may be critical to represent the class. There is a Hybrid solution that duplicates minority class  points and removes some majority class points. This method will enhance the model but will inherit the problems of the two procedures.
A new technique was introduced [28], called Synthetic Minority Over-Sampling Technique(SMOTE). This technique is a combination of oversampling and undersampling. Still, the oversampling is done by creating new points of the minority class rather than duplicating, which reduces the effect of overfitting, as shown in Figure 4.
The NSL-KDD data points of all categories (normal and attacks) are equal after applying the SMOTE, as shown in Figure 5.

D. SPARK MODEL
The last block in Figure 1 is the use of a Spark cluster to train the Deep Learning (DL) models.
The model training process has three main steps. The first is to input the training data point to the feedforward network to produce a predicted output, then use the predicted output against the actual output to compute the loss, and use it to optimize the weights of the feedforward network, then repeat the process.

1) FEEDFORWARD NETWORKS
DL has many network models. It has been proven that the CNN model has low detection accuracy [17], so in this work, we will test three different architectures; MLP, RNN, and LSTM. These architectures will be described in the following subsections.

a: MULTILAYER PERCEPTRON
MLP is a set of feedforward artificial neural network. An MLP is constructed by the input stage, the output stage, and the hidden stage [29]. The hidden stage may contain many layers or at least one. All nodes in the hidden stage are a neuron that sums all inputs with weights and then applies a nonlinear activation function as in equation (2).
where a k is the activation function, W xh is the weight between the input and the hidden layer, b k is the bias value of the current layer, and Relu is the activation function of the neuron. Relu has been used lately in DL training because it reduces the vanishing and error gradient. Moreover, Relu faster than other activation functions [24]. Relu can be evaluated using equation (3).
The output stage will be used for all three feedforward techniques in the DLS-IDS model. The next equation explains the output stage operation.
where y k is the predicted output, W hy is the weight between the hidden layer and the output layer, a k−1 is the output of the previous layer, and f is the activation function, which could be sigmoid or Softmax. The activation function for the binary classification is sigmoid, and multiclass classification is Softmax. Both functions are given by equations (5) and (6). softmax where i represents the sample, and j represents the class, and M is the number of classes.

b: RECURRENT NEURAL Network(RNN)
It is called recurrent because each node depends on the previous computation. RNN treats the input as a time series [30].
The following equations can evaluate the activation function and output.
where W hh are the weights of hidden of previous computation to present hidden layer, h t−1 is the output of the last calculation, x t is the input at time t, and y t is the output at time t. Figure 6 shows the RNN architecture.

c: LONG SHORT TERM MEMORY
RNN model suffers from vanishing gradient descent problem, which leads to creating LSTM. LSTM architecture is composed of a cell (the memory part of the LSTM unit) and three gates. The three entrances are the input gate, the output gate, and the forget gate [31]. The forget gate function is to discard inessential details, while the input gate function is to modify the memory according to the input. Finally, the output gate function is to determine the output based on the input and memory gates.
where x t is the input vector to the LSTM unit, f t is the forget gate's activation vector, i t is the input gate's vector,c t is the cell input activation vector, c t is the cell state vector o t is the output gate's activation vector, h t is the hidden state vector, also known as output vector of the LSTM unit, W is the weight matrices, and b is the bias vector parameters.
The following subsection will discuss the overfitting problem.

d: OVERFITTING SOLUTION
The deep neural network tends to overfit the decision boundary on the training dataset. In this work, we use two methods to reduce the effect of overfitting; the first is the SMOTE technique, which is described before in the class imbalance section, and the second is network regularization.
One of the most effective techniques for neural network regularization is the dropout layer [32]. The dropout layer function is to generate a mask s that samples the output of the previous layer. The sampling has a Bernoulli distribution with a probability 'p': (15) VOLUME 8, 2020 s k is the dropout probability in layer k. This mask will be applied to the activation function: where '•' denotes the Hadamard product. All the mentioned network elements are the feedforward propagation that will be used in the DLS-IDS model, and the following sections describe the Backpropagation in the DLS-IDS model.

2) LOSS COMPUTATION
After the feedforward, the predicted output is used to calculate the loss. Then an update algorithm will be applied to the weights to decrease the loss and eventually increase the accuracy. Categorical cross-entropy and binary crossentropy are used in the DLS-IDS model to evaluate the loss in the case of multiclass and binary classification, respectively, as follows [33]: where L y,ŷ is the loss function, y ij ,ŷ ij represents the actual and predicted output of sample i for class j, respectively. The binary cross-entropy is given by

3) ADAM OPTIMIZERS
The next step after calculating the loss is using an optimizer to update the weights. The optimizer used in DLS-IDS is Adaptive moment estimation(Adam), which is an adaptive learning rate method [34]. Adam is a combination between the RMSprop and Momentum algorithms. Adam stores the past gradient descent m t , and the past squared gradient descent v t , which are an exponential moving average of the first and the second moment of the gradient, respectively. Adam algorithm has six computations which are as follows: where g t is the gradient, J is the loss function, and ∇ is the gradient.
where β 1 , β 2 are decay terms for the first and second momentum. The next step is to calculate a bias-corrected first and second momentum estimates. (23) wherev t ,m t are the corrected bias estimates. The last step is to update weight, which is given by equation 24.

IV. EXPERIMENTAL DESIGN
The training of the models was implemented on Google Cloud Dataproc. Dataproc has Spark version 2.4.4 over a Hadoop version 2.9. The training dataset will be divided into the training dataset and the validation dataset. The validation dataset is essential to make sure that the model will perform well on the test dataset.

A. SPARK CLUSTER CONFIGURATION
This article presents three different Spark cluster configuration. The use of these configurations will illustrate the impact of Spark in the DLS-IDS model to reduce the training process time. The main advantage of using Spark is that the Spark cluster can be made by commodity hardware. Although this article uses the Google Cloud Dataproc, which gives the ability to choose powerful machines, commodity hardware configurations were chosen. The first configuration contains one master node with two workers, and each node has two processors with 7.5 GB memory. The second configuration consists of one master node and two worker nodes, where each node has four processors with 15 GB memory. The last configuration is one master and four workers where each node has two processors with 7.5 GB memory. All nodes are within the same rack.

B. MODEL ARCHITECTURE SETTINGS
Nine different model settings on each Spark cluster configuration will be trained. The hyperparameters that will be adjusted are the number of hidden layers and the network type. As mentioned earlier in the feature preparing section, the 41 features have been converted to 118 features. So, the basic model arrangement has 118 nodes for the input layer, 80 for the hidden layer, and 5 for the output layer. The model settings differ in the neural network type and network architecture. This experiment implements the three neural network types mentioned earlier, i.e., MLP, RNN, and LSTM, with three different arrangements. The network architecture training runs for three values of the hidden layer; the first only one hidden layer; the second two hidden layers; the last is three hidden layers. TABLE 3 shows the different model arrangements that will be used for the three types mentioned earlier. These values are used in the model performance evaluation metrics, which are defined and calculated below [24], [35]. 1) Accuracy: It is the ratio of the correctly classified packets (normal or attacks) to the total dataset. It can be calculated as: 2) Precision: It is the ratio of correctly classified attacks to the total number of identified attacks. It can be calculated as: 3) Recall: It is the ratio of accurately classified attacks to the total number of attacks in the test dataset. It can be calculated as: 4) F1-Score: It is the average of the precision and the Recall with a weight of 2. It can be calculated as: This section presents and discusses the results of the experiments. It is divided into two parts the DL algorithms on Apache Spark results and rigorous analysis for the selected model.

A. DEEP LEARNING ALGORITHM ON APACHE SPARK
The delay cost computation for the three configurations mentioned earlier is investigated, along with model settings accuracy. Each configuration trains nine different models for one hundred epochs. At last, some failure implementation scenarios are presented.  [24] used a five-layer MLP model, and the output accuracy was less than 78.6%.

1) MODEL SETTINGS ACCURACY
The RNN accuracy is 81.88%, 81.371%, 80.897%. The authors in [9] used a two-layer RNN model and found that the best accuracy was 81.29%. The RNN determines the input using the previous state and the input, and that is why it has a better performance than MLP because there is a relation between the attacks and different fields. For example, the ping of death attack, which lay in the DoS category, has a protocol type of ICMP and lengthy payload. This reason drives the use of LSTM for intrusion detection. The LSTM accuracy is 82.440%,83.57%, and 81.535%. LSTM has a better performance than MLP and RNN. Figure 8 illustrates the enhancement due to the use of the SMOTE technique. A test has been made for LSTM with two VOLUME 8, 2020   hidden layers without applying SMOTE to get the effect, and the accuracy result is 82.24%.
From the above, it can be concluded that The use of LSTM with two hidden layers is the best model. Each cluster treats the memory in all workers as memory containers. The third configuration and the second configuration have the same numbers of containers. The containers in the second configuration on two workers while in the third configuration, the containers on four workers, which add communication overhead, that explains the difference in the training time between the second and the third configuration. The nine models are trained in a sequence manner with the following order: RNN, LSTM, then MLP. After the training for RNN and LSTM has been finished, a conversion process is done from the RDD form to the pyspark, which is python for Apache Spark, data frame form to be suitable for the MLP input.

2) DELAY COST FOR EACH CLUSTER CONFIGURATION
Since the first layer of RNN is the first to train, it takes more time than expected. This delay caused due to workers' initialization and setting the memory containers on each worker.
The authors in [9] trained their model in 11444 seconds, while the delay cost for the DLS-IDS model to train nine models is only 1758.21 seconds. Another platform that may be considered to perform the training process is Hadoop. However, it has been found that Spark is faster than Hadoop by almost 100 times. This advantage is because Hadoop executes the operations in the storage, and Spark executes the operations in the memory, as stated by Apache. This considerable difference demonstrates that the use of Spark is better than conventional training techniques. The mentioned reasons prove that Spark is the most suitable platform for the training process in the DLS-IDS model.
Spark can train the data in three forms RDD, dataset, or data frame. Spark has no built-in libraries for DL. Developers have made a library called Elephas, which enables the use of Spark in DL. The library supports the MLP to train the pyspark data frame, while RNN and LSTM failed to train on the pyspark data frame since they must have an input in a three-dimensional structure. RNN and LSTM use the RDD form as input to satisfy the input structure requirement.
One of the main features of the DLS-IDS is the use of Spark to speed the training process. The implementation code runs all model arrangements in a sequence to present all the results at once. Since Spark does the operations on the memory, the first configuration failed to train all models in one run. This failure happened due to the lack of memory, which was  7.5 GB only, while other configurations were able to train all the arrangements.

B. RIGOROUS ANALYSIS FOR THE SELECTED MODEL
We select the LSTM model with two hidden layers because it has the highest accuracy among all models, and the analysis will be for the binary test and the multiclass test. The model determines if the packet is an attack or normal only in the binary test. In the multiclass test, the model obtains the attack belonging to which class. Then, this model is applied to the KDDTest-21 dataset. Finally, a comparison between the resulted model of the DLS-IDS approach and the previously presented IDS attack detection accuracy is presented. Figure 10 illustrates the accuracy of binary classification on the train and test dataset for one hundred epoch. The training accuracy reached 99.61%, and the test accuracy reached 85.44%.

1) BINARY TEST ANALYSIS
The model output has been evaluated against the KDDTest+ dataset. The output of the confusion matrix is TP = 9846, TN = 9417, FN = 2987, and FP = 294. The equations will determine The statistical evaluation of the model is shown in TABLE 6. Figure 12 illustrates the accuracy of binary classification on the train and test dataset for one hundred epoch.    The overall performance has increased with SMOTE. However, it is evident that the detection of the major classes, which are normal and DoS, accuracy is reduced because SMOTE has added new points to the minority classes, which   affected the model bias toward the majority class. This reduction explains the increase in FPR. The difference of G-Mean after using the SMOTE shows the reduction of the overfitting of the model. A comparison of the accuracy of the multiclass is presented in Figure 13. The comparison illustrates the decrease in the accuracy in the dominant classes and the enhancement in the minor classes.

2) MULTICLASS TEST ANALYSIS
The model has been tested on the KDDTest-21 dataset, Figure 3 shows the dataset information, and the same analogy will be used. The confusion matrix is built for both class imbalance and SMOTE applied shown in TABLE 10 and  TABLE 11.
The confusion matrix is used to generate TP, TN, FN, and FP. The equations will determine the statistical evaluation of the model shown in Table 12. Figure 14 presents a chart of the accuracy of the multiclass. The graph illustrates the decrease in the accuracy in the dominant classes and the enhancement in the minor classes.

3) COMPARISON BETWEEN DLS-IDS AND PREVIOUSLY PRESENTED IDS
A comparison is listed below between the state of the art Intrusion detection algorithms and the DLS-IDS model. Figure 15 shows the traditional machine learning algorithms accuracies against the KDDTest+ and KDDTest-21, while Figure 16 shows the deep learning algorithms accuracies against the earlier mentioned datasets. In Figure 16, DNN and Deep-MLP did not test their models against the KDDTest-21. Figure 15 and Figure 16 show that the DLS-IDS model of this article enhances the overall attack detection accuracy.

VI. CONCLUSION
This article presented a new intrusion detection system based on deep learning. This system is called Deep Learning Spark Intrusion Detection System or DLS-IDS for short. The DLS-IDS model has four main building blocks, and we use the NSL-KDD dataset for training and testing purposes. The NSL-KDD dataset has a class imbalance problem. Therefore, the four system blocks are to choose and explore, preprocess, class imbalance solution, and the last block is training over Apache Spark. This DLS-IDS proved that the use of Spark is better than a regular implementation for DL. The Spark cluster enables model training with different hyperparameters, such as the model elements type and the number of hidden layers. Since Spark uses memory to execute its operations, then memory size must be taken into consideration of the design process of new models to avoid the system halt. When the Spark cluster contains many workers, there will be a communication overhead delay, but this delay is less than the overall computation delay. When dealing with a dataset that contains class imbalance, it is better to use Synthetic Minority Over-Sampling Technique (SMOTE) as a preprocessing step to enhance the detection accuracy of the model and reduce the overfitting effect of DL. The DLS-IDS found that the use of LSTM with SMOTE improves the detection accuracy to reach 83.57%. In future work, we consider the use of more datasets to cover more types of attacks hence train the model on these new attacks. Also, the use of the Kafka Hadoop tool to test the proposed model in real-time configuration would be considered in the future.