Efficient BiSRU Combined With Feature Dimensionality Reduction for Abnormal Traffic Detection

Abnormal traffic detection is an important network security technology to protect computer systems from malicious attacks. Existing detection methods are usually based on traditional machine learning, such as Support Vector Machine (SVM), Naive Bayes, etc. They rely heavily on manual design of traffic features and usually shallow feature learning, which get a low accuracy for high-dimensional traffic. Although the method based on Long Short-Term Memory (LSTM) has an excellent ability to detect abnormal traffic. The sequence-dependent structure of LSTM cannot realize parallel computation, which leads to slow model training and limits its applicability. To address the above problem, we propose an efficient Bidirectional Simple Recurrent Unit (BiSRU) combined with feature dimensionality reduction for abnormal traffic detection. Specifically, in order to perform feature dimensionality reduction on the original high-dimensional network traffic, we design a stack Sparse Autoencoder (sSAE) to extract the compressed high-level features. For the purpose of realizing efficient parallel computation and accurate feature extraction, a BiSRU is utilized to extract the bidirectional structural features of the traffic. Finally, the experimental results show that our proposed method significantly outperforms existing methods in terms of accuracy and training time. The method we propose can timely and accurately detect various abnormal traffic and achieve effective network security protection.

. Abnormal traffic detection runs in a network intrusion detection system. It is divided into three parts: information collection, information analysis and result processing. The information collection part includes the collection of network data and user status and behavior. The information analysis section generates an alert and sends it to the console when it detects some misuse pattern. The result processing section takes the appropriate action according to the predefined response generated by the alarm. each feature information is directly related to anomalies. Secondly, in terms of feature dimensions, one-dimensional features may be difficult to characterize fully, but multidimensional features have a massive impact on computation.
There have been many studies on abnormal traffic detection, mainly include traditional machine learning methods and deep learning methods [9]. The traditional machine learning methods are just shallow feature learning. They have certain limitations when processing complex data. The feature processing that traditional machine learning must do is time-consuming and requires specialized knowledge. The performance of most machine learning algorithms depends on the accuracy of the extracted features. Deep learning methods reduce the manual design effort of feature extractors for each problem by automatically retrieving advanced features directly from original data. Deep learning methods provide excellent feature representation learning capabilities [10]. It has the characteristics of self-learning and meets the needs of accurate real-time detection. Besides, a large amount of network data makes deep learning methods more effective than traditional machine learning methods.
In order to meet the above problems and challenges, we design to apply the excellent feature learning capabilities of deep learning to achieve highly accurate network abnormal traffic detection. The main contributions of this article can be summarized as follows: • First, to extract the advanced features of compression from the raw high-dimensional data, we propose a stack Sparse Autoencoder (sSAE) to perform feature dimensionality reduction on the input traffic data. Which maps the data in the original high-dimensional space to the low-dimensional space and extract the comprehensive features. The sSAE can reduce the calculation amount of the model and running time.
• Second, to achieve parallel computing and accurate feature learning, we propose a BiSRU based on Simple Recurrent Unit (SRU) and apply it into abnormal traffic detection. It can extract the bidirectional structural features of the traffic and realize accurate detection.
• Third, we compare the proposed method with other traditional machine learning methods and deep learning methods. Experimental results show that our proposed method is better than state-of-the-art methods in terms of accuracy and training time. The rest of this article is organized as follows: We discussed related work in Section II and introduced the basic theory in Section III. In Section IV, we introduce our propose detection model, and then in Section V, we introduce experimental comparative analysis. Finally, we conclude Section VI.

II. RELATED WORK
This section provides an account of previous studies on abnormal traffic detection using traditional machine learning and deep learning.

A. TRADITIONAL MACHINE LEARNING
Most of the previous studies are based on traditional machine learning methods, such as Support Vector Machine (SVM) [11], Naive Bayes [12] and Decision Tree [13]. Naive Bayes is an important machine learning algorithm, which is widely used in the field of classification. Panda and Patra [14] used the Naive Bayes algorithm for abnormal traffic detection. This algorithm has been tested on the KDD Cup99 dataset and has achieved good results in terms of false alarm rate and calculation time. Ashraf et al. [15] Applied Naive Bayes to network intrusion detection. Their idea is to use a Bayesian algorithm to select the most likely category based on the premise of feature independence. Farid et al. [16] proposed a hybrid IDS of Naive Bayes and Decision Tree, and the detection rate on the KDD Cup99 dataset reached 99.63%. Nevertheless, the KDDCup99 dataset is an outmoded dataset, which has much redundancy and cannot fully reflect the characteristics of abnormal traffic in modern networks. K. Rai et al. [17] use the Decision Tree C4.5 to perform intrusion detection experiments on the NSL-KDD dataset. In this work, 16 attributes were selected as the detection features of the dataset. But the VOLUME 8, 2020 accuracy was only 79.52%. In [18], Hebatallah et al. performed different feature selection schemes and used a Decision Tree J48 classifier with a Gain Ratio (GR) filter for detection on UNSW_NB15 dataset. In [19], Tian et al. proposed a hybrid method of shallow and deep learning. they used a stack Autoencoder to reduce feature dimensionality. Then the SVM and Artificial Bee Colony algorithm are combined for classification. The accuracy of the experiment reached 89.62% on UNSW_NB15 dataset. However, traditional machine learning requires feature selection to select suitable features, which takes much time and may not necessarily achieve good performance. Moreover, these traditional machine learning methods are just simple shallow feature learning, which has poor performance for high-dimensional network traffic data [8].

B. DEEP LEARNING
Deep learning is a branch of machine learning. With the continuous development of big data and computing power, deep learning methods are rapidly emerging and have been widely used in various fields, such as image detection and speech recognition [20]. At the same time, many studies are applying deep learning to abnormal traffic detection. Recurrent Neural Network (RNN) is a deep learning model and widely used in time series data learning. In 2017, Yin et al. [21] proposed a RNN-IDS algorithm for intrusion detection system. This method is applied to the NSL_KDD dataset, and it is proved to be superior to traditional machine learning methods. LSTM is a variant of RNN, and it solves the shortcoming of gradient vanishing and gradient explosion of RNN. In [22], Roy and Cheung proposed a Bidirectional Long Short-Term Memory Recurrent Neural Network (BLSTM RNN) for intrusion detection in the Internet of Things (IoT). This model achieves a detection accuracy rate of approximately 97% on the UNSW_NB15 dataset. However, LSTM has a defect of not being able to parallelize calculations, when it computes the current state information, it must wait until the previous state to complete the computation. In 2018, Lei et al. proposed the Simple Recurrent Units (SRU) model based on LSTM [23]. This model uses a simpler structure to remove the previous sequence dependency and implements parallel computing. SRU achieves 5-8 times faster training speed than LSTM on classification domain and conversational systems domain.
Based on the above works, traditional machine learning methods that are typically used in abnormal flow detection often fail and cannot detect many known and new security threats, largely because those approaches provide less focus on accurate feature selection and classification. It is often inefficient for large scale network flow. For the current deep learning methods like LSTM, they often pay more attention to the improvement of the model and they are very time consuming. To address above problems, we propose an efficient BiSRU model combined with feature dimensionality reduction for abnormal traffic detection. Specifically, we propose an sSAE to perform feature dimensionality reduction on the original high-dimensional network traffic data. sSAE can not only help to represent the input data but also improve the calculation efficiency. BiSRU can learn the bidirectional structural features of network traffic through the input forward and backward, and achieve accurate detection with less model training time. Finally, we conducted adequate experiments on the UNSW_NB15 network intrusion detection benchmark datasets to verify our proposed model.

III. BASIC THEORY A. SPARSE AUTOENCODER
The Sparse Autoencoder (SAE) is an unsupervised learning method. It provides powerful non-linear generalization and widely used in image denoising [24] and dimensionality reduction [25]. As shown in Fig. 2, the SAE model consists of three parts: an input layer, a hidden layer and an output layer. It achieves feature extraction by reconstructing the input data and setting the number of hidden neurons to less than the number of input layer neurons. The learning process of SAE includes two stages: encoding and decoding. Encoding can realize the non-linear transformation from high-dimensional space to low-dimensional space. The encoder converts the input data X = {x 1 , x 2 , . . . x i } ∈ Z n into a more abstract eigenvector H by the activation function f (·) as follows: The decoder reconstructs the input from the eigenvector, the reconstructed vectors Y = {y 1 , y 2 , . . . y i } can be obtained by: (2) ). ( where W denotes the weight matrix between different layers and b is the bias. As a conventional autoencoder, the loss function J (W , b) in the decoding phase is calculated as: where the first term is the reconstruction Mean Squared Error (MSE) [26] and another term is the regularization for avoiding over-fitting. λ is the weight decay coefficient, n l is the number of layers, s l denotes the neuron number in layer l, and W (l) ji denotes the connecting weight between neuron i in layer l + 1 and neuron j in layer l.
The SAE incorporates the Kullback-Leibler (KL) divergence to overcome feature redundancy, the selective activation of neurons helps to make the encoded data more characteristic and easier to recognize [27]. The KL divergence is defined as: where ρ is the proportion of activated neurons,ρ j is the average activation degree of each hidden unit, M is the number of hidden neurons. Therefore the loss function in the SAE is calculated as: where µ is the weight used to control the sparsity penalty term.

B. LONG SHORT-TERM MEMORY
Long Short-Term Memory (LSTM) is a variant of RNN, which overcomes the problems of gradient vanishing and gradient exploding [28]. LSTM has been widely used in Natural Language Processing [29], speech recognition [30] and time series prediction [31]. The structure of a LSTM cell is shown in Fig. 4. The correlation processing of LSTM are updated as follows: where f t is forget gate, i t is input gate, o t is output gate, C t represents the state of the cell at the previous moment, C t represents the current state of the cell, h t represents the output of the current cell, and h t−1 represents the output of the cell at the previous moment. σ and tanh are activation functions. W is weight matrix and b is bias.

IV. THE PROPOSED DETECTION MODEL A. STACK SPARSE AUTOENCODER
The model of an sSAE is formed by stacking multiple single SAE [32]. It can extract the features efficiently and reduce the dimensionality of the features [33]. As shown in Fig. 3, training an sSAE mainly includes two phases: pre-training and fine-tuning.

1) PRE-TRAINING
A large number of unlabeled data are used to conduct individual training for each SAE with an unsupervised layer-by-layer VOLUME 8, 2020 greedy learning method. The number of neurons in the output layer is consistent with the input layer. The process is as follows: Step 1: use unlabeled data as input, randomly initialize the weight matrix W and bias vector b.
Step 2: set hyperparameters, such as epoch n, sparsity penalty term µ, learning rate λ and dropout rate η, etc.
Step 3: train SAE in unsupervised and calculate the average activation degreeρ j and loss function J sparse .
Step 4: update the weight W , and bias b in each epoch until the loss is within the set threshold.
Step 5: take the output of the hidden layer of the previous SAE as the input of the next SAE, and repeat step 4 with layerby-layer greedy learning method until the training of all the SAEs is completed.

2) FINE TUNING
A little labelled data was used for supervised learning, and the backpropagation algorithm was used to optimize the initial parameters. The process is as follows: Step 1: connect the trained hidden layer of each SAE in an orderly manner to form the whole sSAE, set the weight matrix W and bias vector b obtained from the pre-training phase.
Step 3: add the softmax activation function to the last layer of sSAE for supervised classification.
Step 4: compare the predicted value with the label value to obtain the reconstruction loss, and use the backpropagation algorithm to fine-tune the whole sSAE network parameters.

B. SIMPLE RECURRENT UNIT
Much of the current progress in deep learning has come from increased modelling power and associated computation, which often involves larger, deeper neura networks [34]- [36]. However, although the deep neural network brings obvious improvement, it also takes a great deal of training time.
In order to solve the computational power of the training model, the parallelization method of using GPU for accelerated training has been widely used in the field of deep learning. Using GPU to accelerate the Convolutional Neural Network (CNN) has obvious improvement in training speed [37]. However, LSTM cannot implement the parallelization, because the calculation of h t must wait until the previous h t−1 calculation is completed [38]. SRU is proposed to remove this limitation. In the existing deep learning library, an SRU implementation is about five times faster than an LSTM implementation [23]. In SRU, h t calculation is no longer dependent on the calculation of the previous moment, and the training speed after parallel processing is faster than LSTM. The SRU structure is shown in Fig. 5.
The complete architecture of a SRU decomposes to two sub-components: light recurrence and highway network. Due to enhanced parallelization and gradient propagation, the combination of two components makes the entire architecture simple and expressive, and easy to expand. First, a linear transformation is performed on the input x t : The calculation process of highway network is similar to LSTM. It reads the input vector x t in sequence and calculates the state vector C t to obtain sequence information. Specifically, a forget gate f t controls the information flow as follows: where denotes an element-wise multiplication. The state vector C t is determined based on the adaptive average of f t for the previous state vector C t−1 and the current observation Wx t is calculated as: In the previous gate recurrent architecture, C t−1 is multiplied by a parameter matrix to calculate f t , e.g., However, including v f C t−1 makes it difficult to parallelize state calculations, because each dimension of C t and f t depends on all items of C t−1 , and the calculation must wait until C 1 is completely calculated.
Use v f C t−1 to simplify substitution in light recurrence components. Each dimension of the state vector becomes independent to achieve parallelization.
The highway network is helpful for gradient training. It uses reset gate r t to adaptively combine input x t and state C t from the light recurrence: where (1 − r t ) x t is a skip connection [39] that allows the gradient to propagate directly to the previous layer.
To solve a complex computation bottleneck of matrix multiplication, it can be batched in all time step, which significantly improves the calculation intensity and GPU utilization. The matrix multiplication is combined into one, and subsequent processing can be searched according to the index, as follows: where U ∈ R n×3d is the calculated matrix, d is hidden state size and n is input data length. The operations between the elements in the sequence can be compiled and merged into a kernel function and parallelized in the hidden dimension.

C. THE DETECTION MODEL 1) sSAE FOR FEATURE DIMENSIONALITY REDUCTION
Each SAE consists of an input layer, a hidden layer, and an output layer. The output of the previous SAE hidden layer serves as the input of the next SAE input layer. Through stacking, multi-level abstraction of the original traffic data and effective extraction of features are completed. First, the original UNSW_NB15 traffic was preprocessed to obtain a vector with an input dimension of 196 as the input of sSAE. Then, the network structure of three hidden layers (hidden layer neurons are respectively {128,32,32}) is adopted to represent the nonlinear relationship between the data. Finally, the feature data of 32 dimensions after dimensionality reduction is selected as the output. As shown in the lower part of Fig. 6, training a sSAE to find the best parameter θ = (W , b) by minimizing the difference between the input traffic examples X = (x 1 , x 2 , . . . , x n ), n ∈ N * and its reconstruction x n . After obtaining the optimal parameter θ, sSAE generates a function f (W x n + b) = x n ≈x n , which converts the input network traffics into a new feature representation. Compute the loss function Execute the back propagation to obtain parameters θ = (W , b) 11

end
In order to train a sSAE feature dimensionality reduction model, we chose to stack three single SAEs. The motivation for this is that the deep-level network structure can more accurately reconstruct the input data, ensuring that the high-level features of the encoded data are not lost while features dimensionality reduction. The key steps of the proposed sSAE are illustrated in Algorithm 1. It is worth noting that we use Batch Normalization (BN) processing [40] behind each hidden layer of sSAE. BN makes the distribution of input data in each layer of the network relatively stable, accelerating the model learning speed. More importantly, it can make the input data of the activation function fall in the gradient unsaturated zone through the normalize operation, alleviating the gradient vanishing problem caused by the network depth and complexity. Suppose m-size batch sample x, do all the features in it as follows:

2) BiSRU FOR CLASSIFICATION
We propose a BiSRU model based on SRU, which is more efficient for abnormal traffic detection. We take the network VOLUME 8, 2020 Compute forget gate f t Compute the current cell state based on the previous cell state C t−1 and forget gate f The output of the SRU h t = r t g(C t ) + (1 − r t ) x t 13 The output of the BiSRU H = ( h t , ← h t ) 14 Execute the sigmoid classifier f W (s) (x ) = traffic after feature dimensionality reduction as the high-level feature representation of the original data, and use them as the input vector of the BiSRU based abnormal traffic detection model, which is expressed asX = (x 1 , x 2 , . . . , x n ). Assuming that the predicted result vector is y(x)∈{0, 1}, the detection model is trained by comparing the loss between the predicted value y(x) and the actual label value Y . The input data of the model must be time series data, so we use TimeseriesGenrator to generate time-series for the input data first so that the input data becomes a sequence data with a time step. The key steps of BiSRU are shown in Algorithm 2. After calling TimeseriesGenrator to generate time series traffic, we set the learning rate decay to control the learning rate η in sections, and the purpose is to achieve more efficient learning and to train a neural network in different stages. In particular, BiSRU is trained by using mini-batch Stochastic Gradient Descent (SGD) [41]. It summarizes the forward information h t and the backward information ← h t to enhance feature extraction abilities.
For the binary classification problem, the label of the k-th network traffic sample is y(k) ∈{0, 1}, where 1 and 0 represent abnormal traffic and normal traffic, respectively. After completing the advanced feature learning process, for output layer a sigmoid classifier is utilized to encode reconstructed data as: where f W (s) is a sigmoid activation function with parameters W (s) .

V. EXPERIMENT RESULTS AND ANALYSIS A. DATASET DESCRIPTION
In order to evaluate our proposed method for abnormal traffic detection, the commonly used public standard intrusion detection UNSW_NB15 dataset is used for verification. The UNSW_NB15 is a new public data for intrusion detection. It was created by the Cyber Security Research Team of the Australian Cyber Security Centre (ACCS) to solve the data redundancy problem in the KDDCup99 and NSL_KDD datasets [42]. UNSW_NB15 has a reasonable number of records in training and testing datasets. The total number of records in the training dataset is 175341, and the testing dataset has 82332 records. Each traffic record contains 42 features and one class label. As shown in Table 1, the features can be categorized into five types: basic, flow, content, time, additional generated. According to all features, UNSW_NB15 dataset has nine types of attacks, Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode and Worms, as shown in Table 2. We label all attack type records with abnormal traffic and all normal records as normal traffic samples. The training dataset is used as model training, and the testing dataset is used as model verification.
We also convert ''state'' and ''service'' in the same way. After all the transformation, the 42-dimensional features of the UNSW_NB15 dataset are mapped into 196-dimensional features.

2) NORMALIZATION
UNSW_NB15 dataset has multiple features with maximum and minimum ranges. These features are incomparable and unsuitable for processing. We normalize the data and scale it to a small specific interval. Min-Max normalization is used for mapping all features into the range [0,1] by follow equation: . (23)

C. PERFORMANCE METRICS
The performance metrics for abnormal traffic detection depend on the confusion matrix. [43]. The detailed description of the values in confusion matrix can be defined as follows: • True positive (TP): anomaly traffic instances correctly classified.
• False negative (FN): anomaly traffic instances wrongly classified. These four items are used to generate the following performance evaluation metrics: • Accuracy: evaluate the overall success rate of the model in detecting normal records and abnormal traffic, and is calculated as: • Precision: evaluate the proportion of the actual normal samples to the sample divide into normal, The formula is: • Recall: evaluate the ratio of correctly classified abnormal traffic instances to the total number of abnormal traffic instances. The calculation formula is: • False Positive Rate (FPR): evaluate the proportion of normal instances that are misclassified as abnormal traffic in the total number of normal instances. The formula is: Memory RNN for intrusion detection. Referring to the parameters in the original paper, the number of hidden layers is 3, the batch-size is 132, the time step is 60, the dropout rate is 0.8, the learning rate is 0.001, and the epoch is 100. -Our proposed method: As shown in Table 3, the nodes in each layers is set to {196,128,32}. ρ and learning rate are set to 0.04 and 0.001 respectively. In BiSRU, time step = 10, the last decoder layer adopts sigmoid activation function and the rest layes are followed by the ReLU activation function and the optimizer uses mini-batch SGD.

2) METHOD COMPARISON
In order to evaluate the performance of our proposed model, we conduct all experiments on the UNSW_NB15 dataset. Table 4 lists the performance comparison between our proposed method and the state-of-the-art methods. All models are trained on the training dataset, and the model is verified on the testset. In each set of experiments, we evaluated eight methods, including our proposed method on four performance indicators (Accuracy, Precision, Recall and FPR). The experimental results in Table 4 show that, compared with other traditional machine learning and deep learning methods, our proposed method can achieve the highest accuracy of 0.9941, Precision of 0.9904, Recall of 0.9964, and FPR of 0.0077, which means our proposed method has better performance than other methods in detecting abnormal traffic. Fig. 7 is a comparison of the training accuracy and testing accuracy and loss convergence between our proposed sSAE+BiSRU and two deep learning methods. All models have been trained for 100 epochs, and performance indicators have been evaluated after each epoch. In contrast, the proposed method has a much faster loss convergence rate during training and testing and can obtain the best results faster to improve accuracy, which is obviously better than other methods. We further plot the Receiver Operating Characteristic (ROC) curve of our proposed sSAE+BiSRU and state-ofthe-art methods on UNSW_NB15 as Fig. 8. The ROC curve can easily detect the performance of any threshold value and select the best diagnostic threshold value. The closer the ROC curve is to the upper left corner, the more accurate the testing will be. The point on the ROC curve closest to the upper left corner is the best threshold with the fewest errors and the fewest total false positives and false negatives. The ROC curve of sSAE+BiSRU is the closest one to the upper left corner, indicating a better generalization ability against the other methods. All the results reported above demonstrate that sSAE+BiSRU outperforms its competitors. We can conclude that sSAE+BiSRU is capable of efficient intrusion detection.
In the following, we further examine the robustness of sSAE+BiSRU with different training ratios based on UNSW_NB15 dataset. Fig. 9 reports the detailed performance of the model under different training ratios. We can see that sSAE+BiSRU achieve the highest Accuracy and Recall when the training ratio is 80%. With a training ratio of 70%, the highest Precision and lowest FPR are obtained. In short, it is indeed necessary to control the training ratio to bring performance improvement to abnormal traffic detection. For our proposed sSAE+BiSRU model, it can be found that the results obtained with different training rates are different, but the results within this range are still relatively excellent. This also means that the proposed method has a certain robustness to the data set, and shows high detection accuracy and low FPR in different scale training dataset.

3) COMPUTATION COMPARISON
To deepen this investigation, Table 5 reports the number of training parameters and running time required for both the proposed method and state-of-the-art methods. We use GPU to accelerate the training speed of all models. It can be noticed that, when training on the UNSW_NB15 dataset, the proposed method has fewer trainable parameters and lower training time and testing time. We use a small number of parameters as possible in the structure, which can realize efficient parallel computation.

E. RESULT OF DIMENSIONALITY REDUCTION
In the first stage, to further improve the Accuracy of abnormal traffic detection and effectively reduce the feature dimensionality. The relevant results of the dimensionality reduction   effect of sSAE are analyzed. Four sSAE models with different structures are used to change the number of stack layers, nodes and ρ. The performance comparison of different structures is obtained, as shown in Fig. 10. The parameter of sSAE has an important influence. After data preprocessing, two layers and a three layers sSAE are set for data dimensionality reduction. It is the first reduction of the original data, which determines the ability to extract the features of the input layer. On the one hand, the increase in the number of nodes can strengthen the learning ability of the network. On the other hand, it will affect the generalization ability of the network, which causes over-fitting, and the calculation will also be greatly increased. Therefore, set the number of nodes in the first hidden layer is set to 128. The softmax activation function is combined with an output layer to estimate the probability. When the number of hidden neurons is large, the compressed representation of the input cannot be obtained without sparseness. The sparse factor ρ can make the average activation of hidden neurons extremely small. ρ is usually a small value close to 0, so we set it to {0, 0.02, 0.04, 0.06} respectively. Finally, the performance indexes of sSAE with different structures are compared under different ρ situations.
In order to prevent over-fitting and reduce computational overhead, we obtain results by setting different numbers of layers (two and three layers) and the number of hidden nodes to obtain different dimensionality reduction effects. By comparing the sSAE performance indicators of four different structures, we found that: (1) when the number of stacked layers of sSAE is three, the overall performance of the four indicators is better than that of two layers; (2) compared with 64 nodes in the second hidden layer, a better result can be achieved when the number of nodes is 32; (3) among the four model results, the best results are obtained when the structural model is {128, 32, 32}.

F. EVALUATION FOR BiSRU
In the second stage, the encoded data which from sSAE dimension reduction is input to the time-related model based on RNN variants. The four different variants include Sim-pleRNN, LSTM, GRU and SRU. We use tanh activation function and Adam optimizer. The Adam optimizer can be used to adapt the learning rate, reduce the parameter setting steps and have an efficient calculation way. In order to verify the validity of the time series model, all samples in the dataset are used to conduct experiments on different time step.
In Fig. 11 (a), it is clear that as the time step increases, the training FPR on the training dataset gradually decreases. Compared with other models, SRU has the fastest trainging   FPR decline, which means that as the time step increases, the SRU model can converge faster to achieve a higher Accuracy and reduce the FPR. As shown in Fig. 11 (b), the model training time increases with time step. Among them, the training time of LSTM and GRU is much more than SRU. It is worth noting that the SimpleRNN has the least training time due to its simplest internal structure. However, SimpleRNN is comfortable to appear gradient vanishing and gradient explosion. It can be summarized from the above results that as time step increases, SRU can obtain the lowest trainging FPR compared to other models and the training time is much less than LSTM and GRU. Fig. 11 (c) and d) show the FPR and time of the testing dataset as the time step increases. Under the same time step, the SRU model obtains the lowest FPR, and the testing time is much less than that of LSTM and GRU models.

G. ABLATION STUDY
In order to verify the effectiveness of our proposed two-stage combination method, we conduct a model ablation study. Specifically, to verify that the improvement comes from each component. We remove the component of each stage in turn from the proposed sSAE+BiSRU and compare it with the complete sSAE+BiSRU. For the individual sSAE, we add softmax activation function to the classification test. The BiSRU we proposed is a bidirectional structure based on SRU. In order to verify the superiority of BiSRU, we compare it with a single SRU. We name these models as follows: • Model w/o sSAE: The model without the sSAE component.
• Model w/o SRU: The model without the SRU component.
• sSAE + SRU: The model replaces BiSRU with an SRU cell. The ablation study results are shown in Table 6. It can be seen that no matter whether two-stage components are deleted from our proposed model, the last four performance indicators will decline. Among them, the Accuracy of our proposed sSAE+BiSRU model is 0.9941, Precision is 0.9904, Recall is 0.9964, and FPR is 0.0077. Compared with conventional SRU, the performance has been slightly improved. When without the sSAE component, the effects of various performance indicators have become worse. When the BiSRU part was deleted, the result dropped significantly and became the worst, only reaching Accuracy of 0.8898. This shows that the use of BiSRU can significantly improve the Accuracy of abnormal traffic detection, which comes from its efficient automatic extraction of bidirectional features of network traffic. BiSRU can learn the bidirectional structural features of network traffic through the input forward and backward, and achieve accurate detection with less model training time.

VI. CONCLUSION
In this article, an efficient abnormal traffic detection method based on bidirectional compression features extraction using BiSRU and sSAE was proposed. Our main contribution is to introduce an efficient in-depth features extraction method combined with dimensionality reduction. Firstly, we propose an sSAE to perform feature dimensionality reduction on the high-dimensional network traffic and extract the compressed high-level features. Then a BiSRU with parallel computing is proposed for abnormal traffic detection, which adopts a bidirectional structure to extract the context feature of network traffic better. This leads to a significant speedup for model training compared to LSTM and GRU. The experimental result shows that our method maintains a competitive performance compared to the state-of-the-art methods in term of accuracy and computational efficiency. In general, this article analyzes the shortcomings of modern abnormal traffic detection technology. Based on deep learning technology, we propose BISRU combined with feature dimensionality reduction algorithm for intrusion detection system. The network intrusion behavior can be detected correctly from the traffic data in the intrusion detection system to further ensure the security of network environment.
As future work, we will jointly utilize a swarm intelligent optimization algorithm for training BiSRU and sSAE to potentially further increase the performance due to the automatic hyperparameter tuning.