NADS-RA: Network Anomaly Detection Scheme Based on Feature Representation and Data Augmentation

Network anomaly detection aims to identify network anomalies, and it has obtained many achievements using the supervised classification technique. Since the supervised classifier depends on the prior data, it is difficult to accurately classify the rare anomalies when they account less in the training set. Data augmentation can tackle the imbalanced training set problem through creating artificial rare anomaly samples. However, the existing data augmentation methods either ignore the data distribution or ignore the spatial knowledge between features. Therefore, this article addresses this issue by proposing a Network Anomaly Detection Scheme based on feature Representation and data Augmentation (NADS-RA). Re-circulation Pixel Permutation strategy is first designed as feature representation strategy to construct images, and it rotates each feature left by the times of feature number to maintain the spatial knowledge between original network traffic features. An image-based augmentation strategy is thus designed to produce augmented images according to the distribution characteristics of rare network anomaly images with the help of Least Squares Generative Adversarial Network, which alleviates the effect of imbalanced training set and avoids over-fitting. After that, NADS-RA is implemented on the Convolutional Neural Network classification model. We conduct experiments on five public benchmark datasets, including NSL-KDD and UNSW-NB15, and so on, and compare against 12 detection methods and 17 data generation methods. The experimental results demonstrate the superior effectiveness of our work to state-of-the-art methods and the general applicability in different scenarios.


I. INTRODUCTION
With the fast development of the Internet, network security has become increasingly challenging. Network anomaly detection, as an effective scheme to identify the anomalous behavior, has received some achievements by supervised classification methods [1], [2]. Most supervised classification models depend on prior data, and they assume an equal distribution of training data [3]. However, real-world situations do not usually in line with such an assumption. For example, in the nine weeks of network connectivity data collected from The associate editor coordinating the review of this manuscript and approving it for publication was Chun-Wei Tsai . a simulated US Air Force LAN [4], the number of the normal samples is more than 60,000, but that of user-to-root (U2R) attack is less than 100. In this case, U2R can be seen as a kind of rare anomalies. When the number of rare anomalies is less than that of the normal samples [5], [6], supervised methods are usually limited in classifying rare anomalies [7]. Generally speaking, in the training set, when the data of one class is significantly outnumbered by the data of at least another class, it can be considered imbalanced [8]. The classifiers are not generally prepared for the imbalanced training dataset, and they are likely to predict new coming samples as the majority class [9], and miss the real minority class that might be harmful to the network, like U2R attack [10]. Hence, VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ it is essential to classify rare anomalies from the imbalanced data. Data augmentation is well-known to tackle the imbalanced classification by creating artificial rare data, but it is difficult to create realistic-looking data [11]. The commonly used Random Over-Sampling (ROS) and Synthetic Minority Oversampling Technique (SMOTE) [12] methods produce data along the line segment that joins rare data [13]. They over-sample data based on the local information rather than the overall distribution of the rare class, so the data generated by these methods might disturb the global distribution of the original data and further weaken the training effectiveness. Least Squares Generative Adversarial Network (LSGAN) [14] has been successfully applied in the image processing to produce similar images by learning the data distribution. It was firstly applied in network security to produce similar network traffic [15] by learning the data distribution of each feature independently, which loses the spatial knowledge among the features. Representing network traffic features as images might overcome this issue, but the existing network traffic feature representation strategies [10], [16]- [18] encode the features by One-Hot first, and then transform the encoded vectors into the pixel values, which disrupts the feature unity and the spatial knowledge of partial adjacent features. What's more, it is prone to over-fit [16]. Therefore, how to produce the realistic-looking data, upon maintaining the spatial knowledge of features and the distribution of data is especially challenging [19].
To solve this problem, we design an image-based data augmentation strategy. To the best of our knowledge, it is the first time that LSGAN is used to produce the augmented network anomalies on the basis of network traffic feature images. We rotate each feature left by the times of feature number to construct a circulant matrix as pixel values [20]. The raw network traffic features are thus represented as images, thereby maintaining the spatial knowledge. Subsequently, data augmentation strategy based on LSGAN is designed to over-sample the rare anomalies automatically according to the imbalance ratio, thus balancing the imbalanced training set and avoiding over-fitting. The Convolutional Neural Network (CNN) is trained on the obtained balanced dataset to learn the spatial knowledge of training data. Finally, a complete network anomaly detection scheme based on representation and augmentation (NADS-RA) is constructed in this article.
We conduct experiments on five public benchmark datasets including two well-known network anomaly detection datasets, namely, NSL-KDD [4] and UNSW-NB15 [21], and one credit card fraud detection dataset [22], and two software defect detection datasets [23], namely JM1 and PC5. The experimental results validate the effectiveness of NADS-RA, it not only improves the overall accuracy, but also decreases the False Negative Rate (FNR) of rare anomalies [24]. Besides, it suggests the superior performance to other state-of-the-art methods, and the general applicability of other areas.
The contributions of this study are summarized as follows: (1) We design an image-based rare network anomalies augmentation strategy that not only maintains the spatial knowledge of network anomaly features, but also keeps the distribution of rare network anomalies.
(2) The experimental results of comparing with state-ofthe-art methods suggest that the designed data augmentation strategy can alleviate the effect of imbalanced training set and further avoid over-fitting.
(3) The proposed NADS-RA is validated on five public benchmark datasets, including network traffic anomalies, software defects, and credit card frauds. The experimental results suggest that NADS-RA is adaptable in different areas.
The remaining of this article is structured as follows. Section II introduces the related work and Section III states the problem. Section IV illustrates the main methodology. The experimental results are described in Section V. Section VI concludes the full article and the future work.

II. RELATED WORK
This article summarizes related work about network anomaly detection based on CNN published in recent five years in Table.1. They are all implemented in the real public datasets. According to their experimental results analysis and described future work, as well as the real situation where the number of anomalies is less than that of the normal behavior, the main challenges of network anomaly detection based on deep learning technique can be summarized as feature representation and imbalanced classification. How to represent the network traffic as the images upon maintaining the spatial knowledge, and how to train an effective classifier on an imbalanced training dataset to classify the rare anomalies are the main problems in the reported works. Therefore, the related works are introduced from two perspectives, feature representation and imbalanced data classification, and they are marked in the second column ''Representation'' and the fifth column ''Balance'' of Table.1, respectively.

A. FEATURE REPRESENTATION
Deep learning technique has been widely used in images-related area since its layer-by-layer processing ability can automatically mine the hidden and high-level characteristics of the input images. To utilize deep learning techniques to improve the network anomaly detection, transforming the raw network traffic features into images is the first step. Many visualization techniques [31] have been designed, and they can be divided into three types: data filtration and transformation, graph theory, and pixel-based representation.
(1) Among the data filtration and transformation techniques, principle component analysis (PCA) [32] is often used, but PCA is considered as an orthonormal linear transformation because it assumes all base vectors are orthonormal, so it is not recommended to use PCA for analyzing categorical data [31]. (2) Graph theory is suitable for the connection network that includes nodes and links, and often used in the network-communication-related scenarios. (3) On the contrary, pixel-based representation technique opens the opportunities to apply deep learning techniques in the network anomaly detection [33] and can be used to analyze categorical data. It aims to change the feature elements into colored pixels, and it supports the fixed size data conversion.
This article investigates the anomaly detection based on network traffic features, so the pixel-based representation is deployed. To represent the network features as pixel-based images, different feature encoding methods have been explored. They can be deployed on the extracted features [4] or the raw payload contents [9] or the combination of both [34]. Payload contents are often seen as the natural language, and all the entities within the payload are encoded by the word embedding methods. To control the same length of encoded vectors, the short payloads are usually filled with zeros to obtain the same length of the longest payload [25]. For the extracted features, they are generally continuous or discrete values [26], [27]. Encoding discrete features [10], [17], [30] or encoding all features [16], [18] by One-Hot will also generate many zeros in the encoded vector. For both encoding methods, the encoded vectors are represented as a sequence of 0 and 1. And then every eight bits are transformed into a decimal pixel value. In this case, the representation process can be seen as a sequential operation of decomposition and combination. The raw features or raw contents are encoded into the binary space, and then transformed into the decimal space, which might destroy the unity of the raw feature or weaken the spatial relationship between the raw features. The related research is marked in the fourth column ''Unity'' of Table.1. Besides, there are also some representation methods that reshape the original features' long vectors into a pixel matrix directly [28], [29]. Though this kind of methods retains the unity of the feature, it will break the relationship between partial adjacent features.
Hence, a pixel-based image conversion strategy to retain the original spatial characteristics and the unity of features is required.

B. DATA AUGMENTATION
In the real world, the anomalies generally occur less frequently than the normal behaviour, so the number of anomalies and that of the normal will be imbalanced in the collected data. For example, in the nine weeks of network connectivity data collected from a simulated US Air Force LAN [4], the number of the normal samples is more than 60,000, but that of user-to-root (U2R) attack is less than 100. In this case, the supervised classifiers will be unable to learn the characteristics of rare anomalies so that they are likely to predict the rare anomalies as the normal, and might miss the real attack, which will make the system enter into a dangerous status. As shown in Table.1, the anomaly detection is studied in the form of binary classification or multi-classification. For the datasets that include many different attack labels, all the attack labels are treated as the anomaly class generally to perform the binary classification. No matter how many classes to be classified, the imbalance problem exists. The fifth column ''Balance'' indicates whether this research has solved the imbalance issue. It can be found that the issue has only been solved in a few studies.
To discover rare anomalies from imbalanced data, commonly used strategies are data re-sampling methods [35] and classifier modification methods [10]. The classifier modification methods aim to make the classifiers to be sensitive to the rare classes. A sequential classifier [7] containing five classifiers was proposed to identify a specific attack in sequence. VOLUME 8, 2020 In spite of good performance, it needs expert knowledge to judge the intermediate classification results. On the contrary, the data re-sampling methods aim to equilibrate the imbalanced training dataset to enhance the characteristics of the rare anomalies before the classification. Since most classifiers are not generally prepared for the imbalanced training dataset, how to balance the training dataset in advance and then adapt the classifiers flexibly attracts many academic attentions.
Over-sampling strategies [3], [12], [13] that aim to generate the similar data to increase the proportion of rare classes has been widely used in the case where the number of rare classes is very tiny. For example, in NSL-KDD, the number of U2R is less than 100. They produce the similar data according to the distance between samples, but they ignore the data distribution. Generative Adversarial Networks (GAN) [36] and Least Squares GAN (LSGAN) [14] have been proved effective to learn the data distribution and produce the similar data. They are firstly applied in network security to generate network traffic data for enhancing the rare labeled raw packet streams and then the enhanced data are used to train a classifier to classify TCP streams [15], [37]. They provide a promising guide for the imbalanced classification, but they mimic each feature independently, which might lose the spatial knowledge between features. Inspired by the above over-sampling methods applied in network anomaly detection, we absorb the advantage of LSGAN to learn the distribution characteristics of rare anomalies. Combined with the feature representation, an intelligent image-based augmentation method is designed. It not only keeps the characteristics of data distribution, but also maintains the spatial knowledge of features. NSL-KDD [4] and UNSW-NB15 [21] datasets are two well-known benchmark datasets, and are used most often, so they are also utilized in this article.

III. PROBLEM DESCRIPTION
According to the definition [8], any data set that exhibits an unequal distribution between its classes can be considered imbalanced. Analyzing the recent research [7], [9], [16]- [18], [25], [28]- [30], it can be found that the main challenge faced by many experiments is that some rare anomalies sometimes are difficult to be discovered even though the overall accuracy of imbalanced classification is high. Our goal is to precisely classify rare anomalies: Given a training set D = {N 1 , . . . , N n , A 1 , . . . , A a }, where n is the number of normal data and a is the number of abnormal data, by assuming that n is far more than a, n a 1000 is considered in this article, the problem is to train a classifier C, so that when a new abnormal sample comes, the model C can accurately predict whether this sample is abnormal or not.
To achieve this goal, constructing a balanced set through over-sampling technique has been widely used [3], [12], [13], [15]. Inspired by them, we focus on designing an effective data augmentation strategy through over-sampling the minority data. Thus our task can be summarized as follows: given an imbalanced training set, how to generate high-quality samples for augmenting the raw imbalanced training set.
Considering that most of the traditional data synthesis approaches either ignore the characteristics of data distribution [12], [13] or disrupt the spatial features within the data [15], [37], and these two issues have not been solved well in the state-of-the-art works. So it is necessary for us to provide a new solution to the data augmentation. To overcome the influence of imbalanced data on the supervised classifier in the network anomaly detection, how to produce the augmented data to further enhance the characteristics of rare network anomalies, upon maintaining the spatial knowledge within each anomaly data and keeping the distribution of rare anomalies is the main problem to be solved in this article. Besides, obtaining the augmented network anomalies data, how to control the mixing ratio of each class in the augmented training set is also explored in this article.

IV. NADS-RA: A SCHEME FOR NETWORK ANOMALY DETECTION
In this article, we design a network anomaly detection scheme NADS-RA to decrease the FNR of rare anomalies. The implementation details of NADS-RA are illustrated in Fig.1. Features are first extracted from the collected network packets captured by the tcpdump, and we validate NADS-RA on the public benchmark datasets that include the extracted features. Then, all the features are pre-processed by feature encoding, reduction, normalization and representation. Afterwards, data augmentation is used to balance the training dataset. Finally, the classifier CNN is trained on the balanced dataset and then evaluated on the new coming test data. A. DATA PRE-PROCESSING 1) FEATURE ENCODING There are some discrete features or symbolic features that cannot be directly accepted as the input of classifier, for example, protocol (tcp, udp, http). We first encode the discrete features into numerical vectors using One-Hot encoder. Each discrete feature is transformed into a N-bit digit that includes only one 1 and N-1 0, where N indicates the unique values number of this feature. Thus, a discrete feature is transformed into a sequence of binary digits. For example, in Table.2, ''tcp'' is encoded into (1 0 0), ''udp'' is encoded into (0 1 0) and ''http'' is encoded into (0 0 1). In this case, one symbolic feature (''protocol'') is represented as three features (''f1'', ''f2'' and ''f3'').
If one feature contains too many unique values (a bigger N), it will generate many zeros, and the encoded vectors are sparse with more zeros and less ones, which will influence the convolution and optimization effect of CNN. Besides, the feature dimension after One-Hot will increase. Take an example in Fig.2, assuming that there are five discrete features and they have different number of unique values ( , the dimension of encoded feature vectors obtained from One-Hot encoder is 16. Obviously, the feature dimension is increased from 5 to 16. Therefore, bundling the binary vectors is executed to avoid the feature dimension increasing. We assume each discrete feature has an equal weight and do not consider the order of the feature. Then the bundling process transforms every eight-bit binary digits into one decimal value. Continue to see the example in Fig.2, 16 binaries obtained from One-Hot will be transformed into two decimal values. In this case, the feature dimension is reduced from 16 to 2. In the process of bundling, we try to avoid splitting one discrete feature's binary vectors into two decimal values, so it can maintain the integrity of each feature.

2) FEATURE REDUCTION
Later, to optimize the remaining continuous features, a feature filter is designed to remove the useless. As the dimensions of features are different, the standard deviation is inappropriate to compare discreteness of features, so the coefficient of variance C v , as a type of classical statistical theory is introduced, and the computation is defined as Equation 1.
where σ i and µ i are standard deviation and mean of i th feature. Generally, a higher C v indicates a higher discreteness, and the feature of a higher C v plays a more important role. Specially, when the mean µ i of i th feature is zero, this feature will be seen as unimportant relatively.

3) DATA NORMALIZATION
Normalization can eliminate differences among diverse dimensional data, so it is therefore widely used in machine learning. Because features of different scales will result in unreliability of training model, we normalize them in the same range. Rescale-min-max normalization is used in this article as Equation 2.
where x max and x min represent the maximum and minimum value of feature x i respectively, x i and x i represent the raw feature and the normalized feature respectively. To avoid too many zeros within the feature matrix, we rescale the range of the normalized feature from [0, 1] to [a, 1] where the indicator a is a predefined nonzero parameter, a ∈ (0, 1). Afterwards, the minimum value of the normalized variable will be changed into a.

4) IMAGE REPRESENTATION
To learn the deep characteristics of traffic feature automatically, we first convert traffic feature vectors into 2-dimension (2D) pixel-based images and then construct 3-channel images [20]. A Re-circulation Pixel Permutation (RPP) strategy is designed as Equation 3, it is used to convert a long vector into a circulant matrix, where x i is the i th sample, and it is an original long vector with M elements. x i is obtained by moving every element x ij (j = 1, 2, · · · , M ) of x i one unit forward every time, then x i is used to represent pixel values of the transformed image whose dimension is M * M . Afterwards, every pixel value is extended to the RGB image pixel by adjusting the pixel value to different percentage on the same position for 3 channels.
Compared with the representation approach that reshapes a long vector into a square matrix directly, RPP retains the original spatial structure of sample, and every sub-image of the converted images consists of the adjacent elements of the original vector. VOLUME 8, 2020 B. DATA AUGMENTATION From the perspective of data sampling, data augmentation aims to increase the number of minority samples to make the imbalance ratio closer to 1. The imbalance ratio is defined as Equation 4, where N i min and N max represents the number of i th class of minority samples and the maximum number of majority class samples, respectively.
A dynamic data synthesis strategy is designed in this article with the consideration of data distribution. As shown in Fig.3, it includes two roles, generator and discriminator. The generator generates random vector z first, and then samples partial data that obeys the distribution z ∼ p z . The discriminator judges the reality of z by comparing them with the real data that obeys the distribution x ∼ p data . If the discriminated result is false, the generator will refine its generating algorithm to make the generated data more similar to the real data until the discriminator cannot judge the generated data, thus they will be output. To ensure the distribution of generated samples is realistic looking to that of the real samples, original samples of the same class are fed into the generation model in a batch every time. The generation model will be executed multiple times dynamically until the number of generated samples is not less than that of the majority. The termination condition is γ i >r where r is a threshold to control the amount of generated samples.
The data augmentation process is illustrated in Algorithm.1, it will train the discriminator D D_steps times first when parameters of the generator G are fixed, then train the generator G G_steps times when parameters of the discriminator D are fixed. In most cases, D_steps is greater than G_steps in order to conduct a better G. Finally, the generative data from G will be output.
Algorithm 1 Data Augmentation Algorithm 1: Input: Fake sample: random noise data z(dim_z); Real sample: x(dim_x); Parameters: batch size (mb_size), training steps (G_steps, D_steps), training times (t_n); Loss function: least square loss function; Optimization solver: Adam optimizer; 2: for iteration in t_n do 3: while i < D_steps do 4: Train D in the unit of mb_size 5: Minimize discriminator's loss function in Equation 5 6: end while 7: while j < G_steps do 8: Train G in the unit of mb_size 9: Minimize generator's loss function in Equation 6 10: end while 11: end for 12: Output: generative data from G The data augmentation strategy increases the proportion of minority class in dataset, and tries to maintain the data distribution in the same class. The enlarged training dataset is nearly balanced and is used to train the classification model to perform the final anomaly detection task.

C. CLASSIFICATION MODEL
After feature representation and data augmentation, a classification model is trained on the balanced training dataset and further used to validate the NADS-RA. Convolution neural network (CNN), as a type of deep learning algorithms, has achieved great classification performance in learning the spatial knowledge of images. This article uses CNN to extract spatial characteristics of network traffic features, and then compare with other methods. The architecture of the commonly used CNN is shown in Fig. 4. A complete CNN model contains multiple convolution layers and pooling layers. Convolution layer mines the local spatial knowledge using the moving convolution kernel whose size is set as 5 * 5, and pooling layer reduces the dimension of images using pooling kernel whose size is set as 2 * 2 in Fig. 4. Repeating the convolution and pooling operations, the spatial knowledge is obtained through layerby-layer processing. Finally, one or more fully connected layers are used to accept the learned knowledge for decision making, and the predicted label is output.

D. OVERALL WORKFLOW
The execution of NADS-RA is illustrated in Algorithm.2. There are four main steps: data preprocessing, feature representation, data augmentation and classification model training. Data preprocessing includes feature encoding, feature reduction and normalization. Feature representation is performed on training, validation and test datasets. Judging from the imbalanced ratio of the training dataset, if it is checked as imbalanced, data augmentation will be executed as Algorithm.1 to generate the synthesis data. Through mixing the original training dataset with the synthesis data, a balanced dataset will be constructed to train the classification model. x (1, :) = x = [x 1 , x 2 , · · · , x n ] 6: for 1 ≤ i ≤ n do 7: x (i + 1, :) = [x i+1 , x i+2 , · · · , x n , x 1 , · · · , x i ] 8: end for 9: end while 10: An image dataset D' is represented. 11: Augmentation: Calculate the imbalance ratio: γ i ; 12: if γ i < 1 then 13: perform Data augmentation algorithm.1 → produced samples 14: update training data ← training data + produced samples 15 optimizer is utilized, learning rate is 1e-3 and cross entropy is used as the cost function. All datasets used in this article are collected from different application scenarios of the real world, and they have been labeled as normal and abnormal or specific attack type. Except for the NSL-KDD and UNSW-NB15, which have been divided into the training set and test set, 80% of the other datasets are selected randomly for training, and the remaining are used for testing. The experimental results are averaged from 100 groups of experiments without special description.

1) DATASETS
We evaluate NADS-RA on five public datasets: NSL-KDD [4] and UNSW-NB15 [21] are two well-known network datasets and used as the main datasets. JM1 and PC5 [23] are two software defect detection datasets, they and Credit card [22] dataset are used to validate the general applicability of NADS-RA in other scenarios.

a: NSL-KDD DATASET
There are four subsets in NSL-KDD [4], namely KDDTrain + , KDDTrain + _20 percent, KDDTest + and KDDTest −21 . There are 41 features and 5 labels including one normal type and four attack types: Denial of Service (DoS), Probe, User-to-Root (U2R) and Remote-to-Login (R2L). As shown in Table. 3, normal traffic accounts for more than half, but U2R and R2L account for only 0.04 and 0.79 percent in KDDTrain + set.  Fig. 5, it can be found that NSL-KDD and these four datasets all have serious class imbalance problems. Therefore, they are all used to evaluate NADS-RA's effectiveness. NSL-KDD and UNSW-NB15 are the main datasets, and the other three datasets are used to evaluate the general applicability of NADS-RA.

2) METRICS
A good anomaly detection approach requires high true rate as well as low false rate. The metrics are calculated using the confusion matrix in Table 4, where TP (True Positive) and TN (True Negative) mean the number of positive instances (referred to Anomaly) and negative instances (referred to Normal) that are correctly classified, and FN (False Negative) and FP (False Positive) mean the number of positive instances and negative instances that are incorrectly classified.
We evaluate the performance of our approach by the following metrics, Precision, Recall, F1, False Positive Rate (FPR) and False Negative Rate (FNR) and Gmean as well as AUC. We report nearly all these metric values since it is widely agreed that the accuracy alone is unable to provide an accurate evaluation of the classification performance, especially for imbalanced datasets. Precision is the ratio of true positive samples to the samples that are labeled by the system as positive. It represents the confidence of retrieval. Thus, it should be as maximum as possible.
Recall, also called as Detection Rate (DR), is the ratio of true positive samples to the real positive samples. It represents the completeness of retrieval, and it is a core metric commonly used to measure the quality of the anomaly detection under consideration. Thus, it should be as maximum as possible.

Recall
F1 is defined as the harmonic mean of Precision and Recall. It represents a synthesis of the performance of retrieval. The higher value of F1 indicates that the approach performs better on Recall and Precision. Thus, it should be as maximum as possible.
False Negative Rate (FNR) is the ratio of false negative samples to the real positive samples. It represents the inability to detect the real positive. If this value is high, the real attacks will be missed, which makes the system to be exposed to the malicious users and enter into a dangerous status. Thus, it should be as minimum as possible.
False Positive Rate (FPR), also termed as False Alarm Rate (FAR), is the ratio of false positive samples to the real negative samples. If this value is consistently elevated, the security analysis operator will intentionally disregard the system warnings, which makes the system to enter into a dangerous status [38]. Thus, it should be as minimum as possible.
Accuracy is the most used metric from the overall view. It is the ratio of correctly classified samples to the total samples. It represents the confidence of the classification. Thus, it should be as maximum as possible.
Area under Curve (AUC), is the ability to avoid false classification. It can be approximately seen as the arithmetic mean of DR (Recall) and TNR (1-FPR) as Equation 13, and it represents a good compromise between DR (Recall) and FPR metrics [39]. It is effective in measuring the performance of classifiers for imbalanced data [40]. Thus, it should be as maximum as possible.
Gmean, indicates the geometric mean of sensitivity and specificity, where sensitivity = TP TP+FN and specificity = TN TN +FP , and it can also be seen as the comprehensive measurement of Recall and FPR. Thus, it should be as maximum as possible.
Gmean = sensitivity * specificity (14) In the binary-classification task, these metrics are used directly. In the multi-classification task, the overall metric is computed by weighted average to judge the overall effectiveness of multi-type attack detection comprehensively in this article.

3) OUTLINES
We conduct four groups of experiments in total as shown in Fig. 6. Experiment 1: After feature representation, the raw data are represented as images (marked as Image Dataset1). Since the number of each specific attack and that of the normal is imbalanced, we label all the attacks as Anomaly to avoid the influence of imbalance on representation. Then, a binary-classification task for identifying anomalies from the normal is abstracted to validate the effectiveness of the representation strategy of NADS-RA. The experiments include comparing with state-of-the-art representation methods, and comparing with different detection algorithms. Experiment 2: To detect multiple types of attack simultaneously, multi-classification task is needed. To improve the detection accuracy of rare anomalies in the raw imbalanced dataset, the imbalanced Image Dataset1 is re-built by data augmentation strategy of NADS-RA, and then the balanced dataset is marked as Balanced Dataset2. The Balanced Dataset2 is used to evaluate the effectiveness of augmentation. A multi-classification task is abstracted for classifying the known attack. The experiments include comparing with state-of-the-art methods, and non-augmentation methods, and different data synthesis methods, as well as different mixing ratios of training sets. Experiment 3: NADS-RA focuses on network anomaly detection, and the effectiveness is validated by two public network datasets, namely NSL-KDD and UNSW-NB15. Besides, we explore to apply it in other scenarios to evaluate its general applicability, such as software defect detection and credit card fraud detection. Experiment 4: Statistical significance tests are conducted to compare the performances of various approaches on multiple datasets.

B. REPRESENTATION ANALYSIS
We first construct the image datasets on raw NSL-KDD [4] and UNSW-NB15 [21] datasets using the feature representation strategy of NADS-RA, and then abstract the anomaly detection as a binary classification problem on image datasets. The comparison test includes two experiments: comparing with other representation methods, and comparing different detection algorithms.

1) COMPARING WITH OTHER REPRESENTATION METHODS
We compare our approach with those reported results in other studies. Among them, supervised methods including convolution neural networks (CNN) [16], [17], deep neural networks (DNN) [41], and unsupervised methods including clustering [42]- [46] are state-of-the-art methods. CNN and DNN are recent methods based on feature representation. Our approach can represent original feature vectors as pixel-based images with spatial knowledge remained. Other deep learning methods can also represent the feature vectors as images but cannot maintain the spatial knowledge and feature unity. We implement the baseline methods according to the descriptions provided in the appropriate papers [16], [17], [41]- [46] and compare NADS-RA with these methods using metrics: accuracy, precision, recall, F1, Gmean, FPR, FNR and AUC.
The results on NSL-KDD and UNSW-NB15 are shown in Tables.5 and 6. The overall classification measurements of our approach are relatively better than that of the other methods. Though previous methods [16], [17] take CNN as the classifier as well, they encode all features [16] or symbolic features [17] by One-Hot encoder, and then take the encoded vectors as input directly. There are too many zeros in the represented sparse vectors that influence the optimization and convolution effect. On the contrary, we only encode the symbolic features by One-Hot and subsequently bundle the binary bits together into the decimal value, which alleviates the influence of massive zeros on optimization and VOLUME 8, 2020  convolution. Specially, in the comparison results on test −21 , method of research [16] obtains the highest Accuracy, Recall, F1 and the lowest FNR. However, its FPR is 0.998. Combining its overall measurements, it occurs the over-fitting, nearly all the test samples are detected as the anomaly, which leads to the imbalanced results. The number of the anomaly samples account for more than 80% leads to that the accuracy is about 0.816. Hence, its biased results cannot reflect the generalization ability. Compared to the methods [42]- [46] that do not involve the representation, the results obtained from the clustering methods are almost the lowest, which further suggests the superior performance of our approach.
We additionally utilize the full NSL-KDD set and UNSW-NB15 dataset to evaluate the generalization ability of NADS-RA. The Receiver Operating Characteristic (ROC) curves of 5-Fold cross validation test are shown in Fig. 7. It can be found that the detection results of five groups are close. So NADS-RA has a better generalization ability.
Considering all metrics, we come to a conclusion that our approach has clear advantages in feature representation. We can maintain the spatial knowledge of original feature vectors, and further contribute to training an effective CNN classifier. The comparison of the experimental results obtained from other state-of-the-art works deeply show the superior performance.

2) COMPARING WITH DIFFERENT TRADITIONAL MACHINE LEARNING ALGORITHMS
NADS-RA trains ResNet50 using represented images and then detects the anomalies. To judge the general adaptability of the feature representation method, various detection algorithms are tested on the NSL-KDD dataset. Since the representation methods combined with state-of-the-art deep learning classifiers have been compared in the last subsection, we implement different traditional machine learning classifiers and evaluate their performance. Figure.8 shows a   comparison of experimental results obtained from two test sets of NSL-KDD dataset.
The X-axis locates seven approaches: support vector machine (SVM), k-Nearest Neighbor (KNN), Decision Tree (DT), Random Forest (RF), Naive Bayesian (NB), Logistic Regression (LR) and NADS-RA. Y-axis indicates seven metrics. It shows that our approach yields the highest AUC, Accuracy, Precision, Recall as well as F1, and the lowest FNR, while FPR is the third lowest. When performing data fitting, deep learning models can extract more complex features than traditional machine learning models and mine the hidden characteristics of the samples. Hence, deep learning models have better representation ability than the shallow learning models [9]. Considering all metrics used in these experiments, we can find that our approach performs globally better than other traditional machine algorithms.

C. AUGMENTATION ANALYSIS
To validate the data augmentation effect of NADS-RA, we conduct four goups of multi-classification experiments on two real imbalanced datasets, namely NSL-KDD [4] and UNSW-NB15 [21]. The comparison test includes comparing with state-of-the-art works, evaluating the necessity of data augmentation, comparing different data augmentation methods and comparing different mixing ratios of training set.

1) COMPARING WITH OTHER WORKS
We first compare our approach to other state-of-the-art works. Tables. 7 and.8 show the multi-classification results on NSL-KDD full set and UNSW-NB15 dataset. Since only partial metrics are used in the reported works, we exhibit the same metric values. Data augmentation strategy of NADS-RA aims at creating new similar samples and then injecting them into the original training set to clarify the VOLUME 8, 2020  characteristics of rare anomalies more evidently within the same or nearly-same distribution. On the contrary, clustering [42] or deep learning methods [18], [47] do not involve the data augmentation, so the detection model cannot learn enough characteristics from the original raw rare data. sNDAE method proposed in [47] obtains the best precision and F1, but it's worth noting that its FPR is more than 14.6%, yet the FPR of ours is only 1.1%. Consequently, we work well in learning knowledge from the rare anomalies. The comparison results show that we have an obvious advantage on the overall classification performance.

2) COMPARING WITH NON-AUGMENTATION METHOD
To confirm the necessity of augmentation for detection method, this subsection trains the same classifier on the augmented balanced dataset and raw imbalanced dataset (marked as ''After augmentation'' and ''Before augmentation'', respectively) using NSL-KDD [4] and UNSW-NB15 [21] datasets. The multi-classification results are exploited from the perspectives of AUC, Accuracy, Precision, Recall, F1, FNR and FPR. Figure. 9 plots the results obtained from two test sets of NSL-KDD and UNSW-NB15 dataset, respectively. Red star and blue circle symbols indicate the results obtained from NADS-RA which is trained after augmentation and before augmentation, respectively. For the experiment on UNSW-NB15, 80% of the full set are selected for training, and the remaining are used for test. It can be found that for all test sets, the AUC, Accuracy, Precision, Recall, F1 measurements are improved, and the values of FNR and FPR are both decreased. The trends of all these metrics have demonstrated that the data augmentation is effective for improving true rates and decreasing false rates compared with non-augmentation method. Therefore, it's necessary to augment the imbalanced training dataset for pursuing a better detection result.
The detailed detection results for each class of NSL-KDD and UNSW-NB15 are shown in Table. 9. There are two values separated by ''/'' in each cell, and they indicate the value obtained after augmentation and before augmentation, respectively. For all classes, especially for the attacks, the global metrics, such as F1, AUC and Accuracy, have improved after augmentation. Observations can be found: The most obvious observations of NSL-KDD can be found in the first three rows, where ''U2R'', ''R2L'' and ''Probe'' attack detection results are given. The FNR of them has been decreased by 14.5%, 29.4% and 15.2%, respectively on both two test sets. It has almost no influence on the other classes. This phenomenon can be reasoned by the data augmentation that increases the proportion of ''U2R'' and ''R2L'' in the training set without information loss, and simultaneously clarifies the distribution of rare classes, which then facilitates  the classification model to better learn the knowledge of them during the training process.
From a global view of all metrics of UNSW-NB15, the augmentation effectiveness is shown more obviously on the ''Analysis'' and ''Worms'' attack detection. For ''Analysis'', it corresponds with the nearly-least F1 and the nearly-biggest FNR before augmentation, and in comparison, F1 has been improved by nearly 60 percent and FNR has been decreased by nearly 50 percent after augmentation. For ''Worms'', it corresponds with the least F1 and the biggest FNR before augmentation, and in comparison, F1 value has been improved by 76 percent, and FNR value has been decreased by nearly 61 percent after augmentation.
In all, data augmentation of NADS-RA is necessary for detecting rare anomalies, and it is promising to alleviate the influence of imbalance on FNR of rare anomalies detection. VOLUME 8, 2020

3) COMPARING WITH DIFFERENT DATA SYNTHESIS METHODS
The effectiveness of data augmentation depends highly on the quality of synthetic data, so we evaluate the quality of data produced by different data synthesis methods. Imbalanced-learn is a common python package offering a number of resampling techniques commonly used in datasets showing strong between-class imbalance [35]. We set these re-sampling techniques as the baseline methods. It contains three categories: Over-sampling, Under-sampling and Hybrid methods.
• Over-sampling technique tends to generate more samples that are similar to the minority data to increase the proportion of the minority class(es), and it includes five classical methods: Random minority over-sampling (ROS), Synthetic Minority Over-sampling (SMOTE) [12], Borderline SMOTE (bSMOTE) [50], SMOTE for Nominal Continuous (SMOTE-NC) and Adaptive synthetic sampling (ADASYN) [51].
• Hybrid-sampling technique tends to combine the over-sampling and under-sampling technique to generate more samples that are similar to the minority class of data and discard partial data of the majority class(es) simultaneously to balance the proportion of both, and it includes two classical methods [54]: SMOTEtomek -SMOTE + Tomek and SMOTEenn -SMOTE + ENN. The quality of synthetic data is measured by seven metrics, Precision, Recall, F1, FNR, FPR, AUC and Accuracy. Comparison results on test + and test −21 sets of NSL-KDD are shown in Table. 10 where the methods' names begin with ''O_'', ''U_'' and ''H_'' indicate the over-sampling methods, under-sampling methods, and hybrid-sampling methods, respectively. For each group of comparison results, the best metric values are marked bold in each column. Generally, the higher true rate values and lower false rate values are, the better quality of generated data is. Analyzing all the metric values globally, most of the data re-sampling methods cannot maintain the high true rates and low false rates simultaneously, since the imbalance problem is not well solved.
Most of the Over-sampling methods aim at duplicating the original minority samples or producing new samples according to the distance [12], which ignores the data distribution so that the generated data will confuse the inter-class margin. In contrast, we produce the data with the help of LSGAN that can learn the distribution of minority samples and then generate the similar samples that obey the same or similar distribution. Therefore, the augmented training dataset constructed by our augmentation strategy is more effective than other over-sampling methods.
Most of the Under-sampling and Hybrid-sampling methods solve the imbalanced training set by randomly discarding partial majority class samples. They reduce the training set, and thus decrease the training time and consume less resources, but they ignore the distribution and might lose the characteristics information that is useful to the majority class [8]. On the contrary, we keep all the original samples of training set and avoid information loss. Furthermore, we insert the similar samples in the raw training set to enhance the characteristics of rare samples. Though the FPR of U_IHT method is less than that of ours, its FNR is the worst, and the over-fitting presents itself. FPR value of ours is 0.126 and 0.098 for two test sets which are both the second best of all methods. Hence, our augmentation strategy can be approximately seen as the best method.
Overall, the metric values of test −21 set are less than those of test + set, because the test −21 set contains many unknown attacks that do not occur in the test + set, so the difficulty of anomaly detection is increased. In conclusion, NADS-RA outperforms other data re-sampling method.

4) COMPARING DIFFERENT MIXING RATIOS OF AUGMENTED DATASET
After producing the high-quality samples, how to control the proportion of each class in the augmented training set is validated. According to the Table.3, there are five classes in the NSL-KDD dataset, and U2R and R2L account 0.04% and 0.79% in the raw training set, respectively. The main challenge faced by many experiments of the state-of-the-art works [18], [42], [47] is the low detection rates of U2R and R2L. In an ideal balanced set, each class accounts the same proportion that is 20% for each class in NSL-KDD. We control the proportions of U2R and R2L the same, and increase them from the original proportion that is less than 1% to 30%. AUC and ROC curves have been proved effective to evaluate the overall classification effectiveness of the imbalanced dataset [40], so we use the average AUC obtained from 100 groups of experiments. Figure. 10 shows the ROC curves of NSL-KDD test + and test −21 set. Obviously, the AUC is the least in the raw imbalanced training set, and the AUC improvement is achieved on all augmented training sets. A general trend appears that with the increasing proportion, the AUC value tends to be bigger. By comparing the AUC metric for different proportions of U2R and R2L, it can be found that accounting 20% for each class contributes to a more stable and effective classification performance. The classification details of each class is shown in Figs.11 and 12. For U2R and R2L detection, AUC is improved on the augmented training set, and the other classes remain almost unchanged or increased. It can be deduced that the produced rare data not only enhances their characteristics, but also helps to improve the classifier's global learning ability. Therefore, the augmented training set is helpful to improve the detection rate of rare anomalies, and the balanced training set seems to be more promising to train a globally effective classifier than the imbalanced training set.

D. GENERAL APPLICABILITY ANALYSIS
To prove the general applicability of NADS-RA, we also implement it on another two scenarios including the credit card fraud detection and software defect detection. Three publicly benchmark datasets are used: one credit dataset [22] and two software defect datasets (JM1 and PC5) from NASA MDP project [23]. 30% of the dataset are selected as test samples and the remaining are used as training samples. To reflect the average classification effect, we take 100 groups of experiments by randomly sampling test samples, then present the average results.

1) CASE STUDY 1: CREDIT CARD FRAUD DETECTION
In this case, to identify the fraud transactions from the legitimate ones, a binary-classification task is abstracted. Since Recall, F1, Gmean and FPR are used in the state-of-the-art methods, these four metrics are used to evaluate the effectiveness of NADS-RA. An ideal fraud detection system should identify precisely the fraudulent transactions, prevent financial loss, and at the same time reduce the number of false positive transactions that require control of human source with significant costs. Table. 11 lists the comparison results obtained before and after augmentation of different detection methods (They are marked as ''Method'' and ''Method+'', respectively), and they express the superior performance of augmentation. Compared with the over-sampling method used in research [55], we additionally represent the original feature vectors into the images, which contributes to the better learning of hidden spatial knowledge. Therefore, NADS-RA provides promising support in the credit card fraud detection.

2) CASE STUDY 2: SOFTWARE DEFECT DETECTION
Since Recall, F1, Gmean and FPR are used in the stateof-the-art methods, we present these metric values obtained from JM1 dataset and PC5 dataset in Table. 12. The detection results obtained before and after augmentation are marked as ''Method'' and ''Method+'', respectively. As we all know, a high Recall can maintain an accurate detection of defects, and a low FPR involves less human investigators. Combining these metric values together, our method takes on an advantage over others. Though the research [56]  obtains the high Recall and F1, its FPR is the too high to require many human resources. According to the statistics of research [57], the accuracy value is less than 50% when there is no repeated data in JM1. In contrast, under the same dataset without repeated modules, the accuracy value of NADS-RA is about 67% on JM1 dataset. In all, the comparison results suggest that NADS-RA has a general adaptability and applicability in different scenarios. Hence, NADS-RA has a great potentiality to be applied in other security fields.

E. SIGNIFICANCE TEST ANALYSIS
To strengthen our approach, the statistical significance tests are conducted to compare the performances of various approaches on multiple datasets. Friedman test and post-hoc Nemenyi test are used to further analyze whether our approach is statistically significant compared with others. As shown in Table.13, the AUC values of SVM, RF, DT and ours over the NSL-KDD, Credit card, JM1 and PC5 datasets are demonstrated. After Friedman hypothesis testing, the null hypothesis (the performances of all approaches are equivalent) is rejected at α = 0.05 since the p-value is 0.0194. This result indicates that our approach is significantly different with other approaches.
Afterwards, it needs to conduct the post-hoc test to further measure how significant are the performance differences among the considered approaches. The post-hoc Nemenyi test is adopted. The critical difference (CD) of 2.3452 is computed at p-value = 0.05. For the AUC metric, the Friedman average ranks of SVM, RF, DT and ours are 3.75, 2.25, 3 and 1, respectively. Generally, the lower the rank, the better performance of the approach is. In Table.13, the best value is indicated in bold. Ours appears as the best of the benchmark approaches, so it is picked as a control algorithm for being compared with the remaining approaches. The rank differences among SVM-ours, RF-ours and DT-ours, the first one is bigger than the CD value and the latter two ones are lower than the CD value, so it can be accepted at the confidence degree of 0.95 that SVM is statistically different from ours, and RF and DT have no statically significant difference in terms of AUC, despite our method wins on most of the datasets. This proves that the deep learning classifiers or ensembling classifiers are more powerful in the big data network anomaly detection.

F. DISCUSSION
Compared with the state-of-the-art works, we obtain a better result in the imbalanced network anomaly detection. The AUC of our presentation strategy is improved by an average of 10 percent compared with 12 detection methods. Since we absorb the advantages of the feature representation and data augmentation together, and then propose an image-based data augmentation strategy for network data. The existing feature representation methods either disrupt the feature unity or lose the spatial knowledge of partial adjacent features, so that the classifier trained on the obtained images do not perform well due to the information loss. On the other hand, our AUC is improved by at least 10 percent compared with 17 data generation methods. Since the conventional data generation methods produce data according to the distance or density, which will disrupt the distribution of original data, and even confuse the margin between the classes. In contrast, we utilize the Re-circulation Pixel Permutation (RPP) strategy which retains the feature unity through bundling the discrete features and keeping the original continuous features. It not only maintains the spatial structure of raw features, but also enhances the spatial knowledge of adjacent features. Furthermore, with the help of LSGAN's ability to learn the data distribution, we produce the augmented image data to enrich the rare class, and then improve the detection rate of rare classes and avoid over-fitting. Therefore, our superior performance can be explained as that we not only maintain the spatial features within each sample, but also keep the distribution characteristics of rare class, and then the augmented training set is used to train an effective classifier.
Meanwhile, we cannot ignore the limitation in this article. The larger represented image size and enlarged training set might cost more training time and resources. Though we have explored various training sets with different proportions of rare classes, it is just for finding the optimal mixing ratios, so a refined training set and fast training process is still essential to be incorporated with the incremental learning online in the big data environment. We will take this problem as our future work.

A. SUMMARY
In summary, we study how to represent the network traffic features as images and balance the imbalanced training dataset to improve the classification accuracy of rare anomalies. The proposed NADS-RA produces augmented data based on feature images, which maintains the spatial knowledge between features and also keeps the data distribution of each class. Through the experiments conducted on five public benchmark datasets including NSL-KDD and UNSW-NB15, and so on, NADS-RA is in good agreement with experimental observations, and the advantages of feature representation and data augmentation are explained. They contribute to learning the high-level characteristics and the hidden knowledge of data, making the classifier more powerful. Overall, NADS-RA opens opportunities for improving the imbalanced classification in the non-image-processing area, and also provides a general deep-learning-based detection scheme for the imbalanced classification in different scenarios.

B. FUTURE WORK
Our current work focuses on over-sampling each class of rare data to balance the imbalanced training dataset. In the future, we will study a more intelligent data generation method to maintain the intra-class distribution and inter-class margin and to further produce multiple classes of data simultaneously, as well as a fast training model to deal with the problem of more training time and more computing resources brought by the enlarged training set.
WEIYOU LIU received the B.E. degree in communication engineering from the Changchun University of Science and Technology, China, in 2018, where he is currently pursuing the master's degree in computer science and technology. His research interests include network security, big data, information security, anomaly detection, and artificial intelligence.
HUI QI received the Ph.D. degree from the College of Computer Science and Technology, Jilin University, in 2015. He is currently an Associate Professor and a Master Student Supervisor with the Changchun University of Science and Technology. His research interests include network security, access control, and vehicular networks.