A Standardized ICS Network Data Processing Flow With Generative Model in Anomaly Detection

Industrial control systems (ICS) now usually connect to Wireless Sensor Networks and the Internet, exposing them to security threats resulting from cyber-attacks. However, detecting such attacks is non-trivial task. The high-dimensional network data pose significant challenges on security anomaly detection. In this work, we propose a network flow data processing method, which can make the complex network data more standardized and unified to assist security anomaly detection. Then, data generation method is applied to collect enough training data. We also propose a evaluation method for generated data. Finally, the bidirectional recurrent neural networks with attention mechanism is proposed to extract the latent feature, and give an explainable results in identifying the dominant attributes. Empirical results show our method outperforms the state-of-the-art models.


I. INTRODUCTION
The core of Industry 4.0 strategy is the deep integration of information systems and physical systems, which aims at promoting the intelligence, informatization and digitization of industrial production. However, while enjoying the convenience of the integration, the ICS also faces the threats caused by the complex network environment. ICS networks are more vulnerable than traditional Internet. Once the ICS network is attacked, the harm is much greater than that of the traditional Internet.
Traditional software security has relatively developed pretection solutions [1], [2], but attacks and defenses on ICS have always been a concern of the academic. Gelenbe achieved the attack by consuming the battery energy of the sensor node, and evaluated the effect of the attack through a mathematical model [3], [4]. Domanska described the SerIoT which can optimize the information security in IoT platforms and networks in a holistic, cross-layered manner [5].
The associate editor coordinating the review of this manuscript and approving it for publication was Nishant Unnikrishnan.
With the interconnection of ICS networks and traditional Internet, network security have become the core issue of ICS security. In view of the reliability and stability of anomaly detection systems (ADS) in traditional IT network [6], [7], many researches on ADS for ICS network has gradually emerged. Morris designed the Snort-based analysis and detection system, which could detect illegal data packets in Modbus traffic effectively. But the proposed system only could detect the known attacks, and the missing report rate (MRR) was high once an unknown attack occurs [8]. Ponomarev proposed a method to check session duration to verify whether the ICS network was attacked, as noticed that the periods of communication were longer than a given threshold, when injection attacks happened [9], [10]. Keliris designed a process-awareness defense and mitigation strategy using the support vector machine (SVM) model. The SVM model trained with anomaly data was used to monitor malicious behavior in real time [11]. Brun used dense random neural networks for online detection of IoT network attacks [12]. However, these studies were not really targeted at data features, and can't capture the potential characteristics of network flows. With the advent of the KDD 99 dataset, researchers paid more attention to high-dimensional network flow data features in ADS [13]. Some researchers gradually used artificial intelligence techniques to analyze network flow data, and studied the general rules existing in network flow data [14]- [17].
Unlike KDD 99 dataset, the real ICS network data is not only small in size, but also have class distribution skews. Since the size of anomaly samples in the data domain may be much smaller than the normal samples, there is a situation where the dataset is imbalanced. If such imbalanced data is used for IDS model training directly, the detections results will be unreliable. In this situation, the lack of representative data will make learning difficult [18]. There are many researches on imbalanced learning [19]- [22], but there is barely on ADS, especially on ICS networks anomaly detection.
For its superiority in feature learning and Mega-data processing, CNN is widely used in computer vision, speech synthesis, natural language processing, and information filtering. Using deep learning for ADS is seldom tested before. The main reason perhaps is lack of ICS data and limited in low dimensional data. Therefore, proposing a standardized processing and data generation method for ICS Network Data with high dimension. This paper is organized as follows: Section II describes the proposed network data feature process methods. Section III discusses the data generation methods and proposes a evaluation method. Section IV describes the proposed data selection methods and attention mechanism. Section V discusses the experiments and results. Section VI presents conclusions.

II. NETWORK DATA FEATURE MINING
A. NETWORK FLOW DATA ACQUISITION ICS network and traditional Internet have much in common in terms of the means of network attacks. Attacks such as portscan, ping-sweep, and denial of service, which are common in traditional Internet, can also be used for attacks on ICS networks. Of course, they also have some differences, ICS is more special. Many behaviors may be normal in general terms, but are the attacks for the ICS systems, such as ''Sending a fake command'', ''Uploading exe files'', and so on. ADS is a defense means that is very practical and easy to implement. In order to have accurate ADS, sophisticated network data is necessary. At the network physical layer, an Ethernet frame is composed of the Ethernet header, and up to 1500 bytes of payload. This payload contains many attributes of the network connection. Analyzing the network data can help to defend against attacks. Data from ICS networks can be acquired through a API called PCAP. Libpcap and WinPcap are the front-end packet capture software libraries for many network tools. Tcpdump, Wireshark, Snort, and Nmap are popular sniffing programs when using pcap data [23]. The data acquisition of ICS networks is shown in Fig. 1.
Due to the lack of data in the field of ICS network security research, Lemay et al. used the SCADA sandbox to build an ICS environment similar to that in Fig. 1. The MTU polls the controllers to generate data, which is collected as the datasets that can be used for ICS network intrusion detection [24]. The datasets are certain universality and can be used as a typical representative of ICS network data. The ICS network datasets involved in this paper as shown in Table 1.

B. NETWORK DATA PROCESSING
Whether it is a traditional Internet or an ICS network, there are often multiple attributes in their network flows, such as destination address, packet length, protocol type, and so on. The original data are PCAP packets, which are parsed so that the data in PCAP format becomes structured data. After data cleaning, the dataset having 54-dimensional features is obtained. The following is a piece of the data flow: Among these features, some feature subsets define a certain attribute of the data stream. Therefore, these feature subsets are processed such that retaining the basic information while reducing redundancy.
For example, the feature subset {0, C, 29, 3C, 11, 3F} represents the source MAC address of the network packet which is ''0:C:29:3C:11:3F''. The MAC address is a 48-bit address information and is the unique identifier of each hardware device. Every 4-bit are usually expressed as a hexadecimal number, so there is the above MAC address form. In other words, radix conversion of MAC address information does not change its physical meaning. Therefore, in order to reduce the dimension of the data, and retain the physical meaning of the MAC address at the same time, we convert it to decimal which is ''0, 12, 41, 60, 17, 63''. Then, treating it as a 6-dimensional vector x MACs = (0, 12, 41, 60, 17, 63), and calculating the 2-norm by Thus, the result is 98.4, and use it to replace the original 6-dimensional feature subset. The same method can be applied to the destination MAC address.
For the feature subset {192, 168, 1, 100}, which represents the source IP address of the network packet. Defining a 4-dimensional vector x IPs = (192, 168, 1, 100) to quantify the IP address, and a 4-dimensional vector z = (192, 168, 1, 1) to represent the vector of the gateway address. Then, calculating the Euclidean distance between two vectors This result is 99, and use it to replace the original 4-dimensional feature subset. The same method can be applied to the destination IP address.
In this way, the initial 54-dimensional data is reduced into 36-dimensional data, which is shown in Fig. 2.
The acquired network packages have high-dimensional attributes that contain a large amount of characteristics. These characteristics mainly include the basic characteristics, traffic characteristics, and content characteristics of the network connection. The basic characteristics include all the attributes that can be extracted from the TCP/IP connection. The traffic characteristics are information such as the IP address and the VOLUME 8, 2020 services of the network connections. The content characteristics are the attributes that describe the states of the network connection. Fig. 2 is a part of data extracted from the ICS network datasets. Each row represents a network flow data, each column represents an attribute, and the last column is the label corresponding to each piece of data.
Faced with such high-dimensional network data, it is difficult to directly measure the impact of each attribute on network attack behavior analysis. The data types between the various attributes of network data are quite different. According to the data types, we divide all the attributes of network data into three sub-attribute sets: Descriptive fields, Proportional fields and Statistical fields. The specific divisions of ICS network data are shown in Table 2.

III. THE BALANCED METHODS BASED ON DATA GENERATION AND EVALUATION
In fact, the ICS network will be in a normal state in most cases. Therefore, anomaly data is only a small part, while the normal data is the overwhelming majority in the collected network data, which results in samples imbalanced problem. As shown in Table 1, the number of anomaly samples in "Sending a fake command modbus 6RTU with operate" is less than one thousandth of the total sample size. Training with such data will seriously affect the performance of the anomaly detection model. In order to represent the distribution of samples more intuitively, using the T-SNE dimensionality reduction method to visualize these 4 datasets. T-SNE is a manifold learning method that attempts to find a low-dimensional representation of the data [25]- [27]. The visualization results are shown in Fig. 3.
Common methods for solving sample imbalance problems include oversampling, undersampling, and cost sensitivity. Due to the small scale of ICS network data, this paper uses data generation methods to generate new samples of the minority (anomaly) class, which based on the idea of oversampling, but not simply repeating a few class samples. It's not only make the number of normal and anomaly samples more balanced, but can increase the total amount of training samples.
A. SMOTE SMOTE (Synthetic Minority Over-sampling Technique) is an improved scheme based on random oversampling algorithm [28]. The generation of new samples is achieved by interpolating between the samples of the minority class. Assuming that there are a total of n samples in the fewer classes, choosing a sample x i randomly and finding the k nearest neighbors among these samples. Then, selecting one of the k nearest neighbors sample x ij arbitrarily, the interpolation between it and x i will be a new sample where rand(0, 1) represents a random number between 0 and 1. The anomaly samples are expanded to lower the proportion of positive and negative samples. The Fig. 4 are visualization results with a positive to negative sample ratio of 1:1.

B. MAHAKIL
MAHAKIL is a novel synthetic oversampling approach based on the theory of inheritance and the Mahalanobis distance [29]. It takes inspiration from the field of biology that offspring inherit traits from their parents by obtaining chromosomes from each parent in equal quantity. Spliting  TABLE 2. Dividing all attributes into three parts depending on the data type. The symbolic data is divided into Descriptive fields, the float-point data is divided into Proportional fields, and the integer data is divided into Statistical fields. dataset N into arrays of minority class N min and majority class N maj , then computing Mahalanobis distance for each sample in N min where D i is the Mahalanobis distance for each sample x i , µ is the mean of the N min , is the covariance matrix of the N min . Next, sorting N min by Mahalanobis distance, and dividing them from the midpoint into two parts N bin1 and N bin2 . Selecting a pair of instances y a and y b from N bin1 and N bin2 in order, and calculating the mean of the pair as a new instance. The 4 datasets are processed at 1:1 ratio for positive to negative by MAHAKIL, and the results are visualized by T-SNE and shown in the Fig. 5.

C. GAN
GAN (Generative Adversarial Nets) is based on the Game Theory, which consists of two feed-forward neural networks [30]. One of the neural networks is a Generator G and the other is a Discriminator D. They compete against each other, with G producing new candidates and its adversary D evaluating their quality. G and D play the following two-player where p data is the data distribution, p z is the prior distribution of the generative network. The discriminator aims to maximize the probability of distinguishing real from generated data. Whereas the generator is keeping the differentiation between real and generated data to a minimum so as to trick the discriminator into believing that generated examples are real. Based on the originally given noise, GAN can generate data that matches the real situation and has diversity by self-game. Applying it to the ICS network data, and the process is shown in the Fig. 6.
The minority class data of 4 datasets are generated so that its ratio to majority class reaches 1:1 by GAN respectively, and the results are visualized by T-SNE and shown in the Fig. 7.

D. THE EVALUATION METHOD BASED ON MAHALANOBIS DISTANCE OF INTRA-CLASS AND INTER-CLASS SAMPLES
The above three data generation methods can perform sample expansion and balancing on ICS network data. But it's unknown whether the generated data is valid and available. Therefore, an appropriate strategy must be established to evaluate the effectiveness of the generated data. According to the previous analysis of the visualization results, when the generated samples are too similar with the original same class samples, there will be clustering. It's almost a multiple VOLUME 8, 2020 repetition of the original data. If the generated samples are too different from the original same class samples, it will blur the boundary between the anomaly data and the normal data. This paper proposes that evaluating the quality of the generated data needs to carried out in three claim: (i)Whether the generated data has a good inheritance of the original data characteristics. (ii) The diversity of the generated data. (iii) The speed at which the data is generated. The visualization results from the data generated by the previous three methods can be roughly analyzed for the three methods. The new samples generated by SMOTE preserve the characteristics of the original data, but it gives rise to clustering. The new samples generated by the MAHAKIL are diverse but ignore the boundaries between samples of different classes. GAN has the advantages of the former two methods, the data generated by which is not only diverse, but also inherits the characteristics of the original data well.
In order to further evaluate the performance of the three methods, we propose a evaluation method based on Mahalanobis distance of intra-class and inter-class samples. The Mahalanobis distance has been explained by equation 4, which can measure the similarity between two sample sets. Inspired by this, the similarity between the intra-class samples is used to illustrate the diversity of the generated data, and the difference between the inter-class samples is used to illustrate the inheritance of the data. For majority class samples x and minority class (Pending generation) samples y, the similarity and the inheritance for generated data can be evaluated by where the I x∼x is similarity measure between the intra-class samples, the I x∼y is inheritance measure between the interclass samples, µ x and µ y are the mean of the two class samples, x and y are the covariance matrix of the two class samples. These two measures can evaluate the claim (i) and (ii) for data generated by three methods respectively. The larger the I x∼x , the greater difference between the intraclass samples, which means that the more diverse the data. The larger the I x∼y , the greater difference between the interclass samples, which means that the new data inherit the characteristics of the original data well. The claim (iii) will be evaluate by specific experiments.

IV. NETWORK FLOW DATA SELECTION AND ATTENTION A. NETWORK FLOW DATA FEATURE ANALYSIS AND SELECTION
Descriptive fields are all symbolic data. These data are discrete, and it is difficult to characterize and measure them directly. We quantify these symbolic data with numbers. For example, for the ICS network data, the protocol types are [TCP, UDP, Modbus], then they can be quantified as [1]- [3].
In this way, all discrete data are quantized to facilitate feature analysis. The quantized features are represented as F D . Proportional fields are all float data. According to the information such as the IP address and the MAC address of the network connection, it is possible to obtain the number of connections with the same target host or the same service for a period of time, so that multiple proportional data can be obtained. Some of these data is between [0, 1], and the value of other data are greater than 1. We use Sigmoid function to deal with the data, and map all of them to (0, 1]. Sigmoid funcion is defined as where x ij is the j-th sub-attribute value of the i-th piece of data in the Proportional fields, S(x ij ) is the value after being mapped by the Sigmoid function. The processed features are represented as F P . Statistical fields are all integer data, each attribute has a different interval and cannot be directly analyzed. We use the normalization method to process the statistical fields. Let there be a total of n pieces of data, for the i-th piece of data, the j-th attribute value in the statistical field is a ij , the normalization process is where Z ij represents the normalized results, m is The total number of attributes in the statistics field. The processed features are represented as F S . With this method, the data range of each attribute is as close as possible for horizontal comparison and analysis. Meanwhile, it can keep the original data distribution as much as possible.

B. THE ATTENTION FOR ICS NETWORK DATA
In order to find out which attributes of the ICS network data have greater impact on the anomaly classification, this paper uses the attention mechanism [31], [32]. For a ICS network data {F 1 , F 2 , . . . , F n }, where n is the number of features, and some of these features have connection with the neighbors, such as IP source & IP destination, MAC source & MAC destination. Therefore, using a Bi-RNN (bidirectional RNN) to encode the them. As shown in Fig. 8 (10) where S is the attention output that that summarizes all the information of features in the ICS network data. The attention output S is a high level representation of the ICS network data and can be used as features for anomaly classification by Softmax. The attention value α i of each annotation h i is computed by where u i is the a hidden representation of h i , u k is the nearby attributes vector. The α i reflects the importance of the annotation h i with respect to the previous hidden state u i−1 in predicting the network data categories. This paper uses attention to analyze the importance of each feature in ICS network data for anomaly classification.

A. DATASET
This paper uses the data generation methods to expand the minority samples in the ICS network dataset, so that the originally imbalanced data tends to be balanced. Then, we proposes an evaluation strategy for the effects of the three data generation methods. The datasets in Table 1 are collected in a complete and realistic scenario, which provided valuable data for ICS network security research [24]. For the 4 original datasets in the Table 1, using the three data generation methods to expand the original data, so that the normal and anomaly samples reach a certain proportion. The total number and the anomaly number are shown in the Table 3, and then performs the classification tests and evaluations.

B. THE CLASSIFICAION RESULTS OF NEW ICS NETWORK DATA
Convolution Neural Network (CNN) has become an industry standard technology [33], [34]. In academic direction, CNN is widely used in computer vision, speech recognition, speech synthesis, image synthesis, and natural language processing [35]- [37]. In industrial applications, CNN is widely used in autonomous driving, medical image recognition, and information filtering [38]- [40]. CNN also has good performance in extracting and classifying network traffic [41]. We uses CNN to classify ICS network anomaly behaviors. In fact, CNN has better talent for matrix operations, and it can extract object features in depth. Therefore, it has great performance in image recognition. Inspired by this, transforming the network data, fold the sequence data into matrix. For each piece of data in ICS network datasets, includes 2 Descriptive fields, 4 Proportional fields, and 30 Statistical fields. Transforming the {F D , F P , VOLUME 8, 2020    F S } to a 6 × 6 matrix. The vacant part is filled with 0. The main parameters of the CNN model designed in this paper are shown in Table 4.
Finally, using CNN model to the ICS network data, and Pre (Precision Rate), Rec (Recall Rate), and F 1 − Score are used to evaluate the classification effect. The results on original datasets are shown in Table 5. Due to the serious imbalance between the normal and anomaly ratios in the datasets, the classification results of the three datasets are seriously over-fitting. Then, we verify the augmented datasets with new data generated by three methods respectively, the results are shown in Table 6, 7, 8.
Comparing the experimental results of the above three methods. In general, the data generated by GAN has the best classification results, In other words, the data generated by GAN can assist the original data more effectively, thereby the training effect of the classification model is better. Compared with GAN, the SMOTE method is slightly inferior in terms of classification results, and followed by the MAHAKIL. For the first three ICS network datasets, the CNN model has experienced a serious over-fitting before data generation, and the classification results cannot be measured accurately. After data expansion and augment, these three datasets can be classified effectively, and there are good results on Pre, Rec and F 1 − Score. In addition, GAN is a process of self-game, and the process of training the generator takes a long time. However, when the training is completed, it takes less time to generate the data than the other two methods.

C. THE EVALUATION OF GENERATED DATA
For the above three data generation methods, after analyzing the anomaly classification of the data, the experiments are further carried out by using the evaluation method proposed in this paper. For each data generation method, we measure the generated data by I x∼x and I x∼y , and the results are shown in Table 9.
From the Table 9, the data generated by the MAHAKIL has the maximum I x∼x , which indicates that the generated data has the largest difference and is much more diverse. Followed by GAN, and then the SMOTE. For the I x∼y , the data generated by the GAN has the maximum value, the SMOTE gets a smaller value, and the MAHAKIL far from the other two methods. This result illustrates that the data generated by GAN has the best inheritance of the original data features, SMOTE is slightly inferior, and MAHAKIL is also not very good at it. In general, MAHAKIL achieves the highest diversity at the expense of the inheritance, while GAN is the most balanced method that get good diversity as well as retaining original data characteristics.

D. ATTENTION RESULT ON NETWORK DATA
For an anomaly data, using the attention mechanism to explore the features that has the greatest effect on the anomaly classification in the 36-dimensional data. Fig. 9 is an attention result visualized with a heat map.
Through the analysis by attention, it can be seen from Fig. 9 that the most important attributes in the ICS network 4262 VOLUME 8, 2020  These 9 attributes play a important role in the ICS network anomaly classification, although other attributes also play a more or less role. From this, we can known that there are only 1/4 of the attributes of 36-dimensional ICS network data play a key role.

VI. CONCLUSION
The paper aims to conduct intelligent anomaly detection on the ICS network, but finds that there is a serious problem of imbalanced sample in the ICS network data. Therefore, this paper uses three data generation methods to expand the ICS network data, and proposes an innovative evaluation method for generated data. Experiments show that this method can effectively evaluate the availability of data from the aspects of diversity and inheritance, and further prove that GAN is the state-of-the-art methods on the generation of ICS network data.