Multivariate Abnormal Detection for Industrial Control Systems Using 1D CNN and GRU

,


I. INTRODUCTION
Industrial control systems (ICSs), such as SCADA, DCS and PLC, have been widely used in industry, energy, transportation, water conservancy, municipal, and other national key infrastructure fields and are the core part of industrial automation [1]. In recent years, with the increasing degree of integration of industrialization and information, the number of interconnections between industrial equipment and the Internet has also increased sharply, and malicious attacks targeting industrial control systems through the Internet have followed [2]. A number of high-impact cyberattacks have been reported in recent years, including the infamous Stuxnet malware, which targeted nuclear centrifuges in Iran and The associate editor coordinating the review of this manuscript and approving it for publication was Mitra Mirhassani . appeared in 2010 [3], an attack on Ukraine's power grid in 2015 that left millions of people spending Christmas in the dark [4], and, since 2016, a series of industrial control security incidents, such as the Israeli power system invasion [5] and Russian hackers' public release of the SCADA Pass default password list of industrial control hardware and software equipment [6]. This series of events has sounded the alarm of industrial control system security.
Therefore, improving the detection ability of industrial control system attacks has become an urgent problem for academia and industry. One of the approaches is to learn from the traditional network-based intrusion detection system, collect and analyze the network traffic in the industrial control system, match it with the known abnormal traffic model (if the matching fails, it will be regarded as unknown abnormal traffic), and analyze it to determine whether there is a suspicious intrusion or attack [7]. To provide intrusion detection rules governing how to prevent the invasion of detailed information to improve the ability of a system to carry out intrusion detection and prevention, Morris et al. [8] improved the requirements of the network intrusion detection systems of industrial control systems by analyzing four types of intrusion vulnerability, namely, denial of service, command injection, response injection and system reconnaissance. As increasingly more wireless smart devices are applied to ICS networks, conventional encryption methods and security patches cannot improve the security level of older devices in ICS networks. Shen et al. [9] used the simplicity of the program process and the stability of the hardware configuration to measure the interlayer data response processing time and analyze the network traffic, which enhanced the traditional intrusion detection mechanism in an ICS network.
Another approach is to model the underlying physical process and predict the data, comparing it with the actual data to determine whether an exception is generated [10]. To prevent an attacker of an industrial control system from false data injection attacks, Lu and Feng [11] proposed a system state estimation method based on the particle filter estimate node status of the system, introduced a voting scheme to identify malicious nodes, estimated the actual system state accurately by building a local estimator and removing the voted malicious nodes, and finally used the prediction algorithm with the estimated actual status to predict the actual measured value of the damaged nodes. Kurt et al. [12] modeled a power grid system studied as a discrete time linear dynamic system and proposed a distributed dynamic state estimation algorithm based on a Kalman filter to estimate the system state, which greatly reduced the response time of the detector to detect structural and random attacks and improved the detection accuracy. Although these methods can greatly improve the accuracy of anomaly detection, it is difficult to build these models, which requires an in-depth understanding of the physical system and familiarity with the control process of the system [12].
With the development of machine learning and neural networks, some scholars have gradually combined deep learning with the intrusion detection of industrial control systems [13] and achieved certain results. Bangbing et al. [14], based on the industrial control intrusion detection technology of the long short-term memory (LSTM) network, selected the relatively optimal LSTM model through cross-validation and achieved higher accuracy than that of traditional intrusion detection methods. To further improve the detection accuracy of the anomaly model based on LSTM, Wang and Jin [15] combined LSTM with linear discriminant analysis (LDA), used LSTM to learn the context representation of the event log of the industrial control network, and used LDA to classify the features of the output of LSTM to achieve anomaly detection. However, the detection rates of unknown or new attacks are unsatisfactory, and it is more difficult to detect attacks based on physical layers. For this reason, Liu et al. [16] proposed a two-level anomaly detection framework. The first level extracts features through the convolutional neural network (CNN) model, and the second level takes the extracted features as the input of the process state transition algorithm to construct the normal state process transfer model of an ICS to realize the detection of unknown attacks or 0-day attacks. Kim et al. [17] applied the time series data anomaly detection to the edge calculation of the industrial Internet of things (IIoT) with a stacked compression variational autoencoder (SCVAE), which reduced the size of the model and the time spent in detection while ensuring a similar accuracy level. Because there are many controllers and sensors in industrial control systems, the data corresponding to different processes or scenarios are different. To effectively detect cross-scene network attacks, Wang et al. [18] proposed extracting nonlinear and nonstationary features from power load data with a stacked autoencoder and then using these features to improve the accuracy of power load data prediction to improve the accuracy of network attack detection.
Generally, when evaluating a model, we should consider not only the precision but also the recall rate. The F1 score [19] can reflect these two indicators at the same time. The higher the F1 score, the better the detection result of the model. Due to the low F1 score generated in the detection of multitype attacks with the above methods, when the attack types are diverse, better detection cannot be achieved. Therefore, we propose an intrusion detection system model based on the autoencoder, 1D_CNN and GRU. We use the model to detect cyberattacks on the SWaT dataset. The contributions of this paper are as follows: • An anomaly detection method for multivariant industrial process time series data; • A model for predicting sensor/controller parameters in industrial control systems (the spatiotemporal correlation and other dependencies between the values of the sensors and the controllers at each moment are fully learned by combining the 1D_CNN with the GRU); • A higher F1 score for anomaly detection on the SWaT dataset when using our method.

II. CORRELATION PRINCIPLE OF AUTOENCODER
The autoencoder [20] is an unsupervised learning method that includes a feedforward symmetric network of multi layers of neurons; the structure is shown in figure 1. It is important to note that the number of hidden layers can be more than one. The process by which the network converts the input data v into the code h for representation is called encoding. To train the autoencoder, the code h obtained from the hidden layers needs to be mapped to the output layer through the decoding layers to obtain the reconstructed vector v. The encoding process can be expressed as and the reconstructing process can be expressed as where f l is a nonlinear mapping of the lth hidden layer, which can be used as a sigmoid function [21], tanh function, and so on. For the first hidden layer, x = v; for the first hidden layer of the reconstruction process, H = h; and for the last hidden layer, I = v. The target of autoencoder is to reduce reconstruction errors.
where L = v − v 2 . This makes the hidden layers learn a feature representation of the input data. Current studies have found that the stacked denoising autoencoder (SDA) [22] has better performance in feature extraction. The input of the SDA is randomly damaged by randomly setting some input units to zero through random mappingṽ ∼ qD (ṽ|v). The SDA tries to reconstruct the original input from the damaged input version. So we can use the mean squared error as the loss function.

III. ABNORMAL DETECTION METHOD
We proposed an abnormal state detection method based on the calculation of the statistical deviation between the predicted value and the observed value. The main idea of the anomaly detection method is to predict the value of the next moment or sequence of the next time according to the existing flow in the previous period and judge whether the abnormality occurs because of the deviation between the predicted value and the actual value. The steps are as follows: let X = (x 0 , x 1 , . . . , x n−1 ) and Y = (y n , y n+1 , . . . , y n+m ) be input flow sequence vectors and output flow sequence vectors at time i, respectively. Since (y n , y n+1 , . . . , y n+m ) is the predicted value, there will be deviation from the actual observed value; thus, the deviation − → e t at time i can be set as where − → y a is the value of a dimension's feature in the actual observed value corresponding to the predicted value − → y .
where µ e t is the mean of the prediction error − → e t from time i to time t and the corresponding standard deviation is σ e t . Calculate the mean and variance of each predicted characteristic value and observed value in Y , and finally, set the threshold value T according to the deviation value. If then the state is regarded as abnormal; otherwise, it is normal.

IV. ICS ANOMALY DETECTION MODEL BASED ON 1D_CNN AND GRU
First, the features of the historical data are obtained through the SDA, and then the predicted features in each window are obtained through the neural network combined with the 1D_CNN and GRU. Finally, the statistical deviation between the predicted features and the observed features is calculated and compared with the set threshold to determine whether the system enters an abnormal state.
In the stage of data preprocessing, the one-hot encoder (OHE) [23] is used to convert the original flows into a vector. Since the dimensions of the vector are very high, an autoencoder is used to reduce the dimensions of the vector. To obtain a better feature vector with low dimensionality and give the model a certain anti-disturbance ability relative to an anomaly or an attack on the system, we use the SDA to extract the flow features and reduce the dimensions of the features; however, the dependencies between some features may be lost due to the dimensionality reduction.
To extract better features effectively, the network should learn higher-level and more-abstract operation and parameter features to learn industrial control system features from low-level common features. To add potential information about the feature interdependencies, we add an additional full connection layer before the CNN, which expands the number of features we provide to the CNNs, as shown in figure 2. We set the dimensions i of the extended vector to a value between the output dimensions m of the SDA and the dimensions n of the initial vector after being transformed by the OHE, i.e., m < i < n. The CNN naturally integrates low-/medium-/high-level features and classifiers in an end-to-end, multi-tier manner and can enrich the ''hierarchy'' of features by the number of stacked layers so that deeper networks can learn richer, more advanced signal functions. However, the deeper CNN has not only a strong functional learning ability but also its own defects. First, the gradient is calculated by chain back propagation. As the number of layers increases, an exponential decrease/increase in the gradient easily occurs. Therefore, deeper CNNs often encounter the problem of gradient disappearance or explosion, and their training becomes more difficult. Second, network degradation is another major problem, leading to an increase in the training errors of samples. Therefore, we use the 1D_CNN with the residual module to avoid these problems. The 1D_CNN introduces the idea of residual learning into the traditional 1D CNN. The 1D residual block can effectively address the deeper CNN training difficulty and performance degradation. In addition, the introduction of the pool layer further enhances the feature learning ability of CNNs in noisy environments. The performance of the network is further improved through the adoption of batch normalization.
The extended vectors are taken as the input of the 1D convolutional neural network with the residual module, as shown in figure 3. After the expanded vectors are normalized through batch normalization [24], the ReLU function [25] is adopted as the activation function for nonlinear mapping and then processed through a 1D convolution layer. After the above steps are carried out many times, the residual module is used for the output.
However, using only the 1D_CNN to learn the features and their dependencies will lead to deterioration of the accuracy and robustness of system prediction, which will result in a high false positive rate of multitype attack detection.
On the other hand, because the conventional RNN architecture is too sensitive to the interference of adjacent time periods, the problem of error flow disappearance still exists. Compared to the original RNN and LSTM, the gated recurrent unit overcomes the gradient disappearance problem by multiplicative gates. Generally, both the GRU and LSTM are suitable for processing time series data, but the GRU is more efficient at achieving convergence and updating weights, and the internal gate structure is more concise than LSTM. The experiments also show that the GRU is better than LSTM in this model.
By providing a series of data points to the above network structure, a single subsequent data point containing many features can be obtained. After we obtain the output vector of the 1D_CNN [26], we let it pass through a dropout layer [27] to increase the generalization capability of the network. Then, the output of the dropout layer is used as the input of the GRU neural network [28] (as shown in figure 4) to predict the subsequent flow features, and the predicted eigenvector is obtained symmetrically through a fully connected layer. For the whole network mentioned above, the mean squared error (MSE) [29] is adopted as the model's loss function to train our model.
where y i is the predicted feature and y i is the actual feature.
We use the method that we proposed in section 3 for flow anomaly detection. First, the eigenvector Y t p of the traffic at the next time t is obtained by using the above network model, and the eigenvector of the actual traffic at the time is Y t a . According to the formula, the deviation of the eigenvalue of dimension i in the eigenvector corresponding to the flow at time t is − → e i t , and the mean value µ e i and standard deviation σ e i of the eigenvalue of dimension i are calculated. After regularization of the deviation, the deviation value of the is obtained. We use this method to calculate the eigenvalues of each dimension. Then, we set the threshold T and the minimum value of the time window w as hyperparameters, multiply the deviation value of each dimension to obtain an outlier, and then use the outlier to judge whether an abnormal state occurred: (9) where m represents the number of dimensions of the eigenvector. For the above outliers, we map them to a level between 0 and 10 to obtain the level of outliers at time t.
Finally, compared with the set threshold T , if OS t > T , the system at time t is abnormal and can be regarded as attacked; otherwise, it is normal.

Algorithm 1 The Proposed Anomaly Detection Method
Input: training data, testing data Output: anomaly score OS t 1: Initialize weight parameters in model 2: Transform the normalized data into the initial vector

V. EXPERIMENT AND ANALYSIS A. EVALUATION CRITERIA
In this paper, the F1 score is adopted to evaluate the performance of the model, which is often used in multiclassification problems. The relevant definitions are as follows: where TP (true positive) is the number of abnormal flows detected correctly, FN (false negative) is the number of abnormal flows that have occurred but have not been detected, and FP (false positive) is the number of normal flows that have been incorrectly marked as abnormal.

B. DATASET
This paper uses the SWaT dataset collected through Secure Water Treatment [30] designed by researchers in Singapore to test and verify the performance of the model. The SWaT dataset was collected from a test-bed water treatment system. The test-bed water treatment system represents a smaller version of a large modern water treatment plant used in large cities. By codesigning the operational test platform with the relevant departments, the whole physical process and control system are very similar to the real system. As shown in figure 5, the water purification process consists of six subprocesses, i.e., P1 through P6 [31]. Each stage is controlled by a corresponding pair of programmable logic controllers (PLCs), one for the primary PLC and the other for the standby PLC. Between the different stages, the PLC communicates through a separate network called the L1 network. In addition, in the SWaT platform, all network communications between the PLCs, sensors, and actuators use industrial Ethernet/IP (ENIP) and common industrial protocol (CIP) stacks. The first process involves the raw water supply and storage, and P2 is pretreatment, in which the water quality is assessed. Unwanted material is removed by ultrafiltration (UF) backwashing P3. The remaining chlorine is destroyed during dechlorination (P4). The water from P4 is then pumped into a reverse osmosis (RO) system (P5) to reduce inorganic impurities. Finally, P6 stores the water for distribution. The test platform ran 24 hours a day for 11 days. The collected data contains 51 labels (25 sensors and 26 actuators). Tag names define their roles. For example, MV represents a motorized valve, P represents a pump, FIT is used for flow meters, and LIT represents a level transmitter. The dataset contains a total of 36 types of attacks. One of the targets was to reduce the performance from a nominal level (for example, 5 gallons per minute) to a low value. This attack can be initiated by damaging the LIT401 sensor, which measures the water level in the P4 ROfeed tank. By attacking LIT401, the attacker lowered the horizontal level of the RO feed tank from 800 mm to 200 mm, causing PLC-4 to stop pumping P401, reducing the water pumped into P5. Finally, the attack sensor LIT401 had a negative effect on the output  water flow of the RO device (measured in FIT501 in P5). According to system specifications, this flow must be maintained at approximately 1.2 cm/h, which results in nearly 5 gallons/min of treated water. Thus, during the observation period, the amount of treated water decreased. Table 1 summarizes some general information about this dataset and the KDDCUP99 dataset, which is commonly used in other studies. It is worth noting that the SWaT dataset is significantly different from the KDDCUP99 dataset: a) Depending on the scenario, different attacks may last for different durations. In fact, some attacks did not take effect immediately. The timing of system stabilization also varies from attack to attack. Some simple attacks, such as those aimed at changing traffic, can stabilize the system in a short period of time, while attacks that have a greater impact on system dynamics require more time to stabilize the system. b) Attacks on one sensor/actuator may affect the performance of other sensors/actuators or the entire system after a certain time delay. c) In addition, similar types of sensors/actuators tend to respond to attacks in similar ways. For example, attacks on the LIT101 sensor (a water-level sensor in phase P1) and on both LIT101 and LIT301 (another water-level sensor in phase P3) caused significant abnormal spikes, but readings from other sensors and actuators (such as flow sensors and electricity meters) had a relatively small impact. Therefore, compared with KDDCUP99, the SWaT dataset is closer to a real industrial control system, and the spatiotemporal relationship in the dataset is more complicated. Therefore, it is very important to adopt a multivariate method for anomaly detection in system modeling because the overall performance changes of different subprocesses can help better identify attacks. In other words, the potential association between the sensors and the actuators may help in the detection of system behavior anomalies caused by an attack.

C. SETUP
The model was trained and tested on two Intel i7-6700k workstations (32 GB of RAM) using the NVIDIA 8 GB 1080 GPU. The model and anomaly detection algorithm were implemented using Google's TensorFlow framework [32]. We used low-level APIs to facilitate fine-grained control of the network architecture.

D. DATA PREPROCESSING
Generally, in machine learning, the data are normalized to a Gaussian distribution with a mean of 0 and a standard deviation of 1. However, to ensure that the distribution of most labels in the normal portion of the SWaT dataset is not distorted or multiple spikes occur, we chose min-max normalization, where the minimum and maximum values of normalization are 0 and 1, respectively. The adoption of min-max normalization can also give our model structure generalizability for different types of data. Therefore, we normalized the original data into standard data in 51 dimensions. After normalization, the normal data in the dataset were used for training, 80% of which were training data and 20% of which were validation data, and the abnormal data of an attack were used as test data.

E. SYSTEM ARCHITECTURE
For the model proposed in this paper, the first layer of the 1D_CNN network structure adopts kernel = 2 and 32 filters, and the filter size of the next 1D convolutional layer is twice that of the previous layer. GRU or LSTM networks with depths of 3 and 100 internal hiding units were used. Since the data in the SWaT dataset were recorded once per second, we set the window length to T = 120 (that is, data are collected within 2 minutes). To capture the relevant dynamics of the data, the window (T = 120) was applied to the normal dataset and the test dataset, where the slide length on the normal dataset window was 10 and the slide length on the test dataset was 120.

F. TRAINING
To avoid the occasional local minimum problem in training the neural network, we conducted multiple independent trainings on the neural network and selected the best result from the multiple training results. Since the neural network was trained by the stochastic gradient descent method and was mostly initialized randomly, the results obtained from each operation were different. We iterated the batch_size of 4096 each time during model training. The experimental results in this paper are all based on the optimal results corresponding to the model. First, we analyzed the difference between the monolayer GRU and LSTM in model training after they were combined with the 1D_CNN. The results are shown in figure 6 and figure 7. These analyses allowed us to obtain the loss function and accuracy curve of the two models in the training stage on the training set and verification set.
The experiment shows that in the monolayer case, the loss function of the model using LSTM is not significantly different from that using the GRU in the first few rounds of training, which means that the model using LSTM is not  significantly different from that using the GRU in the first few rounds of training. With the increase in the number of training iterations, LSTM and the GRU almost reached convergence at the same time. Therefore, for the model with the single-layer GRU and LSTM combined with the 1D_CNN, the difference in training is not very large, and the GRU can achieve the same training effect as LSTM under the condition of multiple trainings.
Experiment 2: Multilayer LSTM and GRU were combined with a single-layer 1D_CNN Then, we increased the layers of the GRU and LSTM to three layers to analyze the impact of the change in the number of layers on the model training. The results are shown in figure 8 and figure 9.
The experiment shows that the training speed of the multilayer model is faster than that of the single-layer model and that the multilayer model converges faster. The models with increasing GRU and LSTM loss function values were similar under optimal conditions, but using the GRU model helped to speed up the decline of its loss function compared with using the LSTM model. In the fourth round, to achieve relative stability, using the LSTM model is a poor choice, while use of the GRU is beneficial, as using the LSTM model during practice will cause a small wave. In addition, in terms of the accuracy of model training, the model using the GRU  achieved better stability during training than that using LSTM and can achieve higher accuracy. Experiment 3: Multilayer LSTM and GRU were combined with a multilayer 1D_CNN.
After analyzing the multilayer GRU and LSTM, we tried to increase the number of layers of the 1D_CNN to analyze whether the increase in the number of layers of the convolutional layer would affect the above results. We also wanted to know whether increasing the number of 1D_CNN layers would have a significant positive effect on the model. The results are shown in figure 10 and figure 11.
The above experimental results show that increasing the number of layers of the 1D_CNN can indeed improve the model. Specifically, it can enhance the stability of the model, and the loss function of the model becomes gentler during training. However, compared with LSTM, the GRU is still gentler, and its fluctuation is not very large. Considering that multilayer networks are usually used when designing models, GRU networks are more suitable for this paper during the training phase.
Experiment 4: Selection of the 1D_CNN layers As shown in the above experiment, choosing the multilayer 1D_CNN to build an anomaly detection model can yield better results than choosing the single-layer 1D_CNN. Therefore, we further determined how many layers of the 1D_CNN  should be set to build the detection model in this paper. The results are shown in figure 12 and figure 13. When we adopt a 2-layer 1D_CNN, although the loss function declines more stably, the convergence effect of the model is significantly worse than that of a 4-layer 1D_CNN and 8-layer 1D_CNN. Then, it can be seen that the loss function and accuracy of the model with the 8-layer 1D_CNN achieves the best results, but the stability of the model is worse than that of the model with the 4-layer 1D_CNN, and the loss function and accuracy fluctuate significantly. In the training phase, the performance of using the model with the 4-layer 1D_CNN is not much different from that of the 8-layer 1D_CNN. Therefore, this paper used a 4-layer 1D_CNN to build the abnormal state detection model of an industrial control system.

Experiment 5: Experimental results for different types of networks
We tested the current popular industrial control system anomaly detection methods based on the CNN network on the SWaT dataset and compared the time spent by different models in training and testing and separately on corresponding datasets for different processing stages. Tests were performed to compare different models to detect the F1 scores that occurred during different stages of processing on SWaT.
Looking at figure 14 and figure 15, we can see that the model using the 1D_CNN combined with the GRU requires more training time and testing time than does the 1D_CNN model alone. However, the testing times are not very different. The testing time of the model using the 4-layer 1D_CNN combined with a 3-layer GRU is less than 20 minutes longer than that of the model using the 4-layer 1D_CNN alone. On the other hand, we can see from Table 2 that the model using the 2-layer 1D_CNN has the smallest F1 score in the five process stages on SWaT and the worst detection effect. The F1 score of the model with the 8-layer 1D_CNN in these 5 process stages is higher than that of the model with the   2-layer and 4-layer 1D_CNN. However, in the P2, P3, P4 and P5 stages, the F1 scores are lower than those in the model combining the 4-layer 1D_CNN and 3-layer GRU, and the detection effect is not as good as that of the model combining the 4-layer 1D_CNN and 3-layer GRU. Only in the P1 stage the performance is better than that of the model combining the 4-layer 1D_CNN and 3-layer GRU. As for the attacks in the P2 stage, there are only four such attacks numbers 6, 24, 29 and 30 in the dataset. The attack numbers 29 that did not achieve the desired effect and another attack numbers 24 that did not have the expected impact according to the description. Thus we can conclude that the relatively low F1 score for the P2 stage is caused by the test setup and our model successfully detected the attacks which were actually carried out.
Experiment 6: Results of the model in the testing stage To test the model proposed in this paper, we calculated the precision, recall rate and F1 score of the model when it was applied to the attack detection of industrial control systems and compared it with several traditional abnormal state detection methods of industrial control systems. The results are shown in Table 3.
The experiment shows that the model based on the SDA and 1D_CNN has a great improvement in precision and recall rate compared with the traditional SVM, the precision ups to 99.6% with the recall rate of 85.3%; thus, the F1 score is also increased by 11.7% compared with the SVM. Compared with the model based on the DNN, the precision of the model based on the SDA and 1D_CNN is also improved, the precision is increased by 1.3% and the recall rate is also much better than that of the model based on the DNN, the recall is increased by 17%, so the F1 score of the model is nearly 11% different. For the TABOR and the MLP, there is always an improvement and a decrease in precision and recall, so the F1 score of them is not improved obvious. Compared with the sequenceto-sequence neural networks proposed in [35], the precision, recall rate and F1 score increases by at least 2%, 6% and 4% respectively. In addition, based on the model with the SDA and 1D_CNN, the GRU is adopted to predict the traffic, and compared with LSTM, the precision and recall rate are improved. In terms of time consumption, the model using the GRU is also superior to the model using LSTM.
We also estimate the AUC of our anomaly detection method according to each record detection metrics. The reason for using each record label for the AUC is to define the number of negative instances, which is problematic in the case of each attack detection (it is impossible to measure how many non-attacks there are). It can be seen from Figure 16 that the AUC of the anomaly detection method is high which reaches 0.976. Table 4 shows that the detection performances of the models based on the SVM, DNN, TABOR or MLP for some  attacks (such as attack 1, attack 3, attack 12, etc.) are very poor, while our method achieves better performances for most attack detection tasks. The recall rates of our method and the method proposed in [35] are higher than other methods in detecting most attacks. But our method performs better in detecting some attacks, such as attack 3, attack 14, attack 16, and attack 19. The water level sensors (e.g. LIT-101, LIT-301 and LIT401) have the best anomaly detection performance. This is because other than the direct attacks on those sensors, most of attacks applied to the other sensors/actuators also affected the water levels indirectly. For the attack 14, due to the chemical accumulation in P5 is relevant for anomaly detection after the chorine has been added to the system in P3 and P4, so the recall rate of the attack 14 is higher. For the attack 16, the target of attack 16 is MV101 and LIT101. For the attack 19, although the target is P203 and P205, it impacts the MV101 and LIT401. We can detect them through the MV101. In particular, a high recall rate is achieved for attacks with a low recall rate when using SVM-or DNN-based model detection (such as attack 1, attack 2, attack 3, etc.), thus improving the recall rate of multitype attack detection.
We discovered that only four attacks were consistently undetected: attack numbers 4, 10, 11 and 24. When reviewing the description of these tests we found that these attacks failed to have their expected impact on the system; thus, it is reasonable that they were undetected. For the attack 4, this attack opens MV-504 that does not exist in the dataset. The description in SWaT of attacks indicates that this attack has no impact. For the attack 10, this attack attempted to close MV-304 but MV-304 was closed later than when this attack occurred. In the SWaT dataset, MV-304 did not change. According to the description of the dataset, the attack 11 failed because Tank 301 was already full. For the attack 24, it is said that P-201, P-203, and P-205 did not start because of mechanical interlocks. In the dataset, nothing was changed around. Thus we can conclude that the low recall for some particular attacks (No. 4,10,11,24) is caused by the test setup and our model detected the attacks which were actually carried out.
The SDA is used to achieve better learning and dimensionality reduction processing of the eigenvectors, and some features are eliminated. In addition, the 1D_CNN performs well in processing time series. Combined with LSTM or the GRU, features can be predicted more accurately. However, compared with the complex gate structure in the LSTM model, the structure of the GRU is simpler, achieving better results than LSTM in this paper. Therefore, the model based on the combination of the SDA, 1D_CNN and GRU can also be used in the abnormal state detection of industrial control systems and can achieve a better detection effect. Compared with the traditional methods, it greatly improves the precision and recall rate in the abnormal state detection of industrial control systems.

VI. CONCLUSIONS
This paper studied the use of autoencoders and neural networks to detect attack behaviors in an ICS and tested the effect of a series of model structure changes. To fully use the spatiotemporal correlations and other dependencies between multiple variables (sensors/actuators) in a system to detect anomalies, a model based on the SDA combined with the 1D_CNN and GRU was introduced, which was applied to the abnormal state detection of the SWaT dataset, successfully realized the detection of multiple types of attacks on the physical control level of the industrial control system, and achieved a higher precision and recall rate, thus improving the F1 score of the detection system. After using the SDA to conduct feature learning on the data, on the one hand, the data features were fully learned, and some irrelevant features were eliminated, and on the other hand, the eigenvectors originally obtained through the OHE were dimensionally reduced to avoid the dimension explosion caused by a large amount of data. To make the 1D_CNN fully learn the potential dependence between features, the eigenvector output by the SDA was extended through a fully connected layer. In the 1D_CNN structure, the 1D convolution layer was used to process each feature from the time dimension, and the features were predicted better by combining the 1D_CNN with the GRU.
Although the anomaly detection system for industrial control proposed in this paper achieves a good anomaly detection effect, the system proposed in this paper does not consider the situation in which the anomaly detection system may be deceived by an anti-attack. Therefore, this paper studies how to design an industrial control anomaly detection system that can not only achieve a better anomaly detection effect but also resist a higher degree of data disturbance and have a stronger optimization strategy to effectively resist an attack. This will be an interesting direction for future work. In addition, due to the use of the SDA, the number of model parameters was relatively large, which would result in training difficulty. Therefore, a better method or model for learning features could be found. For the optimization of data features, it was also possible to consider the relationship between networks in learning. For the structure obtained after learning the model, we could also try to interpret it more reasonably to design the model and rule-based detection more reasonably.