Siamese Neural Network Based Few-Shot Learning for Anomaly Detection in Industrial Cyber-Physical Systems

With the increasing population of Industry 4.0, both AI and smart techniques have been applied and become hotly discussed topics in industrial cyber-physical systems (CPS). Intelligent anomaly detection for identifying cyber-physical attacks to guarantee the work efficiency and safety is still a challenging issue, especially when dealing with few labeled data for cyber-physical security protection. In this article, we propose a few-shot learning model with Siamese convolutional neural network (FSL-SCNN), to alleviate the over-fitting issue and enhance the accuracy for intelligent anomaly detection in industrial CPS. A Siamese CNN encoding network is constructed to measure distances of input samples based on their optimized feature representations. A robust cost function design including three specific losses is then proposed to enhance the efficiency of training process. An intelligent anomaly detection algorithm is developed finally. Experiment results based on a fully labeled public dataset and a few labeled dataset demonstrate that our proposed FSL-SCNN can significantly improve false alarm rate (FAR) and F1 scores when detecting intrusion signals for industrial CPS security protection.


I. INTRODUCTION
C YBER-PHYSICAL system (CPS), which can usually be divided into three layers including the physical layer, transmission layer, and application layer, is a multidimensional complex system integrating computation, physical processing, and networking. With the rapid development of Industry 4.0, signals and messages exchanging through networks based on industrial Internet of Things (IIoT) empower the functionality and efficiency of CPS in industrial environments [1], including real-time perception, dynamic control, and information service of large-scale engineering systems. However, the diversity of CPS applications deploying across networks in IIoT makes it vulnerable to both cyber and physical attacks among different levels of systems, especially for message transmissions in smart manufacturing processes.
Currently, due to the new characteristics of different attacks in industrial CPS, it becomes necessary to involve and develop advanced intelligent computing, communication and control technologies to deal with the cyber-physical security issues [2]. Typically, the possibility of industrial CPS compromised by various attacks becomes higher along with the increase of the number of physical sensors and I/O interfaces. For example, in 2015, the Ukrainian State Electric Power Department suffered a malicious code attack and resulted in a power outage, which has been viewed as a typical case of cyber security shortcoming [3]. The openness of modern information and communication technology makes cyberphysical security a significant issue in developing industrial CPS. In particular, intelligent anomaly detection becomes a significant way to identify both cyber and physical attacks among the whole networks for security protection.
Modern AI technologies, including intelligent sensing, smart control, etc., are widely used for behavior monitoring in smart manufacturing. However, there are still several challenges when detecting abnormal signals in industrial CPS. First, the hybrid cyber-physical environment constructed with a cloud infrastructure is a large and complicated distributed system, thus a large volume of industrial data stream (e.g., instruction, accelerometer, video, image, etc.) is generated via a variety of physical systems and sensors. To alleviate the damage caused by malicious attacks in industrial CPS, it requires real-time anomaly detections with high accuracy and timeliness, to facilitate the This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ surveillance of overall performance based on the data stream obtained and transferred from different levels of distributed nodes across the system. Another critical issue in industrial CPS is that such kinds of abnormal events occur rarely in the real world. The low occurrence probability of these anomaly activities results in the lack of well labeled data for model training. In addition, missing of surveillance data, which may be caused by different factors, such as sensor failure, data transferring error, etc., is a common problem in most of industrial systems, but will bring a more difficult situation for automatic data collection and model training toward intelligent anomaly detection. Since conventional learning techniques mainly depend on a large labeled training database, it becomes more challenging when facing the above problems in the real-time surveillance and anomaly detection tasks [4]. Therefore, intelligent strategies need to be designed and developed to deal with the time-consuming issues especially when multiple sensors are used to extract samples based on different frequencies during a more complex data fusion process in industrial CPS [5].
Few-shot learning is an emerging learning paradigm aiming at tackling issues on the lack of training data, which enables models to identify novel categories with only a few sample data provided for them. The key point is that the training sample needs to be carefully selected in order to perfectly match the inference during the testing phase. Each step is designed to simulate a small-sample learning task by subsampling classes and data points (e.g., sampling five classes at one time, each of which is with five labeled samples). To complete the few-shot learning task, a well-trained feature extractor should be devised, and an effective classifier is essential to mine rich information from a small number of labeled samples.
In this article, we propose a Siamese neural network based few-shot learning model to deal with the cyber-physical security protection issue with few labeled data. In particular, a Siamese convolutional neural network (CNN) is constructed to improve the high-dimensional feature learning, and further facilitate the identification of novel classes, based on an optimized relativefeature representation. A robust cost function with three specific losses is designed to enhance the training efficiency, and an improved algorithm is developed for intelligent anomaly detections in industrial CPS. The main contribution of our article can be concluded as follows.
1) A few-shot learning model based on Siamese CNN is designed with a relative-feature representation scheme, which can alleviate the over-fitting issue especially when coping with few labeled data in industrial CPS. This model is later referred to as few-shot learning model with Siamese convolutional neural network (FSL-SCNN). 2) A robust cost function, considering a combination of the transforming loss in relative-feature representation, the encoding loss during CNN encoding process, and the prediction loss based on the distances between the anchor sample and the positive and negative samples, is introduced, which may significantly enhance the efficiency of training process. 3) An intelligent detection algorithm is developed based on a transformed lower dimensional feature representation, which can be applied for anomaly detections from large amount of industrial CPS data with few labeled samples. The rest of this article is organized as follows. Section II presents an overview of related works on few-shot learning techniques for malicious attack detections. The proposed FSL-SCNN and anomaly detection implementation are discussed in Section III. Section IV demonstrates the results on performance evaluation for the proposed method. Section V conclude this article.

II. RELATED WORK
In this section, several issues relating to this article in intelligent industrial systems, including analytics on anomaly detection techniques for CPS, and models on few-shot learning in industrial applications, are discussed respectively.

A. Anomaly Detection Techniques for CPS
In current years, researchers have paid great efforts to tackle the vulnerability and security issues for CPS, which were implemented in a variety of applications ranging from data acquisition, surveillance, and industrial control systems. There are various kinds of cyber and physical attacks which may affect the reliability and security of CPS. For example, to explore vulnerabilities in industrial CPS, Alan et al. [6] considered a covert attack for service degradation, and introduced a backtracking search optimization algorithm to deal with the system identification attack in cyberphysical control systems. Beg et al. [7] focused on the false-data injection attack, and designed a detection framework, to identify changes based on a set of candidate invariants inferred from Simulink/Stateflow diagrams in cyber-physical dc microgrids. To cope with a typical type of denial-of-service (DoS) attacks in CPS, Sun et al. [8] proposed a resilient control strategy with a dual-mode algorithm, which could be used for the optimization problem in model predictive control without the consideration of model uncertainties and measurement noises.
Particularly, anomaly detection has been analyzed extensively for purposes of cyber security and system reliability. It becomes crucial to develop appropriate security protection frameworks or control systems to tackle vulnerabilities in industrial CPS under different attack scenarios. Several AI-based approaches have been investigated for cyber-physical security protection including attack identification, fault detection, and tolerant control [9]. Kim et al. [10] analyzed the cyber-physical vulnerability, and presented a software-defined networking-based architecture for man-in-the-middle attack. They applied it in a specific communication-based train control system, to improve the resiliency for attack detections. Li et al. [11] built a dual deep learning (DL) model with an energy auditing mechanism, to monitor and identify cyber and physical attacks in IoT environments. They designed a disaggregation-aggregation structure to learn the system behaviors for the attack detection. The disaggregation part was used to analyze the energy consumption for cyber-attack identification, and the aggregation part was used to measure the power consumption for physical attack identification. Pearce et al. [12] introduced a framework to prevent different kinds of cyber-physical attacks based on the runtime enforcement, in which the bidirectional timed policies were specified in an industrial CPS application.
Obviously, previous researches have shown the success in applying DL techniques to identify a variety of cyber-physical attacks. However, conventional supervised learning models heavily rely on the prior knowledge and well labeled training samples, which may be difficult in handling the over-fitting issue, and even result in poor performance when detecting new categories with a few samples for anomaly detection in intelligent industrial environments.

B. Few-Shot Learning in Industrial Applications
Few-shot learning is an emerging type of transfer learning technique. By reusing the transferrable knowledge of existing models, a classifier can be built to identify the novel category using only a few labeled training samples [13]. Along with the popularity of DL, few-shot learning model is increasingly drawn attention in modern industrial applications. Gu et al. [14] built a recognition network to deal with the few-shot density problem for industrial safety and environmental protection. They employed the model-agnostic meta-learning algorithm to optimize the initial parameters, which could achieve better classification results with only a small number of gradient steps in flare soot applications. Sun et al. [15] constructed a feature fusion model based on the so-called focus-area location and high-order integration for few-shot tasks. The few-shot learning was utilized to identify similar regions and extract more discriminative features. Perez-Cabo et al. [16] proposed a deep metric learning method for the generalized presentation attack detection problem, in which a triplet focal loss was defined to regularize a new "metricsoftmax" loss. They used the few-shot learning to improve the feature representation and distinguish attacks only using the image data. Huang et al. [17] developed a few-shot learning model for imbalanced data problems. They designed a gated network structure to analyze the known types and unknown types in anomaly detection, and tested their method in identifying new anomaly types for few-shot learning tasks. Chowdhury et al. [18] introduced a DL approach to few-shot intrusion detection. The CNN model, linear support vector machine, and one-nearest neighbor classifier were integrated together in a training model for new feature representations. They argued that their method could be used to identify some minority attack types. Shen et al. [19] presented a machine learning-based framework for resource management in wireless communications. The idea of few-shot learning was used in a self-imitation mechanism, which could optimize a new task with a few unlabeled samples based on a pre-trained learning model. Lu et al. [20] defined two types of outliers: representation outlier and label outlier, and constructed an attentive profile network model for outlier suppression based on few-shot learning using user-provided data.
Comparing with previous researches, in this article, we construct a few-shot learning model to overcome the overfitting issue, in which a Siamese CNN structure is designed and constructed to alleviate the loss of key features. The proposed model can be applied to enhance the anomaly detection performance for security protection in industrial CPS.

III. FEW-SHOT LEARNING MODEL WITH SIAMESE CNN IN CPS
In this section, we first introduce the system architecture for cyber-physical security protection in industrial CPS. A fewshot learning framework is constructed and presented with a CNN-based Siamese network. An intelligent anomaly detection algorithm is then developed based on a relative-feature representation scheme and a robust cost function design. Fig. 1 illustrates a typical architecture for security protection in AI-enhanced industrial CPS. Usually, attackers may hack into the CANbus network and send the malicious code to compromise systems. The supervisory control and data acquisition (SCADA)system is involved to monitor and collect signals (e.g., vibration, temperature, and TX&RX packet data) generated across the cyber network, in which the DL-based anomaly detection module is deployed to identify anomalies. Since it is a costly task to collect anomalies with large enough set of samples for traditional model training, a Siamese CNN encoding network model is designed to facilitate the real-time analysis based on few-shot learning, which can improve the intelligent anomaly detection with higher efficiency and accuracy in industrial CPS.

A. Problem Definition
Given an anomaly detection problem in the industrial CPS, two general datasets, D nor and D ano , are taken into considered to indicate the normal and anomaly samples, respectively. D nor = {(x nor i , y nor i )| i = 12, . . . N nor }, contains N nor labeled normal samples, in which x nor i is the data sample and y nor i is the corresponding class label. Likewise, D ano = {(x ano i , y ano i )| i = 12, . . . N ano }, contains N ano labeled anomaly samples, in which x ano i is the data sample and y ano i is the corresponding class label. We assume N nor N ano to describe the few-shot learning scenario. Thus a set of samples from D ano is selected to form the support set in each training episode, and the corresponding query set Q, which is used to indicate the unobserved samples of novel classes between different episodes, can be described as = {(x ano j , y ano j )| j = 12, . . . N q }. Summarily, in each episode, we randomly select K malicious attack classes, each of which includes C labeled samples, to form the K-way C-shot learning problem, aiming to enhance the generalization of detection capability of our model especially for novel attack identification.

B. Few-Shot Learning Framework for Anomaly Detection
The proposed FSL-SCNN is designed to tackle the issue on lacking adequate labeled anomaly samples in our detection tasks. Differing from conventional classification models, our FSL-SCNN do not predict the class for an input sample data directly, but calculate the distance between the input samples in terms of their optimized feature representations. In particular, a CNN-based Siamese network is constructed to cope with the few-shot learning problem, thus the novel classes can be identified even with only a few sample data supported. The framework of FSL-SCNN for anomaly detection in CPS is illustrated in Fig. 2.  As shown in Fig. 2, to train this DL model, two input sample data (i.e., one from support set and one from query set) for each class will be sent into two identical CNN simultaneously. A relative-feature representation scheme is applied to transform their original features into a lower dimensional representation, which can help the neural network alleviate the overfitting issue, and consequently enhance the detection performance. In the Siamese network, two combinations of convolution layer and pooling layer are introduced to extract feature embeddings. During the testing process, the distance between these two feature embeddings will be calculated to identify whether these two input samples belong to the same class.
Given x i as one input sample sent to the FSL-SCNN, the feature embedding f (x i ) extracted by the Siamese CNN can be represented as follows: where θ encoding is the encoding parameter of CNN. The distance between two feature embeddings from two input samples x i and x j is defined and calculated based on the pairwise Euclidean distance, which can be described as follows: Finally, the output of the FSL-SCNN is generated based on the fully connected layer and SoftMax layer, which can be expressed as follows: where Sof tM ax ( * ) indicates the function of SoftMax and F C ( * ) indicates the function of fully connected layer. P (x i , x j ) represents the probability whether x i and x j belong to the same class.

C. Robust Cost Function Design
To ensure the prediction accuracy and false alarm rate (FAR) for anomaly detections from large volume of industrial CPS data with few labeled samples, three losses are considered in our cost function design. As shown in Fig. 2, the transforming loss L rel is issued in the relative-feature representation. The encoding loss L ecd is generated during the CNN encoding process, which is designed to measure the variance between the transformed relative-features and extracted feature embeddings. The prediction loss L pre is a triplet loss based on the distances between the anchor sample, and the positive and negative samples.
Considering we only have a few samples in the support set, the dimensionality of original features for input samples becomes relatively large compared to the total number of support samples, which thus usually leads to the overfitting and poor generalization performance for the model. Motivated by [21], a relative-feature representation for input samples is applied to reduce the dimensionality of original features. Specifically, the transformed features with relatively low dimensionality of an input sample is calculated based on the distance between itself and all the other samples using (2). Given n samples as the input for the model, n(n−1) 2 sets of distances need to be calculated. The detailed relative-feature can be calculated and expressed as follows: For example, given four samples, x 1 , x 2 , x 3 , x 4 , with their corresponding pairwise distance: Therefore, for each input (x i , y i ), the loss in relative-feature representation can be defined and calculated as follows: * ) is the Euclidean distance. p m is calculated by averaging the samples of class m for relative-feature representations, while p m is calculated for the corresponding representations in each training episode.
The loss for CNN encoding is employed to measure if there is any loss of key information within the Siamese network. Following the encoding process described in (1), The decoding function based on the Siamese CNN is defined as follows: Since it is difficult to observe the information loss directly during the encoding process, motivated by [22], the relative entropy theory can be used to measure the loss of information based on the real distribution and theoretical distribution. Thus, given a probability distribution of an input sample x i , the encoding loss is designed to minimize the number of feature embeddings, while retaining the key information of features in the original data. The detailed calculation based on the Kullback-Leibler (KL) divergence can be described as follows: where p(x i |f (x i )) indicates the real distribution of the sample data. q(x i |f (x i )) is the calculated distribution and can be treated as an approximate to p(x i |f (x i )). Furthermore, distances between the anchor sample and the positive and negative samples are considered to measure the prediction loss based on the Siamese network. The detailed calculation can be formulated as follows: where x a i x p i and x n i indicate the anchor, positive, and negative samples respectively. α ∈ (0, 1) is a coefficient to adjust the FAR in anomaly detection. Usually, it is empirically set as α > 0.5 during the training process. The maximum function is used to ensure a minus loss for L pre , thus the anchor sample can be more similar to the positive sample than the negative one, based on this adversarial design.

D. Intelligent Anomaly Detection for Cyber-Physical Security Protection
To pursue an efficient training performance, the cost function L F SL−SCNN in the FSL-SCNN is composed based on a combination of the three losses discussed above, which can be defined and expressed as follows: where τ is a balance coefficient to control the encoding loss L ecd during the training process. Specifically, L F SL−SCNN is designed to tackle the following challenges during the few-shot learning process: retaining the critical information when transforming original highdimensional features into a relatively low dimensionality during the relative-feature representation; and enabling the learning model to present reasonable feature embeddings in the Siamese network, thereby alleviate the overfitting problem when the training data is insufficient. The concrete anomaly detection algorithm is shown in Algorithm 1.
The training process via the proposed FSL-SCNN is divided into three steps: feature transformation, feature encoding and distance comparison. In each training episode, the raw data x is transformed to x i based on the relative-feature representation scheme first. x i is then formalized into the structured feature embedding f (x i ) through the CNN encoder. According to a selected Anchor sample x a i a positive sample x p i and a negative sample x n i , the corresponding classes y a i , y p i , and y n i are predicted via the constructed Siamese network respectively. The losses generated during relative-feature representation, CNN encoding, and prediction process, are calculated using the designed cost function as addressed by (5), (7), and (8). Consequently, the model M will be finalized by minimizing the total loss L F SL−SCNN .

IV. EXPERIMENT AND ANALYSIS
In this section, evaluations are conduced to demonstrate the performance of our proposed method for anomaly detection, comparing with other similar mechanisms based on two different datasets.

A. Dataset and Experiment Design
To investigate the effectiveness of the proposed FSL-SCNN, both a fully labeled public dataset and a few labeled dataset are considered in our experiment evaluation. The fully labeled for each episode do 4: Choose k class with c samples from D nor and D ano to build support set 5: Choose k class from Q to build query set 6: for x i in support set do 7: Transform x i into relative representation x i 8: Calculate transforming loss by Eq. (5)  9: Transform x i into feature embedding f (x i ) via the CNN Encoder by Eq. (1) 10: Calculate encoding loss by Eq. (6) 11: Select anchor sample x a i and predict y a i based on f (x a i ) by Eq. (3)  12: Select another positive sample x p i and negative sample x n i , predict y p i and y n i by Eq. (3) 13: Calculate prediction loss by Eq. (8) 14: Update network to minimize L F SL−SCNN by Eq. (9) 15: end for 16: end for 17: end while 18: return M public dataset UNSW-NB15, generated by the Australian security laboratory for CPS [23], is applied to evaluate the general prediction performance of the proposed method. This dataset is composed of network traffic packets created using IXIA Per-fectStrom tool, including realistic modern normal activity and synthetic contemporary attack behavior packets. It contains nine categories of cyber-physical attacks including: analysis; fuzzers; DOS; generic; backdoor; exploit; reconnaissance; worm; and shellcode. The few labeled dataset used in the experiment is generated in an intelligent CPS for smart manufacturing as illustrated in Fig. 1, in which the network transmission packet is collected via the SCADA system, and contains a small number of randomly generated abnormally high or low transmission rate signals. The average packet amount per second is fluctuated with a normal state of 0.05 KB/s. Specifically, the former dataset is used to evaluate the training efficiency and anomaly detection performance of the proposed method, while the latter one is used to investigate the effectiveness of our method in a cyber-attack scenario.
We selected several classical and widely used machine learning methods, and a Siamese model for anomaly detection in CPS as the baseline methods. Specifically, the time series analysis (TSA) which is introduced as a non-machine learning technique, classical machine learning methods including Naïve Bayes (NB), random forest (RF), and one-shot support vector machine (OS-SVM), are compared in this article. It is noted that OS-SVM is a kernel-based variation of SVM method with only one-shot data sample for each class, thus is selected to compare with the proposed FSL-SCNN. In addition, a Siamese convolutional autoencoder (SCAE) model [24], comprising twin convolutional autoencoders, is involved for comparison evaluations as well.
Four widely used metrics, precision, recall, F1, and FAR, are applied and calculated according to whether normal/anomaly signals have been identified correctly or not, in order to demonstrate the performances of these mentioned methods based on the fully labeled public dataset. In particular, FAR is an important metric to evaluate the anomaly detection performance in CPS especially in unbalanced dataset. The lower the FAR, the better performance is achieved by the model in practical scenarios.

B. Anomaly Detection Performance Evaluation
We chose stochastic gradient descent (SGD) as the optimizer to train the model. The learning rate was set to 0.1 and we iterated 800 times to investigate the training process in the experiment. The transforming loss, the encoding loss, and the prediction loss obtained in each iteration using UNSW-NB15 are shown in Fig. 3 respectively.
As shown in Fig. 3, the overall performances of the three losses decline fast and become relatively stable. Relatively, the error rates of transforming loss and prediction loss fluctuate greatly during the learning process according to Fig. 3(a) and (c), while the error rate of encoding loss drops sharply and trends to stable after 200 iterations according to Fig. 3(b). This training result indicates the applicability and suitability of our model in few-shot learning.
Furthermore, to evaluate the feature embedding effect based on the relative-feature representation and CNN encoding in the Siamese network, we investigate all the six methods based on the principal components analysis (PCA) result. The visualization comparisons based on UNSW-NB15 are shown in Fig. 4.
The distinct difference in terms of data distributions shown in Fig. 4 demonstrates the imbalance in the dataset, as well as the corresponding features. In other words, the number of normal samples is much more than the number of the attack samples. It can be observed that feature embeddings based on our proposed FSL-SCNN result in a better clustering performance. The better the clustering performance, the better effect of the feature extraction will be. Moreover, comparing with other five methods, the method generates an obvious clustering result with few overlaps among features in two distinguished classes. This result indicates the effectiveness of the combination of relative-feature representation and CNN encoding in reducing dimensionality and retaining the key feature information during the learning process within our Siamese network.
We go further to evaluate the overall performance of anomaly detection, based on precision, recall, F1, and FAR in an imbalanced dataset. Especially, the FAR is a significant indicator to   demonstrate the performance of anomaly detection in the real world. The results are compared and given in Table I. According to Table I, we observe that the proposed FSL-SCNN has achieved the best results in F1 score and FAR at 0.936 and 0.047 respectively. Since FAR is an important indicator to evaluate the performance of anomaly detection in CPS as we discussed earlier, this result shows that because of the relative-feature representation scheme and the robust cost function designed in our model, the FSL-SCNN can not only distinguish the anomaly signals from the normal ones efficiently, but also reduce the false detection rate in the few-shot learning scenario.
In addition, we investigate the effectiveness of the method in terms of anomaly detection in a real-world cyber-attack scenario. The comparison experiment was conducted based on the few labeled dataset collected in a real CPS as illustrated in Fig. 1. We compared the true attacks and detected anomalies according to the network throughput (bytes per second) captured in the CPS. The evaluation results are illustrated in Fig. 5.
As shown in Fig. 5, we observe the true attacks and detected anomalies respectively, based on the continuous signals generated in the CPS across the timeline from 0 to 1600 s. Anomalies are detected via the proposed CNN-based Siamese network. Obviously, it can be viewed as a few-shot learning problem because there are only a few attacks within the timeline, as depicted in Fig. 5(a). Comparing with the detected anomalies in Fig. 5(b), it is found that most of the cyber-attacks have been effectively identified, which indicates the usefulness of the proposed FSL-SCNN in the real few-shot learning scenario for anomaly detection in industrial CPS.

V. CONCLUSION
In this article, to enhance the cyber-physical security protection in intelligent industrial systems, we proposed the FSL-SCNN to deal with the few labeled and imbalanced dataset generated in industrial CPS for intelligent anomaly detection.
A Siamese CNN encoding network were constructed to measure the distance for input samples based on their optimized feature representations, instead of returning the prediction result directly. The Siamese network structure was capable of identifying novel classes of cyber-physical attacks, even with a few labeled training samples. To alleviate the overfitting issue, a relative-feature representation scheme was utilized to transform original features into a lower dimensional representation. A robust cost function design was introduced, in which three specific losses, including the transforming loss in relative-feature representation, the encoding loss during CNN encoding process, and the prediction loss based on the distances between the anchor sample, and the positive and negative samples, were seamlessly integrated together to enhance the training efficiency. An intelligent anomaly detection algorithm was then developed to deal with the few labeled data generated in industrial CPS. Experiments and evaluations based on a fully labeled public dataset and a few labeled dataset demonstrated that the method could significantly improve the F1 score and reduce the FAR score comparing with other related methods, which indicated the effectiveness of the proposed model in detecting intrusion signals with few labeled samples in industrial CPS environments.
In future studies, we will go further to conduct more evaluations in different situations to improve the algorithm with better accuracy and efficiency.