Least Squares Generative Adversarial Networks-Based Anomaly Detection

Multivariate statistical process control (MSPC) is a technique for detecting anomalies by monitoring several quality characteristics simultaneously. For the MSPC problem, the Hotelling’s $T^{2}$ control chart has been widely used as a typical method. Recently, researchers have converted the MSPC problem into a classification problem such as the artificial contrast (AC) and the one-class classification (OCC). Previous studies have shown that these methods outperform the Hotelling’s $T^{2}$ chart when the data do not follow a multivariate normal distribution. However, unless the size of the process data is enough for the AC and the OCC, they cannot work properly. To tackle this problem, in this paper, we propose a novel anomaly detection (AD) approach. The proposed method adopts the least square generative adversarial network (LS-GAN) to estimate the probability distribution of the training data. It generates new training samples from the learned probability distribution. The classifiers such as the random forests (RF) and the one-class support vector machines (OC-SVM) are considered for tackling the AC and the OCC respectively. The numerical experiments demonstrate that the proposed approach outperforms the existing methods in terms of the area under the receiver operating characteristic (ROC) curve (AUC).


I. INTRODUCTION
Statistical process control (SPC) is widely used in various industries to monitor and improve output quality. Univariate control charts typically used in the SPC are the Shewhart control chart, the cumulative sum (CUSUM) chart, the exponentially weighted moving average (EWMA) chart, etc. However, they cannot consider correlation among two or more related quality characteristics. Applying two or more independent univariate control charts still fails to capture the correlation among the quality characteristics.
Hence, in order to monitor the correlation between the quality characteristics effectively, the multivariate statistical process control (MSPC) is suggested. The MSPC is a widely known technique for simultaneously monitoring several quality characteristics. In tradition, the Hotelling's T 2 control chart, the multivariate CUSUM chart and the multivariate EWMA chart have been used for many years for the MSPC. The Hotelling's T 2 control is defined as The associate editor coordinating the review of this manuscript and approving it for publication was Qi Zhou. T 2 = X −X T S −1 (X −X ), whereX indicates the sample mean vector and S is the covariance matrix from the normal data [1]. Followed by the Hotelling's T 2 control chart, there have been many attempts to a comparative analysis of diverse multivariate control techniques [2]. However, although the Hotelling's T 2 control chart has been widely used for many years, it has a limitation in that a false alarm is frequently raised if the process data do not follow a multivariate normal distribution. In order to remedy the limitation, the artificial contrast (AC) and the one-class classification (OCC) were proposed.
The AC is one of the monitoring approaches for the MSPC problem. The key concept of the AC is to generate the artificial data from the uniform distributions and create labels (classes) to build the binary classification model. The numerous AC-based monitoring approaches have been proposed over the past few years. Hwang et al. [3] proposed the AC that converts the MSPC problem into the binary classification problem. Hwang and Lee [4] proposed a novel approach that can be applied to extremely imbalanced data with a system failure by shifting the artificial data. The cluster-based AC was also proposed for monitoring the inhomogeneous multivariate processes without the normality assumption [5].
Besides, the OCC is also considered as an another method for the process monitoring that contains only target group (or normal group) and is used to determine the degree of abnormality of new samples [6]. First created by Moya and Hush [7], the OCC using only normal samples to train the model has been widely used until today.
The one-class support vector machine (OC-SVM) has been noted as a typical example for the OCC based on the SVM. The SVM [8] has been widely used for the binary classification problem, aiming to find the optimal control boundary that can maximize the generalization ability by enlarging the margin between the two classes. Similar to the SVM, the main purpose of the OC-SVM is to construct the control boundary around the positive samples in order to differentiate the outliers (non-positives) from the positive data. The SVM-based algorithms to deal with the OCC were proposed by Tax and Duin [9] and Schölkopf et al. [10]. The support vector data description (SVDD) that is the data description method using the kernel functions for solving the OCC was proposed [9]. This method differentiates between the two classes using a hyper-sphere, not a hyper-plane, around the positive class samples. As an extension of the classifiers for the OCC, the deep learning-based methods for solving the OCC have also been introduced in recent years. The one-class convolutional neural networks (CNN) which are inspired by the OC-SVM was suggested [11]. Also, the deep SVDD aiming to find the smallest hyper-sphere in the feature space of the data set learned from the CNN was proposed [12].
The AC and the OCC are useful for building control boundaries by classifiers, but do not work properly for the limited number of samples. For the case, we assume that newly generated samples from a generative model can improve the prediction performance of the AC and the OCC. So, in this paper, to enhance the detection performance of the AC and the OCC with the limited number of samples, we generate new training samples using the GAN [13]. As a deep learningbased generative model, the GAN is an effective tool for generating new samples that did not exist in the original data set. Not to mention the image generator field, the GAN has been applied in time-series [14], [15] and has surged in popularity more recently.
The GAN consists of two adversarial networks (a generator and a discriminator) that compete with each other in order to generate realistic samples. Through adversarial training, the generator becomes capable of generating new samples by accurately learning the distribution of the input data set. Taking an advantage of the GAN's characteristics, Douzas and Bacao proposed a GAN-based oversampling method [16]. They adopted the conditional GAN (cGAN), a variant of the GAN, as an oversampling method and demonstrated that it shows better prediction performance than the other oversampling methods such as the synthetic minority oversampling technique (SMOTE). However, the cGAN is not appropriate for the AC and the OCC because it requires at least two classes.
Hence, we propose the least square generative adversarial network (LS-GAN)-based anomaly detection (AD) approach to handle data consisting only of positive classes. In other words, our proposed approach is based on the unsupervised learning method that the target class does not exist while the cGAN is based on the supervised learning method. The proposed method leverages the LS-GAN [17], a variant of the GAN, to improve the prediction performance of the AC and VOLUME 10, 2022 the OCC. According to the previous study [17], the LS-GAN can generate samples more similar to real samples than the regular GAN [13]. It also has improved learning stability over the regular GAN [17]. Fig. 1 shows an overview of the LS-GAN-based AD. The proposed method consists of three steps. The first step is the training of the LS-GAN. The goal in this step is to estimate the distribution of data within the input space. The second step is data augmentation to establish control boundaries. Additional samples generated in previous steps may contribute to improved anomaly detection performance. In this step, we use the LS-GAN trained in the previous step to generate new samples that do not exist in the training data. The third step is the construction of control boundaries. The control boundary is the boundary that separates normal from abnormal. Referring to the second row of Fig. 1, test samples located outside the control boundary are considered anomaly samples. The proposed method is the combination of off-line and on-line. It can be seen as a twostage process phase I and phase II. The goal in the phase I is to establish a control boundary. Therefore, the proposed method is offline in phase I when the control boundary is not established. The goal of phase II is to monitor the online data and quickly detect anomalies in the process from the control boundary established in phase I. At this stage, the proposed method is the on-line. The AC control boundary looks better than the OCC control boundary because the AC control boundary can capture the quadratic pattern, while the OCC control boundary cannot. The reason why the AC control boundary outperforms the OCC control boundary is that the AC considers the process data as well as the artificial data. The definition of the AD encompasses the MSPC, the binary classification, and outlier detection. But, in this study we only consider the MSPC with the simulation data set, and the binary classification with the real data set. Both simulation and real data sets are considered for the performance comparison. For the simulation data sets, we give three cases of the shift (small, medium, and large) to the three different directions (x-axis, y-axis, and both of them). To compare their performances with the proposed method, the kernel density estimation (KDE) and the Gaussian mixture model (GMM) are also applied as the generative models. With the three generative models above (the GMM, the KDE, and the LS-GAN), the two, five, and ten times enlarged samples from the original samples are generated. Then, the OCC and the AC are used for detecting anomalies with the generated samples. As a result, we found that the AC with the LS-GAN demonstrates the best performances in terms of the AUC score.
The contribution of this study is as follows. First of all, we improve the prediction performance of the AC as well as the OCC using the LS-GAN in order to tackle the lack of training samples. In our experiment, we realize that the AD approaches with the LS-GAN are better than those with the other generative models. Second, we construct the LS-GAN model to mimic the real training data by setting appropriate parameters. Lastly, we find that the performance of the AC is better than those of the OCC in most of the results. Also, the experimental result shows that the AC with the LS-GAN which is trained with the largest training samples has the highest average ranking.
This paper is constructed with 7 sections. From Section II, we explain the details of the AC and the OCC. The existing generative models are illustrated in Section III. In Section IV, the proposed LS-GAN-based AD is explained. Section V elaborates the experimental settings with the 9 simulation and 5 real data sets. And the comparison results are specifically described in Section VI. We finalize this paper with the conclusion in Section VII.

II. THE AC AND THE OCC A. THE AC
Proposed by Hwang et al. [3], the AC is one of the nonparametric monitoring approaches for the MSPC problem. Given that the MSPC is designed for simultaneously monitoring several process variables, the AC learns a classifier for defining a control boundary by generating the artificial data. The following is the overall procedure of how the AC works.
The first step is generating the artificial data. When the number of in-control process samples is n, the samples are denoted as {x 1 , x 2 , . . . , x n }. With the n number of samples and m number of process variables, n × m matrix is represented as In addition, a vector of the responses Y is represented as Y = (y 1 , · · · , y n ) T . Then, for each process variable X j , the artificial data set is generated with the sample size t. From a uniform distribution, the artificial data set is created in the range of subtracting the sample standard deviation of X j from the minimum value of X j to adding the sample standard deviation of X j and the maximum value of X j . When the x i is a sample from the process data, y i is equal to 1, while y i is 0 when the x k is a sample from the artificial data (k = 1, . . . , t). Finally, the training data can be obtained by combining the process data and the artificial data as illustrated in Fig.1 Step 3 (a). After generating the artificial data, a classifier should be determined. According to Hwang et al. [3], two specific classifiers including the random forest (RF) and the regularized least square classifier are introduced. As a classifier V: {x} →∈ {0, 1} plays a role of a control boundary, the MSPC problem is converted to the inseparable binary classification problem. Here, the Type-I error is defined as the probability saying that the process is out of control even though the process is in control. By adjusting the cut-off value, the Type-I error can be manipulated.
Since the RF has been used to deal with the binary classification problem in order to determine the in-control boundaries [3]- [5], we follow the same approach. The RF represents an ensemble of individual decision trees based on a bootstrap technique [18]. The individual decision tree is a tree-like model that classifies samples. However, the problem of overfitting is a major limitation in the individual decision trees. In order to eradicate the problem, an ensemble of individual decision trees is suggested. The RF involves three steps as follows. 1) When n number of samples are taken from the training data set, an individual decision tree is constructed for the n samples. 2) As illustrated in the Fig. 2, each decision tree generates an output. 3) The final output is determined by the majority voting. In other words, by aggregating the votes from the individual decision trees, the final class of the test object is decided.

B. THE OCC
The major idea of the OCC is to construct a control boundary that surrounds normal samples, assuming the samples outside of the control boundary as abnormal samples. The detail concept of this methodology can be easily grasped by comparing it with a multi-class classification and a binary class classification [19]. While the multi-class classification contains training data from several classes, the two classes (a positive class and a negative class) are included in the binary class classification. Unlike the multi-class classification and the binary classification, the OCC has only one training samples from the positive class. Therefore, the control boundary by the learned classifier in the OCC has the shape of encircling the positive samples [19]. Fig. 3 demystifies the differences among the three types of classification. As shown in Fig. 3, the OCC is described as classifying properly-characterized positive samples with no negative samples [20]. Since the negative class of the binary classification is difficult to obtain in the real situations, the OCC has attracted many attentions for several decades. A number of the classifiers for the OCC for the SPC have been proposed [21], [22]. For example, a novel control chart for the OCC based on a k-nearest neighbor algorithm was introduced [6]. Another OCC-based control chart using the k-means data description (KMDD) algorithm, which is called the KM-chart, was proposed by Gani and Limam [23]. Moreover, the OC-SVM that adapts the SVM to the OCC was suggested [10].
The OC-SVM is to solve the OCC [7]. The OC-SVM aims to make positively-labeled samples classified by the hyperplane furthest away from the origin after mapping the original data to the feature space. For a better understanding, the OC-SVM method is illustrated in Fig. 4 [24], [25]. As shown in the Fig. 4, the hyper-plane separates the original data set from the origin. The upper part of the hyper-plane is classified as the normal data and the lower part of the hyperplane is classified as the abnormal data. Finding the optimal hyper-plane that separates the normal data from the origin is required [24]. The optimization objective function can be defined as: where the first term is for the regularization to decrease the variability. ρ is the distance between the origin and the hyperplane, ξ i is the slack variable that is penalized in the objective function when the i th training sample is located inside,n is the number of training samples, (i = 1, 2, . . . , n), and ν is a trade-off parameter ranging from 0 to 1 determining the proportion of a penalty. So, the second term is the summation of penalties given to the normal data which are located close to the origin than the distance ρ. is a mapping function that maps original data x i to a kernel space using a kernel function K (·, ·). The proper choice of a kernel function is dependent on the number of features (e.g. linear, sigmoid, polynomial and radial basis kernels) [24]. When the hyperplane is determined after the optimization problem is solved, the data can be classified as the normal data when it is above the hyper-plane, and considered as the abnormal data when it is below the hyper-plane using the condition sign (w · (x i ) − ρ).
The Equation (3) can be simplified to: Then, the maximization problem can be altered to the minimization problem by switching the sign. Using the kernel function K (·, ·), the mapping function can be substituted with K (·, ·).

III. THE EXISTING GENERATIVE MODELS
In general, increasing the number of training samples is one of the primary methods for enhancing the performance of the classification models. Not to mention the GAN, the KDE and the GMM that estimate a probability density function using the observed data are also used for the data augmentation.

A. THE KDE
The KDE is a non-parametric technique for estimating the probability density function of a random variable with a kernel function [26]. In order to alleviate the disadvantage of a traditional histogram (e.g., discontinuity between each bin of a histogram, and fluctuation by the size of bins, etc.), a kernel function is adopted and is able to produce the smooth estimate of the probability density function. The formula for the KDE method is as follows.
where K h (x) = 1 h d K x h , n is the number of training samples and K is a kernel function that is generally symmetric function such as a Gaussian [26]. And we define K h as a kernel function with the size transformation since the result is dependent on the bandwidth (h). d indicates the number of dimensions of feature vectors. Since the scarcity of data in the high-dimensional feature space is the main challenge of the KDE, it is generally used with the low-dimensional data.

B. THE GMM
While the KDE is a non-parametric technique, the GMM is a parametric method, where the data are assumed to come from prescribed models that are determined by parameters. As one of the probabilistic models, the GMM estimates a parametric probability density function, where samples are generated from a mixture of the multiple Gaussian distribution. The GMM is able to represent the feature of distribution precisely that is not capable of with only one normal distribution function. The GMM can be defined as 1, . . . , n) is the mixture weights, and g (x | µ i , i ), i = 1, . . . , n, is the component Gaussian densities [27]. Since the GMM is a mixture of the multiple Gaussian distribution, a parametric estimation problem can occur by calculating the various mixture components including weight, mean, and variance. In general, the maximum likelihood estimation (MLE) is used for the parametric estimation. However, in the mixture model, the expectation maximization (EM) algorithm is the key to solve the parametric estimation problem [28]. Mainly used in incomplete data or data with missing values, the EM algorithm provides the MLE of the parameters of an underlying distribution from the given data set.

C. THE GAN AND LS-GAN
The GAN is a generative model that generates samples which are close to the real samples using the two adversarial networks. One is a generator which generates fake samples and the other is a discriminator which distinguishes the original samples from the generated samples. An adversarial training improves the generator over time, until the discriminator can no longer distinguish between the real and the fake. The GAN shows a powerful ability to learn the high-dimensional and complex distribution of data without any parametric assumptions.
Formally, the generator randomly takes the noise samples from a noise distribution, such as Gaussian and uniform distribution. It maps them to a data space same as input real data G(z) : R d → R m . The discriminator D(x) calculates a probability that x is a real sample, not a generated sample from the generator D (x) : R m → [0, 1]. The GAN is based on a non-cooperative game and training the GAN means optimizing the following minmax objective function.
If the generator learns the distribution of data successfully, the generated samples become so close that they are indistinguishable from the real data. Also, the discriminator emits 0.5 everywhere [13].
Although the GAN can learn the distribution followed by the real data without any assumptions, it has some unsolved problems, such as the gradient vanishing and the gradient explosion. When it comes to tackling the problems, the LS-GAN that uses the least squares loss function rather than the sigmoid cross-entropy for the discriminator has been proposed [17]. The objective function of the LS-GAN is expressed as: where w G and w D are the parameters of the generator and the discriminator, respectively.

IV. THE LS-GAN APPLICATION TO ANOMALY DETECTION
The aim of this paper is to improve the prediction performance with the combination of the LS-GAN and the AC and the OCC. The proposed LS-GAN-based AD proceeds as follows. 1) The first step is the training of the LS-GAN to learn the distribution of the training data set. 2) Generating new samples using the trained LS-GAN is the second step.
3) The AC and the OCC are trained with both existing and generated samples. Fig. 1 depicts the procedure of the proposed method. We conduct an experiment of the comparison between the regular GAN and the LS-GAN in order to verify their performances. We consider the number of outliers a comparison measure because even the very small number of outliers can distort the control boundary. For the performance comparison between the regular GAN and the LS-GAN, we use the banana-shaped simulation data set that follows the non-normal distribution [3]. Fig. 5 shows the performance results of the regular GAN and the LS-GAN. As shown in Fig. 5 (a), the extreme outliers are marked in the dashed box.
Therefore, in this study, since the LS-GAN generates the extreme outliers less than the regular GAN, we select the LS-GAN as a generative model instead of the regular GAN for the stable training. The hyper-parameters of the LS-GAN are determined by leveraging on Douzas and Bacao's approaches [16] as shown in Table 1. Both the generator and the discriminator adopt the rectified linear units as the activation function for the five hidden layers. The last activation of the generator is the hyperbolic tangent function and the last activation of the discriminator is the sigmoid function. Each model is trained by the Adam optimizer with the different learning rates. The learning rates of the generators and the discriminators are 0.0001 and 0.001, respectively. The training epoch is set to 10,000 and the mini-batch size is 20.

V. EXPERIMENTAL SETTINGS A. DATA SETS
In this paper, we consider both simulation and real data sets for evaluating the effectiveness of the proposed approach. To generate the simulated data, we consider a bivariate banana-shaped distribution. For testing, normal samples generated under the same conditions as the training data set are combined with the abnormal samples. Here, the samples with   no shift, δ 0 = 0 are considered as normal samples. On the other hand, we consider not only a shift size but also the shift types for the abnormal samples. The shift sizes are small (δ 1 = 0.1), medium (δ 2 = 0.3), and large (δ 3 = 0.5). And the three types of a shift are considered: x-only, y-only, and both x and y. So, 9 simulation data sets (Simulated 1 ∼ Simulated 9) are represented in Table 2 and shown in Fig. 6.
Additionally, in order to recheck the effectiveness of the proposed method, the five real data sets are borrowed from the UCI Machine Learning Repository. The AC and the OCC  may not be appropriate for the binary classification problems of the five data sets because they are the alternatives of the Hotelling's control boundary. Only the Wine data set consists of three classes, and the rest of the real data sets consist of binary classes. Samples belonging to the majority class are considered as the normal samples in this paper. For the Wine data set, we consider the two major classes as the normal and the remaining minor class as the abnormal. 80% of the randomly chosen normal samples are reserved for training while the remaining 20% of the normal samples are combined with the abnormal samples for the testing. Finally, the principal component analysis (PCA) is applied to all the real data sets for the dimensionality reduction. In this paper, the first d principal components taking more than 90% of the total variance are used. Besides, all of data sets are normalized between −1 and 1. Table 3 represents the description of the data sets.

B. THE SETTINGS FOR THE GENERATIVE MODELS
The generative models (the KDE, the GMM, and the LS-GAN) are used to increase the amount of training samples. The number of the training samples of each data set is respectively increased by 2, 5, or 10 times of the original samples. So, with the three different generative models, a total of 9 increased training data sets are generated for the one original data set. The illustrative examples of each original data set and the two times-increased fake data set made by the generative models are shown in the appendix Figure 8. The appendix Figure 8 shows that the proposed method can generate new samples that are not observed in the training data set. Gray dots denote observed samples in the training data, and colored dots denote generated samples by the proposed model. For the real data sets the first principal component is on the x-axis and the second principal component is on the y-axis.

C. THE SETTINGS FOR THE CLASSIFIERS
The appropriate settings for the two classifiers are also required. For the RF, the number of artificial data is twice of the training samples for each case as shown in the Table 4. Plus, so as to enhance the performance of the OC-SVM, the hyper-parameter ν is set to 0.05.
We use the area under the curve (AUC) as the performance measures [29]. The AUC is the area under the receiver operating characteristic (ROC) curve generated by plotting the false positive rate (FPR) against true positive rate (TPR) at various threshold values. The definitions of the FPR and TPR are presented in Equation (9) and (10).
where FP (false positive) is the number of normal samples which is falsely predicted, TP (true positive) is the number of abnormal samples which is correctly assigned, TN (true negative) is the number of normal samples which is correctly predicted, and FN is the number of normal samples which is falsely predicted.

VI. EXPERIMENTAL RESULTS
As mentioned in the previous section, we increase the training samples using the generative models. Each generative model increases the number of the training samples to the multiples of the number of the original training data by 2, 5, and 10 respectively. As a result, the AUC scores for the 20 classification models for each data set are calculated.
The detail results of the experiments are summarized in the appendix TABLE 6. The AUC rankings for each data set are in parentheses, and the boldface indicates the best-performed model for each corresponding data set. Table 5 shows the average rankings according to the AD approaches with each generative model based one the appendix Table 6. The average ranking values close to 1 indicate the best overall performance, while values close to 20 indicate the worst overall performance. The experimental results demonstrate that the average rankings of the LS-GAN as a generative model are higher than those of the baselines regardless of the AD approaches. Especially, the AC performs the best when the LS-GAN increases the number of training samples to ten times. In the perspective of the average rankings, the AC performs better than the OCC for every result of generative models' performances. Regardless the AD approaches, the LS-GAN ranks the first place in all the real data sets.
Only the best performances of the three generative models are compared for the respective real and the simulation data sets in Fig. 7. As shown in the Fig. 7, the LS-GAN-based AD performs the best with the majority of the simulation data sets.

VII. CONCLUSION
In this study, we propose the LS-GAN-based AD to increase the prediction performance of the AC as well as the OCC under the circumstances of the limited training samples. The AD procedures are as follows. The first step is the training of the LS-GAN to learn the distribution of the training data set. Second, in order to increase the number of the training samples, the LS-GAN generates new training samples and then combines the generated samples with the original samples.
Third, the new training data sets are used to detect anomalies with the AC and the OCC. Finally, we evaluate the prediction TABLE 6. The prediction performances according to the data sets. VOLUME 10, 2022 performance of the proposed method using 5 real and 9 simulated data sets. Although the LS-GAN-based AC as well as the LS-GAN-based OCC shows the most successful performance, training the LS-GAN requires more effort and time compared to the other existing generative models. Also, whether the proposed method is applicable to time series data remains unresolved. Therefore, a combination of generative and anomaly detection models for time series data [30], [31] can be considered in future studies. In additions, systematic methods such as cross-validation using evaluation measures to train LS-GANs in the context of AC can be considered.

APPENDIX
See Fig. 8 and Table 6.