Hybrid Discriminator With Correlative Autoencoder for Anomaly Detection

Advances in deep neural networks (DNNs) have led to impressive results and in recent years many works have exploited DNNs for anomaly detection. Among others, generative/reconstruction model-based methods have been frequently used for anomaly detection because they do not require any labels for training. The anomaly detection performance of these methods, however, varies a lot, due to the change of the intra-class variance and the difference in complexity of input samples. In addition, most previous state-of-the-art works on anomaly detection have empirically adjusted several hyperparameters to heighten their performance of anomaly detection. These sorts of procedures are known to be impractical and create obstacles in real world anomaly detection. To solve these problems, we propose a hybrid discriminator with a correlative autoencoder for anomaly detection. In the proposed framework, the discriminator implicitly estimates the conditional probability density function and the autoencoder has improved ability to control the reconstruction error. We provide theoretical foundation of our method and verify it through various experiments. We also confirm practical benefits of our interpretation of the conditional expectation and the proposed framework by comparing our results with other state-of-the-art methods.


I. INTRODUCTION
The recent explosive growth in machine learning and big data applications has increased the need for techniques that can detect anomalies. During machine learning, samples that are observed during training are typically called inliers, indistribution or normal samples, while samples not observed during training are called outliers, out-of-distribution (OOD), novel, or abnormal/anomalous samples. Except for comparing with references, in the remaining sections we use the term ''in-distribution'' for samples observed during training, and use the term ''OOD'' for samples not observed during training.
The purpose of anomaly detection is pretty simple: to distinguish OOD samples from in-distribution samples. However, the execution can be fairly complex. Some previous works on anomaly detection have assumed there is only The associate editor coordinating the review of this manuscript and approving it for publication was Okyay Kaynak . one class of in-distribution samples. From that perspective, if numerical samples labeled as '0' are in-distribution, then all other numerical samples are OOD. This type of problem has been called novelty/outlier detection or one-class classification (OCC). OOD detection was introduced in other works [11]- [13]. In OOD detection, the in-distribution and OOD samples are from different data sets. For instance, if facial samples are in-distribution, then numerical samples are OOD. Another way of categorizing anomaly detection is to consider multiple class samples within the same data set as either in-distribution or OOD. For instance, if numerical samples labeled as '0', '1', and '2' are in-distribution, then the remaining numerical samples are OOD. So far no one has reported experimental results under this setting but we have found that increasing the number of classes for in-distribution samples dramatically decreases the anomaly detection performance, especially in previous generative/reconstruction model-based approaches. It can be presumed that using multiple classes for in-distribution samples without any labels will increase the intra-class variance of in-distribution samples.
Most previous works detect anomalies by mapping higherdimensional samples to lower-dimensional space. In this way, if the intra-class variance of samples increases, distributions of in-distribution samples and OOD samples can overlap or get closer to each other in lower-dimensional space. As a result, if the intra-class variance of samples increases, the anomaly detection performance significantly decreases. Generative/reconstruction model-based approaches [14]- [16] for OOD detection have an advantage because they do not require any labels for training. However, instances [1], [5], [19] have been found where the probability density likelihood of untrained OOD samples is higher than that of trained in-distribution samples, which are inconsistent with the idea [17], [18] that generative models can be used for anomaly detection. For instance, Glow [14], a state-of-the-art normalizing flow, trained on FashionMNIST or CIFAR10 assigns the higher likelihood to untrained MNIST or SVHN than to its trained FashionMNIST or CIFAR10. Unfortunately, in generative models, it has been experimentally observed that the probability density likelihood does not only depend on whether or not samples are trained but also on the difference in complexity of input samples (input complexity) [6], [7]. Estimating the probability density is still an open issue, which is a significant factor of the anomaly detection performance of generative model-based approaches.
In order to solve the performance degradation problem due to the change of the intra-class variance and the difference in input complexity, we propose a framework for anomaly detection which is capable of conditionally discriminating and selectively controlling the reconstruction error by using a hybrid discriminator with a correlative autoencoder. We focus on the anomaly detection problem which uses only in-distribution samples without any labels during training, and verify the proposed method through experiments for OCC, OOD detection, and multiple class anomaly detection (MCAD). In particular, the proposed method does not use any data set-dependent empirical hyperparameters except for essential hyperparameters such as the size of minibatch or learning rate.
The rest of the paper proceeds in the following order. First, we investigates previous works for OCC and OOD detection in Section II. In Section III, we expain details of the proposed method including the background, the proposed framework, a hybrid discriminator, a correlative autoencoder, and the procedure for training and testing. In Section IV, we provide details of the experiments including experimental setup, experimental results, ablation study, and analysis/discussion. Finally, we conclude the paper in Section V.

II. RELATED WORKS
We categorize previous DNN-based approaches for anomaly detection according to whether they require labels for training or not. Classifier-based approaches [1]- [4], [11], [12] require labels for training, while self-supervision-based approaches [13], [20] generate pseudo-labels for samples and perform pre-text tasks for training. Generative modelbased approaches [1], [5]- [7] and reconstruction modelbased approaches [8]- [10] do not require labels for training. In this section, we mention only closely related works to our research. For a broader survey on methodologies and applications for OCC and OOD detection, we recommend the reader to refer to [21], [22].
In OOD detection, generative/reconstruction model-based approaches have a great advantage because they do not require labels for training but they suffer from the performance degradation problem as shown in Table 1. Conceptual block diagrams of previous and our methods for anomaly detection. Yellow blocks are used for both training and testing, and white blocks are used only for training. (a) The conceptual block diagram of the generative/reconstruction model-based approaches for anomaly detection [8], [9]. (b) The conceptual block diagram of the autoregressive reconstruction model-based approach for anomaly detection [10]. (c) The conceptual block diagram of the proposed method.
Most previous generative/reconstruction model-based approaches improve the performance degradation problem resulting from input complexity in the following ways.
• They distinguish OOD samples from in-distribution samples by estimating the Watanabe-Akaike information criterion (WAIC) of the samples [5].
• They first separate image samples into background components and semantic components, and then distinguish OOD samples from in-distribution samples using the likelihood ratio of these components [6].
• They measure input complexity using the compression tool, and distinguish OOD samples from in-distribution samples based on the measured values [7].
Unlike OOD detection, in OCC samples from a certain single class are considered as in-distribution and commonly the rest of the samples are considered as OOD within the same data set. Labels are not given in OCC, and so a generative/reconstruction model which generates more reconstruction errors in OOD samples than in-distribution samples is often required. Most previous generative/reconstruction model-based approaches for OCC used the following ways to increase the anomaly detection performance.
• The probability distribution of in-distribution samples is trained to follow a predetermined probability distribution through an adversarial autoencoder (AAE) [23]. A discriminator discriminates OOD samples from in-distribution samples, which implicitly estimates the probability distribution of reconstructed samples during testing [9].
• Reconstructed samples are generated through an autoencoder (AE) trained with the mean squared error (MSE) loss and the log-likelihood loss. Both the likelihood of autoregression [24], [25] and the MSE [10] are used to discriminate OOD samples from in-distribution sample during testing.
However, the anomaly detection performance of previous generative/reconstruction model-based approaches [8]- [10] degrades a lot when samples from multiple class within the same data set are used as either in-distribution or OOD. In other words, the anomaly detection performance of previous generative/reconstruction model-based approaches are significantly affected by the intra-class variance.

III. PROPOSED METHOD
Here, we provide a high-level analysis of the proposed method of by comparing them with previous methods. We present the overall schema of previous and our methods in Fig. 1.

A. BACKGROUND
It can be seen in Fig. 1(a) and 1(b) that most of the previous generative/reconstruction model-based approaches for anomaly detection use a discriminator (D), and a generator (G) as an AE which consists of an encoder (G E ) and decoder (G D ) since GAN [26] has proposed. They pit G and D against each other. G is trained to reconstruct an output G(x) similar to an input sample x which can fool D. D is trained to distinguish the real sample x and the generated (reconstructed) G(x) from each other. As training proceeds, both D and G improve each other, ultimately resulting in G reconstructing realistic samples which satisfy p d (x) = p g (G(x)), where p d (x) is a probability density function (PDF) for real samples x, and p g (G(x)) is a PDF for generated samples G(x). In most previous generative/reconstruction model-based approaches for anomaly detection only in-distribution samples are available to train given models, and G introduces reconstruction errors for OOD samples during testing. They considered that D can distinguish the difference between the distribution of OOD samples and the distribution of in-distribution samples because p d (x) = p g (G(x)) for OOD samples. D discriminates OOD samples from in-distribution samples by observing the reconstruction ability of G(x).
To exploit generative/reconstruction model for anomaly detection, the previous methods simultaneously applied more than one auxiliary non-parametric loss, such as the mean square error (MSE) and/or the cross entropy (CE) loss, into the log-likelihood loss of GAN because G from original GAN cannot reconstruct G(x) directly from real samples x. These non-parametric losses result in less reconstruction error than the log-likelihood loss of GAN. Therefore, the ratios between these non-parametric losses and the log-likelihood loss have to be adjusted empirically in order to change the reconstruction ability of G(x). Consequently, D implicitly estimates the reconstruction error of G(x) to detect anomalies.
Previously, the hyperparameters used to control the ratio between these non-parametric losses and the log-likelihood  Table 6 in Appendix A.
loss have heavily affected the anomaly detection performance. As can be seen in Fig. 1(a) and 1(b), the previous methods controlled the reconstruction error for G(x) by empirically adjusting the hyperparameters. Some works [8], [9] measured the reconstruction error using the MSE, while [10] estimated the probability density of latent vectors z using autoregression (AR). In this way, the MSE and AR were used to detect anomalies. Through our experiments we found that these generative/reconstruction model-based approaches are vulnerable to not only input complexity of samples but also to the large intra-class variance.

B. PROPOSED FRAMEWORK
The proposed anomaly detection framework is composed of two main modules, a Hybrid Discriminator (HD), and a Correlative AutoEncoder (CAE) as can be seen in Fig. 1(c). The former performs the discrimination (detection) process, while latter performs the generation (reconstruction) process. These two networks are trained in an adversarial and unsupervised manner within an end-to-end setting. Rather than using the previous discriminators in Fig. 1(a) and 1(b), we combined a conditional discriminator D 2 (x, ·) and the original discriminator D(·) from GAN [26] to propose the HD.
The proposed CAE (G) is composed of two encoders G E1 , G E2 , and a decoder G D . In the proposed G, G E1 is trained to be adversarial in response to D 1 (·), and G E2 is trained to be adversarial in response to D 2 (x, ·). G D is trained to be adversarial in response to both D 1 (·) and D 2 (x, ·). Consequently, we can see that G E1 , G D , and D 1 (·) form the first adversarial network, while G E2 , G D , and D 2 (x, ·) form the second adversarial network.
The proposed conditional discriminator D 2 (x, ·) estimates the conditional probability density of x and G(x) under the condition that a real sample x is given. In the next section, we will show that D 2 (x, ·) not only replace the function of the MSE in Fig. 1(a) and 1(b), but also distinguishes OOD from in-distribution samples. D in Fig. 1(a) discriminates OOD samples from in-distribution samples by implicitly estimating the probability density of G(x). On the other hand, D 2 (x, ·) in Fig. 1(c) discriminates OOD samples from in-distribution samples by implicitly estimating the conditional probability density of G(x) given x. Also, D 2 (x, ·) is estimated relatively according to input complexity of samples. The CAE produces generated (reconstructed) samples G(x) considering the correlation between the outputs of G E1 and G E2 .

C. HYBRID DISCRIMINATORS
The proposed hybrid discriminator (HD) consists of two different discriminators; one is the discriminator D 1 (·) as in [26], and the other one is the conditional discriminator D 2 (x, ·). A previous work [27] suggested the use of the conditional discriminator to conditionally generate an image. Our D 2 (x, ·) is somewhat similar to the conditional discriminator model in previous work but unlike them D 2 (x, ·) here has no relation with conditional generation. We combine D 2 (x, ·) and D 1 (·) into the HD to detect anomalies, as well as implicitly control the reconstruction of G during training.
We first define the minimax objective for D 1 (·) and G. Since only G E1 and G D are adversarially trained with D 1 (·), we denote the generator model for D 1 where G 1 is a differentiable function represented by a multilayer perceptron with parameters θ G E1 and θ G D .
Eq. (1) is identical to replacing D(G(z)) of the minimax objective from GAN [26] with D 1 (G(x)). Thus, D 1 (·) and G 1 VOLUME 9, 2021 reach the global optimum when p data (x) = p g (G(x)), where p data (x) is the PDF for given data samples x, and p g (G(x)) is the PDF for generated samples G(x).
Since only G E2 and G D are adversarially trained with D 2 (x, ·), we denote the generator model for D 2 In Eq. (2), p d (x|x) is the conditional PDF of x given x, and p g (G(x)|x) is the conditional PDF of G(x) given x. We found that both D 2 (x, ·) and G 2 reach the global optimum when p d (x|x) = p g (G(x)|x). Since real samples x are i.i.d. and sample x is given to G(x) as a condition, x = G(x) must be satisfied to satisfy p d (x|x) = p g (G(x)|x). Please refer to the Appendix A for further details including proof. D 2 (x, ·) distinguishes between x and G(x) by using the conditional PDF (p g (G(x)|x)) and not the PDF (p g (G(x))). By using x in D 2 (x, ·), we expect that anomaly detection using D 2 (x, ·) is better than anomaly detection using D 1 (·), especially when samples are diverse and/or complex. D 2 (x, ·) is not affected by the number of classes for samples unlike D 1 (·) because the real sample x is always given as a condition. Because D 2 (x, ·) compares the PDF of the real sample x and the generated sample G(x) relatively, D 2 (x, ·) is less affected by input complexity.
The MSE between x and G(x) trained with D 2 (x, ·) can be expressed as E (x − G(x)|x) 2 [28] because D 2 (x, ·) learns the conditional PDF of samples. On the other hand, the MSE between x and G(x) trained with [29], the reconstruction error of G(x) trained with D 1 (·) is greater than that of G(x), or rather, G(x)|x trained with D 2 (x, ·). In other words, the variance of real samples within mini-batch is eliminated for the conditional PDF (conditional expectation) because a real sample x is given for G(x). Moreover, the MSE in Fig. 1 2 . Therefore, we can use D 2 (x, ·) instead of the MSE in Fig. 1(a). D 2 (x, ·) can implicitly estimate not only the conditional PDF of G(x) but also that of x. While D 2 (x, G(x)) is trained with adversarial learning, D 2 (x, x) is trained with the maximum log-likelihood as is a generative model which explicitly estimates probability density. Although D 2 (x, ·) implicitly estimates the probability density, D 2 (x, x) can relatively measure the difference in input complexity. When input complexity of the samples is similar to each other, D 2 (x, x) for in-distribution samples is higher than that for OOD samples like generative models such as Glow [14] or PixelCNN++ [15]. Accordingly, D 2 (x, ·) basically functions as discriminator of GAN, D 2 (x, G(x)) discriminates OOD samples from in-distribution samples, and D 2 (x, x) can be used to measure input complexity.

D. CORRELATIVE AUTOENCODER
As we would like G to increase reconstruction errors for OOD samples regardless of input complexity, we propose a correlative autoencoder (CAE) which cooperatively works with the HD. The CAE consists of two encoders G E1 , G E2 , and a decoder G D . In the proposed CAE, G E1 is trained to be adversarial in response to D 1 (·), and G E2 is trained to be adversarial in response to D 2 (x, ·). G D is trained to be adversarial in response to both D 1 and D 2 . G E1 generates z 1 (= G E1 (x)), and G E2 generates z 2 (= G E2 (x)) from the same sample x, respectively. Since G 1 results in more reconstruction error than G 2 as mentioned in the previous section and both G 1 and G 2 share the same decoder G D , we can say that G E1 results in more reconstruction error than G E2 . We normalize both z 1 and z 2 , and G D takes their summation to produce a generated sample G(x).
In multi-domain learning [32], [33] or one/few-shot learning [34], [35], it is already known that there are common features which can be expressed as the covariance/correlation between domains or categories. Therefore, we can assume that the correlation between z 1 and z 2 will be high if there are common features between the z 1 and z 2 for in-distribution samples. We apply the Pearson correlation to both z 1 and z 2 because we would like them to be consistent regardless of the data sets. Thus, z 1 and z 2 will be normalized as follows.
For E[z 1 ] and E[z 2 ], we use the mean of z 1 and z 2 from the training mini-batch, respectively. In the same way, we use the standard deviations of z 1 and z 2 from the training minibatch for σ z 1 and σ z 2 . This policy will not change, either during training or testing. In other words, all of the statistical parameters in Eq. (3) are only set during training. The CAE generates G(x) whose varying reconstruction error depends on whether x was observed during training or not. To match the order of z N 12 , z 2 is normalized as z N 2 using expectation and variance of z 2 . We already know the reconstruction error of z 2 is less than that of z 1 . Thus, the reconstruction error of z N 2 will be less than that of z N 12 .
If G D uses only z N 12 , rarely seen samples during training will introduce large reconstruction errors, and this will lead to false negatives during testing. On the other hand, if G D only uses z N 2 , generalization becomes too high, which leads false positives during testing. By using z N as z N 12 + z N 2 , z N maintains basic reconstruction functions of z N 2 , and at the same time it introduces a reconstruction error from z N 12 , which varies depending on whether x was observed during • Update θ D 2 (x,·) (parameters of D 2 (x, ·)) by ascending its stochastic gradient: ) . • Update θ G E2 (parameters of G E2 ) and θ G D (parameters of G D ) by ascending its stochastic gradient: G(x i ))) . • Update θ D 1 (·) (parameters of D 1 (·)) by ascending its stochastic gradient: ) . • Update θ G E1 (parameters of G E1 ) and θ G D by ascending its stochastic gradient: , σ z 1 and σ z 2 for a mini-batch of m normal samples for training. end // The gradient-based updates can use any standard gradient-based learning rule. // We used Adam with betas = (0.9, 0.999), eps = 10 −8 , and weight decay = 0. training or not. Thus, G D trained with z N does not need any empirical hyperparameters to adjust the reconstruction error.

E. TRAINING, AND TESTING
The proposed framework including the CAE and the HD are trained with the following loss.
As shown in Eq. (4), the total loss L(D 1 , D 2 , G) is the sum of Eq. (1) and Eq. (2). In the proposed framework, note that the parameters of G E2 will not be updated when G 1 and D 1 (·) are trained. Also, the parameters of G E1 will not be updated when G 2 and D 2 (x, ·) are trained. The proposed framework is composed of two adversarial networks, the G 1 and D 1 (·) networks, and the G 2 and D 2 (x, ·) networks. These two adversarial networks are trained alternatively. We describe the detailed training procedure in Algorithm 1.
During testing, we exploit the proposed conditional discriminator D 2 (x, ·) for anomaly detection. We know D 2 (x, G(x)) implicitly estimates the conditional PDF for the generated G(x) under the condition that the real sample x is given. We can expect that D 2 (x, G(x)) depends on whether sample was observed during training or not, so it can be used to discriminate OOD samples from in-distribution samples using the following policy.
where λ th is a threshold for binary classification.

IV. EXPERIMENTS
To evaluate the anomaly detection performance of the proposed method, we compared it with previous methods mentioned in Sec. II and Table 1. We set various test conditions for anomaly detection in Table 2. To prove whether the proposed method eradicates the performance degradation problems of anomaly detection due to differences in input complexity of samples, we use combinations of data sets that have been reported to cause this situation.

A. EXPERIMENTAL ENVIRONMENT
In One-Class Classification (OCC), samples from a single class(or category) are used as in-distribution during training, and samples from another single class or other multiple classes are used as OOD during testing. The rest of the samples except in-distribution samples are used as OOD for OCC on MNIST, FashionMNIST, and CIFAR10, For OCC on COIL100, we followed the experimental setting from [9], [31]. OOD detection evaluates the anomaly detection performance on inter-data sets. We use entirety of a data set for in-distribution samples during training and entirety of another  data set for OOD samples during testing. Commonly, input complexity of in-distribution samples are more complex than that of OOD samples. For instance, using FashionMNIST as in-distribution samples and MNIST as OOD samples is very common for OOD detection.
In multiple class anomaly detection (MCAD), we use multiple classes for in-distribution samples during training, as well as other multiple classes for OOD samples during testing. Both the in-distribution and OOD samples are from the same data set. The number of classes for in-distribution samples and the number of classes for OOD samples are configured to be the same. We randomly select 50% of all classes in a certain data set and allocated them to in-distribution. The remaining 50% are allocated to OOD samples. Only in-distribution samples with no class labels are used during training. Both in-distribution and OOD samples are used to evaluate the anomaly detection performance during testing. Samples for training and testing are completely separated, and so there are no common samples between them. We assumed that both the in-distribution samples used for training and the in-distribution samples used for testing are from the same distribution (p d (x)). We used the area under the receiver operating characteristic (ROC) curve (AUC) as a metric for the anomaly detection performance. Commonly, the anomaly detection performance for MCAD is lower than that for OCC and OOD detection because in the MCAD experiment samples have the larger intra-class variance and the higher input complexity.
Including previous works listed in Table 1 no one has reported experimental results for MCAD so far. Therefore, we test [8]- [10] for MCAD using their source code, and compare their results with ours. The proposed framework is implemented in the PyTorch 0.3.1 environment with multiple NVIDIA GTX 1080Ti and RTX Quadro 8000. If a data set has its own preset for training and testing, we follow the preset. If a data set has no preset for training and testing such as COIL100, we randomly draw samples for training and testing with no duplication. Consequently, we set our experimental setting to be identical to previous works [9], [31] for fair comparisons. We present details of the data sets used for our experiments in Table 2.
Although we confirm that the change of the number of layers (or other arguments such as kernel size, output channel, and etc. for convolution layers) is slightly affected to anomaly detection performance within average range of 1%, we decide to use the same network architectures of which the number of layers are all five in Table 6 for OCC, MCAD, and OOD detection. The proposed framework shows superiority under various anomaly detection scenarios, including OCC, MCAD, and OOD detection on various data sets.

B. EXPERIMENTAL RESULTS
We conduct experiments for OCC, MCAD, and OOD detection using various works of anomaly detection including ours.
Our results in all the tables are the average of at least five test runs. We plotted all the figures by selecting certain AUCs and ROC curves close to the average. Since the difference between OCC and MCAD is the number of classes for OOD samples, we present the comparison of the AUCs for both OCC and MCAD in Table 3. In Table 3, we report the AUCs of previous methods for OCC by referring to their original references and also we report the AUCs of previous methods by using their code [8].  The overall ROC curves of previous and our methods on MNIST can be seen in Fig. 3. The AUCs of the proposed method are all higher than that from other previous methods, regardless of OCC (except FashionMNIST) or MCAD. In particular, since FashionMNIST has many visually similar samples between classes such as FashionMNIST samples labeled as 'T-shirt/top' and 'Shirt' which have the small interclass variance, the anomaly detection performance of the proposed method for MCAD on FashionMNIST is higher than that for OCC.
We provide the comparison of the AUCs for OOD detection in Table 4. Results of previous methods are from their original references. We use a relatively complex data set as in-distribution samples; for instance, if FashionM-NIST samples are used as in-distribution, then MNIST samples are used as OOD, which is very common for OOD detection. In Table 4, FashionMNIST samples are used as in-distribution, MNIST samples are used as OOD, and the anomaly detection performance degrades due to the difference in input complexity. Note that the previous generative/reconstruction model-based approaches used pre-processing, compression tool, and adjusted hyperparameters to improve their performance but the anomaly detection performance of the proposed method is comparable to that of theirs without using any pre-processing or other minor adjustments. It means that the proposed method can solve the performance degradation problem due to the difference in input complexity.

C. ANALYSIS AND DISCUSSION
As can be seen in Table 3, the proposed method shows high AUCs for OCC and MCAD. In particular, the proposed method exhibits higher AUCs than previous methods for MCAD, which means that the proposed method is less affected by the intra-class variance. In Fig. 3(a), the AUC is the lowest for samples labeled as '8' and we believe samples labeled as '8' have many similar features which other samples also have. In Fig. 3(b), the AUCs are low when samples labeled as 'T-shirt/top' and 'Shirt' are used as in-distribution because they have the small inter-class variance. In Fig. 3(c), the AUCs are low when samples which have with a complex background are used as in-distribution. In Fig. 3(d) the AUC of the proposed method is much higher than that of previous methods.
In order to verify that the proposed method solved the performance degradation problem due to the intra-class variance and input complexity, we plotted the histogram of D 2 (x, ·) values and present them in Fig. 4, along with images of x and G(x) in Fig. 5.  The histograms of D 2 (x, x) and D 2 (x, G(x)) of the proposed method for OOD detection are shown in Fig. 4(a) and 4(b) respectively, where FashionMNIST samples are used as in-distribution and MNIST samples are used as OOD. Although input complexity of MNIST is lower than that of FashionMNIST and D 2 (x, ·) is trained on FashionMNIST samples, it can be seen that Fig. 4(a) D 2 (x, x) for MNIST are located near the high points of D 2 (x, x) for FashionMNIST. Since D 2 (x, x) implicitly estimates the probability density log-likelihood of samples, it barely discriminates OOD samples from in-distribution samples. In Fig. 4(b), however, there is little overlap between D 2 (x, G(x)) for reconstructed samples of MNIST and D 2 (x, G(x)) for reconstructed samples of FashionMNIST. The histograms of D 2 (x, x) and D 2 (x, G(x)) of the proposed method for MCAD are shown in Fig. 4(c) and Fig. 4(d) respectively, where MNIST samples from five classes are used as in-distribution and the rest of the MNIST samples are used as OOD. Although D 2 (x, x) did not observe any of OOD samples during training, D 2 (x, x) for in-distribution and OOD samples have the similar distributions. It means that discriminating OOD samples from in-distribution samples by simple estimation of the probability density loglikelihood of samples is impossible because of the large intra-class variance and the difference in input complexity. In Fig. 4(d), however, D 2 (x, G(x)) for OOD samples are mostly concentrated near 0. Thus, D 2 (x, G(x)) can discriminate OOD samples from in-distribution samples. Fig. 5(a) and 5(b) show visualization of the real samples x and reconstructed samples G(x) for in-distribution and OOD respectively. The reconstruction error of G(x) for OOD samples (bottom) is larger than the reconstruction error of G(x) for in-distribution samples (top) because it was trained to generate G(x) considering the correlation between z 1 and z 2 . Note that the proposed method can optimally set the reconstruction error of samples to detect OOD samples without empirically adjusting hyperparameters for the ratio of the loss between D 1 (·) and D 2 (x, ·), while previous generative/reconstruction model-based approaches using the MSE commonly use them. This means the proposed method can consistently control the reconstruction errors of all samples using Eq. (3) regardless of the sort of data sets. The comparison of the proposed framework including the CAE and the HD, the framework including the CAE, the MSE, and D 1 (·), and the framework including the AE and the HD for MCAD and OOD detection. The higher, the better. Best results presented in bold font.

D. ABLATION STUDY
In order to verify superiority of the proposed CAE and HD, we conduct several experiments that substitute the MSE for D 2 (x, ·), and substitute an AE for the CAE. The overall experiment results for this ablation study are present in Table 5. If both G E1 and G E2 in Eq. (4) are replaced on G E , the total loss for the AE with the HD framework will be which is similar to Eq. (4), where β is an empirical hyperparameter to control the reconstruction error. Since the CAE is replaced by the AE, β should be set empirically in the way that other SOTA approaches [8]- [10] did. In this ablation study, β was empirically set to 0.1 for OCC and MCAD, and 0.02 for OOD detection.
We presented the AUC traces by increasing training steps in Fig. 6 in order to confirm stability of the proposed framework. It can be seen that substituting the CAE for the AE and using D 2 (x, ·) as the anomaly detector instead of D 1 (·) results in the best performance, where the traces remain flat and high across all training steps. Note that the framework composed of the CAE, D 1 (·), and the MSE still exhibited decent performance, although it was not as good as the proposed framework including the CAE and the HD. D 2 (x, ·) acts as the parametric MSE, merely substituting the MSE for D 2 (x, ·) will not degrade the anomaly detection performance significantly. Once the AE is substituted for the CAE, however, it degrades the anomaly detection performance a lot, and the AUC traces no longer remain flat and high across all training steps.
We presented the estimated Pearson correlation coefficients for in-distribution and OOD samples in Fig. 7, in order  to verify validity of the correlations for z N 12 which was discussed in Sec. III-C. E[z N 12 ] which is the mean vector in the Pearson correlation coefficients batch is as follows.
where m is batch size, i is an index for the samples in the batch, and k is an index for the latent vector. E[z N 12 ] is the mean of E[z N 12 ], as follows where K is the dimension for the latent vector. In Fig. 7, it can be seen that E[z N 12,k ] for in-distribution samples is almost higher than that of OOD samples, and E[z N 12 ] for in-distribution samples is substantially higher than that of OOD samples because samples can be characterized by certain dominant latent vectors. Each E[z N 12,k ] and E[z N 12 ] in Fig. 7 are from the training step whose AUC is the highest and we can see similar observations in almost all of the results. Consequently, in-distribution samples have a greater correlation between z 1 and z 2 than OOD samples.
To maximize the effect of the correlation between z 1 and z 2 , we used encoders G ER , G EG , G EB which are assigned to each of red, blue, and green channels of color images such as the CIFAR10 and COIL100 (please refer to Table 6 in Appendix B for details).

V. CONCLUSION
In this work we proposed the CAE and the HD that provide the solution for problems of previous generative/ reconstruction model-based approaches for anomaly detection. Based on the theoretical analysis and experimental results, we confirm that the proposed CAE and HD was superior to the previous autoencoder, discriminator, and MSE in despite of the large intra-class variance and free-hyperparameters. In addition, the proposed method was comparable to state-of-the-art generative/reconstruction model-based approaches for OOD detection with no requiring any data sets-dependent and empirical hyperparameters. Although the CAE and the HD here were implemented using a naive autoencoder and discriminator, the CAE and the HD can be applied to other generative and discriminative models. We believe that the proposed CAE and HD can be employed not only for anomaly detection, but also in other fields where selective reconstruction error and conditional/unconditional discrimination are important.

APPENDIX A PROOF OF GLOBAL OPTIMUM FOR D 2
The losses for the networks including D 2 (x, ·) and G are given in Eq. (2). The concatenation of a given sample x and a generated sample G(x) can be modeled as the conditional distribution. The conditional PDF for the first term in Eq. 2 is p d (x|x), and the conditional PDF for the second term in Eq.  The optimal discriminator D * (x, ·) is arg max D V (D, G).
Substituting Eq. (9) into Eq. (8) If we change p g (G(x)|x) in Eq. (10) to p g (x), Eq. (10) is equal to V (D * , G) in [26]. Therefore, the global optimum that satisfies arg max D V (D, G) and arg min G V (D, G) simultaneously is achieved at p d (x) = p g (G(x)|x).

APPENDIX B IMPLEMENTATION DETAILS
We present the network architecture details of the proposed framework in Table 6. For other hyperparameters and algorithm details please refer to the Algorithm 1.