Introduction
The objective of unsupervised anomaly detection is to identify anomalous samples from data. Unsupervised anomaly detection assumes that only normal samples are present while anomalous samples are absent in the training dataset. This formulation is useful when it is difficult to collect sufficient anomalous samples in advance or to obtain all possible anomaly patterns. Real-world examples of such scenarios include video surveillance [1], medical diagnosis [2], equipment failure detection [3], and manufacturing inspection [4].
There have been many research attempts to investigate unsupervised anomaly detection using deep neural networks. Among them, reconstruction-based anomaly detection [5]–[7] using autoencoders (AEs) is an intuitive and promising method to detect anomalies in the image domain. An AE is trained on normal samples to reconstruct them through a bottleneck layer, so that an anomalous sample is extremely distorted whereas a normal sample is not. For a new sample, the difference between the input and its reconstruction is used as the anomaly score. A sample with a score higher than a predefined threshold is rejected as an anomaly.
However, owing to the bottleneck architecture of an AE, even the reconstructions of normal samples are mildly distorted. The performance of reconstruction-based anomaly detection is strongly influenced by the size of the bottleneck in the AE. If the size of the bottleneck is excessively large, the reconstruction performance will be improved, but anomalous samples will also be well restored, defeating the purpose of the AE. By contrast, reducing the size of the bottleneck significantly affects the results corresponding to both normal and anomalous samples. The anomaly scores of some normal samples can be higher than those of anomalous samples. This phenomenon causes the over-detection issue, resulting in deterioration of the overall anomaly detection performance.
In this paper, we introduce an outlier-exposed style distillation network (OE-SDN). We identify two components of the distortions by the AE: style translation and content translation. The OE-SDN has an extensive architecture and is trained based on knowledge distillation and outlier exposure regularization to mimic style translation while suppressing content translation. To detect anomalies, we measure the difference between the outputs of the AE and OE-SDN to capture the degree of content translation while style translation is canceled out, thereby alleviating the over-detection issue.
Fig. 1 shows examples that compare the outputs of the AE and OE-SDN. In Fig. 1a, the AE blurs the bristles and changes the overall tone from yellow to red. We regard these mild distortions as style translation. In Fig. 1b, the AE transforms abnormal areas such that they resemble normal areas by generating some bristles to replace the missing ones. These extreme distortions are regarded as content translation. Fig. 1a shows that the OE-SDN blurs the bristles and changes the color as the AE does. However, as shown in Fig. 1b, the OE-SDN does not replace the missing bristles, unlike the AE.
Examples of distortions arising from the AE and OE-SDN in normal and anomalous regions. In (a) and (b), the left, middle, and right represent the input image, the output of the AE, and the output of the OE-SDN, respectively.
Related Work
A. Reconstruction-Based Unsupervised Anomaly Detection
A prevalent choice for anomaly detection is reconstruction-based anomaly detection using such models as an AE [5], a variational autoencoder (VAE) [6], and a generative adversarial network (GAN) [8]. It identifies a sample as an anomaly if the reconstruction error is above a certain threshold. The anomaly detection performance degrades if the reconstruction error of an anomaly is lower than the threshold. Gong et al. [7] used an AE with a memory module to calibrate the reconstruction error of anomalous samples. Zong et al. [9] considered both the distance of features and the reconstruction error to detect anomalous samples.
Unlike previous studies [7], [9] that attempted to detect anomalous samples with a low reconstruction error, this study targets normal samples with a high reconstruction error.
B. Out-of-Distribution Detection on Labeled Data
The aim of out-of-distribution (OOD) detection on labeled data is to construct a classifier to identify whether input data were sampled from the distribution of a training set or from a novel distribution [10]–[13]. Hendrycks et al. [13] suggested that the confidence can be attributed to samples based on the maximum prediction value by the classifier and that samples with a confidence value less than a fixed threshold can be rejected and regarded as OOD samples. Liang et al. [12] used adversarial perturbation [14] and temperature scaling to lower the confidence of a classifier when OOD samples were inferred. Some previous studies used regularization techniques to calibrate the confidence of the classifier. Lee et al. [10] set cross-entropy loss as a penalty term. Hendrycks et al. [11] employed margin ranking loss in a similar manner.
Without the notion of a classifier, OOD detection is highly similar to anomaly detection. We borrow ideas from OOD detection to address unsupervised anomaly detection.
C. Knowledge Distillation
Knowledge distillation is a method of transferring knowledge from a teacher network to a student network. Its applications mainly entail network compression. Hinton et al. [15] employed the predictions of a teacher network as soft labels and trained a smaller student network with these labels for a classification task. Chen et al. [16] and Fukuda et al. [17] applied knowledge distillation for object detection and speech recognition tasks, respectively.
The aforementioned applications use knowledge distillation to distill the knowledge of heavy ensemble models, achieving state-of-the-art performance with a lighter, faster network. To minimize the loss of accuracy, as much knowledge as possible should be transferred.
Unlike the conventional knowledge distillation methods, the objective of knowledge distillation for the proposed method is not compression but style mimicking. Thus, instead of transferring all knowledge from the teacher network, the proposed method aims to extract and distill only a small portion of knowledge that corresponds to style translation.
Method
A. Overview
The proposed anomaly detector comprises of two neural networks, as illustrated in Fig. 2. The first network is the autoencoder (AE), which reconstructs the input using a bottleneck structure. The second network is the outlier exposed style distillation network (OE-SDN), which imitates the output of the AE with an extensive non-bottleneck structure. Given a test sample, the proposed method calculates the anomaly score by comparing the AE and OE-SDN outputs.
B. Autoencoder
AE
1) Training
Suppose the training set consisting only of normal data, denoted by \begin{equation*} \mathcal {J}_{AE}(f_{AE}) = \sum _{x\in \mathcal {X}_{\text {normal}}}{ d\left ({x,f_{AE}(x) }\right)}.\tag{1}\end{equation*}
2) Anomaly Detection
Given a test sample \begin{equation*} \epsilon (x_{\text {test}}) = ||{x_{\text {test}}-f_{AE}(x_{\text {test}})}||.\tag{2}\end{equation*}
The anomaly detection performance is affected by the size of the bottleneck layer. If the size of the bottleneck layer is small, the AE will considerably distort anomalous samples. However, the AE will also distort normal samples, especially those with infrequent or complex patterns. The unintended distortions of normal samples will deteriorate the overall anomaly detection performance.
C. Outlier-Exposed Style Distillation Network
OE-SDN
1) Knowledge Distillation
To make the OE-SDN imitate the style translation of AE, we adopt knowledge distillation [15]. We distill the knowledge of the AE and provide it to the OE-SDN. Given a training dataset \begin{equation*} \mathcal {L}_{KD}(f_{OS}) = \sum _{x\in \mathcal {X}_{\text {normal}}}{ d\left ({f_{OS}(x),f_{AE}(x) }\right)}.\tag{3}\end{equation*}
2) Outlier Exposure Regularization
Knowledge distillation from the small AE to the large OE-SDN results in the OE-SDN learning the style translation of the AE. However, the trained OE-SDN may also imitate some extreme distortions of the AE, as shown in Fig. 3. Hence, we adapt the concept of outlier exposure [11] to the regularization for the OE-SDN. We define the outlier exposure regularization (OER) term \begin{equation*} \mathcal {L}_{OER}(f_{OS}) = \sum _{\tilde {x}\in \mathcal {X}_{\text {aux}}}{ d\left ({\tilde {x},f_{OS}(\tilde {x}) }\right)}.\tag{4}\end{equation*}
The AE, SDN and OE-SDN are trained on the class “3” subset of the MNIST dataset, where the SDN is the network trained without the OER term. Due to the bottleneck, the AE can only reconstruct the class “3” samples and it transforms anomalous samples (“0” in this case) to the normal class “3.” (a) Both the SDN and OE-SDN successfully imitate the blurring style of the AE. (b) The output of the SDN is the same as the output of the AE even for anomalous data; this means that the SDN mimics extreme distortion as well as mild distortion. On the other hand, the OE-SDN successfully reproduces anomalous data without extreme distortion.
The auxiliary dataset
3) Training
The objective function for the OE-SDN \begin{equation*} \mathcal {J}_{OS}(f_{OS}) = (1-\lambda) \cdot \mathcal {L}_{KD}(f_{OS}) + \lambda \cdot \mathcal {L}_{OER}(f_{OS}),\tag{5}\end{equation*}
Given the training dataset
4) Anomaly Detection
To detect anomalies, we adopt an alternate anomaly score instead of a reconstruction error. Given a test sample \begin{equation*} \epsilon '(x_{\text {test}})=||{f_{OS}(x_{\text {test}})-f_{AE}(x_{\text {test}})}||.\tag{6}\end{equation*}
Experiments
This section describes the evaluation of the proposed method for two unsupervised anomaly detection tasks: classification and segmentation. Each task has different benchmark datasets and baseline methods. All the experiments were implemented using PyTorch [20].
The common configurations for the two tasks are as follows. We used DSSIM [18] and MSD for the loss functions of the AE and OE-SDN, respectively. We set the hyperparameter
We evaluated the performance of each method in terms of the area under the receiver operating characteristic curve (AUROC), which is calculated independently of the threshold. For the anomaly classification task, the AUROC is calculated using image-wise anomaly scores. For the anomaly segmentation task, the AUROC is calculated using pixel-wise anomaly scores.
A. Unsupervised Anomaly Classification
We evaluated our method for the MNIST [22] and CIFAR-10 [23] datasets for the classification task of unsupervised anomaly detection. Both datasets had 10 classes from which we created 10 setups similar to those created by Ruff et al. [24]. In each setup, one class was chosen as the normal class and the remaining were the anomalous classes. Every setup had approximately 6,000 training images in the MNIST dataset or 5,000 in the CIFAR-10 dataset. The number of test images was 10,000 for both sets.
For both the MNIST and CIFAR-10 datasets, we implemented an AE with a simple architecture composed of fully-connected layers. The encoder consisted of four layers of 128, 64, 32, and 10 units. The decoder was designed symmetrically to the encoder. Leaky rectified linear units (LReLUs) with a slope of 0.005 were applied after each layer except for the output layer. Regarding the OE-SDN, we constructed the architecture using the residual attention block of residual non-local attention networks (RNAN) [25], which has proven to be effective in general image translation tasks. Fig. 4 shows the detailed architecture of the OE-SDN. We set an initial learning rate of 0.001 for the AE and 0.0001 for the OE-SDN. We trained the AE in 50k iterations and the OE-SDN in 10k iterations, which was sufficient for the loss of each network to converge. The batch size was 20 for both networks.
Architecture diagram for the OE-SDN. The architecture adapts the RNAN [25], in which we set the number of filters to 32 in all convolution layers.
We compared our method with four baseline methods: one-class support vector machine (OC-SVM) [26], isolation forest (IF) [27], GAN for anomaly detection (AnoGAN) [8], and deep support vector data description (DeepSVDD) [24]. The experimental results for these four baseline methods were obtained from [24]. We also used the AE and SDN (OE-SDN with
The results are shown in Table 1. The proposed OE-SDN outperformed the baseline methods in terms of the average AUROC for the MNIST and CIFAR-10 datasets. For the setups with Dog, Horse, or Truck as a normal class on the CIFAR-10, the AE performed worse than DeepSVDD, whereas the OE-SDN yielded superior results. This demonstrates the effectiveness of the OE-SDN. The SDN performed worse than the AE in many cases for both datasets. The SDN learned not only style translation but also extreme distortions from the AE, as shown in Fig. 3. The OER mitigated this problem, resulting in a higher AUROC. For the setups with Airplane or Bird as a normal class in the CIFAR-10 dataset, the OE-SDN performed worse than the AE. There were more normal images that the OE-SDN can easily reproduce while the AE reconstructed with distortions, leading to more false positive samples than in other setups. An image of an airplane flying without landing on the ground is an example.
B. Unsupervised Anomaly Segmentation
We used the MVTec-AD dataset [4] to assess unsupervised anomaly segmentation performance. It contained 5354 images corresponding to 10 object categories and 5 texture categories that represent real-world inspection scenarios. The training set of the MVTec-AD dataset had only normal images, and the test set contained defect images with a pixel-wise ground-truth mask.
Considering that the MVTec-AD dataset contained relatively large images, we adopted a modern convolutional neural network (CNN) architecture to implement the AE. The detailed architecture is shown in Fig. 5. We used the same architecture for the OE-SDN as in the classification experiment.
Architecture diagram for AE for segmentation experiment. Except for the last convolution layer, all convolution filters have a kernel size of
Texture images were cropped to
We investigated the effectiveness of the proposed method in comparison with three unsupervised anomaly segmentation methods that were used as baselines in [4]: AnoGAN [8], a method based on CNN feature similarity [28], and an AE [4] with an alternative architecture. The experimental results for these three baselines were obtained from [4]. We also used our AE and SDN as baseline methods to evaluate the validity of the components of our method similar to the classification experiment.
As shown in Table 2, the OE-SDN and SDN achieved the best performance on average for both textures and objects. The OE-SDN obtained considerable AUROC gain for categories in which the AE had low AUROC, such as Tile, Wood, Capsule, Metal Nut, and Zipper. On the other hand, the advantage of the OE-SDN was insignificant for categories in which the AE already detected anomalies well, such as Carpet, Grid, Hazelnut, Screw, and Toothbrush. The OE-SDN yielded a slightly lower AUROC than the AE in the Pill category which contained the defect patterns that the OE-SDN cannot reproduce. For example, a slightly changed color was not reproduced by the OE-SDN, which resulted in missing abnormal pixels.
Fig. 6 shows examples of cases in which the OE-SDN exhibited improved performance and those in which it does not. In successful cases, shown in Fig. 6a, the high anomaly scores generated by the AE for the normal regions were significantly suppressed by using the OE-SDN. However, there were also cases in which the OE-SDN was not effective, as shown in Fig. 6b. In the first, second, and fourth rows, when the AE detected abnormal areas appropriately without over-detecting the normal areas, the OE-SDN did not provide further performance improvement. In the third and fifth rows, the AE originally did not produce significant anomaly scores for abnormal areas. Because the OE-SDN suppressed anomaly scores for normal areas while maintaining anomaly scores for abnormal areas, it was difficult to expect performance improvement in such cases.
Successful and failed cases corresponding to our method on texture categories of the MVTec-AD dataset. Each row shows the results of each category in the order Carpet, Grid, Leather, Tile, and Wood. Each column represents a raw image, ground-truth mask, pixel-wise difference of raw input and reconstruction from the AE, and pixel-wise difference of the outputs of the AE and OE-SDN.
C. Effect of Auxiliary Dataset and Hyperparameter
We investigated the effect of the auxiliary dataset
1) Comparative Experiment for Auxiliary Dataset
We performed a comparative experiment with regard to the auxiliary dataset
2) Sensitivity Analysis of Hyperparameter
We used the MNIST and CIFAR-10 datasets to study the robustness of the hyperparameter
Conclusion
In this study, we presented the OE-SDN to overcome the over-detection issue of conventional reconstruction-based anomaly detection methods. Considering an AE, the OE-SDN was trained with two objectives: knowledge distillation and outlier exposure regularization. Consequently, the OE-SDN preserved the style translation and suppressed the content translation of the AE. We introduced an alternate anomaly score defined as the difference between the outputs of the AE and OE-SDN. Experiments on real datasets showed that our method outperforms existing methods, including reconstruction-based anomaly detection using AEs.
In future work, we will investigate various regularization and learning methods used in style transfer studies to train the OE-SDN effectively. Further, because the OE-SDN can be used with any other reconstruction-based anomaly detection method, we will apply the OE-SDN to other recent methods to improve the anomaly detection performance.