Spatial Contrastive Learning for Anomaly Detection and Localization

With the development of deep learning, abnormal detection methods have been widely presented to improve performances in various applications, including visual inspection systems. However, there remains difficult to be directly applied to real-world applications, which often include the lack of abnormal samples and diversity. This paper proposes contra embedding that adopts progressive autoencoder with contrastive learning to address these difficulties. The autoencoder is trained progressively to reproduce the details of the original images, and modified CutPaste augmentation helps to learn to recover normal images. Especially, contrastive learning based on normal embedding vectors effectively reduces false positives caused by the autoencoder. The proposed method is also helpful when normal data have complex shapes, sizes, and colors. In experiments, MVTec AD dataset is used to show the generalization ability of the proposed method in various real-world applications. It achieves over 98.0% AUROCs in detection and 97.7% AUROCs in the localization, respectively, without using the ImageNet pre-trained model as in previous methods.


I. INTRODUCTION
Recently, huge performance improvement has been achieved in the unsupervised anomaly detection task, which aims to detect unusual events of test data by only training unlabeled data [1], [2]. Unsupervised anomaly detection is assumed that the training dataset does not usually have abnormal samples, and it can be used when normal samples occur more frequently than abnormal ones. This method can be applied to cases that are difficult to collect abnormal data, such as medical and industrial applications [3]- [6].
In anomaly detection and localization, there exist basically two categorized methods such as a reconstruction-based method [7]- [11], [12], and an embedding similarity-based one [13]- [15]. An Autoencoder (AE) structure is one of the reconstruction-based methods. It finds the differences between the input image and the reconstructed one. The AE is an intuitive and interpretable, and effective solution to extract important features of images that have various characteristics [16], [17]. In anomaly detection fields, the recon-The associate editor coordinating the review of this manuscript and approving it for publication was Yongqiang Cheng . structed image quality is one of the key issues to improve performances because generated images usually seem blurred due to the bottleneck structure of the model. To avoid this drawback, Gong et al. [18] used AE with the memory module to correct the reconfiguration error of the abnormal sample. On the other hand, Chung [19] solved the problem by restoring the image using knowledge distillation and outlier exposure regularization.
Recently, embedding similarity-based methods in [1], [13]- [15] showed good performance using feature vectors of a pre-trained network to detect the abnormality. However, they require an appropriate pre-trained model trained using extra data such as ImageNet. The used model does not guarantee that it will properly extract the feature vectors of all datasets.
Our approach, based on the reconstruction-based method is inspired by the recent success of self-supervised models such as SimCLR [20] and MoCo [21]. Spatial contrastive learning is applied to learn local features for fine-grained datasets. The method is more applicable for extracting and adjusting features when sufficient normal data exists. This paper proposes a novel method called contra embedding for unsupervised anomaly detection trained on normal data, including various characteristics such as shape, size, and color. The proposed contra embedding combines a Progressive Autoencoder (PAE) and spatial contrastive learning approach to perform outstandingly. The PAE is used to detect subtle abnormalities in detail, and it is robust to small changes in size and shape of normal instances. Additionally, modified Cut-Paste augmentation reduces the parameters dependencies of the autoencoder. The augmentation helps autoencoder learn denoising and reduce over-detection when autoencoder has large compression vectors. The patch-level contrastive learning is proposed to extract local embedding vectors, which compensates for the anomaly map of the PAE. Figure 1 shows the anomaly map that is generated from the proposed method. Structural Similarity Index (SSIM) from autoencoder detects abnormalities effectively but detects high-frequency regions of normal images as defects. This problem is supplemented with a generated mask through contrastive learning. The last anomaly map shows the effectiveness of our proposed method.
The contribution of our method can be summarized: • We address the drawback of over-detection by the reconstruction-based methods through patch-level contrastive learning.
• The proposed approach is more useful in realistic conditions like non-aligned datasets.
• The method improves the reconstruction performance of the autoencoder by using progressive learning and modified CutPaste augmentation.

II. RELATED WORK A. RECONSTRUCTION-BASED METHODS
Reconstruction-based methods encode or reconstruct normal images using neural network architectures like Autoencoder (AE) [7], [8], Variational Autoencoder (VAE) [9], or Generative Adversarial Networks (GAN) [10]- [12], [3]. The autoencoder is based on the idea that the model could not reconstruct the anomalous image properly since it concentrates only on the frequent occurrence of the normal one. The anomaly score measures the difference between input images and reconstructed ones.
To improve the quality of reconstruction images, SSIM-AE [7], [8] propose to train an autoencoder with the structural similarity loss for comparing luminance, contrast and structural information between local image regions. MemAE [18] and OE-SDN [19] use an additional method like memory module or knowledge distillation to complement reconstruction images. Two methods are mainly focused on a classification task of anomaly detection such as MNIST and CIFAR-10. Recently, InTra [22] suggests a Transformer model with masked patches. The method tried to improve performances using global attentions.

B. EMBEDDING SIMILARITY-BASED METHODS
Embedding similarity-based methods extract meaningful vectors for anomaly detection. Patch-SVDD [23] extends Support Vector Data Description(SVDD) to the patch levels. The anomaly score is measured using Nearest Neighbor Search(NNS) based on the distance between an embedding vector from a test image and all embedding vectors from training images. Patch-SVDD improved the performances using the patch-level features, but the inference time increases linearly with the size of the training data. Recently, the other authors in [1], [13]- [15] suggested using feature vectors of a pre-trained network to detect the abnormality. Especially, PaDiM [1] proposed a patch-based approach that preserves the coordinate from image space to feature maps. The feature vectors are extracted in each patch according to the pre-trained network using ResNet [24], WideResNet [25], or EfficientNet [26]. Although the method is simple, it shows outstanding performance. However, the performance is limited by the pre-trained data and the approach is hard to learn additional features when performance is not enough.
C. SELF-SUPERVISED LEARNING Self-Supervised Learning (SSL) has been remarkably developed to reduce the labeled data dependencies in the past few years. Learning of a representation is a core issue of SSL in computer vision. A series of methods have been proposed to learn a representation of an image that does not have labels [20], [21], [27], [28], [29], [30]. These methods help neural networks learn meaningful features by training with a pretext task from unlabeled data. Several related tasks have been suggested such as predicting rotations of images, invariance to image transformations, and context prediction. Contrastive learning is one of the most recent and effective works to use mutual information for training [20], [31]- [33].
In recent research experiments, the SSL in computer vision [34]- [37] has shown good performance for anomaly detection. These methods extract features using geometrical changes such as a rotation or a flip of the entire image. However, the methods are helpful for discriminative features from the overall information of an image, such as CIFAR10 VOLUME 10, 2022 FIGURE 2. Inference process of the proposed method. Contra embedding map is generated by contrastive learning features and the progressive autoencoder makes an anomaly map using structural similarity index. The final anomaly map is generated by pixel-wise multiplication.
or ImageNet, not suitable for fine-grained datasets such as MVTec AD [7]. CutPaste [2] learns self-supervised representations by classifying augmented images as abnormal ones. GradCAM is used to make an anomaly map. The method shows the similarities between features of abnormal data and augmented normal data in classification, but the outputs of localization are limited with CAMs, which are mainly highlight the most discriminative points of the images.

D. PROGRESSIVE LEARNING
In general, the image reconstruction in an AE requires non-negligible computational costs for extracting features, and the performance is highly affected by hyperparameters such as the learning rate. SinGAN [38], Progressive GAN [39], and Pioneer [40] have been proposed to address this problem by gradually learning the images from low resolution to high resolution. These methods used progressive learning sequences in generators and discriminators to increase the resolution of the generated images. We apply this concept of progressively increasing the size to the autoencoder.

III. METHOD
In this section, we present the anomaly detection algorithm. As shown in Figure 2, the method uses two neural networks, where the one is the PAE which reconstructs normal images, reducing parameters dependencies of the autoencoder and improving detection performances, and the other network is a convolutional network for feature extraction. The feature extractor uses the self-supervised method that contains two tasks which are context prediction and comparing similarity. In inference, two anomaly maps are multiplied to construct the final map. The anomaly map of the PAE is calculated based on the Structural Similarity Index (SSIM) [41] to get pixel-size features.  A. PROGRESSIVE AUTOENCODER Figure 3 presents the learning process of PAE used in our method. At the initial step, the autoencoder is trained using small images of 14 × 14 size, and the trained network is modified to add the convolution and transpose layers to the encoder and the decoder. Through transfer learning, the modified network is trained repeatedly using larger size images than the before step. For this training, the L 2 loss is adopted to compare input images and outputs at a pixel level, and the equation is as follows.
Here, X denotes a training image, and f AE (·) denotes the autoencoder.
The quality of restoring images is the key issue to the autoencoder to detect abnormalities. The restoring quality, such as shape or noise, depends on the number of compressed vectors. If the number of vectors is too large, the autoencoder reconstructs too detailed images, including unwanted factors, such as defects. To address the issue, we adopt an autoencoder with augmentation, modified CutPaste (M-CutPaste), which leads to utilizing sufficiently large vectors and preventing training of defective objects, as shown in Figure 4. A detailed procedure of M-CutPaste is described in Algorithm 1. First, random augmentations such as contrast, brightness, hue, 90 • rotation, and resize are applied as I aug for a patch image. Next, I aug is generated by an inversed cutout mask which is a random size white square. The cutout mask M is applied to the input image to generate I cutout . Finally, a modified cutpaste image I mcutp is made by combining I cutout and I aug .
The autoencoder is trained to generate I input with the generated image I mcutp as input. This method assumes the M-CutPaste area as noise and helps learn denoising.

B. CONTRA EMBEDDING LEARNING
Contrastive learning is generally used as a positive pair by applying different augmentation to one image, and the other images are used as a negative pair. We utilize the process return I cutp for a patch level, not an image level. In addition, instead of using images with different applications of augmentation as positive, we use the exact same location as the original image restored by autoencoder as positive and different locations as negative. Figure 5 shows the training process of the proposed contrastive learning. The batch size is N , and we define the contrastive prediction task on same position pairs of input patches P i (i = 1, . . . , N ) and reconstructed patches Q j (j = 1, . . . , N ) where P i and Q j have the same position when i = j. Let the feature extractor network as g(f con (·)) where g(·) is the projection head which constructs with two dense layers and f con (·) is encoder network which constructs with VOLUME 10, 2022 convolution layers. It makes representation z i , z j from P i and Q j , respectively. When i = j, the patch P i has one positive sample and have N − 1 negative samples. At this time, the cosine similarity of the positive sample is trained so that it approaches 1, and the negative sample approaches 0. The equation for a positive pair example is defined as where τ denotes a temperature parameter, sim(·) represents the cosine similarity function between two vectors. Equation 2 is calculated for all batch size N , the contrastive loss is calculated as follows: Additionally, the relative locations of images patches are used to learn global image structure by solving context prediction [30]. The loss leads to stable results of training. For a randomly selected patch r i1 (i = 1, . . . , N ) and s i1 (i = 1, . . . , N ) in an input and a reconstructed image, corresponding patch r i2 or s i2 is selected from one of its eight neighborhoods in a 3 × 3 grid. If the relative position is y ∈ {0, . . . , 7}, the dense layer C is trained to predict y ri = C(f con (r i1 ), f con (r i2 )) (CE(y ri , C(f con (r i1 ), f con (r i2 )) +CE(y si , C(f con (s i1 ), f con (s i2 ))), (4) where CE(·) indicates the cross entropy. The total loss is described in mini-batch as follows: where λ means a hyper-parameter for balancing losses.

C. INFERENCE
At an inference, we generate two anomaly maps as shown in Figure 2. The first is M ssim generated through the autoencoder, and the second is M con generated through contraembedding. The anomaly map by the autoencoder M ssim is generated using the SSIM to get pixel-level differences. Let two w × w window size image patches are denoted as k i (i = 1, . . . , n) and k i (i = 1, . . . , n) which represent the patch of input image and reconstruction image respectively. n denotes the number of pixels in the image.
A pixel-level SSIM score between two patches k i and k i are as follows: Here, µ k i , µ k i is the mean intensity of k i and k i , σ k i , σ k i is the variance, and σ k i k i is the covariance. We set the numerical stability constants c 1 and c 2 to c 1 = 0.01 and c 2 = 0.03. If SSIM (i) is calculated for the entire images, the first anomaly map M ssim is generated. The second anomaly map, M con is calculated using feature network f con (·) which is trained with contrastive learning. First, an input set consists of m cropped images with a patch size p h ×p w and half stride from the original image I h ×I w like figure 2. The patch set of the original image denotes p i (i = 1, . . . , m), and the patch set of the image restored through the autoencoder denotes q j (j = 1, . . . , m). Then, two embedding vector sets are generated as f con (p i ) and f con (q j ). Here, f con (q j ) becomes a query set and a given query vector, the minimum value is defined as one anomaly score by measuring the distance of L 2 from all vector sets f con (p i ). The equation of anomaly score for a patch is as follows.
M con (j)(j = 1, . . . , m) make small anomaly map and resize to same size as the original image. The final anomaly map M is constructed by pixel-wise multiplication of two maps.
The anomaly score is defined as the maximum value of the final anomaly map.

IV. EXPERIMENTS A. IMPLEMENTATION DETAILS
The training process of the proposed method is performed in two steps. First, the input images are trained using PAE, and reconstructed images are generated. Second, contrastive learning is performed at the patch level using the reconstructed and input images. During the second step, trained PAE weights are fixed and feature extractor f con , classifier C, and projector g are trained simultaneously using equation 5.
The PAE was trained with the following augmentation operations: rotation(−10, +10), brightness(0.1), contrast(0.8, 1.2), hue(0.1), saturation(0.9, 1.1) and M-CutPaste. M-CutPaste used same hyper-parameters for brightness, contrast, hue and saturation. Additionally, 90 • rotation, zoom in(1.1), and mask size of cutout in Table 2 are used. Adam optimizer and learning rate of four steps {0.0005, 0.00025, 0.000125, 0.00006} were used with a batch size of 128. MvTec AD consists of two datasets as normal train data and test data. We randomly choose 80% of images from the train data set which is used in training. The remaining 20% of the image is used to check the training loss. The image size of training data is selected differently depending on the types of data, such as texture and object. For the zipper image set, the image size is set the same as the texture instead of the size of the object. Detailed information about the autoencoder structure, the size of training images, and the epochs of PAE are explained in Table 1 and Table 2.
The feature extractor was trained with random crop (12 − 128) augmentation. Adam optimizer and initial learning rate of 0.0002 with cosine decay policy were adopted. TABLE 1. Architecture for the PAE. All convolution layers have a kernel size of 3 or 4. A Conv block has two convolution layers with the same filter size. Each layer has a kernel size of 3 and 4 and a stride of 1 and 2 respectively. A Transpose block has a transpose layer and a convolution layer with a kernel size of 4 and 3, a stride of 2 and 1, and the same filter size. A loss hyper-parameter λ is set as 1 for all experiments. The network was trained for 2k epochs with a batch size of 128. The feature extractor network structure is described in Table 3.
The experimental results are calculated as the average of Area Under the Receiver Operating Characteristic curve (AUROC) with three different random seeds. In the inference, the SSIM with a window size of 15 × 15 is used to calculate the anomaly scores for M SSIM . The patch size p h × p w in M con is set as 32 × 32 for all evaluations.

B. DATASET DESCRIPTION
Our proposed method is evaluated on the MVTec AD, commonly used to test anomaly detection and localization in manufacturing fields. The dataset consists of five textures and ten objects categories, and each category has different normal data quantities for training from 60 to 391. Most images are well center-aligned except the screw category. The training set has only normal images, while the test data contains normal and abnormal samples.  [24]. The two dense layers are called projector and only used for training.

1) COMPARISON WITH THE STATE-OF-THE-ART
The detection and localization results of MVTec AD are shown in Table 4 and Figure 6. Patch-SVDD [23], CutPaste [2], and InTra [22] have compared performances with our solution when each method is trained only using a given 60 to 391 train data from MvTec AD. Additionally, PaDiM [1] is compared, although it needs additional data. PaDiM is a representative method in embedding similarity-based techniques that have different characteristics. The proposed method presents improved results for the average of all categories in detection and localization without using extra data. In particular, the cable category is difficult to detect abnormalities for the reconstruction-based method, because the image contains many high-frequency components such as stripped wires. Intra, the reconstructionbased method, shows particularly low performance in cable, but our method shows the best performance by overcoming its drawbacks. Although our method shows better performance on average AUROC, it is less effective on specific categories. The localization results of wood and toothbrush show lower performance than the state-of-the-art methods. For these categories, images have a lot of dots as noises. The patch-level contrastive learning detects these noises as abnormal points. It could not compensate the autoencoder adequately. Figure 6 presents success and failure cases. In Figure 6, success cases show accurate localization even when defects are small and have tiny color changes. However, the method is hard to localize defects if abnormalities have too tiny texture variances.

2) DETECTION AND LOCALIZATION ON A NON-ALIGNED DATASET
To estimate the robustness of the anomaly localization method, non-aligned MVTec AD is used. The dataset is not officially released, but PaDiM proposed it. MVTec AD data is modified with (-10, 10) rotation and randomly cropping with 0.875 proportion (e.g. crop from 256 to 224.). In this case, detection results are not counted because lots of defects are removed during the cropping process. Table 5 shows that the proposed method is robust to the translation and rotations of the image compared with

TABLE 4. Comparison of detection and localization results with state-of-the-art methods on the MVTec AD(AUROC %)
We report mean and standard error tested with 3 random seeds. The highest AUROC value is represented in bold and the lowest value is underlined. PaDiM only reports mean AUROC for detection.

FIGURE 6. Success(a) and failure(b) cases of the proposed method(contra embedding) in localization.
Each column represents each category's results: carpet, grid, leather, tile, wood, bottle, cable, capsule, hazelnut, metal nut, pill, screw, toothbrush, transistor and zipper. From up to down, each row displays the original image, ground truth mask and anomaly map.
the PaDiM method. Especially for the texture categories, the results of non-aligned images are as good as aligned images, which shows our method is highly robust to the alignment.

D. ABLATION STUDY
We conduct several experiments to analyze our method, contra embedding. The first experiment is to figure out the importance of each component. The structure of autoencoder, denoising augmentation and contrastive learning parts are analyzed. Secondly, we study the impact of loss and training convergence in contrastive learning. The third experiment shows the validity of the progressive autoencoder by comparing the performance of PAE and SSIM-AE. Additionally, we analyze the effects of modified CutPaste augmentation in PAE.

1) COMPONENT ANALYSIS
We present the impact of different components which we add or remove in our framework. In the structure of autoencoder (AE), SSIM-AE [7] trained using equation 6 and PAE are considered. Denoising augmentation, M-CutPaste and contrastive learning are the other components. In Table 6 last row, removing the contra embedding part shows that the contrastive learning is the essential part of all components. In the case of texture, M-CutPaste is also important to detect and localize the defects. The method using SSIM-AE shows similar performance to PAE with a difference of 0.3-1.3 in detection.

2) EFFECTIVENESS OF CONTRASTIVE LEARNING a: ANALYSIS OF LOSS FUNCTION
In self-supervised learning, we trained the feature extractor using two losses: contrastive loss and context prediction loss in equation 5. In Table 7, performance by the different loss functions is described. The left column shows the PAE result without contrastive learning, the middle one is the contrastive learning without prediction loss and the right one is the total loss results. In most cases, the contra embedding with the context prediction loss shows good performances. The context prediction loss helps the feature extractor to learn suitable features for anomaly detection. Both results of the grid and screw cases show that the use of the autoencoder, even without other compensation, could represent excellent performances. This outstanding result can be obtained because the normal images are simple and have uncomplicated patterns without large variances. The context prediction could, unfortunately, be ineffective in the average result of texture categories. However, the context prediction loss could be helpful where the input has distinguishable objects like cable and capsule.

b: ANALYSIS OF TRAINING CONVERGENCE
To analyze the convergence of self-supervised learning, we compared the average AUROC depending on the training epochs. The detailed results are different by each category, but most of the training loss is converged after the 2k epoch. Figure 7 shows the performance is saturated after the 2k epoch.

3) RECONSTRUCTION PERFORMANCES OF AUTOENCODER
In this ablation, we experiment with the autoencoder with different elements to understand its property. The structure of the autoencoder and different augmentation were tested to compare reconstruction performances.

a: EFFECTIVENESS OF PROGRESSIVE AUTOENCODER
We analyze the effect of the autoencoder structure by comparing its detection and localization performances. In Table 8, PAE shows 5.2% and 3.6% higher performances than SSIM-AE in texture detection and object localization. The reconstruction results through SSIM-AE and PAE in leather and metal nut are presented in Figure 9. The SSIM-AE cannot restore the details of the original image, while the PAE accurately restores the details of the original image. For a more specific analysis, we measure the PSNR between normal and reconstruction images. Figure 8 shows the PSNR measured by SSIM-AE and PAE by category. In all categories except hazelnut and screw, PAE shows higher PSNR than SSIM-AE. As a result of Figures 8,9 and Table 9 compares the performance according to the use of M-CutPaste. The M-CutPaste case shows 1.3% and 0.8%   higher AUROC than the basic ones in detection and localization, respectively. CutPast and M-CutPaste show similar performances in object categories but M-CutPaste is better in texture ones. Figure 10 shows that the abnormal region of the image (b) remains as it is, whereas as a result of using M-CutPaste, the abnormal region of the image (c) is completely removed. The figures present that M-CutPaste makes the autoencoder learn to recover diverse abnormalities to normal. As a result, if an abnormal image has a huge difference in structure or color, the autoencoder could reconstruct the better normal image than that trained without augmentation.

V. CONCLUSION
This paper has proposed the unsupervised anomaly detection and localization algorithm using PAE combined with selfsupervised learning. The autoencoder could be a proper solution when there is a large volume of normal data that has great diversity. In the proposed method, the autoencoder is trained progressively to reproduce the details of the original images and M-CutPaste augmentation helps to learn to recover normal images. Moreover, the contrastive learning is suggested to compensate the autoencoder to overcome the diversity of normal data. Cosine similarity is used to compare to find differences between the recovered images of the autoencoder and the original image in the patch level. The method is helpful for extracting patch level features. Using this method, the anomaly detection AUROC resulted in 98.0% in the generally verified MVTec AD dataset and showed high efficiency without a need to use extra training data. Moreover, the methods proved to be effective when input data was diverse in the forms of translation and rotation.