SA-PatchCore: Anomaly Detection in Dataset With Co-Occurrence Relationships Using Self-Attention

Various unsupervised anomaly detection methods using deep learning have recently been proposed, and the accuracy of the anomaly detection technique for local anomalies has been improved. However, no anomaly detection dataset includes co-occurrence-related anomalies, which are combination-related. Thus, the accuracy of anomaly detection for co-occurrence-related anomalies has not progressed. Therefore, we propose SA-PatchCore, which introduces self-attention to the state-of-the-art local anomaly detection model, PatchCore. It detects anomalies in co-occurrence relationships and anomalies in local areas with the benefit of the self-attention module, which can consider contexts between separated words introduced first in the natural language processing field. As no anomaly detection dataset includes anomalies in co-occurrence relation, we prepared a new dataset called the Co-occurrence Anomaly Detection Screw Dataset (CAD-SD). Furthermore, we performed experiments on anomaly detection using the new dataset. SA-PatchCore achieves high anomaly detection performance compared with PatchCore in the CAD-SD. Moreover, our proposed model shows almost the same anomaly detection performance as PatchCore in an MVTec Anomaly Detection dataset, which is composed of anomalies in a local area. As a contribution to the anomaly detection task, we have released the CAD-SD to the public. The code and dataset are publicly available at https://github.com/IshidaKengo/SA-PatchCore


I. INTRODUCTION
An anomaly detection task that identifies a sample as normal or anomalous is essential in various fields, such as industry, medical care, and security. In the industrial field, visual inspection has been conducted until now for the quality assurance of products. However, human visual inspection has problems, such as a shortage of inspectors' workforce and individual variability. Therefore, automation of appearance inspection using image recognition is expected to alleviate these problems. In recent years, deep learning has achieved outstanding results in image recognition, and various anomaly detection models using deep learning The associate editor coordinating the review of this manuscript and approving it for publication was Oguzhan Urhan .
have been actively studied. The MVTec Anomaly Detection (MVTecAD) dataset [1] is used as a benchmark of deep learning-based anomaly detection techniques. The dataset is created by assuming visual inspection of products in real environments. MVTecAD [1] includes images of 15 categories of products of normal and abnormal images. The types of anomalies in the dataset are local anomalies, such as scratches, stains, and cracks, where part of the image is anomalous. Among various anomaly detection methods, the state-of-the-art PatchCore [2] achieves area under the receiver operator curve (AUROC) score of 99.6%. Many of the highly accurate methods for MVTecAD [1] use convolutional neural networks (CNNs) pre-trained using ImageNet [3] to extract features of images and distinguish normal and anomalies based on the distribution of these features in feature space. However, the existing detection models for MVTecAD are unable to detect anomalies in the relationships between distant pixels, which are anomalies in co-occurrence relationships because they extract image features from convolutional layers. The co-occurrence relationship anomaly is excluded from MVTecAD [1] and it is determined based on the features of the relationship between distant pixels (Fig. 1). If a product with a hex nut attached to one side of the screw rod is assumed to be normal, then it will become abnormal if a hex nut is attached to both ends of the screw rod or if there is no hex nut attached to either side of the screw rod. For such co-occurrence relation anomalies, the high-precision anomaly detection model proposed for MVTecAD [1], such as PatchCore [2], cannot sufficiently demonstrate the anomaly detection performance.
Thus, we focus on using self-attention in image recognition and enabling anomaly detection of the co-occurrence relationship. The self-attention was proposed as an operation method that can consider the relationship between words in the translating task of natural language processing [4]. Recently, there have been an increasing number of applications in the image recognition field, such as the Vision transformer [5]. We can consider the relationship between distant pixels on the image as the self-attention uses the entire image as an input and calculates the features based on the relationship between pixels. We constructed an anomaly detection method that can detect anomalies in co-occurrence relationships by capturing the relationship between distant features using self-attention. In this study, we propose a SA-PatchCore that incorporates the self-attention into PatchCore [2], which is a state-of-the-art model for MVTecAD [1], to identify anomalies in co-occurrence relationships (Fig. 2). The proposed model is valid for both anomalies in local regions and those in co-occurrence relationships. In SA-PatchCore, the local features extracted using a pre-trained CNN and the global features based on the relationship between distant pixels, obtained using self-attention to the features, are mapped on the feature space, and normal or abnormal data is distinguished based on the distribution of the features. The contribution of this study is as follows: 1) We propose SA-PatchCore incorporating self-attention into PatchCore [2] to detect anomalies in local regions and co-occurrence relationships. 2) SA-PatchCore can calculate relationships of the features without using the linear transformation and its training, which is included in the conventional selfattention model. 3) SA-PatchCore applies self-attention to compressed feature maps using the CNN so that the large computational complexity of the self-attention model does not become a bottleneck. 4) We constructed a new dataset called the Co-occurrence Anomaly Detection Screw Dataset (CAD-SD) for anomaly detection, including anomalies in the local regions and co-occurrence relationships. 5) SA-PatchCore achieves almost the same abnormality detection accuracy as PatchCore [2] for MVTecAD [1] consisting of only the abnormality in the local area while achieving a high abnormality detection performance even in the CAD-SD.

1) RECONSTRUCTION-BASED METHOD
The reconstruction-based method is based on generative models, such as autoencoder [17] and generative adversarial network (gan) [18]. these techniques are based on the hypothesis that the generation model learned so that only normal images can be reconstructed are unable to properly reconstruct abnormal areas of abnormal images. in the simplest case based on the autoencoder, Zhou et al. [6] performed anomaly detection by comparing input and output of Autoencoder. Bergman et al. [7] proposed an ae-ssim that replaces the error with ssim. draem [8], smai [9], and nsa [10] created pseudo-anomaly images and used them for self-supervised learning. in the gan-based methods, schlegl et al. [11] detected anomalies by comparing evaluation and generated images, and Song et al. [16] proposed anoseg using self-supervised learning. in recent years, most of the high-performance anomaly detection methods are representation-based rather than reconstruction-based methods. this is because the improved generative model successfully reconstructs abnormal images, and the methods using self-supervised learning, which uses pseudo-images, are biased against pseudo-anomalies.

2) REPRESENTATION-BASED METHOD
The representation-based method detects anomalies based on the distribution of encoded features obtained from putting images into a network. It includes the methods [19], [20], VOLUME 11, 2023 [21] for training a neural network to make statistical reasoning based on one-class classifications, methods [22], [23] for using the latent variable space of an autoencoder, and methods [25], [26], [27] using the discriminator of GAN to classify anomalies. However, in recent years, several methods have employed the CNN pre-trained on large-scale external datasets, such as ImageNet, to extract image features. Different [30], CS-Flow [31], and FastFlow [32] are the representation-based methods that use the normalizing flow. SPADE [33] uses feature maps at various levels of the network for fine-grained anomaly detection and localization based on the k-NN method. The model of Rippel et al. [34] uses encoded features as a multivariate gaussian distribution and calculates anomaly scores using the Mahalanobis distance. PaDiM [35] applies this approach at the patch-level to multi-scale feature maps. Several structures of the SPADE and PaDiM are related to PatchCore [2], which is the current state-of-the-art anomaly detection model in MVTecAD benchmark [1].
PatchCore [2] uses the Wide-ResNet50 [36], pre-trained on the ImageNet, as a feature extractor and average pooling to aggregate feature maps extracted from the middle layer of Wide-ResNet50 [36] to calculate the features per patch. The features of the calculated normal data are stored in the memory bank during training. Furthermore, the features of the calculated unknown data and the feature quantity in the memory bank with a small distance on the feature space are obtained using the k-NN method during inference. The distance is used as the patch-level anomaly score and the maximum of this patch-level anomaly score is the image-level anomaly score. PatchCore [2] reduces the loss of normal and abnormal information by considering the neighbor pixels for patch-level features. Greedy Coreset Subsampling reduces computational costs. PatchCore [2] can detect anomalies with high accuracy for anomalies in local areas in datasets, such as MVTecAD [1]. However, PatchCore [2] is weak to anomalies in co-occurrence relationships because it is the mechanism for extracting features using a pre-trained CNN. The proposed SA-PatchCore solves this PatchCore problem [2] by applying the self-attention to the extracted features, and it can detect the anomalies of co-occurrence relationships.

B. SELF-ATTENTION
Self-attention is proposed for natural language processing translating tasks [4], which can consider the context between distant words. Specifically, the input sequence is linearly transformed to generate three variables: query, key, and value. The inner product of the query and key is normalized using softmax to obtain the relevance of the key (search destination) to the query (search source). The weighted sum of this relevance and the value is the output of the self-attention. Therefore, the self-attention module, which can consider the relevance of the entire input sequence, solved the problem of relevance disappearing because of the distance of the input sequence of recurrent neural networks used in the conventional machine translation. The module achieved model features based on the global feature relevance in the input image regardless of the distance of the input sequence.
In recent years, using self-attention has been actively studied, even in the image recognition field. SASA [37], LRNet [38], SANet [39], and Axial-SASA [40] proposed a model, in which the self-attention layer replaces the convolution layer in ResNet, as a simple approach to use self-attention in image recognition. Each of these models proposes to replace self-attention in a different format. The Vision Transformer [5] proposes a model structure that divides the input images into patches and puts these patches into several transformer block. It shows comparable performance to or better than the conventional CNN. DETR [41], VideoBERT [42], VIL-BERT [43], CCNet [44], AA-CN [45], and BoTNet [46] are models using both convolution and self-attention. The computational complexity becomes enormous when highresolution images are input into self-attention because its computational complexity increases in the order of square based on the length of the input sequences. BoTNet [46] applies self-attention to feature maps whose resolution is reduced using convolution to solve this problem. Furthermore, our proposed SA-PatchCore has a similar construction and prevents the computational complexity from increasing because it uses self-attention for feature maps compressed using a pre-trained CNN.

III. METHOD
Our proposed model is based on PatchCore [2], which is a state-of-the-art anomaly detection model in MVTecAD [1] and introduces the self-attention module. We named the proposed model SA-PatchCore. SA-PatchCore retains the high anomaly detection performance of PatchCore [2] for local anomalies, and the introduction of the self-attention module enables highly accurate anomaly detection in co-occurrence relationships. Fig. 3 depicts the model structure of SA-PatchCore.
A. PatchCore-BASED STRUCTURE SA-PatchCore is based on PatchCore [2] and is composed of several parts.

1) FEATURE EXTRACTION
SA-PatchCore uses the WideResNet50 [36] pre-trained on ImageNet to extract features of input images. The final output of each hierarchy from the convolutional network is extracted as a feature map and used for abnormality detection. Generally, the deeper the hierarchy, the more the global feature map captured, which is specialized for learning tasks. SA-PatchCore uses feature maps of the middle layers of the WideResNet50 [36] because the local features for the unknown data are crucial in the industrial anomaly detection task. Specifically, SA-PatchCore uses Layers 2 and 3 of the WideResNet50 [36]. Layer 2 has a more local feature representation than Layer 3; the algorithm of PatchCore that aggregates features in the neighborhood is applied to Layer 2 to detect local anomalies. Let φ 2 (h, w, c) be the feature map of Layer 2 with height h, width w, and c channels. The patchlevel features that aggregate local features in the neighborhood are expressed as follows: (1) f agg is the aggregate function in the neighborhood. SA-Patch-Core [2] uses average pooling with a kernel size of 3, stride 1, and padding 1. As Layer 3 has a more global feature representation than Layer 2, its feature is used as input to the selfattention module for detecting anomalies in co-occurrence relationships. Let the feature map of Layer 3 be φ 3 (h, w, c) and the self-attention module be a transformation function f SA to features with information necessary for anomaly detection of co-occurrence relationships. The features considering relationships obtained from Layer 3 are expressed as follows: P 2 , which aggregates features in the neighborhood to detect local anomalies, and P 3 , which contains the information necessary for anomaly detection of co-occurrence relationships, are concatenated and stored in a memory bank M . The resolution of P 3 resized to match that of P 2 since it has a lower resolution thanP 2 .

2) CORESET SUBSAMPLING
The size of the required memory bank becomes large and inference time significantly increases when the size of the feature map increases. PatchCore [2] solves this problem by subsampling the feature quantity using greedy coreset subsampling, and SA-PatchCore uses a similar mechanism. Coreset subsampling finds a subset S ∈ A, such that the solution to the problem in sample A comes closest to that of sample S [47]. The coreset M c for the memory bank M in the patch-level feature space is chosen so that the coverage of M c is approximately the same as the original memory bank M [48], [49] because PatchCore [2] takes the nearest neighbor computation. PatchCore [2] uses the iterative greedy approximation proposed in [49] because the exact computation of M c is NP-hard.

3) ANOMALY DETECTION
SA-PatchCore selects m * which is the nearest neighbor of the patch-level features m test of test data, among the patch-level features m ∈ M of the training data stored in the memory bank. It estimates the patch-level anomaly score s of the test image X test from the distance between patch-level features m test and m * .
The image-level anomaly scoreS for the test image X test is obtained from the maximum patch-level anomaly score s in the X test .

B. SELF-ATTENTION MODULE
SA-PatchCore introduces a self-attention module (Fig. 4) to detect co-occurrence anomalies. This module is applied to the feature maps obtained from Layer 3 of the WideResNet50 [46], and it is used as a transformation module to obtain feature maps X SA with the information required for the anomaly detection of co-occurrence relationships. Once a feature map of Layer 3 φ 3 (h, w, c) is obtained, max pooling of kernel sizes 3, strides 1, and padding 1 are applied to emphasize the nearby features, which are turned into vectors X ∈ R hw×c .X is replicated in triplicate to compute the self-attention as a query, key, and value in the Transformer [4]. X SA is expressed as follows: where d X is the depth of X . The vector X SA , which considers the relationship between distant features, is obtained by calculating the relationship between pixels using the product of a query and key as weights and calculating the product of the weights multiplied by the softmax and value. By resizing the obtained X SA to the size of the original feature map,  a feature map P 3 with the information necessary for detecting anomalies of co-occurrence relationships is obtained. The self-attention module does not calculate keys, queries and values by linear transformation in the Transformer [4] but uses max pooling because it is used to generate feature maps based on the relationships necessary to detect anomalies in co-occurrence relationships. Furthermore, since the computational complexity of the self-attention increases in the order of the square based on the input sequence length, the high computational complexity is occasionally a problem when high-resolution images are input into the self-attention. However, SA-PatchCore has the advantage that the computational complexity problem does not become a bottleneck because it inputs feature maps compressed using a pre-trained CNN to the self-attention module.

IV. EXPERIMENTS
We created the CAD-SD to verify the effectiveness of SA-PatchCore, which includes the anomaly of the local area and that of the co-occurrence relationship. Then, we experimented with anomaly detection on the dataset.

A. CO-OCCURRENCE ANOMALY DETECTION SCREW DATASET (CAD-SD)
MVTecAD [1] is a typical dataset for evaluating the anomaly detection method; however, it contains only the abnormality of a local area, in which an abnormal part exists only in some parts, such as scratches and dirt. Currently, there is no dataset for anomalies of co-occurrence relationships, which are anomalies of combinatorial relationships. Therefore, we created the CAD-SD, which includes the anomaly of the local area and that of the co-occurrence relationship for the images of products consisting of screw rods and hex nuts. The images in the dataset were taken at random angles using a camera. Table 1 shows the imaging environment of the dataset. The camera used was a DFK33UX183 manufactured by Argo Corporation. The aperture and shooting distance were set at 16 and 25 cm respectively. The size of the image in the dataset was trimmed from 5472 × 3648 to 700 × 700. HPR2-75SW manufactured by CCS Corporation was used for the lighting, and PD2-5024 (A) was used for the power supply. Figure 5 shows examples of the images in the dataset. The CAD-SD includes normal images of the product with a hex nut attached to one side of the screw rod. The types of abnormal images in the dataset are roughly divided into the anomalies of the local region and that of the co-occurrence relation. The anomalies of the local area are ''Scratch,'' in which a portion of the product is scratched, and ''Paint,'' in which some paint adheres to a part of the product. The anomalies in the co-occurrence relationship are ''Over-coupling,'' where hex nuts are coupled on both sides of the screw rod, and ''Lacking,'' where hex nuts are not

B. EXPERIMENTAL CONDITION
We experimented with anomaly detection using the CAD-SD. The image in the dataset was resized to 224 × 224 and used as input to the model. The CPU is an Intel R Core i9-9900K CPU @ 3.60 GHz, and the memory is 32 GB. The GPU configuration is an NVIDIA GeForce RTX 3090 with 24 GB of memory. The batch size is 1 and the sampling rate of Greedy Coreset Subsampling is 1%. PatchSVDD [20], PaDiM [35], PatchCore [2], and CS-Flow [31] are used as comparison methods. The AUROC is used as the evaluation metric for image-level anomaly detection; the AUROC was calculated for all test images and each anomaly type.
C. RESULTS Table 2 shows the results of image-level anomaly detection in CAD-SD. Table 3 shows the evaluation for each type of anomalies in CAD-SD. SA-PatchCore achieved the best performance. SA-PatchCore is slightly less accurate than PatchCore [2] in detecting anomalies in the local regions of ''Scratch'' and ''Paint,'' but it is more accurate than the other methods. The method is on average about 30% more accurate than PatchCore [2] in detecting anomalies of the co-occurrence relationship between ''Over-coupling'' and ''Lacking,'' which is the highest accuracy. This result indicates that SA-PatchCore has a significant improvement in the detection of co-occurrence anomalies while maintaining sufficient detection performance for anomalies in local regions. It shows the advantage that SA-PatchCore retains the effectiveness of PatchCore [2] for anomalies in local regions while improving the effectiveness for anomalies in co-occurrence relationships by introducing the Selfattention module. Figure 6 shows the results of localizing the anomaly area. The heatmap is normalized based on the patch-wise anomaly scores of all test images, and the lower limits are set to appropriate values. The red color indicates that the anomaly score is higher. SA-PatchCore is able to identify both local anomalies and co-occurrence anomalies. Table 4 shows the inference speed for a single image on CAD-SD. SA-PatchCore achieves almost the same inference speed as PatchCore [2], which is faster than the other methods. It indicates that SA-PatchCore achieves high detection accuracy by introducing self-attention while maintaining a high inference speed.

V. DISCUSSION
Several discussions are presented on SA-PatchCore. First, we evaluated the anomaly detection performance on several anomaly detection datasets including MVTecAD [1]. Next, we examined the optimization of the modeling structure by focusing on the hierarchy of feature extraction and pooling in the self-attention module.

A. ANOMALY DETECTION ON OTHER DATASETS
We experimented with MVTecAD [1] to investigate the anomaly detection performance of the SA-PatchCore, which is a widely used anomaly detection dataset, although it excludes co-occurrence anomalies. Table 5 shows that the anomaly detection performance of SA-PatchCore on MVTecAD [1] was slightly lower than that of PatchCore [2] VOLUME 11, 2023    and CS-Flow [31] but better than PatchSVDD [20] and PaDiM [35]. Table 6 shows the results on the BeanTech Anomaly Detection dataset (BTAD) [50] and the AITEX dataset [51]. SA-PatchCore scores higher detection accuracy than PatchCore [2] for these datasets. SA-PatchCore has the advantage of being able to detect both local anomalies and co-occurrence anomalies well. However, these existing datasets exclude co-occurrence anomalies and consist mainly of local anomalies. These results show that SA-PatchCore has sufficient anomaly detection performance even for datasets consisting of only local anomalies. SA-PatchCore has high anomaly detection performance even for local anomalies, while improving the anomaly detection performance of co-occurrence relations by introducing the Self-attention module.

B. OPTIMIZATION OF THE MODEL STRUCTURE 1) HIERARCHY OF FEATURE EXTRACTION
The proposed SA-PatchCore places Layer 2 of the WideResNet50 [46] into the average pooling for local feature  extraction and Layer 3 into the self-attention module for feature extraction of co-occurrence relationship. To evaluate the validity of this structure, we conducted anomaly detection experiments on the CAD-SD even in a model structure where Layers 2 and 3 are combined and inputted into the average pooling and the self-attention module. This structure directly incorporates the self-attention module into PatchCore [2]. Table 7 shows that the structure of SA-PatchCore is more effective than the original structure of PatchCore [2], which uses Layers 2 and 3 cooperatively. This confirms that SA-PatchCore is a suitable model structure for detecting anomalies in local regions and co-occurrence relationships.

2) POOLING IN THE SELF-ATTENTION MODULE
We investigated the suitability of max pooling in the selfattention module for SA-PatchCore in the CAD-SD when average pooling or no pooling is used instead of max pooling.
The results in Table 8 show that the anomaly detection performance is the best when max pooling is used, which is effective for detecting anomalies in co-occurrence relationships.

VI. CONCLUSION
We proposed SA-PatchCore in this study, which extends the current state-of-the-art PatchCore [2] to detect anomalies in co-occurrence relationships by introducing a self-attention module. This module is a transformation module that can obtain feature maps by considering the relationship between features without using the linear transformation of the conventional self-attention and its training. SA-PatchCore prevents the computation of self-attention from computational complexity by inputting feature maps compressed using a pre-trained CNN in the self-attention module. Furthermore, since no anomaly detection dataset includes co-occurrence anomalies, we prepared the CAD-SD that includes both local and co-occurrence anomalies. SA-PatchCore has sufficient anomaly detection performance on MVTecAD [1], which is composed of only local anomalies, and it achieves state-ofthe-art anomaly detection performance in the CAD-SD.