Extensive Knowledge Distillation Model: An End-to-End Effective Anomaly Detection Model for Real-Time Industrial Applications

Detecting anomalies is an essential task in many industries. Current state-of-the-art methods rely on a large number of parameters for high accuracy, which may not be suitable for implementing cost-effective real-time applications. Additionally, developing robust detection models is difficult due to limited anomaly samples. To address these issues, we propose an end-to-end anomaly detection method that utilizes effective data generation and comprehensive knowledge distillation. In particular, the proposed approach first employs a highly effective generative model to generate realistic anomaly images. It then transfers the pre-trained master network’s essential intermediate layers and final layer knowledge to a novice network by using the knowledge distillation technique. In the conducted experiments with 4 real-life datasets, the proposed model outperforms its counterparts, including state-of-the-art models, by 0.6% on MNIST and CIFAR-10 datasets, 0.2% on the custom dataset, and stays competitive on the MVTec AD dataset. Additionally, the proposed model outperforms all of its peers in trainable parameter numbers by having only 0.17 M parameters. This is at least twice as few parameters as the baseline model. Overall, the proposed approach offers an efficient solution to anomaly detection that achieves high accuracy despite limited anomaly samples and fewer parameters.


I. INTRODUCTION
The goal of anomaly detection (AD) is to identify inputs during testing that appear unusual or novel to the model based on the normal samples observed during training. AD has been a crucial and challenging responsibility in computer vision, with diverse applications such as using images to ensure the quality of industrial products [1], [2], monitoring health processes [3], [4], industrial internet of things [5], or network monitoring system [6]. Nevertheless, there are 2 primary The associate editor coordinating the review of this manuscript and approving it for publication was Chuan Li. challenges while developing an anomaly detection model. First, it is difficult to acquire enough anomaly images to test the network since defects occur not so often in the application area. Second, the types of anomalies are unexpected and diverse, making it difficult for the classifier to make correct predictions.
In manufacturing enterprises, the elimination of defective products is strongly linked to the satisfaction level of product customers. It is therefore imperative for big industrial companies to maintain stable, high-standard defect monitoring. Traditionally, human labor is employed to tackle this task. Nonetheless, there are numerous drawbacks to this approach.
These drawbacks include the time and money needed to train workers or the existence of human factors that lead to errors. Hence, artificial intelligence (AI)-based models are widely used in automating defect monitoring systems in manufacturing since it shows more stability in this area [7], [8], [9], [10].
Defective products are widely inspected using automated visual inspection (AVI). In AVI applications, machine learning (ML) and deep learning (DL) are common AI approaches [11], [12], [13]. In light of the relevance of AVI techniques, extensive research has been conducted using computer vision applications to study this topic. The existing DL-based techniques can be classified into five main categories: generative adversarial network (GAN)-based, autoencoder (AE)-based, distillation-based, and self-supervised learning approaches. Although these DL-based AVI approaches demonstrate promising results in some aspects, there are also several limitations. In particular, GAN-based approaches can learn from unlabeled data, however, face difficulties in detecting uncommon anomalies [14], [15]. Furthermore, AE-based approaches can be trained efficiently on large datasets but they suffer from computational complexity and fail with intricate datasets [16], [17]; while existing distillation-based and selfsupervised learning approaches excel in detecting previously unseen anomalies, on the other hand, they require enormous training time and do not resolve imbalanced dataset problem [18], [19].
It is critical to note that both quality and speed are crucial factors during automated outlier detection in industries. Since both components are closely correlated with the company's profit [20]. A lightweight model demonstrating high accuracy is required. For the most part, achieving high accuracy with lightweight models is challenging. To tackle this problem, there have been several efforts employing the knowledge distillation (KD) technique to detect possible anomalies [18], [21]. The KD method involves the use of a master network to teach an inexperienced novice network how to distinguish normal images through distillation during training. As a result, novice networks possessing only normal image features behave differently when confronted with anomalous images. Lastly, a classification test is conducted using the anomaly score resulting from the discrepancy between the novice network and the master network. It is possible, therefore, to transfer the most important expertise of a master network with a large number of parameters to a lightweight novice network. As a result, a lightweight model with high accuracy levels can be obtained.
The previous efforts that utilized the last layer output of the master network to teach the novice network neglect valuable knowledge of intermediate layers [18]. It is proven that the intermediate layers of neural networks provide high-resolution features, whereas the last layers produce semantically rich features. Zhang et al. [22] in their research showed that the activation values of the middle layers of neural networks provide a reliable way of representing the input images. Therefore, by solely using the final layer, only a small fraction of the master's knowledge is shared with the novice.
To remedy the aforementioned problems, an end-to-end anomaly detection system is proposed that first employs contrastive unpaired translation (CUT) model [23] to generate real-life-like anomaly images on a custom dataset. The CUT model is a recent state-of-the-art (SOTA) image translation algorithm that does not rely on cycle consistency, and one-sided images can be translated. Therefore, this method generates better-quality synthetic images compared to existing methods. Following this, a KD-based approach with fewer parameters is incorporated to achieve high accuracy. In particular, a pre-trained powerful model is utilized as a master network. For knowledge transfer, multiple crucial intermediate and final layers are used to gather both semantically rich and multilevel resolution features of the input images. This research results in the development of a lightweight novice network. This is capable of achieving a competitive accuracy level compared to the widely used accurate models available at present.
To summarize, this paper contributes fourfold: • Propose an end-to-end anomaly detection system that cures an imbalanced dataset problem and classifies anomalies with high accuracy.
• Construct a KD-based model that comprehensively transfers a master network's knowledge to a smaller novice network through both essential intermediate layers and the final layer.
• Owing to comprehensive knowledge distillation, the proposed model has surpassed the performance of previous extensively used models or achieved competitive performance on benchmark and custom datasets.
• As a comparison to previous methods, our approach is less computationally expensive owing to an elaborately designed model architecture. The rest of the paper is organized as follows. A review of related research in AD is presented in Section II. A detailed description of the proposed methodology, as well as the network architecture, is provided in Section III. Following this, Section IV evaluates the experimental results and compares the performance of the proposed method with that of other recent studies. The discussion of the results is provided in Section V. Conclusions and future work are discussed in Section VI.

II. RELATED WORKS
In this section, the different approaches to the AD task are reviewed and their specifics are discussed.
Anomalous data points that are significantly different from normal data will have a low probability of being real data according to the discriminator network.
Schlegl et al. [14] proposed AnoGAN, an unsupervised learning model that involves mapping images from their original form to a latent space. When applied to newly collected data, the model identifies anomalies and assigns scores to image patches based on how well they fit into the previously learned distribution.
Akcay et al. [15] presented the model that uses encoderdecoder-encoder sub-networks to create a lower dimension vector that is used to reconstruct the generated image. By minimizing the distance between these images and their latent vectors during training, the model learns the data distribution for normal samples. At inference time, a larger distance metric from this learned distribution indicates an anomaly.
More recently, Chen et al. [25] have proposed EAL-GAN which is a type of conditional GAN that uses a single generator and multiple discriminators to identify anomalies through an auxiliary classifier. It also incorporates an innovative ensemble learning loss function that addresses class imbalance by compensating for individual discriminators' weaknesses.
However, all of the aforementioned approaches face significant difficulties on two fronts. Firstly, they are not effective in one-class scenarios. Secondly, they struggle to detect rare anomalies that are significantly different from normal but occur infrequently.

B. AUTOENCODER-BASED APPROACHES
These approaches rely on the principle that learning typical latent features enables the model to reconstruct normal inputs more accurately than abnormal ones. As a result, anomalies produce greater reconstruction errors than normal inputs.
Bergmann et el. [26] introduced a method that uses similarity loss as a replacement for traditional mean-squared error (MSE) to train the model. As a consequence, their proposed model demonstrated superior results in defect segmentation. However, the method experiences difficulties in one-class settings.
Zong et al. [27] proposed a deep autoencoding Gaussian mixture model (DAGMM) that can learn complex representations of data and perform unsupervised anomaly detection. Moreover, this approach does not require labeled data and can be trained on large amounts of unlabeled data. Nevertheless, this approach may not be suitable for high-dimensional data due to the use of a Gaussian mixture model. In addition, training can be time-consuming and computationally intensive.
Dehaene et al. [17] introduced projecting abnormal data onto a learned normal data manifold using gradient descent on energy derived from the AE's loss function. This method can be improved by adding regularization terms to that model prior to the desired projection. By iteratively updating the AE input, the method can produce higher-quality images without losing high-frequency information due to the AE bottleneck.
However, this approach may struggle with identifying anomalies in one-class settings. Moreover, the approach relies on a large number of parameters which makes this method computationally expensive.
Abati et al. [16] proposed a broad system in which a deep AE is given a parametric density estimator that uses an autoregressive process to learn the probability distribution of its latent representations. The main contribution of their approach is that it does not assume any specific characteristics of the anomalies, which makes it easily adaptable to different situations. Nevertheless, these techniques fail when used on industrial or intricate datasets.

C. DISTILLATION-BASED AND SELF-SUPERVISED LEARNING APPROACHES
Caruana et al. [28] were the first to propose the idea of training a smaller, more compact model to mimic the behaviour of a larger one. Using ensemble methods, they solved problems such as memory storage deficits and test set challenges. Hinton et al. [29] proposed a method for training the novice model to learn from the soft targets provided by the master model. These targets are probability distributions over classes instead of one-hot vectors. However, both of these approaches require careful tuning of the temperature parameter used to soften the master model output probabilities. In addition, they are very sensitive to hyperparameters and require enormous training time.
Bergmann et al. [18] proposed an uninformed-students method that develops discriminative latent embeddings learned by the student network to represent input data in a lower-dimensional space, which are sensitive to anomalous data while preserving normal data structure. However, this method requires a teacher network to be trained on a larger dataset that includes both normal and anomalous data. This may be time-consuming and require a significant amount of labeled data.
Salehi et al. [19] demonstrated the effectiveness of using intermediate activation values to better exploit the master network knowledge. By combining MSE and the angle between the output values they taught the novice network to mimic the master network's actions. However, their method does not solve the problem of imbalanced datasets in the anomaly detection domain.
Li et al. [30] proposed a two-step approach to creating anomaly detectors from regular training data. They initially used self-supervised deep learning to train representations and then constructed a one-class classifier based on those representations. Their method classifies standard data using the CutPaste technique, which involves cutting and pasting image patches into an image at random locations. Nevertheless, this method suffers from a large number of parameters since it relies on a large pre-trained model to classify.

III. PROPOSED METHODOLOGY
In this section, a detailed explanation of the proposed technique is provided, including the different stages involved. These stages include data preprocessing, data learning, and inference. The diagram in Fig. 1 illustrates an end-to-end framework for the proposed model.

A. DATA PREPROCESSING
As discussed in Section I the amount of anomalous data available in the real world is limited. In the context of anomaly detection, generating more anomalous examples when only a few are available can benefit the development of an accurate and robust anomaly detection model. In order to remedy the imbalanced dataset problem we employ a powerful CUT model to create real-life-like anomaly images of our custom dataset provided by a private company in Daegu, South Korea. Before being fed to the CUT model the abnormal images of imbalanced datasets are transformed into tensors to reduce computation during training. This is done by extracting the images from their directories and representing them as tensors, as they are better at representing multidimensional data. The resulting tensor is 4-dimensional, denoted as X ∈ R N ×C×H ×W , where N represents the total number of images, C represents the number of channels, H represents the image height, and W represents the image width. Afterward, the images are resized to 256 × 256. The resized images are then standardized to follow a standard normal distribution using (1).
Equation (1) relates two types of data: the original data X and the standardized data X std . In this equation, i represents a specific data sample, and N represents the total number of data samples in the dataset. This synthetic data generation technique enables the translation of images without paired examples. It utilizes contrastive learning, where the aim is to make patches in the input and output images correspond to similar points in a feature space that has been learned. The method achieves this by maximizing the mutual information between the patches. Furthermore, CUT draws negatives from within the input image itself, rather than from the rest of the dataset. Hence, this method is faster and less memory-intensive than existing methods and produces  high-quality results even when only the source domain is available during training. However, balanced datasets with a sufficient amount of anomaly images do not undergo this step of data preprocessing. Fig. 2 shows 5 random samples of both real and generated anomaly images.
As illustrated in Fig. 2, the generated anomaly images are visually similar to real ones. Therefore, synthetic anomaly images can be generated to address the problem of limited anomaly images while developing an anomaly detection model.
Then the generated images, X, are transformed into tensors. Subsequently, the images are resized to 256 × 256. Whereas, the images in MVTec AD [31] dataset are resized to 128 × 128. The resized images undergo a standardization process described in (1).

B. DATA LEARNING
During the data learning process, dataset X train = {x 1 , . . . , x n } with only normal images is fed to both master and novice models. Afterward, using the combination of angle loss and weighted distance loss functions E total = E dir + αE dis discussed in Section III-B2 we calculate error between the master network M and the novice network N based on the outputs of essential intermediate layers and the final layer. The loss value is then used to optimize the novice network parameters. Overall, the algorithm for the data learning stage is provided in Algorithm 1

1) NETWORK ARCHITECTURE
As the master network, VGG16 network [32] pre-trained with ImageNet [33] is used since it demonstrated excellent results when applied to anomaly detection domain [34]. For the novice network, we formulated a new model similar to VGG16 but with much fewer channels since for the error calculation, these two networks' layers should match. Fig. 3 provides a general overview of the proposed method. As shown in Fig. 3 both master and novice networks consist of 5 convolutional blocks. However, the number of channels in each block is significantly lower in the novice network. As a result, the number of trainable parameters in the novice network is 8 times less than that of the master network. A more detailed investigation of this is provided in Section V. The numbers in each convolutional block represent the output channels of the convolutional layers and the 'M' corresponds to the maxpooling layer. The error between several essential intermediate and final layers of master and novice is computed. This error is combined to get total error E total . Fig. 4 illustrates the contents of the convolutional blocks of the networks.
As can be seen in Fig. 4, the first and second convolutional blocks contain two convolutional layers while the third to fifth blocks each have three convolutional layers. 69754 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.

2) LOSS FUNCTIONS
As our objective is to match the outputs of intermediate essential layers of the novice network to those of the master network, both direction and distance errors are employed. As a distance metric the Euclidean distance is used while for the direction error computation, the cosine similarity metric is employed. We denote the k-th essential layer of the network as EL k and the activation values of the master network as a EL k m , and the novice network's one as a EL k n . The distance error focuses on reducing the Euclidean distance between the activation values and can be calculated as where M k represents the number of neurons in layer EL k and a EL k (l) is the value of l-th activation in layer EL k . M EL indicates the overall number of essential layers. Moreover, minimizing the distance between the activation values may not ensure that all vectors with similar Euclidean distances will activate the next neurons similarly. In particular, this phenomenon occurs more often when the activation function is ReLU. Considering this we employ distance error in our method which is computed as where vec(x) is a function that converts a matrix x, regardless of its dimensions, into a one-dimensional vector. By utilizing the two previously described losses, the total loss E total is constructed in the following manner where α indicates the hyperparameter that pushes the amount of distance error to be equal to that of direction error. An optimal α value is found through experiments. More detailed elaboration is provided in Section V.

C. INFERENCE
Once the training process is complete and the novice network N has obtained the optimal parameters, the process of making predictions on new, unseen data is initiated. Due to the fact that N possesses knowledge only about normal images, there will be a discrepancy in their performance. This is when test samples X test = {x 1 , . . . , x n } that contain abnormal images are fed to both the master network M and N . This is due to the fact that M also recognizes abnormal samples. Using (4), this VOLUME 11, 2023 behavioral difference is computed to produce a score. This score is thresholded to identify anomalous samples based on this loss value. A similar preprocessing process is performed on the test data X test during this stage as well.

IV. EXPERIMENTS AND RESULTS
In this section, we highlight the specifics of the experiments performed and present the results by comparing the performance of our proposed system with that of the SOTA methods from each approach discussed in Section II.
There are ten classes of image data in the MNIST and CIFAR-10 datasets. Similar to the general experimental settings in one-class classification, both consist of normal and abnormal datasets. Among the ten classes, one was considered normal and used as training data, while the other nine were classified as abnormal. Both normal and abnormal images were included in the test dataset. Digits 0 to 9 are included in the MNIST dataset. There are approximately 6,000 images per class in the training data. 80 % of the data was used to train the model and to verify its performance during training. A total of 10,000 images were used in the testing, including both normal and abnormal images. For the training data, the CIFAR-10 dataset contains images of ten objects with 5,000 images per class. For the testing process, 10,000 images were used that included both normal and abnormal classes.
The MVTec AD dataset comprises image data for 15 products, which are categorized into object and texture classes. The object class includes ten products, while the texture class contains five. The dataset consists of 3,629 normal data and 1,725 abnormal data. Each product class contains both normal and abnormal images, enabling the model to be trained and tested for anomaly detection performance on a per-class basis. The last dataset contains real-life fabric images, provided by a private company in Daegu, South Korea. Table 1 provides overall information about selected datasets.
As described in Table 1, datasets are divided into train and test sets for training and evaluation purposes, respectively. The custom dataset initially comprised 1140 normal and 30 anomaly images. After conducting the data generation stage explained in Section III the number of anomaly images is equalized to that of normal images.

B. BASELINE MODELS
To prove the effectiveness of our proposed system, we selected several most resilient and precise SOTA methods from various approaches discussed in Section II. Our aim is to compare the proposed method with the baseline methods and showcase its superior performance in both accuracy and efficiency. To be in correspondence with the original papers, as baseline models for the MNIST and the CIFAR-10 datasets, we selected ARAE [37], AnoGAN [14], LSA [16], and U-Std [18] models. While for MVTec AD and Custom datasets, we employed AnoGAN [14], AE-SSIM [26], VAE-grad [17], AE-L2 [26], and CutPaste (standard) [30] models. More detailed information about the aforementioned approaches was provided in Section II. Both baseline models and the proposed method were trained and evaluated under the same conditions using the evaluation metric as described in Sections IV-C and IV-D.

C. EXPERIMENTAL SETUP
We formulated the baseline and proposed methods using Python version 3.8 and PyTorch library version 1.6.0. We applied E Total (discussed in Section III-B2) as the objective function for the proposed method, and α hyperparameter was selected between 0-1 for the datasets. The experiments were performed utilizing one 8 GB NVIDIA GeForce RTX 2060 SUPER GPU with CUDA 10.0 with a mini-batch size of four for the proposed method and all baseline methods. We trained the MVTec AD dataset over 100 epochs and other datasets over 50 epochs since the evaluated techniques reached convergence at that point and no improvements could be observed in their performance even after additional training beyond this period. For optimizing the parameters we used the Adam optimizer [38] with a learning rate of 10 −3 .

D. EVALUATION METRICS
To evaluate the performance of our proposed model against baseline methods, similar to prior studies, we employ the area under the receiver operating characteristic curve (AUROC) as a measurement in our work. AUROC captures both the sensitivity (true positive rate) and specificity (true negative rate) of the model across different decision thresholds. This is very critical in anomaly detection since the number of true anomalies (positive cases) is often much smaller than the number of normal cases (negative cases), and the cost of missing an anomaly can be very high.

E. RESULTS OF THE CONDUCTED EXPERIMENTS
In this subsection, we evaluated the quantitative and efficiency comparison of the considered models by examining a variety of factors. These factors include the number of trainable parameters, the size of the trainable parameters, and AUROC. Table 2 displays the results of the AUROC comparison of the considered models on the MNIST dataset.
As highlighted in Table 2, the proposed model demonstrated the highest performance considering the overall average AUROC score on the MNIST dataset compared to baseline models. Particularly, the proposed model scored 99 whereas the next highest-scoring model scored 98.4. Additionally, the remaining baseline models are behind the proposed model by at least 3 or more. Table 3 demonstrates the performance of the considered models on the CIFAR-10 dataset.

TABLE 3.
Comparison of the considered models in terms of average AUROC in % every 10 epochs on the CIFAR-10 dataset. Table 3, the proposed model outperformed its peers relating to the overall average AUROC score on the CIFAR-10 dataset by showing 81.8 which is 0.6 higher than the score of the second-best model, U-std. Other baseline models are far behind, almost 20 points less than the proposed model. The results of the performance of the considered models on the MVTec AD dataset are illustrated in Table 4 As illustrated in Table 4, the proposed method demonstrated competitive performance on the MVTec AD dataset. Specifically, the proposed method showed top performances in the Carpet, Wood, Bottle, Cable, Hazelnut, Screw, Zipper, and Tile classes by achieving 76, 95, 99, 88, 99, 77, 96, and 96 scores, respectively. Regarding the other classes, the proposed method maintained competitive performance. However, with regard to the mean value the proposed method showed the second-best performance with an 88.7 score. This is only 0.6 less than the score of the highest-performing  model, CutPaste (standard), which is 89.3. Table 5 demonstrates the comparison of the models' performance on the custom dataset.

As shown in
As shown in Table 5, the proposed model outperformed all baseline models considered on the custom dataset using AUROC in %. In particular, the proposed model achieved an overall average score of 99.8 which is 0.2 and 1.0 better than those of the second and third-highest-performing models, CutPaste (standard) and VAE-grad. While the other three baseline models were significantly outperformed by the proposed model. In addition, the proposed model acquired 100 scores faster than baseline models after 10 epochs, whereas baseline models needed at least 20 epochs to reach 100.
Overall, taking into account the combined results from the MNIST, CIFAR-10, MVTec AD, and custom datasets, the proposed model consistently demonstrates superior performance across multiple datasets. It outperforms baseline models in terms of AUROC scores, achieving the highest or second-highest scores in all cases. In this way, it indicates that the proposed model has great robustness and effectiveness when it comes to anomaly detection tasks, which makes it a promising approach for real-world use cases. A comparison was made between the proposed method and baseline models by analyzing their network design and efficiency, as shown in Table 6.
From Table 6, the proposed method outperforms the considered models in both evaluated factors. In particular, the    proposed method is more efficient in terms of trainable parameters by requiring only 0.17 million parameters. This is more than two times better than the smallest baseline model, ARAE. Moreover, other baseline models have far more trainable parameters than the proposed model. Concerning trainable parameter size, the proposed method is remarkably lightweight thanks to an elaborately formulated model network architecture. To be precise, it is more than twice smaller as the ARAE model and more than six times smaller than the AE-L2 and AE-SSIM models. In addition, the remaining baseline models are much larger in terms of trainable parameter size.
The proposed model outperforms the baseline models in terms of efficiency, resulting in a more compact, lighter, and more efficient solution without sacrificing performance in comparison to the baseline models. As a result of the combination of efficiency and effectiveness of the proposed method, it becomes obvious that the method has many advantages when it comes to real-time industrial anomaly detection. Table 7 shows the comparison of the results of using both the outputs of several essential intermediate layers and the final layer and using only the final layer's outputs for knowledge distillation on the MNIST dataset.

A. THE EFFECT OF EMPLOYING DIFFERENT LAYERS FOR KNOWLEDGE DISTILLATION
As shown in Table 7, the model that employed various layers to distill the knowledge of the master network outperformed its counterpart that employed only the last layer's output to transfer knowledge. An average AUROC score in every 10 epochs is higher approximately by 3-6% during 40 epochs and by 2% in the remaining epochs for the former model. The mean score for the network using both multiple intermediate and final layers is higher by 3.8 than for the network using only the final layer, 99 and 95.2, respectively. We can conclude that using different layers of the master network helps to distill knowledge more comprehensively than using only the last layer's output. Table 8 demonstrates the comparison of the master and the novice network in terms of trainable parameters, computational complexity, and trainable parameter size.

B. THE EFFICIENCY OF THE NOVICE
As shown in Table 8, the novice network, which contains only 0.170 M trainable parameters, has more than 8 times fewer trainable parameters than the master network, which has 138.3 M trainable parameters. In terms of the number of multi-add operations, the novice network is 50 times more efficient since it has only 0.309 billion mult/adds operations whereas the master network has 15.61 billion operations. Since the novice network showed 0.65 MB for trainable parameters size, the novice network is significantly more efficient than the master network. The master network's trainable parameters size is 527.79 MB, up 811 times more than the novice network's. In conclusion, the novice network is small and has few trainable parameters that are required by real-time applications in the anomaly detection domain. However, this lightweight model demonstrates competitive accuracy.

C. LOSS FUNCTION HYPERPARAMETER
The results of different values of hyperparameter α in loss function E Total on the datasets are presented in Table 9. For the MVTec AD dataset, the average AUROC score in % after 100 epochs is provided. While for other datasets the number of epochs is 50.
As shown in Table 9, the value of α=0.05 is best suited for the MNIST, and the custom datasets since this value demonstrated high scores of 99, and 99.8, respectively. As compared to other values, the CIFAR-10 achieved the best score of 81.8 with α=0.01. However, the MVTec AD dataset with α=0.5 gained the best score of 88.7.

D. ANOMALOUS DATA GENERATION MODELS COMPARISON
We compared the selected anomalous image generative model to widely used SOTA generative models to validate its effectiveness. Table 10 exhibits the results of the quantitative comparison of two widely used SOTA, namely CycleGAN [39] and StarGAN [40], with the CUT model used in this method.
As shown in Table 10, the CUT demonstrated ambitious performance when assessed using four different evaluation metrics, namely pixel accuracy (PA), MSE, structural similarity measure (SSIM), and Frechet inception distance (FID). Specifically, the selected method showed top performances when evaluated with MSE, SSIM, and FID by achieving 2,312, 0.91, and 24.11 scores, respectively. In terms of PA the selected method ranked second, which is only 0.06 lower score than the highest performing StarGAN, which achieved 0.74.

VI. CONCLUSION
A comprehensive end-to-end anomaly detection method was introduced. The method begins by generating real-life anomaly image samples using a highly effective generative model. Further, the proposed method employs a knowledge distillation technique that comprehensively transfers the master's knowledge through both intermediate primary layers and the final layer. This is done to teach the novice. As a result, VOLUME 11, 2023 despite having only a small number of parameters owing to an elaborately formulated network architecture, the proposed method outperformed its powerful counterparts on 3 datasets. This study's theoretical contribution lies in the introduction of an effective KD strategy that transforms the essential knowledge of the master network into the novice network through multiple important layers. This method is very straightforward for training and demonstrates high accuracy. The practical significance of this study is the provision of an effective approach for solving data imbalance problems and detecting anomalies in industries. Therefore, the proposed method can be applied to any anomaly detection task with a limited number of abnormal samples. Furthermore, the proposed model is well-suited for real-time anomaly detection, as it is very efficient compared to existing methods. The researchers anticipate that this algorithm has the potential to scale well for future applications. To improve the proposed model, there are two aspects that can be extended in the near future: 1) employing a more powerful master network; 2) including an effective anomaly localization algorithm after anomaly detection.