Generative and Contrastive Combined Support Sample Synthesis Model for Few-/Zero-Shot Surface Defect Recognition

Surface defect detection is one of the most important vision-based measurements (VBMs) for intelligent manufacturing. Existing detection methods mainly require massive numbers of defect samples to train the model to detect the defects. Nowadays, inadequate defect samples and labels are inevitably encountered in industrial data environments due to the highly automated and stable production lines escalatingly deployed, causing fewer and fewer defective products to be produced. Consequently, manual interventions are deeply required to analyze the abnormal sample once an unseen defect accidentally emerges that significantly decreases productivity. To this end, this article proposes a novel few-/zero-shot compatible surface defect detection method without requiring massive or even any defect samples to detect surface defects. First, a novel contrastive generator is proposed to use defects’ text descriptions to synthesize “fake” visual features for those rare defects. Then, the synthesized visual features (for support samples) are fused with “real” visual features (for query samples) into a similarity graph to align the relationships between support samples and query samples. After, a class center optimization (CCO) method is proposed to iteratively update the similarity matrix of the graph to obtain the classification probabilities for the query samples. Eventually, the proposed method solves the problem of the lack of defect samples and the inability of few-shot learning-based methods to recognize unseen classes. Massive experiments on eight fine-grained datasets show that our method gains an average of +8.29% improvements on few-shot recognition tasks and achieves an average of +8.23% improvements on zero-shot recognition tasks compared with the state-of-the-art (SOTA) method. Moreover, the proposed method is deployed in a real-world prototype system, and the method’s feasibility is finally demonstrated. The core code of the proposed method is available at: https://github.com/NDYBSNDY/AsC.

Generative and Contrastive Combined Support Sample Synthesis Model for Few-/Zero-Shot Surface Defect Recognition Yuran Dong , Cheng Xie , Member, IEEE, Luyao Xu , Hongming Cai , Senior Member, IEEE, Weiming Shen , Fellow, IEEE, and Haoyuan Tang Abstract-Surface defect detection is one of the most important vision-based measurements (VBMs) for intelligent manufacturing.Existing detection methods mainly require massive numbers of defect samples to train the model to detect the defects.Nowadays, inadequate defect samples and labels are inevitably encountered in industrial data environments due to the highly automated and stable production lines escalatingly deployed, causing fewer and fewer defective products to be produced.Consequently, manual interventions are deeply required to analyze the abnormal sample once an unseen defect accidentally emerges that significantly decreases productivity.To this end, this article proposes a novel few-/zero-shot compatible surface defect detection method without requiring massive or even any defect samples to detect surface defects.First, a novel contrastive generator is proposed to use defects' text descriptions to synthesize "fake" visual features for those rare defects.Then, the synthesized visual features (for support samples) are fused with "real" visual features (for query samples) into a similarity graph to align the relationships between support samples and query samples.After, a class center optimization (CCO) method is proposed to iteratively update the similarity matrix of the graph to obtain the classification probabilities for the query samples.Eventually, the proposed method solves the problem of the lack of defect samples and the inability of few-shot learningbased methods to recognize unseen classes.Massive experiments on eight fine-grained datasets show that our method gains an average of +8.29% improvements on few-shot recognition tasks and achieves an average of +8.23% improvements on zeroshot recognition tasks compared with the state-of-the-art (SOTA) method.Moreover, the proposed method is deployed in a realworld prototype system, and the method's feasibility is finally demonstrated.The core code of the proposed method is available at: https://github.com/NDYBSNDY/AsC.Index Terms-Contrastive learning, few-shot learning, generative learning, graph embedding (GE), surface recognition, vision-based measurement (VBM), zero-shot learning.

I. INTRODUCTION
A UTOMATED industrial production can reduce labor costs and increase productivity.Recent research has used Bayesian techniques for manufacturing methods [1] to significantly reduce the production process cost by producing input parameters for the desired outcome.Similarly, collective robotic systems for constructing multistory buildings [2] reach state-of-the-art (SOTA) construction speeds.With the rapid development of vision computing in recent years, vision-based measurement (VBM) has become one of the most critical and influential methods for automated industrial production [3], [4].Surface defect recognition, as an essential part of automated industrial production, has the critical role of improving production efficiency and reducing labor costs.Compared with manual defect recognition, VBM-based machine inspection methods are more objective and efficient.However, due to the environmental constraints on defect sample collection, the types of defects occurring in the production process are uncertain and random.This requires VBM-based defect recognition models to have the ability to fit a few samples and the flexibility to adapt to complex production environments.
However, the remaining surface defects, i.e., the rare-seen or unseen defects, are still hard to detect since there are not enough such defect samples that can be trained.To solve the problem, existing few-shot models [11], [12], [13], [14], [15] only require very few support samples to prompt the model to detect the query samples.The core idea of these methods is to pretrain a model from other related training samples (the samples that are relatively common and easy to collect) in advance.Then, the pretrained model extracts the features from the support samples (rare-seen samples and hard to collect) and tries to update itself to know the defects.After, the updated model extracts the features from query samples and tries to infer the defects of query samples.These methods require at least one defect sample to conduct the query inference.The more defect samples, the higher the accuracy of these detection models.However, in industrial detection practice, some defect samples are hard to collect in advance or have never appeared because the well-optimized smart manufacturing environment has further reduced the defect rate.Consequently, manual interventions are deeply required to analyze the abnormal sample once an unseen defect accidentally emerges that will significantly decrease productivity.Therefore, enabling defect inference without using any support samples becomes one of the most critical challenges in the surface defect detection field.
To solve the nonsample problem, the zero-shot mechanism is reasonably considered.The basic idea of zero-shot learning for visual computing is to train a cross-modal network to synthesize the visual features from the corresponding semantic features [16], [17], [18], [19], [20].Based on this cross-modal network, the model can synthesize the unseen visual features without truly learning the sample, i.e., only by describing this object in texts.Then, the synthesized visual features are matched with the visual features of the query sample to infer the category of the query sample.However, existing zero-shot learning methods are designed for general image classification tasks that do not perform well in surface defect detection.This is because these zero-shot learning methods directly match the synthesized visual features with query samples without considering the support information in the industrial data environment, which hardly guarantees detection accuracy.
To this end, this article proposes a few-and zero-shot compatible model by considering both synthesized samples and support samples.The proposed method can detect surface defects in both few-sample and nonsample data environments.First, a novel contrastive generator model is proposed to synthesize the visual features according to the semantic features.Then, the synthesized visual features are filtered and considered as support samples to augment the real support samples.After, a graph-based center feature update method is proposed to match the visual query features to the synthesized support visual features iteratively.The experimental results on massive real-world surface defect datasets show the proposed method significantly outperforms SOTA methods in both fewshot tasks and zero-shot tasks.In the highlights, compared with SOTA methods, our method has significant improvements in both few-and zero-shot surface defect detection.Moreover, the proposed method is deployed in a prototype manufacturing scenario, an automated hot-rolled steel surface detection line, to demonstrate its feasibility and applicability.In summary, the work has the following contributions.
1) Compared with deep-learning-based methods, the proposed method can be decoupled into two phases: sample generation and class inference, and only the class inference phase needs to be deployed in the application, which can significantly reduce the model complexity.
Meanwhile, the graph-based class inference method has different feature space distributions and graphs when dealing with different query samples, which is more adaptable to the complex and changing industrial environment.2) Compared with few-shot learning-based methods, we integrate zero-shot learning, where the types of defects that can be recognized are no longer limited to known classes with support samples, and support samples for unknown classes are obtained through the proposed contrast generator instead of being collected.3) Compared with zero-shot learning-based methods, our approach uses inference rather than a fixed model for sample prediction, which allows different samples to have different spatial distributions, seen/unseen class predictions do not affect each other, and the proposed method focuses more on unlabeled query samples rather than labeled seen samples or unseen generated samples.Since there is no need to tradeoff the seen/unseen class focus, we achieve the simultaneous optimal performance of the seen/unseen class prediction instead of the tradeoff performance.4) Compared with SOTA, the proposed method gains an average of +8.29% improvements on few-shot defect recognition tasks and an average of +8.23% improvements on zero-shot defect recognition tasks.The proposed method is deployed in a real-world prototype system to evaluate the feasibility and practical implementation.

A. Different Methods of Defect Recognition
In the latest research, different methods (including methods based on Deep Learning, Few-Shot Learning and Zero-Shot Learning) are used for surface defect recognition and the advantages and disadvantages of different methods are shown in Table I.
The core idea of deep-learning-based methods is to train a fixed classifier to recognize defects through many samples.However, the lack of defect samples leads to the inability to train an accurate convolutional neural network (CNN).Some recent studies [5], [6] have utilized the relevant parameter information of defects to compensate for the wrong recognition of some defect types due to the lack of samples.However, extra information often leads to labeling noise.Yu et al. [7] dealt with labeling uncertainty through knowledge transfer and collaborative learning.Since defect datasets often suffer from data imbalance, deep stochastic chain [8] and gradientbased [9] methods can deal with the difference between defect samples of the same class.The latest research has theoretically solved some existing problems in defect recognition, but in real industrial environments, existing deep-learning-based methods inevitably have some disadvantages as follows.pretrained model extracts the features from the support samples (rare-seen samples and hard to collect) and tries to update itself to know the defects.After, the updated model extracts the features from query samples and tries to infer the defects of query samples.In recent studies, in order to better learn the local features of defects and reduce background interference, Zhou et al. [10] designed a feature extractor with the class agnostic mask to extract the defect features and Zhenyu et al. [11] developed a multiresolution-based cropping enhancement method to enhance the unlabeled defect images.By borrowing the idea of multiscale feature extraction, a novel backbone network, ResMSNet, was proposed [12], which realizes crossdomain few-shot learning with the training set and target defect dataset coming from different domains.Since with few support samples (e.g., shot = 1), few-shot learning-based methods often perform poorly, and some researchers have also attempted to solve this problem by additional information fusion.Zhao et al. [13] in fusing semantic information based on feature relationships to effectively obtain high-dimensional feature information in a few images.Song et al. [14] generated distinguishable class features by learning affine parameters from the original features, making the model more portable.Effective inference methods often play a crucial role in model performance and Xiao et al. [15] optimized the inference process through graph embedding (GE) and optimal transmission to improve model flexibility.It cannot be denied that fewshot methods have advantages under a few defect sample conditions, but some limitations seem to make them difficult to apply.
1) These methods require at least one defective sample (shot ≥ 1) for inference.This leads to the fact that once an unseen defect appears unexpectedly (shot = 0), the few-shot learning-based recognition method breaks down outright, and manual intervention is required to analyze the abnormal sample.2) Detectable defect types are limited to known dataset classes, which leads to the fact that to use the method in production environments with a large number of classes, it is necessary to build at least one support sample for each possible defect type.However, due to the limitations of production environments, collecting comprehensive support samples of all types is an almost impossible task.3) Some methods dealing with different numbers of support samples (different shots) require training different models, e.g., FaNet [13], which leads to complex model deployment.In order to detect novel defect types (classes with no support set) that arise unexpectedly in real production environments, a few studies have attempted to apply zero-shot learning to defect recognition [16], [17], [18].The basic idea of zeroshot learning for visual computing is to train a cross-modal network to synthesize the visual features from the corresponding semantic features [19], [20], [21], [22].Based on this cross-modal network, the model can synthesize the unseen visual features without truly learning the sample, i.e., only by describing this object in texts.Then, the synthesized visual features are matched with the visual features of the query sample to infer the category of the query sample.However, the application of zero-shot learning in the field of surface defect recognition is not emphasized, which is mainly due to as follows.
1) Zero-shot learning-based methods often train fixed models with a mixture of seen classes and generated samples of unseen classes.Since the seen/unseen classes are not differentiated, resulting in the accuracy of the two affect each other, the model needs to tradeoff the attention paid to the two to obtain a compromise performance.2) Existing zero-shot learning models in the field of defect recognition usually have many hyperparameters that need to be selected and optimized, and manual parameter tuning is time-consuming and laborious.3) Zero-shot learning methods in vision are only applicable to benchmark datasets (e.g., CUB, SUN, and AwA) with samples >15 000, while defect datasets have no more than 1000 samples.
4) Zero-shot learning methods in the visual domain try associating local features with attributes [21], e.g., a bird includes a head, a beak, wings, and feet.This is entirely inapplicable for surface defects where it is difficult to disentangle local features.

B. Development of Few-/Zero-Shot Learning in Different Fields
One-shot learning was first proposed by Fei-Fei et al. [23].Since the method can quickly learn new knowledge with a few training samples and generalize, it has been rapidly developed in some fields where training data is rare.
In natural language processing, relational classification tasks provide a basis for constructing structured knowledge (e.g., knowledge graphs) by judging the predefined relationship between two target entities in an utterance.However, the development has been slow due to the lack of training data.Xu et al. [24] introduced few-shot learning into the relational classification task for the first time and constructed the FewRel dataset.Many researchers explored this basis [25], [26], [27], [28], and the introduction of few-shot learning made the performance of the relationship classification task continuously improved [29], [30].
In medical image processing, due to the difficulty of biopsy label acquisition, Qinghua et al. [31] first attempted to introduce few-shot learning into the ultrasound breast tumor diagnosis system and achieved excellent performance.In recent years, the few-shot method has been widely used in the medical field, including the recognition of COVID-19 from rare chest images [32], human cell categorization in rare datasets [33], autism facial feature categorization [34], skin image categorization [35], and healthcare safety monitoring [36].
Palatucci et al. [37] proposed the concept of zero-shot learning due to the ability of this method to detect rare or unseen objects in an image.In some industrial application scenarios, the zero-shot method was introduced.
In remote sensing scene classification, satellite images are prone to new classes of objects beyond the expected scene, which leads to the collapse of deep-learning-based methods.Li et al. [38] introduced zero-shot learning into remote sensing scene classification and proposed a new method for recognizing images from unseen classes.Further studies tried to combine knowledge graphs with zero-shot learning and achieved better performance [39].The latest methods have also continued to apply zero-shot learning to remote sensing scene classification [40], [41], remote sensing image defogging [42], and remote sensing image super-resolution [43].
In intelligent manufacturing scenarios, due to the diversity and randomness of industrial faults, some real fault samples are difficult to obtain or never occur, so zero-shot learning methods have been widely used in the field of industrial fault diagnosis in recent years [44], [45], [46], [47].

III. METHODS
The general framework of the few-/zero-shot visual inspection method is shown in Fig. 1, which consists of two parts: the contrastive generator (see Section III-B) and the graph-based few-/zero-shot inference (see Section III-C).

A. Problem Formulation
Let X , Y, and D = {X , Y} denote the raw visual feature space, the corresponding image labels, and the dataset, respectively.Assume At the same time, the class-level text features are provided A = A s ∪ A u , where A s correspond to the seen classes in D s , and A u correspond to the unseen classes in D u .For the N -way K -shot task, N unseen classes are selected as the test set in D u , in which K with-labeled samples are reserved for each selected class as the support set D t , and the unlabeled samples in the test set are the query set D q .K is usually small or even nonexistent (i.e., K = 0, K = 1, and K = 5).Unlike the common task, the final support set of the proposed task is D t ∪D a , and D a is a text prompt extracted from A u corresponding to N classes.

B. Contrastive Generator
1) Visual Feature Synthesizing: Let a s ∈ A s be a text feature of a seen class while x s ∈ X s be the visual feature of the corresponding class.The input to the conditional generation network G is obtained by splicing the text features a s and Gaussian noise ϵ ∼ N (0, 1).G outputs the synthetic visual samples xs = G(a s , ϵ).Meanwhile, the discriminator network D is used to discriminate a real pair (x s , a s ) from a synthetic pair ( xs , a s ).The feature generator network G and the discriminator network D can be learned by optimizing the following adversarial objective: L G is the loss function of generator G.It consists of a discriminator error E and a class classification loss L cls .L D is the loss function of discriminator D that consists of a synthesized visual feature discriminating error, a real visual feature discriminating error, and a class classification loss L cls .
2) Contrastive Loss for Real Features: Let the embedding of a visual sample x s be denoted as f s = E(x s ), E is an embedding function that maps the raw visual sample x s into the embedding space.To learn the embedding function E, for each data point f s embedded with real or synthetic features, try to randomly take one sample f s+ of the same class as the f s sample as a positive sample and f s+ ̸ = f s .And take N samples randomly as negative samples f s− j from the set of all class samples not of the same class as f s+ samples.Then, a positive sample f s+ is mixed with N negative samples f s− j into an unlabeled set of samples f s j , and the correlation scores between the real embedding and the other real embedding samples are obtained by calculating the dot product similarity between f s and f s j .Finally, the known labeled sample f s is used to distinguish the only positive sample in f s j .For example, as shown in Fig. 2, if the embedded real sample f s class is Am, a randomly selected positive sample f s+ class is also Am, but f s and f s+ are different pictures.Meanwhile, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
samples of classes different from Am (i.e., convexity, In, blister, and bump) can be selected as negative samples f s− j .It is worth noting that since the known labeled samples f s need to distinguish only the positive sample among the set of N + 1 positive and negative samples, the size of N (the number of negative samples) determines the classification difficulty.If N is small, it is not easy to learn discriminative class features, and if N is too large, it leads to long training time and high overhead.At the same time, too accurate class features may lead to the real embedded feature distribution not being compatible with the synthetic feature distribution with significant deviation, making the performance decline.Thus, by weighing the model accuracy against the training overhead, the number of negative samples (N ) is set to 25% of the total number of samples (classes different from f s+ ).
In summary, consider using a contrast loss function called InfoNCE 1 to compute the expected loss of contrast embedding L CR for the real embedding samples f s and f s j .The formula is shown as follows: Here, N denotes the number of negative samples f s− j ( f s− j and f s belong to different seen classes).f s ̸ = f s+ but they belong to the same seen class.τ > 0 is the temperature hyperparameter, which is used to control the convergence rate of the model.
3) Contrastive Loss for Synthesized Features: Analogously, to make the synthesized samples xs fit the real embedding space and increase the distribution distance between different classes of generated samples.The positive and negative samples of the synthetic features are shown in Fig. 2, which are selected in the same way as the real features, where the number of negative samples is also taken as 25% of the total number of samples (which are not of the same class as f s ).The positive samples are also taken randomly from among the samples of the same class as f s .Referring to (2), let f s = E( xs ), the contrastive loss L CS of the synthesized features is defined as follows: During the contrastive generator training process, only seen visual features X s , seen semantic features A s , and seen labels Y s are used.During the few-/zero-shot predicting, a generator G(A u , ϵ) is used to generate the synthesized visual features X u , after which the synthesized visual features are mapped to the embedding space by the embedding function E : Fu = E(G(A u , ϵ)), which includes only the features of the unseen class.Raw features of unseen classes are also mapped to the embedding space F u = E(X u ).
1 https://arxiv.org/abs/1807.03748 C. Graph-Based Few-/Zero-Shot Inference 1) For Zero-Shot Inference: In the industry-specific zeroshot visual inspection process, first, the similarity matrix S is obtained by calculating the feature similarities among the support features Ft and the query features F q synthesized from the contrastive generator (see Section III-B).Then, the similarity graph is constructed from the adjacency similarity matrix S. The class center T (0) i is obtained by initializing the support node T .Finally, the final classification probability matrix M i, j is obtained by continuously updating the class center T (k+1) i with the classification probability matrix M (k+1) i, j .The predicted label is obtained as Ŷ i by selecting the maximum probability of M i, j .
In the above process, all support embedding samples are synthesized by the proposed generator G and the embedding function E (see Section III-B), all query samples are processed by the embedding function E, and all query and support samples belong to the unseen class.Fig. 1 provides the overview process of zero-shot inference.
Let S be the adjacent similarity matrix, and S i, j stores a similarity value of feature i and feature j.Equation ( 4) provides the definitions of Here, Ft is the synthesized visual feature embedding space of the support set, while F q is the real visual feature embedding space of the query set, Ft ∈ Fu , F q ∈ F u .f t denotes the synthesized visual embedding features.f q denotes the real visual embedding features.w denotes a parameter matrix.In the experiment, for each node in S, only Top-k similar neighbors remain.The rest neighbors are marked as 0 in similarities.
Before inference on query set categories, it is crucial to construct a relational network containing support set labeling information and unlabeled query set information.The proposed method constructs interrelationships between query samples and support samples through GE to fully utilize the known label information.The graph-based inference process usually needs to initialize a center for each class and continuously optimize the class centers to achieve class differentiation during the inference process.Different methods of class center selection [48], [49], [50] often affect the quality of inference results and iteration efficiency.In order to obtain more reasonable class centers, the self-attention (SA) mechanism is introduced.By further correlating feature information between samples, the proposed method obtains class center points with rich defect feature information, which is also more global in biasing the support sample distribution.
Eventually, the SAGE module is constructed to further improve the class center optimization module (CCO) performance through sample information integration and class center initialization, which contains ( 5) and (6).
Given a diagonal matrix D i, j = j S i, j , the adjacency matrix S, a normalization function Norm(•), a SA function  Self(•), and a one-layer learn-able weight matrix W , the GE is defined as follows: Here, T is the GE for all support samples.Q is the GE for all query samples.E is the node self-connection matrix and ξ is the weight parameter that balances the importance of the neighboring node and self-node information.θ is the embedding ratio parameter.
Based on the support samples feature matrix T , the support classes' center feature matrix T can be calculated by the following formula: where K is the number of support samples for a class.T k denotes the kth support sample feature while T i represents the ith class' center feature.Here, (0) means the initial center feature T (k+1) i denotes the center feature of the ith class after (k+1) iterations.α is an updating rate parameter.The updating is faster if α is bigger and vice versa.In our experiments, α is set to 0.2.Here, M i, j represents the classification probability of the jth query sample Q j belonging to the ith class.It is calculated by measuring the distance between the class center feature T (k) i and the query feature Q j , as defined in the following: Here, λ is a regularization parameter that will be discussed in the experiment.The settings of the Sinkhorn function are referred from [53] Finally, based on the iterated probability matrix M, the zeroshot inference can be conducted by selecting the maximum value of M j for the given query sample j, the predicted label matrix for all classes is Ŷ i , as shown in (9).
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
2) For Few-Shot Inference: The key challenge of fewshot inference is that the number of support samples is too small (normally only one to five samples) compared with the training and query samples, causing a serious distribution skewness problem.To handle the problem, based on the idea of zero-shot inference proposed, we try to augment the support set F t by adding extra synthesized samples Ft from the feature generator (see Section III-B).However, it was observed in the experiment that simply adding Ft into F t does not obviously improve the classification accuracy.The reason for this phenomenon is that some synthesized samples f t might deviate from the real visual feature distribution.These deviated samples will disturb the model inference.
To guarantee the quality of the synthesized samples, we do not directly add Ft into F t .Instead, Ft is filtered in advance by a classifier based on F t in which only the correct classified samples Ft are added into F t .The inference process is similar to the zero-shot inference.See ( 5)- (9).The only difference is the similarity graph construction, as defined in the following equation: Let Ft ′ be the filtered synthesized sample set.We then have the augmented support set Ft Based on the support set Ft and the query set F q obtained by filtering, the similarity graph Ŝ for fewshot inference is re-constructed [refer to (4)].

IV. EXPERIMENTS
A. Preliminaries 1) Datasets: To verify the effectiveness of the proposed method for surface defect recognition, we validated it on eight different datasets, which mainly include three surface defect datasets (MSD-Cls [15], FSC-20 [13], and MT-CF) and five fine-grained datasets DTD, 2 EuroSAT,3 RESISC45, 4MED-3 (consists of a blood cell image database, 5 multisource dermoscopic images of pigmented lesions HAM10000, 6 and optical coherence tomography (OCT) images 7 ), and GTSRB. 8SD-Cls [15] is a metal surface defect dataset that contains aluminum and steel with different defect types.In MSD-Cls, only a few training data are about steel defects.However, the test data are all about aluminum defects that cause a serious cross-domain problem, making it hard to detect accurately.FSC-20 MT-CF dataset consists of the oil pollution defect database, 9 the annotated road crack image database Crack-Forest, 10 and the magnetic tile surface defect database. 11oting that MSD-Cls, MT-CF, and MED-3 are crossdomain datasets consisting of more than three different datasets from the same industrial domain, the significant data differences are extremely challenging.RESISC45, GTSRB, and DTD datasets are fine-grained multicategory datasets with insignificant class characteristics compared to conventional few-shot visual inspection datasets.Extra experiments, including ablation study, hyperparameter study, and base generator discussions, are conducted on the MSD-Cls dataset.
2) Dataset Splits: To simulate the few-sample data environment, all the above datasets are narrowed by randomly selecting 10-50 samples for each class.Then with reference to the PS-split, 12 the database is divided into the training and validation set D s (seen class) and the test set D u (unseen class).
3) Experimental Setups: In few-shot inference comparison experiments, we follow the different backbone network settings of the SOTA methods (i.e., ResNet-12 [62], ResNet-18 [62], and WRN [63]).In zero-shot inference comparison experiments, CLIP [64] was used to extract visual features X and corresponding text features A of the seen classes for all methods (using only class names as text cues).
4) Evaluation Metrics: For few-shot tasks, accuracy (acc) and way-shot metrics are applied.Here, the way denotes the number of classes in D u while the shot means the number of support samples for each class.For example, a five-wayone-shot means five to-be-classified classes with one support sample for each class during the testing.
For the generalized zero-shot learning (GZSL) task, following the metrics, 13 Top-1 classification accuracy on seen classes (S) and unseen classes (U ) are evaluated.The harmonic mean (H ) of S and U is used to represent the final performance of zero-shot visual inspection where H = 2 × S × U/(S + U ).
To report stable results, 10 000 random draws with 95% confidence are conducted to obtain the average accuracy values for each evaluation.

B. Evaluations 1) Few-Shot Inference Comparison: Table II provides the way-shot results of few-shot visual inspection.
For the zero-shot comparison, the few-shot competitive methods are adjusted to zero-support samples if their source codes are available.Else, "−" in Table II denotes that zeroshot inference can not be reproduced for the corresponding method.On the dataset with mixed seen and unseen classes, our method achieves an average from +25.4% to +37.25% improvement compared with PTNET and GTnet.Notably, this is the first attempt to apply the few-/zero-shot compatible models in industry-specific visual inspection domains and achieves a significant improvement.For one-shot tasks, our method obtains +5.87%, +4.72%, +4.12%, and +7.93% improvements in MSD-Cls, MT-CF, EuroSAT, and MED-3 datasets, respectively, compared to the second-best method, while 4.01 decreases in GTSRB dataset.Notably, our method obtains +12.79% and +11.12% significant improvements over the second-best method in the RESISC45 and DTD datasets, respectively, with improvements >10%.For five-shot tasks, the highlight of the comparison results is that our method obtains +10.5%, +10.98%, +13.65%, and +9.94% significant improvements in the MSD-Cls, RESISC45, MED-3, and DTD datasets, respectively, compared to the second-best method (the average improvement was >10%).On other datasets (MT-CF and EuroSAT), our method obtains +4.18% and +3.95% improvement, while there is a 3.34 decrease in the GTSRB dataset.
The highlights also show from Table II that the proposed method has significant improvements in MSD-Cls, MED-3, RESISC45, and DTD.In detail, from +5.9% to +13.4% improvements are achieved in MSD-Cls and MED-3 datasets (the training and testing classes are not intersected) on the few-shot inference.From +10.1% to +13.2% improvements are obtained in RESISC45 and DTD datasets (relatively larger numbers of classes for the few-shot task) on the few-shot inference.This indicates the proposed method can obtain more critical class differentiation in nontrivial datasets.
On the large-scale few-shot classification dataset FSC-20, the proposed method improves +2.73% and +1.09% on oneand five-shot, respectively, compared to FaNet, the method applied to the FSC-20 dataset.It is worth mentioning that the proposed method obtained a significant improvement of +41.53% on zero-shot.
Furthermore, the qualitative result of one-shot retrieval is provided in Fig. 3.In Fig. 3, each row represents a class such as "Am," "bump," and "damage."Each cell in each row is the to-be-retrieval sample.The green frames denote the correct retrieval, while the wrong retrieval for red frames.The last row indicates steel surface defects, and the other four rows indicate aluminum surface defects.It can be seen relatively high acc is obtained for aluminum damage, bump, and convexity defect retrieval.However, relatively high acc is observed on aluminum defect retrieval.This is because of the unbalanced data distribution problem, very few steel samples  in the training set, of the dataset that need to be considered in future research.
To evaluate the sensibility of the model on the initial number of query samples, 1-15 query samples are applied to the proposed method with the competitor GTnet, as shown in Fig. 4. It is observed that the proposed method always keeps stable no matter the initial number of query samples.On the contrary, GTnet requires a larger initial number (greater than 9) of query samples to get a fair performance.This demonstrates the proposed method is insensitive to the initial query samples.This is mainly because the proposed method uses synthesized samples to augment the query samples, which reduces the dependencies on the initial number of query samples.
2) Zero-Shot Inference Comparison: Table III provides the comparison results of zero-shot visual inspection.It is observed that our method significantly outperforms all competitors in all datasets.In highlights, on the H metric, our method achieves +10.64%, +21.92%, +11.28%, +20.17%, +11.55%, and +2.72% improvements in MSD-Cls, MT-CF, EuroSAT, GTSRB, DTD, and FSC-20, respectively, compared with the highest records.On the H metric, our method significantly surpasses all SOTA methods.This demonstrates that the proposed method can effectively balance the performance between unseen and seen visual inspection.On the U metric, our method obtains +9.54%, +16.83%, +6.21%, +1.65%, +0.17%, +11.62%, and +11.56% improvement over SOTA methods on the MSD-Cls, MT-CF, EuroSAT, RESISC45, MED-3, GTSRB, and DTD datasets, respectively.This denotes that the proposed method can synthesize "fake" features that are very similar to the real features, and the synthesized features can represent the real sample space distribution.On the S metric, our method obtains +8.87%, +3.44%, +1.76%, +6.83%, and +2.86% improvements in MSD-Cls, MT-CF, EuroSAT, RESISC45, and GTSRB datasets, respectively, while 17.81, and 11.28 decreases in MED-3, and DTD dataset.Interestingly, existing zero-shot methods have lower U scores than S in industry-specific data environments.This is because, in industry-specific data environments, there are not enough training samples for the existing zero-shot methods to learn a stable network for predicting unseen samples.On the contrary, instead of using augmented samples for model training, our method uses synthesized samples for model inference.This significantly decreases the requirements for the number of training samples.
To further reveal the performance of the method, three representative methods, LisGAN, CE-GZSL, and CvcZSL, are selected to construct the confusion heat maps, as shown in Fig. 5.The x-axis represents the predicted defect classes, while the y-axis refers to the real defect classes.The dark color represents the high probability given by the model for predicting the class label.The more dark colors close to the diagonal of the heat map, the more accurate the model is.
Obviously, the color distribution of Fig. 5(a) is chaotic which means the corresponding method fails to conduct the task.Fig. 5(b) and (c) has similar color distributions that are close to the diagonal of the heat map.However, there are still many dark colors that deviate from the diagonal of the heat map which represents the wrong predictions.Fig. 5(d) has the clearest color distribution that is close to the diagonal.It has the best prediction performance.This further demonstrates the comprehensive superiority of the proposed method on unseen defect prediction compared with the existing methods.

C. Discussions 1) Ablation Study:
The proposed method consists of RFFC, SAGE, and CCO modules (see Section III-B).To ensure a fair comparison and more clearly demonstrate the performance of the proposed module, the baseline combines a traditional few-shot learning model and a traditional generative zero-shot learning model, similar to S2M2_R [51] and GAZSL [61].The result of the ablation study is provided in Table IV.Interestingly, the addition of the synthesized feature contrast module (RFFC) resulted in a significant improvement (+13.67%,+5.2%, and +3.04%) in accuracy on fewer shots (zero-shot, one-shot, and five-shot).This indicates the RFFC can effectively generate unseen features for few-/zero-shot predicting.
Furthermore, compared with the baseline using the SAGE module alone, the accuracy was optimized on different shots (+1.31%, +3.07%, and +1.47%).This demonstrates that the SAGE module can optimize model performance through  sample information fusion.Using the combination of RFFC, SAGE, and CCO modules compared with the RFFC module alone, the accuracy was significantly improved on different shots (+9.04%, +12.26%, and +10.26%).This shows that SAGE as a feature preprocessing for the CCO module and the combination of the two is more outstanding.
As shown in Fig. 6, we visualize the different classes of features generated by the proposed model (contrastive generator RFFC module), with different colors representing different classes of the MSD-Cls dataset.It can be found that, compared with GAZSL (the milestone model for zero-shot learning) and CE-GZSL (the representative model for zeroshot learning), after the first embedding space optimization based on contrast learning, our method has a significant distance between different classes of synthesized features, a clear boundary between the generated different classes, and a significant decrease of biased samples.
2) Hyperprameters Discussion: The hyperparameters used in our method are k, θ, and λ.For the maximum similarity retention k [k ≥ 1, see ( 4) and (10)] and the embedding graph ratio θ [see (5)], the metric evaluated is the accuracy of oneshot [Fig.7(a)] and five-shot [Fig.7(b)] classifications.The dataset for the evaluations is MDS-Cls.
It is observed from Fig. 7(a) that the accuracy decreases when θ increases and the maximum accuracy is obtained when θ = 1.The accuracy first increases when k increases from 2 to 4 but then gradually decreases when k ≥ 6.Thus, all things considered, for θ and k, the best settings at one-shot are θ = 1 and k = 6.Similarly, according to Fig. 7(b), for θ and k, the best settings at five-shot are θ = 1 and k = 4.
For the regularization parameter λ [see (8)], one-to fiveshot classification experiments are conducted.As shown in Fig. 7(c), the accuracy increases quickly when λ starts from 0 to 5. After, the accuracy of one-shot, three-shot, and five-shot classifications tend to be stable when λ continuously Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.increases.Thus, the reasonable setting of the regularization parameter is λ = 5.
Since the proposed method generates samples from the RFFC module as synthesized support samples for inference on the query samples category, this leads to a different number of synthesized samples with different accuracy rates.As shown in Fig. 7(d), in zero-shot condition, the model performance is optimal when the number of synthesized samples = 6, and then it gradually decreases, which may be because some biased synthesized samples cause the inference process to be misguided, leading to the decrease of the accuracy rate.
Similarly, the model performance is optimal in the one-and five-shot conditions when the number of synthesized samples is 7 and 6.
The memory consumption of the model's inference process on the MSD-Cls dataset with different numbers of synthesized support samples is shown in Fig. 8.A clear trend is that the larger the number of synthesized support samples, the larger the memory consumption of the model inference process, while the query samples remain constant.
In summary, by weighing the relationship between model size and accuracy, the number of synthesized support samples  of the proposed method is uniformly set to 6 in practical deployment.To ensure that the model obtains the maximum performance while consuming as little memory as possible.
3) Inference Time Evaluation: To further understand the performance of the proposed method, the model inference time is evaluated on the MSD-Cls dataset, as shown in Table V.Since the filtering phase of the few-shot inference process can be completed before inference about the query sample categories, the inference process of the proposed method is decoupled into two phases (filtering + inference).The filtering and inference time before decoupling and the inference time without the filtering phase are validated here, respectively.
A clear trend is that the inference time becomes longer as the support sample increases.Meanwhile, the inference time of the proposed method alone is much smaller than the time of both inference and filtering phases.Therefore, filtering operations are performed before model deployment to improve real-time prediction.
4) Backbone Network Discussion: In order to evaluate the impact of different backbone networks on the proposed method, it is evaluated on the MSD-Cls dataset using several mainstream backbone networks (i.e., WRN-28-10, ResNet-12, and ResNet-18).The test results are shown in Table VI, where the accuracy of the proposed method is 46.69% and 78.92% on WRN-28-10 for shot = 0 and shot = 1.The highest accuracy is achieved on ResNet-18 for shot = 5.Overall, WRN-28-10 is more suitable to handle the MSD-Cls dataset with fewer sample sizes.D. Prototype Scenario 1) Hot-Rolled Steel Sample Collection: In order to evaluate the performance of the method in practical applications, we collected 15 types of hot-rolled steel surface defects from our partner manufacturers.Some of the defect samples are shown in Fig. 9(a), and the types of defects include contaminants (Co), inclusions (In), scratches (Sc), oxides (Ox), and so on.This included 150 defect samples (10 for each defect class) and 210 normal samples, and a hot-rolled steel surface 2) Prototype Scene Building: A prototype manufacturing defect detection scenario was established to evaluate the model's performance in a realistic scenario.The prototype scenario consists of a production environment, an IoT middleware (our previous work [65]), and a cloud server (Huawei kAi1s accelerated cloud server).
The production environment is shown in Fig. 9(b), where three industrial cameras with different angles (top-camera, leftcamera, and right-camera) were used to parallel obtain defect samples of hot-rolled steel on the conveyor belt (running at 10 m/s and the length of 600 mm), with the top-camera at 230 mm distance from the samples, and cameras with resolution dimensions of 2594 × 1944 pixels.The images were resized to 64 × 64 pixels to be passed to the server to improve the speed of the model run.
Considering that in the realistic application environment, multiple production lines may be monitored in real-time with multiple cameras, which will lead to a large number of product images being captured at the same time, it is not easy to expand the devices and transmit image data quickly by connecting the cameras directly to the server.Thus, to integrate a large amount of image data quickly, the obtained image data and the control signals of other devices (e.g., reject devices) are integrated into a cloud server via the IoT middleware.The reject device uses a programmable vision robot arm with five degrees of freedom and a vision resolution of 640 × 480, and the microprocessor is a Quad-core ARM A57 + 128-core NVIDIA Maxwell.
The proposed method is deployed to perform defect detection response (controlling the reject device and conveyor belt) and result visualization (displaying on an all-in-one machine) on a cloud server with a Kunpeng 920 2.6 GHz processor.
It is worth mentioning that to perform real-time defect detection, the proposed method is decoupled into three phases in the deployment, including feature generation based on contrast learning (see Section III-B), support sample filtering (see Section III-C2), and defect class inference (see Section III-C1).Only the inference phase is deployed on the server, and the feature generation and sample filtering phases are preprocessed before deployment.First, the synthesized features of seen/unseen classes are generated using the class prompts provided by the experts.Then, the generated seen class features are filtered.Finally, the unseen class synthesized features, the filtered seen class synthesized features, and the real support features are combined to form a feature support library for use in the defect class inference stage.
In practical applications, to realize defect detection (including classification and segmentation), a defect segmentation model with segmentation and object detection functions is introduced, which crops out the detected defects and then passes them to our proposed classification model.More accurate defect locations and smaller image sizes help improve the proposed method's accuracy and speed.The defect segmentation model segments the image when the proposed method finishes classifying the defects.The final visualization is shown in Fig. 9(c).
3) Evaluation Results: For each experiment, we repeated ten times to obtain the average value.Combining the parameter analyses and inference time evaluations from Section IV-C, all experimental parameters were fixed to θ = 1, k = 4, λ = 5, and the number of synthesized support samples was four by weighing model size, classification accuracy, and run time.
The experimental results are shown in Table VII, which includes the average recognition time for K -shot classification, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
the time to send visual features on the network, and the time to receive the results.Based on the experiment results, it can be observed that the proposed method achieves an average accuracy of 100% for five-shot classification.This result indicates that for the classification of hot-rolled steel surface defects in real manufacturing, the model performance of our method is relatively high for both zero-and fewshot.The time cost of classification is also acceptable for manufacturing applications.This work will be further collaborated with Metallurgical Research Institute Company Ltd., and promoted in industrial production lines.

V. CONCLUSION
In the field of surface defect recognition, our work focuses on solving three problems: lack of training samples and model complexity in deep-learning-based methods, recognition of defect types limited to known classes in few-shot learningbased methods, and inability to tradeoff attention to seen and unseen classes in zero-shot learning-based methods.A novel few-/zero-shot compatible surface defect classification method is proposed.Extensive experiments on eight fine-grained datasets show that our method improves by an average of 8.29% on the few-shot recognition task and 8.23% on the zero-shot recognition task compared to SOTA methods.The prototype scenario evaluation demonstrates that the proposed method can recognize defect types in real-time.Meanwhile, the average accuracy of the five-shot classification of hot-rolled steel defects reaches 100%, proving the adaptability of the proposed method in industrial environments.
The limitations of this method are as follows.1) Compared with the existing zero-shot learning methods, the accuracy of the proposed method has significantly improved, but it has not yet reached the expected accuracy for industrial applications.The next step will explore associating the seen class sample information with the unseen class and further optimizing the recognition of the unseen class using the few-shot learning idea.2) Through experiments, it is found that although the classification performance of the proposed method outperforms the methods based on few-shot learning on surface defect datasets (i.e., MSD-Cls, FSC-20, and MT-CF), the model size and inference time are not optimal.The next step is introducing model compression methods, such as knowledge distillation, to make the model more adaptable to real-time industrial production environments.

Fig. 1 .
Fig. 1.Overview framework of the proposed method for few-/zero-shot visual inspection.

Fig. 2 .
Fig. 2. Example of positive and negative samples of real and synthesized features, with synthesized sample resolution of 64 × 64 pixels.

Fig. 3 .
Fig. 3. Qualitative results of one-shot retrieval.Correct and incorrect retrieved instances are shown in green and red, respectively.

Fig. 4 .
Fig. 4. Influence of the initial number of query samples q of our method and compares with previous SOTA method GTnet.(a) GTnet.(b) Ours.

Fig. 5 .
Fig. 5. Confusion heat maps of the representative methods.The x-axis represents the predicted defect classes, while the y-axis refers to the real defect classes.(a) LisGAN.(b) CE-GZSL.(c) CvcZSL.(d) Ours.

Fig. 6 .
Fig. 6.Comparison of RFFC module effects.The visual analysis for different classes of features synthesized by different zero-shot learning models, with different colors representing that the synthesized features belong to different classes.(a) GAZSL.(b) CE-GZSL.(c) Proposed.

Fig. 8 .
Fig. 8. Memory consumption of the inference process with different synthesized support samples.

Fig. 9 .
Fig. 9. Prototype manufacturing scenario for hot-rolled steel defect classification based on the proposed method.(a) Hot-rolled steel defect samples.(b) Production environment.(c) Cloud server results in feedback.

TABLE I COMPARISON
OF STATE-OF-THE-ART DEFECT RECOGNITION TECHNIQUES

TABLE II COMPARISON
RESULT OF FEW-SHOT VISUAL INSPECTION

TABLE III COMPARISON
RESULT OF ZERO-SHOT VISUAL INSPECTION

TABLE V EVALUATION
OF INFERENCE TIME WITH DIFFERENT SUPPORT SAMPLES

TABLE VI PERFORMANCE
EVALUATION OF DIFFERENT BACKBONE NETWORKS

TABLE VII PROTOTYPE
SCENARIO EXPERIMENTAL RESULTS