Attention-Guided Multitask Learning for Surface Defect Identification

Surface defect identification is an essential task in the industrial quality control process, in which visual checks are conducted on a manufactured product to ensure that it meets quality standards. The convolutional neural network (CNN)-based surface defect identification method has proven to outperform traditional image processing techniques. However, the real-world surface defect datasets are limited in size due to the expensive data generation process and the rare occurrence of defects. To address this issue, this article presents a method for exploiting auxiliary information beyond the primary labels to improve the generalization ability of surface defect identification tasks. Considering the correlation between pixel-level segmentation masks, object-level bounding boxes, and global image-level classification labels, we argue that jointly learning features of the related tasks can improve the performance of surface defect identification tasks. This article proposes a framework named Defect-Aux-Net, based on multitask learning with attention mechanisms that exploit the rich additional information from related tasks with the goal of simultaneously improving robustness and accuracy of the CNN-based surface defect identification. We conducted a series of experiments with the proposed framework. The experimental results showed that the proposed method can significantly improve the performance of state-of-the-art models while achieving an overall accuracy of 97.1%, Dice score of 0.926, and mean average precision of 0.762 on defect classification, segmentation, and detection tasks.


I. INTRODUCTION
A UTOMATED visual inspection plays an important role in industrial-informatics-based decision-making systems in various industries, including steel manufacturing companies, automotive industries, electronic manufacturing, and pharmaceutical companies. The correct, consistent, and early detection of surface defects can make it possible to detect defective products early in the manufacturing process, which leads to time and cost savings. Inspection procedures for detecting such defects are usually performed using nondestructive testing (NDT) methods. NDT procedure is a combination of various inspection steps used to identify discontinuities or defects in a product without causing damage to its usability. The most frequently used industrial NDT methods are visual optic testing, radiography, X-ray vision, ultrasonic imaging, dye penetrant testing, magnetic particle testing, and infrared thermal imaging. The testing procedure for each of these methods involves several steps, all of which can be easily automated. However, the final step of visual inspection is more complex in terms of automation and remains primarily a manual process performed by operators.
The traditional machine-vision system relies on hand-crafted features, such as color, contrast, texture, edges, foregroundbackground statistics, etc., followed by machine learning classifiers, such as support vector machines, decision tree, or K-nearest neighbors. Consequently, hand-crafted feature extraction plays an important role in classical approaches. However, these features are not robust and suited for different tasks, which leads to long development cycles. Deep learning methods, on the other hand, learn the relevant features directly from the raw data, without the need for handcrafted feature representations. In recent years, convolutional neural network (CNN) has achieved and even surpassed human-level performance on computer vision tasks such as image classification. The key difference between CNN and traditional machine-vision algorithms is that CNN automatically detects significant features without any human supervision, which made it the most widely used. A fascinating feature of CNN is its ability to take advantage of the spatial or temporal correlation of image data. There are three main problem categories for image recognition tasks using CNN: 1) classification, 2) segmentation, and 3) object detection. The classification task aims to classify an image into a certain category. Starting with the ImageNet Large Scale Visual Recognition Challenge winning architecture of AlexNet [1], a series of increasingly complex architectures including ResNet [2], Inception [3], Densenet [4], and EfficientNet [5] have been proposed in the literature for the classification task. Object detection is a task that localizes an object using a bounding box. Some of the notable object detection algorithms include Fast R-CNN [6], Faster R-CNN, Mask R-CNN [7], single shot detection (SSD) [8], You Only Look Once (YOLO) [9], etc. Segmentation is the task of performing pixel-by-pixel classification. Several segmentation algorithms have been proposed in the literature including fully convolutional networks, encoder-decoder-based approaches [10], multiscale and pyramid architectures [11], etc.
However, industrial visual inspection systems barely utilized the potential of those complex architectures due to several reasons [12]. One of the main reasons is that the continuous improvement in industrial processes has resulted in fewer and fewer defective samples or the number of defective samples is very limited [13]. This problem of learning from a limited number of samples is usually referred to as the small sample problem, which can easily lead to poor generalization ability of the trained model [14]. In addition, the target surface defects have different scales, making the deep learning models even more challenging to identify the small-sized defects. On the one hand, the visual appearance of the real-world surfaces defects varies with the type of materials, imaging conditions, and camera position. On the other hand, it is challenging to distinguish tiny defects from the noise or non-defect components within an image (as shown in Fig. 1). Hence, the appearance of false positives in a defect-free image is an inevitable circumstance. Furthermore, real-time applications of complex CNN models are extremely limited due to the long inference time and the resulting higher computational resource and power consumption.
To address these limitations, we present a novel universal architecture that integrates classification, segmentation, and detection of surface defects in a single network. Our architecture, Defect-Aux-Net, is primarily motivated by a multitask learning (MTL) scheme that exploits useful information from related learning tasks to help mitigate the problem of data scarcity. The proposed architecture is based on FPN-semantic-segmentation [11] with the additional tasks of defect classification and detection to improve the generalization ability by utilizing the image-level information as an inductive bias. Specifically, we developed a new MTL network based on FPN, where the classification task is carried out in the bottom-up pathway of the network and segmentation is performed in the top-down pathway of the network. To create a bounding box, we employ two subnetworks in the top-down pathway, where one subnet determines the class associated with the bounding box and the other performs the regression to adjust the bounding box position.
The FPN-based feature extractor in the proposed network allows surface defects to be recognized at vastly different scales by efficiently sharing features between image regions. We further introduce the positional and the channel attention mechanisms that focus on learning the features of small surface defects to improve the robustness of detecting small defects surrounded by a complex background. We evaluate our model on TekErreka, and Severstal [15] surface defect datasets, with defect classification, segmentation, and detection tasks. Experimental results demonstrate that jointly 4) The proposed model is compact and efficient with the state-of-the-art performance that meets the computational resource requirements of the real-time inference speed.

II. RELATED WORK
A large and growing body of literature has explored the use of CNN for surface defect identification. Kim et al. [16] adopted a few-shot learning technique with a Siamese neural network using CNN, which aims to classify surface defects with a limited number of training images. Lin et al. [17] employed a class activation mapping technique in CNN to simultaneously achieve defect classification and localization tasks in the LED chip defect inspection process. Tao et al. [18] designed cascaded autoencoder (CASAE) architecture to segment and localize defect region. The proposed architecture transforms the input image into a mask prediction, and then, the defect region of the segmented mask is classified into their specific classes. Jing et al. [19] combined autoencoder with a fully connected network to detect keyboard light leakage defects from mere dust. Jian et al. [20] leveraged generative adversarial network to exaggerate the tiny defects within the images to improve the accuracy of different classifiers. Zheng et al. [21] proposed a three-stage model for rail surface and fastener defect detection. In the first stage, the YOLOV5 framework is employed to localize the rail and fasteners. Then, an object detection model based on Mask-RCNN is used to detect the surface defect of the rail surface. At the final stage, the ResNet architecture is utilized to classify defects of the fasteners. To detect defects at a different scale, Xu et al. [22] used a pretrained ResNet model to extract the multiscale features and fuse them using a multilevel feature fusion network. In [23], U-Net and residual U-Net architectures were used for the fine-grained segmentation of surface defects on a steel sheet. The main drawback of these methods is that the model needs a large amount of annotated data and hence the localization of defects is very coarse in the real-time scenario.

A. Network Architecture
Our proposed network is inspired by two deep learning architectures that are widely used: 1) feature pyramid network (FPN) and 2) ResNet-50. Recognizing surface defects at vastly different scales is a fundamental challenge in the industrial machine vision system. For this reason, we use FPN that uses a pyramidal hierarchy of convolutional filters to extract feature pyramids at different scales. FPN consists of two pathways: 1) bottom-up and 2) top-down. The bottom-up pathway also known as the encoder is the typical CNN, which can be any image classifier for feature extraction. As we go up, the encoder gradually decreases the spatial resolution while building high-level feature maps. The top-down pathway is connected to the bottom-up pathway through lateral connections for efficient multiscale feature fusion. It is designed to enhance the feature maps from the bottom-up pathway and build semantically strong feature maps at multiple scales by double upscaling. As a result, the feature pyramid has rich semantics at all levels because the lower semantic features are interconnected to the higher semantics.
1) Bottom-Up Pathway: We tested several standard image classification architectures to select the core model and finally chose ResNet-50 as the backbone. ResNet-50 has shown great performance for surface defect classification, segmentation, and detection tasks. ResNet-50 architecture has the advantage of using a stride of two for each scale reduction, which makes it easier to incorporate ResNet-50 into FPNs when we need to upscale feature maps in a top-down pathway. Furthermore, Resnet-50 is a relatively small network based on modern standards; therefore, it is suitable for our limited labeled data problem. However, existing ResNet-50 feature pyramids have two problems in the way they apply convolution operations to the input features. First, the receptive field of the encoder has the information only about the local region, so the global information is lost. Second, the feature maps constructed from the learned weights are given an equal magnitude of importance but some feature maps are more important for the next layers than others. For instance, a feature map that contains edge information of the defects might be more important than another feature map that has background texture information (as shown in Fig. 2). Thus, to incorporate channel attention we adopt Squeeze-and-excitation (SE) module [24] in the encoder. SE module consists of three components: 1) squeeze, 2) excite, and 3) scale components.
The main goal of the squeeze component is to extract global information from each of the channels c in a feature block U. The global information is acquired by applying a global average pooling operation across their spatial dimensions (H × W ) for each channel U c of U to obtain global statistics (1 × 1 × C). Mathematically, squeeze operation can be represented as After obtaining global information from the squeeze component, the excite component generate a set of weights for each channel. It uses a fully connected multilayer perceptron (MLP) bottleneck structure to dynamically calibrate the weights. This   MLP bottleneck has two fully connected layers with sigmoid activation as the output layer. The output of the excitation component can formally be represented by the following equation: where σ is a Sigmoid operation, ρ is ReLU operation, z is the output from the squeeze component, W 1 and W 2 refers to weights of the two fully connected layers. Subsequently, each channel in the feature map is scaled by a simple elementwise multiplication of the input feature map and weights obtained from the excite component (as shown in Fig. 3).
Surface defects only appear in some parts of the image but not the whole image. Unlike the conventional Resnet-50 architecture, which gives equal importance to each region in an image, the spatial attention reduces background interferences by assigning a weight to each pixel in the feature map.
The spatial attention focuses on the most relevant parts of the feature maps in the spatial dimension. The working principle of our spatial attention mechanism is as follows.
Given feature block U , we use average and max-pooling operations along the channel axis and concatenate them to generate an efficient feature map summary M. A convolutional layer followed by sigmoid operation is then performed on the feature M to produce a spatial attention map (as shown in Fig. 4).
ResNet uses four modules consisting of residual blocks, each of which uses two blocks, 1) Identity (ID) blocks and 2) convolution blocks, depending on whether the input / output dimensions are the same or different. We arrange SE and SA modules in series and integrate into a residual block (as shown in Fig. 5).
2) Top-Down Pathway: Deep features from a bottom-up pathway are upsampled by convolutions and bilinear upsampling operations until all the feature maps reach one-fourth scale. Attention module outputs from a bottom-up pathway {C 2 , C 3 , C 4 , C 5 } are fused to a top-down pathway through lateral connections for an efficient multiscale feature fusion. First, 1 × 1 convolutional filter is applied to the feature maps {C 2 , C 3 , C 4 , C 5 } to get a fixed number of channels and then merged with the corresponding top-down feature map by elementwise addition. Finally, the outputs are summed and then transformed into a pixelwise output (as shown in Fig. 6).
3) Segmentation Branch: The segmentation branch from a top-down pathway aims at classifying pixels into a set of predefined classes. The pixels corresponding to background are far more numerous than the pixels of surface defects in the realworld dataset, which causes the model to be biased toward the background element. To address the pixelwise class imbalance, we employ Dice loss, which uses the Dice coefficient to calculate overlapping of the pixels of the predicted mask with the ground truth label. Mathematically, the Dice loss function is defined as where y i is the ground truth label andŷ i is the predicted label. The value of the Dice coefficient ranges from 0 to 1, where 1 indicates the perfect and complete overlap of pixels.

4) Classification Branch:
The output of the bottom-up pathway encodes the rich abstract feature representations of the input image. Hence, we utilize the spatial average of the feature maps from the bottom-up pathway via a global average pooling layer, and then, the resulting feature vector is fed into the sigmoid or softmax layer depending on the classification type. We employ binary cross-entropy (BCE) as a classification loss function. Mathematically, our classification loss is defined as where y i is the ground truth label,ŷ i is the predicted label of ith sample, and k is the total number of samples. CE is the binary cross entropy function.

5) Object Detection Branch:
We extract bounding boxes and its associated classes by employing box regression and classification subnets at each level of top-down pathway. The classification subnet predicts the probability of defect presence at each spatial location of an input image. The box regression subnet is attached to a top-down pathway in parallel to the classification subnet for the purpose of regressing offset from each anchor box to the ground truth bounding boxes. To handle class imbalance problems, we adopt focal loss [25], an improved version of cross entropy to focus learning on hard negative examples. It is defined as where α t is the weight parameter per class and γ is the hyperparameter focuses on hard negative samples. We choose α t = 0.25 and γ = 4 as suggested in [26].

B. Loss Function
Our proposed method combines three loss functions from the classification, segmentation, and detection tasks, which provide mutual sources of inductive bias for each task. Specifically, the segmentation and detection loss functions signal back to the entire model (bottom-up and top-down pathway) while the classification loss signals back only to bottom-up pathway. We combine and weight the three losses into a multitask loss L M to leverage the heterogeneous annotations and jointly optimize multiple tasks as follows: (6) where β, β 1 , and β 2 are weight parameters. We tested with different combinations of weight parameters and found that β = β 1 = β 2 = 1 yields the best result for all the tasks.

A. Datasets
In this article, we evaluate our framework on real-world surface defect identification problems. We use two challenging datasets with increasing resolutions and complexities, 1) Severstal steel sheet [15] and 2) TekErreka steel fastener defect datasets. Severstal, the largest steel and steel-related mining company, has recently published the largest industrial steel sheet surface defect dataset, which contains pixelwise masks annotated by their technical experts. The dataset contains 12 568 grayscale images of size 1600×256. Each image in the dataset has the possibility of having either no defects, a single defect, or multiple defects divided into four classes. Fig. 7 shows the example of steel defect images on Severstal datasets. We randomly select 10% and 20% of the 12 568 original images as the validation and test data. The main challenge with this dataset is that the interclass similarities between defective and defect-free examples are very high. The TekErreka dataset is a self-collected steel fastener surface defect dataset based on a magnetic particle inspection procedure. The magnetic particle inspection is an excellent method to investigate near-surface defects in steel fasteners. The basic principle is to magnetize a steel fastener parallel to its surface. If the fastener is free from defects the magnetic field lines run within the fastener and parallel to its surface. In case of magnetic inhomogeneity, for instance, near cracks, the magnetic field lines will locally leave the surface and a leakage field occurs. When a suspension of ferromagnetic particles is applied to the test piece surface the magnetic particles will run off at defect-free areas. In the places of leakage fields, the magnetic particles are attracted and clustered together thus indicating the location of the defect. The surface defects can be visible under ultraviolet light. We acquired the TekErreka dataset from a magnetic particle inspection apparatus located at the Erreka fastening solutions. The defects in the TekErreka dataset differ in their size, shape, location, and materials type and thus cover several scenarios in real-time defect detection. The difficulty in this dataset lies in the similarity of defects and noise due to magnetic particles deposition on the defect-free surface of the fasteners. There are many factors responsible for the noise component, which include magnetic particle size, the amount of magnetic particles used, ultraviolet light present, etc. The original examples are directly stored in a database as RGB images of size 2464 × 2056. It has 450 positive and 1200 negative examples. We split the TekErreka dataset into training and testing sets: 80% for training and 20% for evaluation of the model performance.

B. Preprocessing
We resized the images of the Severstal dataset to 128×800 and the TekErreka dataset to 600×600. To keep the pixel values in the same scale, we normalized the images using min-max standardization. It rescales raw pixel values to a range of 0 and 1. This helps the optimizer not get stuck taking steps that are too large in one dimension, or too small in another.

C. Data Augmentation
To improve the diversity of the training set, we apply random but realistic data augmentation such as rotation, vertical/horizontal flips, zoom, shear, and channel shifts.

D. Training Details
The Defect-Aux-Net is implemented using the Tensorflow framework. All the experiments are run on Google-cloud TPU V2 infrastructure, which contains 8 cores with 64 GB memory. The network is optimized with the Adam optimizer and trained with a batch size of 128 for 50 epochs. We adopt one cycle policy [27] to find an optimal learning rate.

E. Evaluation Metrics
The classification results are evaluated using precision, recall, F1-score, and binary accuracy Precision = TP TP + FP (8) Accuracy = TP + TN TP + FP + TN + FN (10)  I  PERFORMANCE OF THE PROPOSED APPROACH ON LOSS VARIANTS FOR THE  DEFECT SEGMENTATION TASK where TP, TN, FP, and FN denote true positive (correctly identified surface defects), true negative (correctly identified nondefect images), false positive (erroneously classified images as surface defect), and false negative (erroneously classified images as non-defect). Precision measures the percentage of images with surface defects that are correctly classified while recall is the ratio of correctly classified images with surface defects to all images with surface defects. F1-score can be interpreted as a harmonic mean of precision and recall. The overall performance of the classification task is measured by its accuracy. The segmentation results are evaluated using Dice score and Intersection-over-Union (IoU), which quantify the percentage overlap between the predicted and target binary masks. To evaluate defect detection results, we used the mean average precision (mAP) that compares the detected bounding box to the ground truth bounding box and returns a score.

F. Experiments on Defect Segmentation
We performed a series of experiments on the TekErreka dataset to test the effectiveness of different loss functions. First, we trained Defect-Aux-Net using BCE, and Dice loss alone as the segmentation loss. Then, it was trained using a combination of loss functions. The results are shown in Table I.
Using Dice loss alone yielded more accurate results than using a combination of losses. Additionally, the Dice loss function assisted our model to converge faster. We use the Dice loss function throughout rest of the experiments.
To verify the effectiveness of the segmentation task using the MTL strategy, we compared the proposed MTL network (Defect-Aux-Net) against the following network with the same bottom-up backbone (Resnet50 + SE + SA attention module). 1) FPN [11]: This is the original FPN architecture without the MTL strategy and serves as our baseline. 2) UNet [10]: This network uses an encoder for multilevel feature extraction and a decoder that scales them up and combines multilevel features through stacking. 3) LinkNet [28]: This is similar to UNet with the difference of replacing stacking operation with addition in skip connections. 4) PSPNet [28]: Pyramid scene parsing network uses a pyramid pooling module for multiscale feature extraction. Based on the experimental results, we observed that the proposed multitask learning strategy achieves better segmentation performance as compared to the state-of-the-art segmentation models. The Dice and IoU scores of the various segmentation models on the Severstal dataset are depicted in Figs. 8 and 9.   We observe that Defect-Aux-Net is able to achieve higher scores for all classes as compared to the other segmentation models. Table II shows the performance of the various networks on the TekErreka dataset. Experimental results from Table II showed that the proposed multitask learning can improve the performance of its corresponding single-task model. Taking advantage of the classification-guidance module, Defect-Aux-Net avoids the oversegmentation of defects in a complex background.

G. Experiments on Defect Classification
We evaluated and compared the classification task performance of the proposed approach with the state-of-the-art deep learning architectures. While evaluating the classification task, the other two modules, segmentation and detection, are removed from the network. The results of the experiments are summarized in Table III. It can be noted that most errors are due to false positives. The visual similarity between defects and surface noise leads to false positive errors. Notably, Defect-Aux-Net obtains overall accuracy of at least 92.9% and at most 99.4% across all defect types on the Severstal dataset. Based on the experimental results, we observe that the proposed MTL approach achieves a surpassing performance over the other models. Also, it is evident that incorporating the segmentation task improves the performance of the classification task and vice-versa.
To assess the effectiveness of the proposed approach against the limited data problem, we removed part of the training data and conducted a series of experiments leaving 90%, 75%, and 50% from the training data. The effect of training data size on its accuracy is shown in Fig. 10. The proposed Defect-Aux-Net showed a consistent performance even when only 50% of the original training data is used in training. As seen, the proposed multitask loss function greatly improves the performance of the classification task by taking image, pixel, and map level optimization into consideration.
To verify the importance of the attention mechanisms in Defect-Aux-Net, we compared the accuracy of the network with  and without spatial and channel attention mechanism (squeeze and excite) on the TekErreka dataset, as shown in Table IV. Furthermore, we experimented with inserting a combination of both spatial and channel attention mechanisms.

H. Experiments on Defect Detection
The proposed model is compared with other object detection algorithms on the TekErreka dataset. The comparative models include SSD [8], RetinaNet [25], and cascade R-CNN [30]. Fig. 11 shows the mAP scores of the various detection models for the TekErreka dataset. We observe that Defect-Aux-Net is able to achieve a higher mAP score as compared to the alternative networks. The mAP of the proposed algorithm is 17.95%, 43.77%, and 26.03% higher than that of RetinaNet, SSD, and Cascade RCNN.

I. Inference Time
In addition to the model performance, we attempt to determine the effectiveness of the MTL framework on the inference time. We compared the inference time of the proposed approach with a conventional single-task network where each task requires a separate pass through the network during inference. All the inference time was measured using a computer with an Intel Core processor. The CPU specification is summarized in Table V.   Table VI, we can see that our proposed framework allows for a 57.1% reduction in the model size by solving different tasks jointly rather than independently. Compared to the single-task network, the inference time of our proposed network reduces by 45.5%.

V. DISCUSSION
By incorporating the MTL strategy, our proposed Defect-Aux-Net improves the performance of defect classification, segmentation, and detection tasks. Intuitively, the multitask deep learning system can provide regularization effects to the multiscale feature learning and thus improve the performance as opposed to the single-task algorithms. Also, the MTL framework can save computational inference time as only a single network needs to be evaluated for three different tasks. The experimental results show that our proposed algorithm greatly improves the performance of the surface defect identification tasks compared to other state-of-the-art deep learning algorithms.

VI. CONCLUSION
In this article, we described an attention-guided MTL scheme, which combines classification, segmentation, and defection for automated surface defect detection. Specifically, we proposed an extended FPN architecture with Resnet-50 incorporated as the encoder section of the model. The hybrid loss function is introduced to enhance the performance of the model. An overall accuracy of 97.1%, Dice score of 0.926, and mAP of 0.762 on classification, segmentation, and detection tasks of the TekErreka dataset were achieved with Defect-Aux-Net.

ACKNOWLEDGMENT
This work was undertaken in the context of DIGIMAN4.0 project ("Digital Manufacturing Technologies for Zero-Defect," https://www.digiman4-0.mek.dtu.dk/). DIGIMAN4.0 is a European Training Network supported by Horizon 2020, the EU Framework Programme for Research and Innovation under Project 814225.