An Enhanced YOLOv4 Model With Self-Dependent Attentive Fusion and Component Randomized Mosaic Augmentation for Metal Surface Defect Detection

Metal surface quality control is significant in the production line of metal products. Detecting metal surface defects is challenging due to the various types and morphological patterns. Recent advances have witnessed deep learning-based automated optical inspection systems as a promising solution. This paper presents an enhanced YOLOv4 model for metal surface defect detection (MSDD). Specifically, we integrate three boosting components into the original YOLOv4, including 1) a self-dependent attentive fusion (SAF) block, placed within the model neck, to enhance inter-path and cross-layer feature fusion, 2) a component randomized Mosaic augmentation (CRMA) scheme to strategically discourage an over-transformed image to participate in training, and 3) a perturbation agnostic (PA) label smoothing method to keep the model from making over-confident predictions and thus act as a means of regularization. The proposed method has been validated on a self-developed MSDD dataset. It is shown that each boosting component can lead to an impressive mAP gain, and the final model outperforms the baselines, namely, Faster R-CNN, YOLOv4, YOLOv5, and YOLOX, by 7.85%, 6.51%, 3.76%, and 3.57%, respectively.

The associate editor coordinating the review of this manuscript and approving it for publication was Chuan Li. Utilizing computational algorithms to identify defects on 28 the metal surface has presented unique merits but also come 29 with challenges. Essentially, the metal surface defect detec-30 tion (MSDD) problem is an object detection task. In other 31 words, given an image taken by an industrial camera, the 32 AOI system needs to identify the location and each individ-33 ual defect instance within the image. Therefore, the main 34 research focus over the years has been improving the detec-35 tion accuracy. AOI techniques for MSDD have gone through 36 three generations: 1) image processing-based algorithms that 37 utilize various image filters to highlight the defect areas and 38 adjust relevant parameters to improve the accuracy [3], [4], 39 [5], [6]; 2) machine learning (ML)-based algorithms rely on 40 hand-crafted features extracted by certain image filters [7], 41 [8], [9], [10]; these features are then fed into ML algorithms 42 like support vector machine (SVM) and k nearest neigh-43 bors (k-NN) for further detection; 3) deep learning-based 44 Faster R-CNN [15], which have achieved state-of-the- 96 art on various benchmarks in prior studies. Results show 97 that our method has outperformed the baselines and 98 presented the highest mAP of 0.9465. 99 The rest of this paper is organized as follows. Section II 100 reviews the related work and highlights the novelty of 101 this study. Section III covers the dataset and a detailed 102 and module-by-module description of the proposed method. 103 Section IV describes the experimental design and reports 104 the key results. Lastly, Section V summarizes the paper and 105 points out future research directions.

107
A. DNN-BASED OBJECT DETECTION 108 DNN-based object detection models can be divided into two 109 categories: one-stage and two-stage networks. The latter, rep-110 resented by Faster R-CNN [15], employs a region proposal 111 network to generate a collection of regions of interest (RoIs), 112 which are then fed into 1) a classifier to predict an object 113 class with a confidence score and 2) a regressor to predict the 114 offsets of the bounding boxes for object localization. On the 115 other hand, a one-stage model is proposal-free, meaning that 116 object classification and bounding box regression are done 117 without using pre-generated RoIs. The YOLO family has undergone active development since 146 the inception of YOLO in 2015. The first version of YOLO 147 works by splitting an image into an S × S grid, and the 148 cells in the grid directly detect objects in the image via 149 training, thus reducing a large amount of computation taking 150 by region proposal required in a two-stage model. Since 151 YOLO is proposal-free, it tends to make numerous duplicate 152 predictions, which can be addressed by a scheme called Non 153 VOLUME 10, 2022 Maximal Suppression (NMS). YOLO's backbone DNN is called the DarkNet, which consists of 24 convolutional layers 155 followed by two dense layers as the detection head.

156
YOLO has been shown to perform poorly in the detec-  pixels were detected as defects [3]. In [4], a polarized , discrete Fourier transform [9], and gray-level 211 co-occurrence matrices [10]. ML models that have 212 appeared in the literature include support vector machine 213 (SVM) [7], [9], [34], K-Means [8], and k nearest neigh-214 bors (k-NN) [10]. ML-based methods can achieve satis-215 factory performance and are more efficient in training 216 and inference, compared to DNN models. However, 217 a limitation is that features need to be manually designed 218 and gathered.

219
• Deep learning-based algorithms have become the main-220 stream for any object detection task in the past ten 221 years. The case also applies to metal surface defect 222 detection.

229
A. DATASET 230 We have developed a metal surface defect dataset that consists 231 of a total of 1064 images. There are two types of defects, 232 namely, scratch and crash, that were manually made during 233 the image collection. Table 1 shows the stats of the dataset, 234 split into training (800 image samples), validation (100 sam-235 ples), and test (164 samples) sets. Each identified defect 236 object has been marked with a bounding box, serving as a 237 label for training. A total of 3,765 scratch and 460 crash 238 objects have been annotated across the dataset. The number of 239 defect objects varies from image to image. It was an intention 240 to create less crash defects than scratch defects since the latter 241 is easier to be made and thus more commonly seen in the 242 real world. Figure 1 shows two image samples in the dataset. 243 Subfigure (a) shows an image with five scratch defects, and 244 subfigure (b) is an image with only one crash defect. It can be 245 seen from subfigure (a) that the scratch defects vary in size, 246 shape, and orientation. Also, to facilitate annotation, several 247 scratches that are close to each other are marked into the same 248 bounding box, indicating a single scratch instance, as shown 249 in the top left bounding box of subfigure (a).  The original PANet utilizes a simple addition opera-303 tion to perform inter-path and cross-scale feature fusion, 304 while YOLOv4 replaces the addition with a concatenation. 305 We argue that not only information from different levels 306 of the paths should be preserved, the relative importance 307 of pixels should be captured and utilized as feature maps 308 are fused. To further enhance feature fusion, we design a 309 Self-dependent Attentive Fusion (SAF) block (marked in red 310 blocks in Figure 2) to replace the concatenation fusion in 311 YOLOv4.

312
An SAF block takes as input two feature maps, denoted 313 by F a and F b , one (say, F a ) from the same path as the SAF 314 block and the other (say, F b ) from the adjacent path. In the 315 neck of YOLOv4, we place four SAF blocks, two for each 316 path, as shown in Figure 2. Let both feature maps be of size 317 (W , H , C), where W , H , and C refer to the width, height, and 318 depth; the output of SAF, namely, the fused feature, denoted 319 by F o is of size (W , H , 2C). The internal design of an SAF 320 block is depicted in Figure 3 and can be formally described 321 in Equations (1) and (2).
(2) 324 in which + and ⊗ refer to pointwise addition and multiplica-325 tion, and [; ] refer to concatenation. 326 We provide several design considerations for the SAF 327 block. First, an SAF block only relies on F b in the calculation 328 of the attention score, making it self-dependent. Second, the 329 reason why F b is used for attention calculation is that F b 330 is closer to the backbone network and the original image, 331 retaining more semantic information, while F a undergoes 332 more layers such as up/down sampling, which may cause 333 information loss. Third, our empirical result shows that it is 334 sufficient to make the attention score s one-channel tensor, 335 which also reduces computational cost.

E. COMPONENT RANDOMIZED MOSAIC AUGMENTATION 337
The original mosaic data augmentation strategy employed in 338 Yolov4 can be summarized with three steps: 1) four images 339 from the training set are randomly selected; 2) for each image, 340 a random transformation is selected and applied to obtain 341 a transformed image; 3) the four transformed images are 342 stitched together to fit the pre-defined scale of an input image. 343 After passing through a transformation algorithm, the size 344 of an image may be changed, and the objects marked within a 345 bounding box may suffer distortion or be cropped, resulting in 346 an information loss and affecting the prediction accuracy. It is 347 pointed that a certain degree of augmentation does enhance 348 VOLUME 10, 2022 FIGURE 2. Overall workflow of the proposed method. The original input images are augmented via the CRMA strategy to generated more diversified but not over-transformed images, which are fed into a YOLOv4 model enhanced with an SAF block. During training, one-hot encoded labels are converted to soft labels via the label smoothing strategy as a means of regularization. the same size, based on the ratio of its actual area size to the 359 original size, combined with the aspect ratio. The gray image 360 uses (127,127,127) RGB values to fill all pixels of the image 361 without specific semantic information. This way, none of the 362 pixels of the gray image provide any useful information to 363 the learning algorithm. As such, an over-transformed image 364 becomes a gray image that does not participate in training due 365 to the noise introduced.
where is used to adjust the relative importance between the 371 two, here we use 0.01 as its value, i.e., the area is still the main 372 reference object. 373 97762 VOLUME 10, 2022

411
In this section, we first define the performance metrics for

461
For each individual class, we can compute an AP value.

462
The mAP is then calculated using the mean of APs across all 463 classes, as shown in Equation 9. In addition to the quantitative results, we also display some 525 qualitative results in Figure 7. Three images with detected 526 defects are shown. Subfigure (a) shows a mixing of scratch 527 and crash defects, subfigure (b) only has scratches, and sub-528 figure (c) only contains crash instances. It is observed that 529 our model can detect the majority of defects with accurate 530 bounding boxes. However, minor detection issues (marked by 531 the numbers) also present and include the following.

532
• Numbers 1 and 4 seem to be small scratches without 533 being detected. After checking the GT, we notice that 534 these two defects were not annotated and thus also 535 missed by our model. Annotation mistakes happen from 536 time to time. In this case, the two defects are hard to 537 be noticed even with human eyes. The quantity of these 538 hard instances is not sufficient for out model to learn 539 useful patterns during training.

540
• Number 2 is the extension of the scratch beneath it. The 541 GT includes the extension while our model missed it.

542
• Number 3 is a shallow crash that is marked by a GT but 543 missed by our model.

572
The study has the following limitations that also suggest 573 our future directions. First, the CRMA scheme demonstrates 574 decent performance on the self-developed dataset, while its 575 usage and effect on other object detection tasks remain to 576 be explored. Second, it would be interesting to investigate 577 different learning paradigms such as consistency training and 578 knowledge distillation. The former brings in weakly super-579 vised learning with image augmentation. The latter involves 580 a teacher and a student model which can be of different 581 neural architectures (e.g., CNN and vision transformer) to 582 enhance model diversity; the teacher model is trained first, 583 and the student can then be trained by taking into account a 584 distillation loss that allows the teacher to transfer knowledge 585 to the student. This way, the student can incorporate more 586 features from both the teacher and itself, leading to a potential 587 performance gain. Lastly, the dataset can be further enhanced 588 with more classes that commonly appear in the industry.