LFF-YOLO: A YOLO Algorithm With Lightweight Feature Fusion Network for Multi-Scale Defect Detection

The detection of defects is indispensable in industrial production. Surface defects have different scales. Both minimal flaws and significant scratches may appear on the same product. The standard method uses a multi-scale feature fusion network, introducing many parameters that may reduce the inference speed. In actual industrial production scenarios, inference speed and accuracy play an equally important role. Therefore we propose an algorithm to effectively improve the detection speed while improving the detection accuracy. The model proposed in this paper called “YOLO with lightweight feature fusion network (LFF-YOLO).” First, we use ShuffleNetv2 as a feature extraction network to reduce the number of parameters. Then, to improve the efficiency of multi-scale feature fusion, we propose the lightweight feature pyramid network (LFPN). Considering that the fixed receptive field is difficult to adapt to the defects of different scales, it may lead to the difficulty of model convergence and seriously affect the detection performance. Therefore, we propose the adaptive receptive field feature extraction (ARFFE) module, which weights the multi-receptive field channels to generate multi-receptive field information. In addition, focal loss is used to solve the problem of imbalance between positive and negative samples. Finally, we conducted experiments on NEU-DET (79.23% mAP), Peking University printed circuit board defect dataset (93.31% mAP),and GC10-DET (59.78% mAP), respectively. Extensive experiments show that our proposed method achieves optimal detection speed compared with the prevailing methods, and the detection accuracy of our method is also highly competitive. We open-soure our code in the following URL:https://github.com/syyang2022/LFF-YOLO


I. INTRODUCTION
The detection of defects is indispensable in industrial production, and the detection of blemishes is a vital part of production. The use of manual methods for defect detection can lead to inefficient detection and subjective factors affecting detection accuracy. Recently, defect detection methods based on computer vision technology have gradually replaced manual defect detection. Traditional computer vision surface defect detection methods are mainly feature-based. This The associate editor coordinating the review of this manuscript and approving it for publication was Marco Martalo . approach relies on manually designed algorithms to extract defect features, resulting in poor robustness and generalization of the model. Deep learning methods compensate for this shortcoming. Convolutional neural networks can capture high-level semantic features, and models have stronger robustness and generalization capabilities than traditional methods. Convolutional neural networks have become a very important method in industry [1], [2], [3].
With the rapid development of deep learning techniques, many excellent target detection algorithms have emerged, and they have been applied in the field of defect detection. Second-stage target detection algorithms such as R-CNN [4], VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Fast R-CNN [5], and Faster R-CNN [6], are based on area suggestion frames allowing for improved detection accuracy at the expense of detection speed. One-stage target detection algorithms such as YOLO [7], [8], [9], [10], SSD [11], RetinaNet [12], simplify the network design by using only one network to classify and localize the target, thus significantly improving the detection speed. The one-stage detection algorithm is more suitable to meet the demand for the real-time detection task. YOLOv3 [9], as a classical onestage network, is the most widely used in industrial scenarios due to its stability. Therefore, we propose an inspection method based on YOLOv3 and improve it to make it easier to deploy on terminal devices. Metal surface defects, such as steel surface defects (Figure 1), are more difficult to detect because of the large scale. In the feature extraction network, features extracted by the shallow layer have richer fine-grained information, such as color, texture, and other details, which can help the model identify the type of defects. Moreover, features in deeper layers mainly contain semantic information, which is the critical information for locating the defect. YOLOv3 uses the FPN [13] network to fuse multi-scale information, utilizing three feature layers for detecting small targets but not fusing the upper layer feature information for detecting large targets. PANet [14] uses a bidirectional link structure so that each detection head utilizes information from multiple feature layers, which improves the detection accuracy of large targets. However, neither FPN nor PANet is a network designed for industrial inspection, so the inference speed factor is not considered. Due to many parameters, most of the existing detection models have high deployment costs. We hope to design a model with fewer parameters and better performance based on a mature framework that has been widely used. In this paper, we propose a more efficient feature fusion network LFPN for defect detection tasks requiring real-time performance, which improves detection accuracy by introducing fewer parameters.
Lightweight networks such as ShuffleNet [15], [16], MobileNet [17], [18], and GhostNet [19] have enabled feature extraction networks to reduce the amount of parameters without reducing accuracy. ShuffleNetv2 is an improvement of ShuffleNetv1, which considers both model parameters and the impact of memory access on inference speed. Therefore, compared with other lightweight networks, it has faster inference speed in practical applications. In this paper, ShuffleNetv2 is chosen as the feature extraction network of the model, and an adaptive receptive field feature extraction module is designed to improve the feature extraction capability.
We propose a more lightweight multi-scale feature fusion network. An adaptive receptive field feature extraction module is used to increase the feature extraction capability of the network while using the k-means algorithm to cluster the anchor frame to obtain a more reasonable anchor frame design. In addition, focal loss is used to address the problem of unbalanced positive and negative sample categories. The main contributions of this article are as follows.

1)
We propose a new model to solve the problem in industries where metal surface defects span extremely and defect detection's low efficiency. The baseline model is the most widely used object detection model, YOLOv3. 2) In order to make the model easier to deploy to the detection terminal, a more lightweight network ShuffleNetv2 was used to extract defect features, but this introduced a problem that the network receptive field was fixed. Therefore, we propose the ARFFE module to increase the receptive field of the feature extraction network so that the model can extract defect information more effectively under the appropriate receptive field. 3) Considering the importance of detection speed in practical application scenarios, we propose a novel lightweight feature fusion network LFPN for multi-scale defects on metal surfaces, which can effectively fuse multi-scale features under the premise of introducing a few parameters, thus improving detection accuracy.

4) Experiments on the open defect detection dataset
NEU-DET [20] show that the proposed method can detect defects quickly and effectively, which proves that the proposed method is superior to other methods in defect detection scenarios, The generalization ability of the proposed method is verified on the printed circuit board defect dataset [21] and steel plate surface defect dataset GC10-DET [22].
The rest of the article is structured as follows: related work presented in Section 2 and our proposed approach, including the lightweight feature fusion network and the adaptive perceptual field feature extraction module, presented in Section 3, experiments on NEU-DET printed circuit board open dataset and steel plate surface defect dataset GC10-DET are done in Section 4 to verify the method's validity. Finally, conclusions are given in Section 5.

II. RELATEDWORK
The primary defect detection methods currently used in the industry include traditional methods and deep learning-based methods.

A. TRADITIONAL METHODS
Traditional defect detection methods mainly extract image features by image preprocessing, such as histogram equalization, grayscale binarization, filtering and denoising. Subsequently, classification detection of defects was accomplished using morphology, Fourier transforms, Gabor transforms, and machine learning methods. For example, Prasitmeeboon [23] used color histogram and SVM to detect particle board defects and used thresholding and smoothing techniques to localize the faults. Chang et al. [24] implemented defect detection on the camera lens surface based on polar coordinate transform, Hough circle transform, weighted Sobel filter, and SVM. Wang and Zuo [25] used Fourier transform and Hough transforms to reconstruct the magnet surface image and obtained the defect information by comparing the grayscale difference between the reconstructed image and the original image to detect defects. These traditional methods require manual feature extraction and have poor robustness.

B. DEEP LEARNING METHODS
In recent years, the rapid development of deep learning has led to its increasingly widespread application in defect detection. Compared with traditional methods, deep learning methods do not need to extract features manually but directly through learning data update parameters to automatically extract features and feed them into subsequent networks for classification and localization prediction. It avoids the complex process of manually designing algorithms and has a very high level of robustness and accuracy.
The currently available target detection algorithms can be divided into one-stage and two-stage networks. Twostage networks, such as Faster-RCNN, were proposed in 2016 to improve R-CNN and Fast-RCNN. He uses the RPN network instead of the previous selective search to train the input feature map to output a series of candidate regions with initial object classification probabilities for more accurate localization of objects, resulting in improved network speed and accuracy. Zhao et al. [26] improved the traditional Faster-RCNN by reconstructing the network structure using multi-scale feature fusion and replacing part of the convolution with deformable convolution. The network was used for steel surface defect detection and reached 75.2%mAP on the NEU-DET public dataset. Cha et al. [27] applied Faster-RCNN for concrete crack and steel corrosion defect detection. Su et al. [28] designed a complementary attention network to exploit the advantages of spatial location features and channel features while suppressing background noise features and embedding them into Faster-RCNN to detect solar cell electroluminescence images. The two-stage network has satisfactory results in terms of detection accuracy. It is challenging to meet the requirements in real-time scenarios such as industrial defect detection due to the detection efficiency problem, so the single-stage target detection network has received more attention.
Single-stage target detection networks such as YOLOv3, SSD, and RetinaNet are more advantageous in inference speed due to their more straightforward structure than twostage networks. With the development of single-stage networks in recent years, the gap in detection accuracy compared to two-stage networks no longer exists. Yin et al. [29] used YOLOv3 to detect sewer pipe defects and achieved 85.37%map. Zhang et al. [30] improved the original YOLOv3 by introducing a new migration learning method for detecting concrete bridge defects, and its performance was improved by 13% compared to the original YOLOv3. Yu et al. [31] improved YOLOv4-CSP based on the problem of small targets for industrial defect detection. They proposed an efficient stepped pyramidal network for fusing multi-scale features, thus improving the detection accuracy of small objects. Wang and Cheung [32] improved the model based on center-net by adding count loss for detecting defects generated in the Additive manufacturing process. Since the comprehensive performance of the single-stage detector is higher than that of the two-stage sensor, it is more widely used in industrial defect detection. In this paper, we choose YOLOv3 as the benchmark model and improve it to make it more suitable for industrial defect detection scenarios.

III. METHOD
In this section, we describe the method we used in detail, and the network structure is shown in Figure 2. ShuffleNetv2 is used as the backbone feature extraction network. An adaptive receptive field feature extraction module is inserted into the backbone network to obtain different receptive fields for defects of various sizes. Then a lightweight feature pyramid network is constructed to fuse defect features of different scales more efficiently. In addition, we use the K-means algorithm to cluster the size of anchor frames and focal loss to solve the problem of positive and negative sample imbalance.
A. LIGHTWEIGHT FEATURE PYRAMID NETWORK FIGURE 2. Model structure diagram, C i denotes the feature layer of the backbone network, P i denotes the feature layer stacked with up-sampling, and T i represents the feature layer fused with multi-scale features. The input image is fed into the backbone feature extraction network after the ARFFE module, and the deepest feature layer is stacked with the upper layer after the channel compression. P 0 is the feature layer stacked with C 0 -C 3 . Then the feature fusion and weighting are performed by the downsampling and channel attention modules and finally output to the detection head for detection. We have verified the inference speed with different lightweight networks on the NEU-DET dataset, and the experimental results are shown in Table 1. The experiments show that the actual inference speed on GPU decreases using GhostNet as the feature extraction network, although the parameters are reduced. ShuffleNetv2 is designed with practical inference speed, replacing part of the grouped convolution with ordinary convolution and using concat instead of adding to reduce element-wise operations. Shuf-fleNetv2 significantly reduces parameters and can obtain faster inference speed during the actual operation. Hence, it is most suitable to be used as a feature extraction network for industrial defect detection models with high requirements for real-time, and its structure is shown in Table 2.

B. LIGHTWEIGHT FEATURE PYRAMID NETWORK
As the network layers deepen, the feature map resolution decreases, and the features of smaller targets disappear. Moreover, more semantic information about the target in the deep network is beneficial in locating the target's position. The original YOLOv3 uses an FPN network structure to fuse multi-scale information to predict objects of different sizes by three layers of feature maps with different resolutions. It is worth noting that the original YOLOv3 does not fuse the feature information of the upper layer when using the feature map of the bottom layer for prediction. Similarly, only the feature information of the bottom layer is fused into the middle layer. PANet uses a top-down and bottomup bi-directional fusion network to connect all the features of the prediction layer before prediction but at the same time introduces more parameters, making the inference speed slower. Therefore, this paper proposes a network structure that fuses features more quickly. First, the bottom layer feature map C 3 is downsampled in the channel direction by grouping channel compression (as shown in Figure 3) then upsampled and stacked with C 2 to obtain P 2 , whose number of channels per layer can be calculated by the formula.
where k is the number of sub-groups, which is set to 4 in this paper, and i denotes the feature layer of the feature extraction network, i ∈ {0, 1, 2}. Similarly, the C 1 and C 0 feature layers are stacked to finally obtain P 0 . The process can be expressed as where CAP denotes the channel average pooling, and G p n denotes the grouped feature map, the n = C P i+1 k , k = 4,and i ∈ {0, 1, 2}. It is worth noting that no parameters were introduced during this period. Although the operation of finding the mean value is used in the channel compression, the computation time consumed is much lower than the convolutional computation. A few convolutional layers are used after P 0 to learn how to fuse the stacked feature layers. At the same time, a channel attention mechanism is added for targets at different scales with different sensitivities to individual channel information. Only a small number of parameters are introduced for downsampling, which improves the inference speed of the whole model, and the detection accuracy is also improved because each prediction layer uses a feature map that fuses all the feature layers.

C. ADAPTIVE RECEPTIVE FIELD FEATURE EXTRACTION
The receptive field is the region's size where each location of the output feature map of each layer of the convolutional network maps to the feature map of the previous layer. A sizeable perceptual field improves the network's performance for the classification task. However, for the target detection task, the receptive field size should correspond to the anchor set to get better performance. A too-large field of perception will cause the detected area to be too small and ignored as background, resulting in poor detection of small objects. And the too-small field of perception, due to the acquisition of too much local information and causing the loss of global communication, affects the recognition of objects. In the defect detection task, the size setting of the anchor has a large gap due to the multi-scale nature of the defect. This paper proposes an adaptive receptive field feature extraction module (shown in Figure 4), which can be easily inserted into any position of the feature extraction network. The specific process can be expressed as follows.
where GAP denotes global average pooling, the input P 0 is stacked together after extracting features by 3×3 convolution of different receptive fields. Then the number of channels is reduced to the same as the input by 1×1 convolution. The convolution of three different receptive fields corresponds to the subsequent prediction on the feature layers of three resolutions. P 2 contains the feature information of different receptive fields. Since the targets of different scales are not equally sensitive to the feature information of different receptive fields contained in the channels, channel attention is used to weigh the feature information. Finally, shortcuts are used to save the information of the original feature map to prevent information loss. The final output P 3 contains the original information and the weighted multi-receptive field information.

A. EXPERIMENTAL ENVIRONMENT
The experimental hardware platform is i5-10400F CPU, NVIDIA GeForce RTX3060ti GPU, and we use PyTorch to build our model, PyTorch version 1.11, Cuda version 11.6, experiments are conducted on windows 10 using VOLUME 10, 2022  . Schematic structure of the adaptive perceptual field feature extraction module, P 0 represents the input feature layer, which is stitched together along the channel direction after the convolution of voids with different rates. The channel is compressed by 1×1 convolution, and finally, the output P 3 is obtained by adding the channel weighting with the original input through the channel attention module. Output P 3 and input P 0 have the same dimensionality.
pycharm. We use mAP as the evaluation index for model accuracy. Model parameters, FLOPs, and FPS are model speed evaluation indexes. See Table 3 for setting model  parameters, and Table 4 for comparing model parameters.

B. DATASETS
We use three kinds of data sets, and the data sets are divided by the same ratio. The ratio of the training set and validation set is 9:1. The ratio of the training set plus the validation set and test set is 8:2, as shown in table 5 for details. The images are randomly enhanced before being input into the network. The data enhancement methods include random flipping and gamut transformation.

1) MODEL PERFORMANCE COMPARISON
To validate the effectiveness of the model, we first compared our model with conventional target detection networks on the NEU-DET dataset, including the one-stage detection networks YOLOv4, EfficientDet [33], RetinaNet, SSD, and the two-stage network Faster-RCNN, and also with other steel surface defect detection models were compared, such as ES-Net, DCC-CenterNet [34], DEA_RetinaNet [35], Improved Faster-Rcnn [26], and then to verify the generalization performance of the model, we conducted experiments on PCB defect dataset and GC10-DET, all experiments were performed on the same hardware platform, and the experimental results are shown in Table 6-8. As can be seen from Table 9, our model has the fastest inference speed and the lowest computational complexity. The mAP is not optimal, but compared with the best DCC-CenterNet, the gap is only 0.18%, which is almost   negligible in practical application scenarios. At the same time, our parameter amount is only 60.51M, less than half of DCC-CenterNet, Our model inference speed also reaches the fastest 63.24 FPS, it indicate that our model is more valuable for practical applications.
YOLOv3, YOLOv4, and Efficientdet use FPN, PANet, and BiFPN as feature fusion networks, respectively. YOLOv4 improves its performance by 3.9% compared with YOLOv3 using a bidirectional fusion network PANet, which indicates that fusing feature layers with multi-scale features before prediction can improve detection accuracy. However, the number of parameters increases, and the inference speed decreases. Our model uses the proposed LFPN as a feature fusion network and improves by 9.29% compared to the benchmark model YOLOv3 and 5.39% and 9.13% compared to YOLOv4 and Efficientdet, respectively. It confirms that our model structure is superior to the above three models in terms of reference quantity and detection performance.
Compared with other one-stage detection models such as ESNet, RetinaNet, DEA-RetinaNet, and SSD, our model's mAP improves by 0.13%, 19.02%, 0.98%, and 12.16%, respectively. Compared with the two-stage networks Faster R-CNN and Improved Faster-Rcnn, our model has a massive advantage in inference speed, with 4.39 times higher FPS than Faster-Rcnn. In comparison, the detection accuracy is improved by 15.81% and 4.03%, respectively. Our model performs satisfactorily compared to the two-stage network, which is known for its detection accuracy.
To investigate the detection capability of our model for different kinds of defects, we compared each class of defects with other models on the NEU-DET dataset, and the results are shown in Table 10. It can be seen that the detection of each type of defect is improved compared with the benchmark model. Crazing has relatively fuzzy boundaries causing the detection model challenging to locate the defect location. The original YOLOv3 has an AP of only 28.14% for Crazing, which is almost undetectable. Due to the addition of the ARFFE module, the feature extraction capability of the backbone network is enhanced, the detection capability of such defects as Crazing is improved more significantly, and 16.97% increases the AP. Compared with DEA_RetinaNet, there is a gap of 15.82%, which is because it adds a difference extraction block between the backbone network and the feature fusion network to reduce the loss of information, and our approach does not have an advantage for defects where the scale varies little. It is not easy to distinguish between boundaries. However, our model has a more flexible receptive field and efficient information fusion capability for defects such as scratches due to the characteristic of excellent scale variation, which makes the detection capability significantly higher than DEA_RetinaNet. AP improves by 22.08% and is also the best value among all models. The model can easily classify defects such as Patches because of their apparent characteristics. The LFPN network incorporates more feature map layers, allowing the model to better utilize global and local information for defect localization. We also achieve the best detection results for Patches. However, the 8.99% difference between such defects as Rolled-in_scale and the best-performing DCC-CenterNet leads to a slightly lower final model mAP than DCC-Centernet.
Finally, we did experiments on the PCB defect dataset and GC10-DET to verify the model's generalization ability. The VOLUME 10, 2022  defects in the PCB defect dataset have the characteristic of small scale. Our model obtained 93.31% mAP, which is only 4.19% lower than the best-performing ES-Net and still has better results than other models. Even though our model is not designed explicitly for small target detection, our model can still make accurate detections for such small defective targets. This is due to the excellent feature fusion capability of LFPN, which makes it possible to effectively fuse global and local information to identify and locate defects when detecting small defect targets accurately. Our model achieves 59.78% mAP on the GC10-DET dataset, which is only 2.15% away from the best-performing DCC-internet, and still has a significant advantage over other mainstream models. In comparing different datasets, our model has a solid competitive detection performance while maintaining the optimal inference speed, which indicates that our model has a powerful generalization capability. At the same time, it achieves the best balance between inference speed and model detection performance.

2) ABLATION EXPERIMENT
To evaluate the contribution of each module to the model, we set up ablation experiments to assess and analyze the backbone network ShuffleNetv2, the proposed lightweight    of targets with varying sensitivities to each feature channel. Comparing the data in rows 4 and 5 of the table, the results show that adding the channel attention module during downsampling can improve the model performance by introducing a small number of parameters. 4) ARFFE module: ARFFE module is added to the shallow layer of the feature extraction network, as shown in Figure 2. Comparing the last two rows of data in the table, after adding the ARFFE module, the model parameters only increased from 60.46M to 60.51M. In contrast, the mAP rose from 76.73% to 79.23%, which verifies the effectiveness of the ARFFE module. Since the ARFFE module can be inserted into any network position, we put the ARFFE module into the deeper layer of the feature extraction network. Although the model's accuracy is improved, the inference speed of the model is significantly reduced due to the introduction of more parameters in the deeper network with more channels, so the ARFFE module is finally put into the shallow layer of the feature extraction network in this paper. Figure 9 represent the actual detection result visualization of our model on the NEU-DET dataset, PCB  defect dataset, and GC10-DET dataset, respectively. Due to the high resolution of PCB defect images, only a portion with defects is captured in the figure as a display. We set a threshold value of 0.5, and the prediction frame is drawn only when the prediction frame score exceeds 0.5. It can be seen that our model can accurately make predictions for various scales of defects in the NEU-DET dataset and GC10-DET dataset, where the GC10-DET dataset has a large image resolution. The scale of defects spans a great deal, from tiny targets such as punching-hole with only a few tens of pixels to those like welding-line that span the entire picture. For these scales span many defects, our model has better detection capability, and LFPN incorporates features of different scales before prediction. The addition of the ARFFE module makes the feature map increase the feature information of multiple sensory fields. In addition, the defect target scale of the PCB defect dataset is microscopic, and our model still has a good detection effect for such small-scale targets.

V. CONCLUSION
In this paper, we want to solve the industrial defect detection problem and improve the detection accuracy while guaranteeing the inference speed. For this purpose, we designed our model based on YOLOv3. First, we used a faster feature extraction network, ShuffleNetv2, to replace the original DarkNet53. To accommodate defects at different scales, we designed the ARFFE module to obtain features for adaptive sensory fields. Then, to improve the fusion efficiency of multi-scale features, we proposed the LFPN network, which enhances the detection accuracy of the network by introducing fewer parameters. The experimental results show that our model reaches 79.23% mAP on the NEU-DET dataset, which is 9.29% higher than the benchmark model. Moreover, we validate the generalization ability of our model on the PCB defect dataset and GC10-DET dataset and reached 93.31% and 59.78% mAP, respectively. Meanwhile, our model has the fastest inference speed, reaching 63.24 FPS on the NEU-DET dataset, which suggests that our model will be beneficial in real industrial application scenarios. Our future work will target model size compression, e.g., using model distillation pruning methods.