Trident-LK Net: A Lightweight Trident Structure Network With Large Kernel for Muti-Scale Defect Detection

Identifying defects at different scales is a challenge in industrial defect detection. To solve this problem, many multi-scale feature fusion networks have been proposed to improve multi-scale target detection accuracy by fusing fine-grained information from shallow networks and semantic information from deep networks. This approach requires the introduction of extraÂ parameters. Thinking from another perspective, can the accuracy of multi-scale target detection be improved by fusing the feature information under different receptive fields? For this purpose, we designed a three-layer network structure called Trident-LK Net. our model uses convolutional kernels of different sizes (31, 25, 1) in the feature extraction phase and establishes cross-fusion connections. This omits the feature fusion part and greatly reduces the network parameters while obtaining a good detection accuracy. Finally we perform experiments on the neu-det dataset and the gc10 dataset to verify the feasibility of our idea. While keeping the number of parameters to a minimum, our model achieves competitive detection results on the neu-det dataset (76.9% mAP) and optimal on the gc10 dataset (63.55% mAP). Our code will be publicly available at https://github.com/syyang2022/Trident-LK-Net.


I. INTRODUCTION
In the industrial production process, defect detection of products is an indispensable part [1], [2].Influenced by the production environment, equipment and other factors, various defects appear on the surface of metal products, which are commonly shown in Figure 1.Some of these defects are very small, with only a few dozen pixels.Some are very large, with hundreds or even thousands of pixels.In the deeper layers of convolutional neural networks, the receptive field increases and smaller defect target information is diluted, which poses a challenge to automated detection [3].
The development of defect detection has gone through a process from manual detection to traditional computer vision based.Manual detection methods are often less The associate editor coordinating the review of this manuscript and approving it for publication was Bo Pu .reliable and more costly due to subjective factors.Traditional machine vision methods such as [4] and [5], require manually designed algorithms to extract features, resulting in poor model robustness and generalization.The development of convolutional neural networks has brought new solution ideas to defect detection, and researchers have found that trained convolutional neural networks are better able to extract local and global features, which makes the models more expressive [6].Classical target detection networks, such as Faster-RCNN [7] and Yolov3 [8], do not require manual extraction of features, and can automatically extract features and output detection frames by training with input data, which improves the generalization ability of the model.The above two models are representative of two-stage detectors and onestage detectors.Compared to the two-stage target detection algorithm, the one-stage detection algorithm does not require the production of suggestion frames and is faster, but it still does not meet the requirements of industrial defect detection due to its huge number of parameters [9].Therefore, many researchers have improved these general-purpose networks.The computational complexity of the network is reduced by various techniques based on the original ones to meet the requirements for real-time performance in industrial scenarios [10], [11].In recent years, the attention mechanism-based model transformer [12] has been used more and more widely in the field of computer vision, and the classification model VIT [13] and the detection model DETR [14] using transformer have achieved good results.Especially for the detection of large targets, DETR has shown good performance.Recently, detection models based on attention mechanisms have also been applied to the field of defect detection.For example, [15]introduced a dynamic anchor frame and an improved multiscale deformable attention module in DETR to improve the detection accuracy of small targets.These DETR-based defect detection models have achieved good results in terms of detection accuracy [16], [17], [18].It was shown that the success of DETR depends on the dense pixel computation making the model obtain a large receptive field [19].However, it is these intensive computations that make the computational complexity of the DETR-based model enormous.Can the same good results be achieved by using large convolution kernels instead of dense pixel computation to obtain large receptive fields?The experimental results are satisfactory.After using a very large convolutional kernel of 31*31, the model is able to detect the target more accurately [19].In other words, a larger receptive field can be used to obtain global feature information when detecting large objects.Inspired by this, the detection of small objects should take into account finer fine-grained information and thus use smaller convolutional kernels to extract this information.Therefore, we designed a three-layer network structure, each layer uses a module stacked with convolutional kernels of different sizes.And cross links are established for information fusion [20], so that the model can identify targets of different sizes more accurately.Considering the real-time nature of industrial inspection, we use depthwise separable convolution [21], [22] to build our module and add channel shuffle [23], [24] to solve the problem of information intermingling between channels.This allows our model to have fewer number of parameters.The main contributions of this paper are as follows: 1) A novel lightweight network is proposed using depthwise separable convolution and a design without using feature fusion networks in order to meet the requirements of industrial inspection for real-time performance.2) A method using a combination of convolutional kernels of different sizes is proposed.The very large convolutional kernel of 31×31 in the top layer network increases the receptive field of the network and improves the accuracy of multi-scale defect detection by fusing the feature information between different layers.Finally, experiments are designed to verify the effectiveness of the combined use of large convolutional kernels and small convolutional kernels to improve the detection performance.3) Experiments were conducted on NEU-DET [25] and GC10 [26] steel indicating defect datasets, and comparing with other models, the proposed method can detect defects quickly and effectively, proving the superiority of our method over other methods.The rest of the paper is structured as follows.Section II describes the related work and introduces the existing defect detection methods.A new lightweight detection network, Trident-LK Net, is proposed in Section III.Section IV conducts experiments on the NEU-DET and GC10 datasets to verify the effectiveness of the method.Finally, conclusions are given in Section V.

II. RELATED WORK
The primary defect detection methods currently used in the industry include traditional methods and deep learning-based methods.

A. TRADITIONAL METHODS
Traditional defect detection methods are mainly based on features.Ren et al. [27] proposed Electrical resistance tomography (ERT) method for detecting cementitious materials indicating defects based on similarity evaluation method of color histogram for evaluation.Song et al. [28] proposed a method based on image block percentile color histogram and feature vector texture feature classification for detecting wood surface defects.Chang et al. [29] implemented defect detection on camera lens surface based on polar coordinate transform, Hoff circle transform, weighted Sobel filter and SVM.All these traditional methods rely on manually designed algorithms to extract features with poor robustness.

B. DEEP LEARNING METHODS
In recent years, deep learning has developed rapidly.Many excellent target detection methods have been proposed one after another, and target detection is divided into one-stage algorithms and two-stage algorithms.The two-stage network preprocesses the image first and outputs a series of candidate 131074 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.regions and classification probabilities to be fed to the next layer of the network.This allows the network to localize objects more accurately.For example, rcnn [30], fastrcnn [31] and faster-rcnn.but the speed of the operation tends to be slower due to the more complex network structure.The one-stage algorithm simplifies the network structure by using only one network to process the input image and finally output the prediction frame.For example, yolov3, yolov4 [32], SSD [33], retinanet [34], etc.
In the field of defect detection, many improved networks based on the above networks have been proposed.Zhao et al. [35] used deformable convolution to improve the traditional Faster-RCNN.and used this network for steel surface defect detection, achieving 75.2% mAP on the NEU-DET public dataset.Cha et al. [36] applied Faster-RCNN to concrete crack and steel corrosion defect detection.Yin et al. [37] used YOLOv3 to detect pipeline defects and obtained 85.37% mAP.Zhang et al. [38] improved YOLOv3 using a migration learning approach, which was used to detect concrete bridge defects with a 13% performance improvement.Yu et al. [39] improved YOLOv4-CSP to improve the detection accuracy of small objects.Wang et al. [40] added count loss improved the center mesh-based model for detecting defects generated during additive manufacturing.

III. METHOD A. BASIC UNIT
Trident-LK consists of the basic modules shown in Figure 2. To keep the computational complexity of the model as low as possible, we extracted features in the shallow layers of the network by repeatedly stacking ShuffleNetv2 units, as shown in Table 1.In the different layers of the three-layer network structure, the convolution blocks in Fig. 2(c) were replaced by convolutions of different sizes to form the basic blocks of the different layers.The final network structure was then formed by repeated stacking of these blocks.To make the network lighter, depthwise convolution was used for the 31×31 and 25×25 convolution blocks, and channel shuffle was added to allow better information interaction between channels.1×1 convolution was used to vary the number of channels to further reduce the number of parameters in the network.Finally, residual links were added to improve model convergence.

B. TRIDENT-LK NET
Figure 3 shows the network structure of the Trident-LK net.After the image is fed into the network, it is downsampled three times, reducing the resolution of the image and increasing the number of channels for subsequent detection.The downsampled feature map enters the three-layer mesh structure.The top-layer mesh uses this feature map directly as input.At the same time, the feature map is downsampled once as input to the middle layer network.The top layer network uses a basic convolutional block of 31×31, while the middle layer uses a basic convolutional block of 25×25, so that the obtained feature map has feature information under different perceptual fields.The middle-layer feature maps are upsampled by bilinear interpolation to obtain feature maps with the same resolution as the top-layer network.At the same time, the top-layer network is downsampled again, and the resulting feature maps are simply stacked.The purpose of this is to enhance the information interaction between different layers so that the information under multiple sensing fields can be used when detecting targets of different sizes.Finally, the feature maps of the middle layer are downsampled and feature extraction is performed using a 1×1 convolution block.Then, the feature maps of the three layers are fused with information using the same method.Finally, the three layers are fed into the detector for detection, so that the rich multi-sensory information can be used for target detection.

C. CIOU LOSS
In object detection, IOU loss [34] is usually used as the loss function between the ground truth and the region proposal.IOU represents the degree of overlap between the region proposal and the ground truth, which is usually expressed as where B gt = (x gt , y gt , w gt , h gt ) denotes GT box and B = (x, y, w, h) denotes the prediction box.Therefore IOU Loss can be expressed as L IOU = 1 − IOU .However, IOU loss only works when the target frames overlap, and has severe limitations when used as a loss function for prediction frames.Therefore, we use CIOU Loss [35] as the loss function of our prediction frame.CIOU Loss takes into account three important geometric factors: overlap area, centroid distance, and aspect ratio.CIOU Loss is defined as follows where α is a positive trade-off parameter, and υ measures the consistency of the aspect ratio, and their calculation can be expressed as

D. LOSS FUNCTION
The loss function we use consists of three components, the bounding box regression loss, the confidence loss, and the loss due to category error.The total loss function is the weighted sum of the three loss components.The loss function is described as follows 6) where S 2 denotes the grid size.

B. DATASETS
We used two metal defect data sets and divided them in the same ratio.The ratio of the training set to the validation set is 9:1, and the ratio of the training set plus the validation set to the test set is 8:2.We use only basic data enhancement methods, including random flip and chromaticity transformation.The specific datasets are described as follows 1) 1) NEU-DET: A dataset of steel surface defects from Northeastern University containing six types of surface defects: Rolled-in scale (Rs), Patches (Pa), Cracks (Cr), Pitted surface (Ps), Inclusions (In), and Scratches (Sc).
There are 300 images for each defect type, for a total of 1800 images in the dataset.The images are 200×200 pixels in size, and we resize the images to 416×416 before feeding them into the network.2) GC10-DET: GC10-DET is a dataset of defects in a real industrial scenario.A total of 10 types of defects are included: Punch (Pu), Weld (Wl), Crescent Gap (Cg), Water Stain (Ws), Oil Stain (Os), Silk Stain (Ss), Injury (In), Roll Pit (Rp), Crease (Cr), and Waist Fold (Wf).The dataset contains a total of 2280 images of 2048×1000 pixels, which were resized to 512×512 before being fed into the network.

1) NEU-DET COMPARISON EXPERIMENT
To verify the performance of the proposed model, we select some mainstream target detection models for comparison.
According to the data in the table3, our model has an advantage in detection accuracy compared to the popular one-stage nets.Against YOLOv3 and YOLOv4, mAP is 6.96% and 3.06% higher, respectively.The improvement over EfficientDet is 6.8%.These models are typical of using feature extraction networks to first extract features from images, and then fusing shallow network features and deep network features to improve the detection accuracy of multi-scale targets.This illustrates the effectiveness of our approach to improve detection accuracy by fusing features under different receptive fields.Also, due to the simplified network structure design, our model has a faster inference speed with nearly times less number of parameters than EfficientDet.We choose Faster-RCNN, a representative of two-stage networks, as a comparison, and the mAP of our proposed model is improved by 13.48%.DCC-CenterNet and DEA_RetinaNet are both improved models proposed for the NEU-DET dataset, and our model is 2.51% and 1.35% lower than them in terms of accuracy, respectively.However, model complexity and accuracy are equally important metrics in practical application scenarios.The parameters of our model are only 7.01% and 5.46% of theirs, respectively.The data shown in Table 4 are the experimental results of our proposed model on each category compared to other models.We hope to validate our idea by analyzing the results of each category.From the data in the table, we can see that the reason why the mAP of our model is lower than that of DEA_RetinaNet and DCC-CenterNet is that there is a gap mainly in the detection of the crazing.In particular, the gap is larger when compared to DEA_RetinaNet.We conclude that crazing has a relatively fuzzy boundary, which makes it difficult to distinguish the background from the defect target.Our model does not have an advantage in detecting such defects.For defects such as scratches, which cover a relatively large area, the detection accuracy is the best among all models because we use a large convolutional kernel, which gives the model a large receptive field.

2) GC10-DET COMPARISON EXPERIMENT
Table5 shows the results of our model on the GC10-DET dataset compared to other models.The size of the GC10-DET dataset images are larger and have more pixels.This poses a greater challenge for the detection of multi-scale targets.From the data in the table, we can see that our model has the best detection accuracy and inference speed.The accuracy is 11.82% higher compared to the Faster-RCNN.It is 11.59%, 9.37%, and 14.31% higher than YOLOv3, YOLOv4, and RetinaNet, respectively.It is 1.62% higher than the improved model DCC-CenterNet.The comparison results of the NEU-DET dataset and the GC10-DET dataset show that our model achieves a good balance between accuracy and inference speed.This also means that our model performs better in practical application scenarios.

3) ABLATION EXPERIMENT
To verify the effectiveness of the proposed method, we conducted ablation experiments on the NEU-DET dataset, 131078 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.and the experimental results are shown in Table6 and Table7.First, we use convolutional kernels of the same size in the three-layer network.The experimental results show that the performance of the model is lower than that of the model using convolutional kernels of different sizes.The best performance is all using convolutional kernels of 25×25, but there is still a difference of 1.99% compared to the best model.Also, from the data in Table6, it can be seen that as the convolutional kernels increase, the performance of the model decreases.The performance of the model also gradually improves.This indicates that for target detection task, a larger receptive field tends to lead to better model performance.This also suggests that we are correct in using large convolutional kernels to design the model.Comparing the first four rows of data in Table7, we can conclude that Using small convolutional kernels is more effective in the third layer of the network, i.e., in detecting small targets.This also shows the validity of our proposed idea of using large convolutional kernels to obtain a larger sensory field when detecting large targets, and using small convolutional kernels for small targets to obtain finer information.

V. CONCLUSION
For the problem of metal surface defect detection, we propose a novel lightweight network.Considering the effect of receptive field on defect detection at different scales, we designed the network structure as three layers, and each layer uses convolutional kernels of different sizes to extract defect features.
We designed comparative experiments to compare the proposed model with the mainstream defect detection models.On the NEU-DET dataset, our model performs 2.51% lower than the best-performing DCC-CenterNet, but achieves the best results in terms of computational complexity with 14 times less number of parameters.On the GC10-DET dataset, our model achieves the best performance.mAP is 1.62% higher than DCC-CenterNet.Finally, to verify the validity of the proposed idea, we designed ablation experiments.The experimental results show that using convolutional kernels of different sizes in different layers performs better than using convolutional kernels of the same size.Moreover, the best performance is achieved by using larger convolutional kernels to detect large targets and smaller convolutional kernels to detect small targets.This also confirms the effectiveness of the proposed method.

FIGURE 3 .
FIGURE 3. Model structure diagram, After shuffllev2 block downsampling, the feature maps are fed into a three-stage network.C 0 is convolutionally downsampled to obtain C 1 .C 0 and C 1 are passed through convolution blocks composed of convolution kernels of different sizes and then fused together to obtain P 0 and P 1 .the final output feature maps T 0 , T 1 , and T 2 fuse feature information under different receptive fields.The target image is then obtained from three detectors.

Figures 4 and 5
Figures 4 and 5 show the visualization of NEU-DET and GC10-DET datasets, respectively.

TABLE 1 .
Feature extraction network architecture.

TABLE 2 .
Initialization parameters of our method.

TABLE 3 .
Compare of the detection results and speed with other methods on the NEU-DET datasets.

TABLE 4 .
Comparison of the detection results of other methods and the proposed method for each category on the NEU-DET dataset.

TABLE 5 .
Comparison of detection results of other methods and the proposed method on GC10-DET dataset.

TABLE 6 .
Effect of using the same size convolution kernels for all three layers on model performance.

TABLE 7 .
Effect of using different size convolution kernels in three layers on model performance.