Loading web-font TeX/Math/Italic
An Improved Dilated-Transposed Convolution Detector of Weld Proximity Defects | IEEE Journals & Magazine | IEEE Xplore

An Improved Dilated-Transposed Convolution Detector of Weld Proximity Defects


Weld Proximity Defects

Abstract:

Weld proximity defects present some characteristics, including mutual occupancy areas, high density, and small-target sizes, which is a challenge to detect industrial she...Show More

Abstract:

Weld proximity defects present some characteristics, including mutual occupancy areas, high density, and small-target sizes, which is a challenge to detect industrial sheet welding accurately. The existing detectors are limited by the fixed size of feature receptive fields or the ability to process imbalanced positive-negative samples, which cannot be applied to detecting weld proximity defects. To solve the above-mentioned problem, we propose an industrial detector based on the Decoupled-Dilated-Transposed You Only Look Once v5 (DDT-YOLOv5). First, the Dilated-Transposed Convolution Cross-Stage Partial Network (DT-CSPnet) is designed in the DDT-YOLOv5, replacing the conventional cross-stage partial structure, to extract defect features effectively. Two middle Resblock Bodies in the backbone are designed as two DT-CSPnet blocks respectively, which can accelerate the convolutional feature extraction process. Second, we provide an improved Sigmoid Linear Unit function to reduce the loss of precision and optimize the gradient of the detection network. Third, the Squared-Focal Loss with the \alpha -balanced method is explored in the decoupled head, including the Maxpooling layer. It aims for the accurate detection and classification of weld proximity defects in imbalanced samples. Finally, we contrast the proposed DDT-YOLOv5 with other related models via the real dataset of weld proximity defects. Experimental results show that the proposed model can present superior effects for detecting weld proximity defects.
Weld Proximity Defects
Published in: IEEE Access ( Volume: 12)
Page(s): 157127 - 157139
Date of Publication: 22 October 2024
Electronic ISSN: 2169-3536

Funding Agency:


SECTION I.

Introduction

Industrial sheet welding is a connecting process of multi-steel sheets by welding, which is widely applied in industrial fields, such as large architectures (e.g., buildings, bridges, and highways), aerospace (e.g., airplanes, rockets, and satellites), and intelligent manufacturing (e.g., vehicles and robots). The welding of the steel sheet may produce many tiny small-sized, intensive, and common-area-occupied defects (i.e., called the weld proximity defect), due to the welding technology and environmental factors. The types of weld proximity defects include weld tumor, black-gray oxidation, burning-through, and shrinkage cavity, as shown in FIGURE 1. These surface defects are difficult to recognize with the naked eye, which may affect the operation of the building or equipment, even leading to safety accidents. Therefore, advanced vision technologies are of great importance for achieving accurate detection of welds.

FIGURE 1. - Weld proximity defects (i.e., The size of the steel sheet including width 10cm, and length 16cm).
FIGURE 1.

Weld proximity defects (i.e., The size of the steel sheet including width 10cm, and length 16cm).

Currently, a large amount of intelligent detection models have been provided in industrial welding fields. According to the target localization principle, the detectors can be classified into one-stage detectors (e.g., SSD (Single Shot Detection) [1], YOLO (You Only Look Once) [2], [3], [4]), and two-stage detectors (e.g., Faster R-CNN (Region-based Convolutional Neural Networks) [5]). Among them, the two-stage methods generate a series of region proposals, and then predict the classification and location targets via the expansive convolutional neural network [6]. Based on the above characteristics, these models have high detection accuracy while requiring much training time or computational parameters.

One-stage detectors can quickly regress to the target classes and locations using an end-to-end framework. They directly adopt the backbone-convolutional neural network to achieve target detection of weld proximity defects [7], [8]. Further, the coupled head takes feature maps from the multi-kernel convolution layers, and then generates the classification and location of outputs [9]. Since a homogeneous convolution layer processes multi-loss values, and further the region proposal network is dropped, it will obtain weak performance for industrial multi-target detection [10]. Namely, the principles of these structures are similar to multi-threaded parallel computing in computers.

Further, some researchers have introduced the decoupled head in the YOLOv5’s prediction head [11], [12]. This head extracts the target features and predicts pixel classes by one or more additional branches in the backbone. Namely, it achieves classification and location information separately, improving the accuracy of edge segmentation and detail retainment. However, the improved YOLOv5 detectors utilize fixed-size convolutional kernels (i.e., the $1\times 1$ or $3\times 3$ sized kernels) to construct feature extraction. It limits the receptive field mapping of pixel-image defects. Further, the YOLOv5 with the decoupled head cannot well process the unbalanced positive and negative samples in each category [13]. This is because both the Sigmoid Linear Unit (SiLU) activation function for the cross-stage partial network (CSPnet) and the Binary Cross Entropy (BCE) loss function for the confidence branch have weak accuracy in steady-gradient calculating for the samples of weld proximity defects.

In a word, although the one-stage YOLO-based detectors spring up (e.g., YOLOx, YOLOv7 [4], and YOLOv8 [14]), the detection field of weld features is mainly based on YOLOv5 (e.g., not the Sota model). Thus, we try to achieve a suitable performance by simply modifying the structure of the existing YOLOv5. There are two key problems in the detection of weld proximity defects that should be considered as follows.

  • What wide-receptive field structure in YOLOv5 should be constructed to improve the feature extraction for detecting weld proximity defects.

  • How to improve the above essential functions in Decoupled-YOLOv5 to accurately process the unbalanced positive and negative samples of weld proximity defects.

Based on the above analysis, we propose a decoupled-dilated-transposed you only look once v5 (DDT-YOLOv5) for the accurate detection of weld proximity defects, mainly based on the Dilated-Transposed Convolution Cross-Stage Partial Network (DT-CSPnet) and Decoupled-improved head (Di-head). Compared to the conventional YOLOv5 models, the novelty is that we adopt the DT-CSPnet with a wide receptive field and the SiLU+ activation function to extract the features of weld proximity defects. Further, we redesign the decoupled head via the additional Maxpooling layer to accurately predict the classification and location. Experimental results demonstrate that the DDT-YOLOv5 model performs superior effects for detecting weld proximity defects, the comprehensive evaluation metric mAP is 89.5%. The main contributions are as follows.

  • $ {\mathbf {\it SiLU}}^{\mathbf {+}}$ activation function for effectively optimizing the gradient of the detection network. We provide an improved SiLU activation function with the additional term $\Delta \textrm {SiLU}^{+}=1/(1+e^{-x})$ , called SiLU+. This activation function aims to reduce the loss of precision and optimize the gradient of the detection network.

  • The novel DT-CSPnet for the feature extraction of weld proximity defects. We provide novel dilated Conv_BN_SiLU+ layers via the dilated convolution, replacing the convolution layer in CSPnet. The $3\times 3$ transposed convolution is added between two dilated Conv_BN_SiLU+ layers. And then, the output of the DT-CSPnet is the stacking of multi-bottleneck blocks with dilated rates (i.e., $d=2 ^{\mathrm {n}}$ , n =1, 2,...). Further, the number of DT-CSPnet blocks in two middle Resblock Bodies of the Backbone is changed to 2, accelerating the features extraction process.

  • Di-head for the accurate prediction of weld proximity defects. We adopt a $3\times 3$ kernel-sized Conv_BN_SiLU+ layer to replace the original $1\times 1$ Conv_BN_SiLU layer, and then adopt the Maxpooling layer in prediction heads. Further, we provide an improved confidence-loss function (i.e., Squared-Focal Loss) to help the accurate detection and classification of imbalanced samples.

The remainder of this work is organized as follows: Section II reviews some related works of the YOLOv5 detector. Section III introduces the structure of the DDT-YOLOv5 model. Section IV gives the experimental results and some theoretical analysis, and finally, the concluding remarks and future directions are given in Section V.

SECTION II.

Related Work

A. YOLO Detectors for the Welding Defects

Considering the type of prediction head, the YOLO detectors for welding defects can be divided into two parts, the conventional YOLO detectors and the decoupled YOLO detectors. Among them, the conventional YOLO detectors have adopted the couple head with a filter size $(L+C+T)$ , L is the parameters x, y, w, h, C is the confidence parameter, and T is the detection type. For example, Shi et al. [15] provided a detection algorithm based on YOLOv3 for solving the leakage of solder joints. Li et al. [16] proposed a YOLOv4-based detector to detect the welding defects of the solid rocket engine shell, which obtained the best performance on the high-level X-ray film. Zhou et al. [17] combined the CSPlayer and self-attention mechanism with the YOLOv5s model, which could detect various small defects of metal materials effectively. This coupled head has a simple structure and feeds the feature map directly into the convolutional layer, generating the output of the target location and category.

However, it required many parameters and computational resources and led to the overfitting problem. To reduce the computational complexity and enhance the generalization ability, the decoupled YOLO detector is proposed to improve the extraction accuracy of weld defects. Liao et al. [7] adopted the improved YOLOx to detect surface defects on metal welding boards, which had the advantages of high detection accuracy and efficiency. Li et al. [18] proposed an enhanced feature-detection YOLOv4 model via a decoupling structure for steel surface defect detection in industrial welding. Further, the YOLO with decoupling heads was designed for the classification and location information separately. Wang et al. [12] propose a multi-task simultaneous monitoring YOLOv5 model for the melt inert-gas welding. Zhou et al. [19] presented a Mobilenet-decoupled head-YOLOv5 to detect welding leaks in the PCB manufacturing process, which achieved the high coupling of different feature distributions. Liang et al. [20] provide an Enhanced YOLOv5 with a self-attentive mechanism for the detection of large-metal stamping defects. However, they are not well-suitable for the detection of weld proximity defects in industrial welding.

B. Improved Losses of YOLOv5 Models

The conventional YOLOv5 adopted the Sigmoid-Weighted Linear Unit (SiLU) as the activation function. The losses of the classification branch and the confidence branch are the Binary Cross-Entropy (BCE) function, and the loss of the location branch is the complete intersection over the complete union function (CIoU) [21]. The improvements of the expansive YOLOv5 model mainly include two parts: the loss branch and the activation function. For example, Sun et al. [22] provided an improved YOLOv5 based on the varifocal loss and the squeeze-excitation attention to detect inner wall defects. Wang et al. [23] adopted a YOLOv5 to achieve indoor occupancy detection, which solved the class and positive-negative samples imbalance by the varifocal loss. Further, Yi et al. [24] proposed the YOLOv5 with the ReLU6 activation function to detect the circuit board faults accurately. Ye et al. [25] constructed the YOLOv5 via the conventional activation functions, and then the SiLU-based YOLOv5 performed well in plant disease recognition.

Based on the above analysis, the existing YOLO series has been unsuitable for the detection of weld proximity defects. Thus, we will explore the DDT-YOLOv5 model via the DT-CSPnet and the Di-head in Sect. III.

SECTION III.

Proposed DDT-YOLOv5 Model

In this section, we will introduce the mainly improved structures of the proposed DDT-YOLOv5, including the SiLU +, DT-CSPnet, and YOLO Di-head.

The conventional YOLOv5l consists of four parts, input, backbone, neck, and prediction head. The input adopts the adaptive anchor box, the Mosaic, and the Image Focus method for data enhancement. The backbone is an image-feature extraction network. It adopts residual cross-stage partial network (CSPnet) and SPPBottleneck structure to enhance the network extraction field. The neck is mainly composed of the feature pyramid network, which can collect effective feature maps with different pooling sizes from multiple Conv_BN_SiLU layers. Further, the output performs the small-target prediction and loss calculation through three YOLO prediction heads.

Unlike the conventional YOLOv5l, the proposed DDT-YOLOv5 model has some changes in the activation function, the CSPnet structure, and YOLO Prediction Heads, as shown in FIGURE 2. First, we improve the SiLU activation function (called SiLU+) and reconstruct it as the Conv_BN_SiLU+ layer in YOLOv5l model. Second, the two-normal Conv_BN_SiLU layers in CSPnet are replaced by the dilated-transposed convolution CSPnet framework, called DT-CSPnet. Since the dilated convolution rate is an integer multiple of 2, we select the series (i.e., $d=2 ^{0}$ , $2^{1}$ , $2^{2}$ ,..., $2^{\mathrm {n}}$ ) to achieve the stacking of multi-bottleneck blocks. The number of DT-CSPnet blocks in middle Resblock Bodies is changed to 2, which reduces one block of each module to accelerate the feature extraction process. Finally, we change the convolutional kernel size of the Conv_BN_SiLU+ layer and adopt the Maxpooling operation in the Decoupled-improved prediction head. In addition, the improved confidence-loss function is used in prediction heads to achieve small-target detection and classification tasks (i.e., called Di-head).

FIGURE 2. - The detailed architecture of DDT-YOLOv5.
FIGURE 2.

The detailed architecture of DDT-YOLOv5.

A. SiLU+ Activate Function

SiLU activation function [26] plays an essential role in the backbone of the conventional YOLOv5. Notably, the SiLU is expressed as the sigmoid function multiplied by its input, which obtains the one-side bounded (lower bounded) with smooth and nonmonotonic properties, as denoted in Eq. 1. The derivative of the SiLU is denoted in Eq. 2.

Compared with the conventional activation functions (e.g., Rectified Linear Unit (ReLU), sigmoid), the SiLU can retain more input information to speed up the nonlinear convergence. It integrates the low computational complexity for ReLU and weighted gradient smoothing for the sigmoid function [27]. However, when the inputs of the activation function are too large or small, the SiLU may cause gradient problems in the back-propagation process, which affects the accuracy of weld proximity defect detection.

Fortunately, the Hard-Swish function (H-swish) is developed by a combination of common operators (i.e., the ReLU6 and SiLU). This segmented activation function is suitable for lightweight networks [28]. It adopts linear and nonlinear approximations (i.e., output = x$\cdot $ ReLU6(x), ReLU6= min (ReLU, 6)) to form the hard part, which improves the model’s detection and classification ability. Further, the H-swish function adopts input bias and proportional scaling (i.e., ReLU6(x+3)/6) to achieve the SiLU approximation, as denoted in Eq. 3. It can finally optimize the potential gradient accuracy.\begin{align*} \text { SiLU}& =x\cdot \text { sigmoid}(x)=\frac {x}{1+e^{-x}} \tag {1}\\ \text {SiL}{\text {U}}'& =[x\cdot \text { sigmoid}(x)]'=\frac {1}{1+e^{-x}}+\frac {x\cdot e^{-x}}{(1+e^{-x})^{2}} \tag {2}\\ \text {H-swish}& =x\cdot \text { ReLU}6(x+3)/6 \tag {3}\\ \text {SiLU}^{+}& =(1+x)\cdot \text { sigmoid}(x)=\frac {1}{1+e^{-x}}+\text {SiLU} \tag {4}\\ (\text {SiLU}^{+}{)}'& =\text {SiLU}'+\frac {e^{-x}}{(1+e^{-x})^{2}} \tag {5}\end{align*} View SourceRight-click on figure for MathML and additional features.where sigmoid($\cdot $ ) is an activation function of a neural network whose variables are mapped in the range [0,1].

Inspired by this idea, we propose an improved sigmoid-weighted linear unit activation function, called SiLU+. The SiLU+ activation function is the multiplication of the linear ($x+1$ ) and the sigmoid functions, as denoted in Eq. 4. When the additional term $\Delta \text { SiLU}^{+}=1/(1+e^{-x})$ tends to be infinite or infinitesimal, the SiLU+ activation function retains lower bounded and non-monotonic properties. For the derivative additional term $\Delta (\text {SiLU}^{+}{)}'=e^{-x}/(1+e^{-x})^{2}$ in Eq. 5, the extreme values are convergent and tend to be zero. When the network input approaches infinity, it can help the SiLU+ reduce the loss and preserve the gradient value. The SiLU+ and SiLU functions are shown in FIGURE 3a, also their related derivatives can be shown in FIGURE 3b. We can find that the SiLU+ is smoother than the conventional SiLU when the inputs are near the negative zero, and the final convergence values of related derivatives are similar. It means that our proposed SiLU+ can effectively optimize the network and improve the convergence speed.

B. DT-CSPnet Structure

In this section, we will describe the structure of the dilated-transposed convolution cross-stage partial network (DT-CSPnet) of the proposed DDT-YOLOv5.

In conventional YOLOv5l, the CSPnet first extracts image features by parallel Conv_BN_SiLU layers. Second, the $1\times 1$ and $3\times 3$ Conv_BN_SiLU layers with residual structures are This light residual network is easy to optimize with a small number of parameters. Finally, it solves the gradient duplication and then mitigates the gradient disappearance with the increase of network depth [29]. However, the fixed-kernel convolution limits the receptive field for image feature extraction, which will finally affect the accuracy of small-target classification and detection [30]. In addition, the successive batch normalization operations make insignificant changes to the local input gradient, which may drop some of the essential pixel features and lead to hidden network over-regularization.

Notably, the dilated convolution has developed in data analysis tasks, especially for the time series [31]. It increases the receptive field and maintains features, which can expand the ability to capture historical features with less computational cost. Based on the above, we have redesigned the CSPnet framework based on the dilated convolution, as shown in FIGURE 4.

FIGURE 3. - Activation functions of the detector model.
FIGURE 3.

Activation functions of the detector model.

FIGURE 4. - DT-CSPnet structure.
FIGURE 4.

DT-CSPnet structure.

In the proposed DT-CSPnet, a $1\times 1$ Conv_BN_SiLU+ layer is first constructed as the initial part of the main branch. Second, we provide a $1\times 1$ dilated Conv_BN_SiLU+ layer to extract input features of the main branch. To avoid the over-regularization problem, we add a $3\times 3$ transposed convolution between the $1\times 1$ dilated Conv_BN_SiLU+ layer and the next $3\times 3$ dilated Conv_BN_SiLU+ layer. The final output of a bottleneck block is the sum of internal residual branches. Third, we select the dilated rates (i.e.,$2^{0}$ , $2^{1}$ , $2^{2}$ ,..., $2^{\mathrm {m}}$ ) to achieve the stacking of multi-bottleneck blocks. The related receptive fields and the kernel size of the current residual block can be calculated in Eq. 6 and 7. Finally, to match the concatenated dimensions of the main branch, we perform a Conv_BN_SiLU+ operation in the shortcut path.\begin{align*} r_{c}& =r_{c-1} +\left [{{(k_{c} -1)\cdot \prod \limits _{l=1}^{c-1} {s_{l}}}}\right ] \tag {6}\\ \omega & =1+\sum \limits _{i=0}^{m-1} {(k-1)\cdot d^{i}} =1+(k-1)\cdot \frac {d^{m}-1}{d-1} \tag {7}\end{align*} View SourceRight-click on figure for MathML and additional features.where $r_{c}$ is the kernel size of the c-th convolutional residual block, $k_{c}$ is the kernel size of the current c-th layer or the pooling layer size, and $\prod {s_{l}}$ is the multiplication of the convolutional strides of the previous (c-1)-th layers. Further, $\omega _{l}$ denotes the receptive field size of the l-th dilated residual layer. $k_{l}$ denotes the kernel size of the l-th layer, and d denotes the dilated factor.

C. YOLO Di-Head Structure

In this section, we will redesign the YOLO-Head via the decoupled convolutional layer, the MaxPooling layer, and the improved Focal Loss function.

The conventional YOLO Head integrates the target outputs through the same coupled convolutional layer. So, it will result in overfitting and low performance of model classification and regression. Fortunately, the decoupled head has been provided to solve these problems. It extracts the location and category information separately through different network branches and further fuses pixels’ multi-category features [32], [33]. The decoupled head can effectively improve the detail segmentation accuracy, reduce network parameters, and enhance the model’s robustness.

Notably, the differences between the proposed Decoupled-improved head (Di-head) and the decoupled head in YOLOv5 are divided into three parts, as shown in the red boxes of FIGURE 5. First, an original $1\times 1$ Conv_BN_SiLU layer of the decoupled head is replaced by a $3\times 3$ Conv_BN_SiLU+ layer, which improves the feature convolution accuracy while slightly increasing the parameters. After a series of convolution, normalization, and SiLU+ activation operations, the integrated prediction output has a dimension of (num_classes $+ 4+1$ ). Second, we perform a MaxPooling operation on the concatenated outputs of the Di-head. It increases the receptive field while maintaining the invariance of the scale and displacement features. Finally, we use the Squared-Focal Loss with the $\alpha $ -balanced method (i.e., called I-FL) for accurate gradient calculation, replacing the BCE as the confidence loss function. Further, the location loss is calculated by the CIoU function, and then the classification loss function is the BCE function. The related descriptions are as follows.\begin{align*} \text { I}-\text {FL}(p,y)& =-\alpha y(1-p)^{\gamma }\log (p) \\ & \qquad \quad -(1-\alpha)(1-y)p^{\gamma ^{2}}\log (1-p) \tag {8}\\ \text {CIoU}& =1-\frac {\vert B\cap B_{gt} \vert }{\vert B\cup B_{gt} \vert }+\frac {\rho ^{2}(b,b^{gt})}{c^{2}}+\beta v \tag {9}\\ \beta & =\frac {v}{(1-\text {IoU})+v}, \\ & \qquad \quad v=\frac {4}{\pi ^{2}}\left ({{\arctan \frac {w_{gt}}{h_{gt}}\!-\!\arctan \frac {w}{h}}}\right)^{2} \tag {10}\\ \text {BCE}& =-y\log p-(1-y)\log (1-p) \\ & ={\begin{cases} \displaystyle -\log,& y=1 \\ \displaystyle -\log (1-p),& y=0 \end{cases}} \tag {11}\end{align*} View SourceRight-click on figure for MathML and additional features.where y is the corresponding label (the positive sample is 1, the negative sample is 0), and p is the probability that the model input sample belongs to the positive sample. B is the predicted box, and $B^{gt}$ is the ground-truth box. $\alpha \gt 0$ is the hyperparameter, $w/h$ is the width-to-height ratio, $\gamma \in [{0,1,2,3,4,5}]$ is an adjustable factor, and c is the diagonal length of the smallest enclosing box covering the two boxes.

FIGURE 5. - YOLO prediction head and Di-head structure.
FIGURE 5.

YOLO prediction head and Di-head structure.

SECTION IV.

Experiments

A. Experimental Dataset Preparation

We collect real steel sheets from the daily industrial welding process. Since the weld proximity defects are small and difficult to distinguish, we locally zoom the defect clusters with the high-resolution camera, and then collect these images as the experimental dataset. The dataset contains about 10000 weld defect images, as shown in FIGURE 6. Among them, the number of weld proximity defects in the dataset is 29200, and contains four types of weld proximity defects. For example, 8115 Black-Gray Oxidation (BO) defects, 6298 Burning-Through (BT) defects, 36642 Weld Tumor (WT) defects, and 17466 Shrinkage Cavity (SC) Defects.

FIGURE 6. - Example of weld proximity defects.
FIGURE 6.

Example of weld proximity defects.

B. Experimental Setting & Metrics

Considering the target detection efficiency and actual industrial welding needs. The inputs of images are resized as $640\times 640$ . The training epoch is 150 and the batch size is 16. Further, we select Adam as the model optimizer and the learning weight-decay rate is 5e−4. The initial learning rate of the training process is 1e−2, and the nms_iou is 0.5. All experiments run in the Python 3.9 programming environment, the Tensorflow framework, the NVIDIA RTX 3090 GPU, the AMD R7-5800x CPU, and the 32 GB of memory.

Further, we select the Average Precision (AP50, AP75, AP50:95), Average Recall (AR), Parameters (Params), Frames Per Second (FPS), and the Training time/epoch (Time) as the metrics of the ablation experiments. We also select the mean AP (mAP), Precision, F1 score, Recall, FPS, Params, and Computational time (C-Time) as the metrics to evaluate the compared models. Precision represents the correct detection percentages among total detection targets. Recall represents the percentages correct detection percentages among true detection targets. The higher the mean Average Precision (mAP), the better the performance of the model.

C. Key Parameter Selection

In DDT-YOLOv5, some key parameters will affect the effect of feature extraction and detection accuracy of weld proximity defects. We think the receptive fields of each bottleneck block depend on the dilation factor, which presents different feature extraction abilities of the DT-CSPnet. Further, the confidence loss function (i.e., I-FL) in Di-head mainly depends on the hyperparameter and adjustable factor by analyzing the loss function elements. Here are the experimental results for key parameter selection.

1) Dilation Factor Selection

In residual blocks, the dilation factor of the dilation convolution is usually set to [1,2,22,23,24]. However, as the dilated rate increases, the receptive field of dilated convolution will present exponential growth, which may lead to overfitting performance. Considering the parameter quantity of the proposed DDT-YOLOv5, we perform limited comparison experiments with the dilation factors [1,2], [1,2,4], and [1,2,4,8]. Further, we adopt the initial parameter settings (i.e., $\alpha =0.25$ , $\gamma =2$ ) in the Di-head of DDT-YOLOv5, and the training epoch is 150.

TABLE 1 shows that the DDT-YOLOv5 based on the dilation factor [1,2,4] can get superior performance (i.e., AP50: 89.7%, AP75: 53.2%, AP50:95: 51.7%, AR: 63.5%,). The other metrics (Parameters, FPS, and Time) are at a medium level and can be acceptable for the training process of DDT-YOLOv5. Further, the dilation factor [1,2,4,8] makes the network parameters increase substantially, which is 13.9M more than that of the dilation factor [1,2,4]. The reason is that the too-large receptive fields in the bottleneck block will cause the overfitting problem, and then reduce the accuracy of feature extraction. When the dilation factor is [1,2], the extraction ability is weaker than that of the dilation factor [1,2,4], which depends on the receptive field. Further, in TABLE 2, we can visualize the maximum receptive field of each bottleneck block in DT-CSPnet.

TABLE 1 Ablation Experiments for DDT-YOLOv5 Head
Table 1- Ablation Experiments for DDT-YOLOv5 Head
TABLE 2 Maximum Receptive Fields of DT-CSPnet
Table 2- Maximum Receptive Fields of DT-CSPnet

Based on the above analysis, we finally select the series [1], [2], [4] as the key dilation factor in the DT-CSPnet to extract the features of weld proximity defects. The maximum receptive field of the DT-CSPnet for feature extraction is 7.

2) Hyperparameter and Adjustable Factor

To select the superior hyperparameter and adjustable factor of the I-FL activation function, we perform the parameter comparison. In general, the hyperparameter factor $\alpha $ is [0.25,0.5,0.75], the adjustable factor $\gamma $ is [1,2,3,4,5], and the training epoch is 150. Considering the range of the squared adjustable factor, we select the $\gamma $ is [1,2] as the compared parameter. We also set the dilation factor of bottleneck blocks as [1], [2], [4], which is based on the results of Table 1. The results of the hyperparameter and adjustable factor are shown in TABLE 3, indicating the evaluation metrics of each factor combination. When the hyperparameter and adjustable factors select $\alpha =0.5$ , $\gamma =2$ or $\alpha =0.25$ , $\gamma =2$ , the proposed model’s performance has significant improvements.

TABLE 3 Results of the Hyperparameter and Adjustable Factor
Table 3- Results of the Hyperparameter and Adjustable Factor

FIGURE 7 indicates the network loss of each factor combination (i.e., including training loss and validation loss). We can also find that the hyperparameter and adjustable factors $\alpha =0.5$ , $\gamma =2$ or $\alpha =0.25$ , $\gamma =2$ get lower loss than other factor combinations. Further, although the selected combination ($\alpha =0.25$ , $\gamma =2$ ) achieves the best accuracy and lowest training and verification loss than others, the curve of all epochs is not smooth enough, and with violent fluctuation. Further, considering the training time of two different parameter combinations (i.e., 4.8s vs, 4.7s), we finally select $\alpha =0.5$ and $\gamma =2$ as the key hyperparameter and adjustable factors of the DDT-YOLOv5 model.

FIGURE 7. - The loss of the DDT-YOLOv5 model.
FIGURE 7.

The loss of the DDT-YOLOv5 model.

D. Ablation Experiments

In this section, we verify whether key parameters affect the accuracy of model detection based on the ablation experiment. Experimental results in TABLE 4 show that when the $1\times 1$ Conv_BN_SiLU+ is adopted in the first layer, the performance of the DDT-YOLOv5 is reduced by about AP50: 0.7%, while without the MaxPooling operation is significantly reduced by about 2.5%. The FPS represents the number of images that can be processed in a second. Based on the FPS metric, we know the added $3\times 3$ Conv_BN_SiLU+ can process more images than the $1\times 1$ convolutional layer (i.e., 21.3 vs. 19.7). Further, the training time via the $1\times 1$ Conv_BN_SiLU is about 5.3s, which is more than that of the proposed $3\times 3$ Conv_BN_SiLU+ (i.e., about $\Delta \textrm {Time}=+0.4\textrm {s}$ ). The reason is that the small-sized convolutional kernel may take more time to extract the image features [34], [35]. When we drop the MaxPooling layer, the training time of the epoch decreases slightly, but the performance of the DDT-YOLOv5 model gets weak.

TABLE 4 Ablation Experiments for DDT-YOLOv5 Head
Table 4- Ablation Experiments for DDT-YOLOv5 Head

In TABLE 5, we explore the effects of the DDT-YOLOv5 with different confidence loss functions. The evaluation metrics of the Focal loss function (i.e., including AP50 is 86.1%, the AP75 is 50.3%, and the AR is 58.8%) are better than the original BCE loss function. Further, the I-FL loss function via the Squared-Focal Loss and the $\alpha $ -balanced method gets the topgallant performance. It means that the hyperparameter and adjustable factors of I-FL function can enhance the effect of negative samples, and then improve the detection and classification accuracy of imbalanced samples.

TABLE 5 Ablation Study of Final Loss in Di-Head
Table 5- Ablation Study of Final Loss in Di-Head

Experimental results in TABLE 6 show that each part of DT-CSPnet in DDT-YOLOv5 can improve the detection performance. When we completely drop the DT-CSPnet and replace it with the original CSPnet layer, the performance of the model decreases significantly (i.e., AP50: 4.8%, AR: 7.0%). The SiLU+, replacing the SiLU, adopted in DT-CSPnet can get some improvement. For example, $\Delta \textrm {AP}_{50} $ is 2.1%, $\Delta \textrm {AP}_{75}$ is 2.8%, and $\Delta \textrm {AR}$ is 4.5%. It means that the additional term in SiLU+ can help reduce the loss of precision and solve gradient problems. Because of the low parameter in the SiLU activation, the FPS value of the DT-CSPnet without SiLU+ is larger than that of the DT-CSPnet (i.e., 25.3 vs. 21.3).

TABLE 6 Ablation Experiments for DT-CSPnet
Table 6- Ablation Experiments for DT-CSPnet

Further, both the transposed convolution and the dilated convolution in DT-CSPnet can improve the feature extraction ability. The DDT-YOLOv5 without the dilated convolution gets lower evaluation metrics than the transposed convolution (i.e., AP50: 86.2%/85.9% vs. 86.6%, AP75: 49.5%/48.0% vs. 51.7%, AR: 58.9%/58.1% vs. 59.6%). Namely, the $3\times 3$ dilated convolution plays a major role in the detection of weld proximity defects (i.e., $\Delta $ AP50: −3.8%, $\Delta $ AP75: −5.2%, $\Delta $ AP50:95: −4.1%, $\Delta FPS$ : +8.5).

Based on the above analysis, the evaluation metrics (e.g., FPS: 21.3, Time: 4.9s) also can be acceptable when we consider the model’s performance improvements (e.g., AP50: 89.7%, AP75: 53.2%, AP50:95: 51.6%, AR: 63.5%).

E. Comparison Experiments

In this section, we compare the YOLOv3 [15], the YOLOv4 [3], the YOLOv5 [4], the YOLOv5(Relu6) [24], the FL- YOLOv5 (Focalloss), the D-YOLOv5 (Decoupled-BCE) [11], the DF-YOLOv5 (Decoupled-Focalloss), the VFL-YOLOv5 (Vari-Focalloss) [23], and the E-YOLOv5 [20] with the proposed model. The evaluation metrics of the comparison experiment are listed in Sect. IV-B. The key parameters (dilation factor, hyperparameter, and adjustable factor) are [1], [2], [4], $\alpha =0.5$ , and $\gamma =2$ . The specific results are shown in TABLE 7. Firstly, the comprehensive evaluation metrics (i.e., mAP: 89.5%, Precision: 88.6%, and F1: 84.3%) of the DDT-YOLOv5 are superior to other contrast models significantly.

TABLE 7 Ablation Experiments for DT-CSPnet
Table 7- Ablation Experiments for DT-CSPnet

Although the FPS and computational time of the proposed model is not the topgallant (e.g., lower than the VFL-YOLOv5 and the E- YOLOv5), our proposed model achieves the best effects of recognition precision and recall. The parameters of conventional YOLO models (i.e., YOLOv3, YOLOv4, and YOLOv5) are lighter than the proposed DDT-YOLOv5, but the metric mAP is significantly weaker (i.e., −8.5%, −8.2%, −7.4%).

Secondly, we can find that the computational time is proportional to the model’s parameter, indicating that the C-Time of DDT-YOLOv5 is 0.042s. Considering the actual performance metrics together, the proposed DDT-TOLOv5 can meet the real-time requirements of industrial detection. Further, we perform visualization detection experiments via the VFL-YOLOv5, the YOLOx, and the proposed DDT-YOLOv5, as shown in FIGURES 8 and 9. Each detection model displays the different detection effects of weld proximity defects. In FIGURE 8, the original image with calibration labels includes about three types of weld proximity defects separately (i.e., WT, SC, & BO). The VFL-YOLOv5 can recognize different defects with high evaluation values while having a risk of incorrect detection. It will make the model’s evaluation metric lower than the proposed DDT-YOLOv5 (e.g., mAP: 86.8% vs. 89.5%). Further, the E-YOLOv5 obtains a significant risk of incorrect and under-detecting defects, which gets lower precision and recall metrics compared to the DDT-YOLOv5. Then, compared with the calibration defect labels of original images, the DDT-YOLOv5 can achieve superior and accurate recognition and classification of weld proximity defects.

FIGURE 8. - Real effects for the detections of weld proximity defects (WT, SC, & BO).
FIGURE 8.

Real effects for the detections of weld proximity defects (WT, SC, & BO).

FIGURE 9. - Real effects for the detections of weld proximity defects (WT, SC, & BT).
FIGURE 9.

Real effects for the detections of weld proximity defects (WT, SC, & BT).

FIGURE 9 also verifies the best performance of the proposed DDT-YOLOv5 than other models. Both E-YOLOv5 and VFL-YOLOv5 have incorrect detection cases of weld proximity defects. The E-YOLOv5 over-detects and labels the common area as the type WT with low precision values. The VFL-YOLOv5 detects the type SC as the two types SC & WT.

Finally, we can find that the network parameter of our model is 67.3M, which is about 16M larger than the VFL-YOLOv5 and the E-YOLOv5. Further, the computational time of the DDT-YOLOv5 is compatible with the above models. This is because, as the welding and detection of most industrial sheets require long time consumption and a high number of resources, the proposed DDT-YOLOv5 model can meet the real industrial detection scenarios (e.g., a piece of steel-structure welding and detection process in the real factory is about 15min, and each welding point detection with the industrial camera is about 2min, as shown in FIGURE 10).

FIGURE 10. - The steel-structure welding and detection scenario.
FIGURE 10.

The steel-structure welding and detection scenario.

F. Actual Detection Effects

Based on the above experiments, we adopt the testing set from the weld defect dataset (i.e., about 10000 weld defect images) to confirm the actual detection effects of the proposed DDT-YOLOv5 model. The details of each defective combination can be shown in FIGURE 11. The proposed DDT-YOLOv5 can accurately detect different combinations of weld proximity defects, including the type of combination WT&BT&SC (precision: 92%, 93%, and 84%), WT&BO&SC (precision: 82%, 96%, and 95%), and BO&BT&WT (precision: 95%, 96%, and 80%).

FIGURE 11. - Actual detection effects of DDT-YOLOv5 model (e.g., WT&BT≻ WT&BO≻ BO&BT&WT).
FIGURE 11.

Actual detection effects of DDT-YOLOv5 model (e.g., WT&BT≻ WT&BO≻ BO&BT&WT).

SECTION V.

Conclusion

In this work, we propose an improved DDT-YOLOv5 model based on the DT-CSPnet and the Di-head to achieve the effective detection of weld proximity defects. The dilated convolution and transposed convolution used in DT-CSPnet capture the wide-receptive features well. Further, we improve the SiLU+ activation function by the additional term $\Delta \textrm {SiLU}^{+}=1/(1+e^{-x})$ , which can optimize the extraction ability of the network. Further, based on the Max-Pooling layer, and the improved confidence loss (i.e., Squared-Focal Loss & the $\alpha $ -balanced method), the redesigned Di-head accurately predicts the classification and location information of weld proximity defects in imbalanced samples. Experimental results demonstrate that the proposed DDT-YOLOv5 can perform superior detection ability than other contrast models, especially the mAP is 89.5%, and the F1 score is 84.3%.

However, the computational costs of the proposed DDT-YOLOv5 model are not superior for the detection of weld proximity defects. In the future, we will optimize the structure of DDT-YOLOv5 with some attention mechanisms to get lightweight and apply it to more industrial welding fields.

References

References is not available for this document.