Bi-directional Convolutional Recurrent Reconstructive Network for Welding Defect Detection

Nowadays, the welding process is essential in various manufacturing industrial fields, such as aerospace, vehicle production, and shipbuilding. The welding defects caused in the process need to be monitored as they can cause serious accidents and losses. Traditional computer vision methods in an industrial application are inefficient when the detection targets have variations in shape, scale, and color because the detection performance depends on the hand-crafted features. To overcome this limitation, deep learning models, such as the convolutional neural network (CNN), are applied to industrial defect detection. These CNN-based models trained on static images, however, a low performance that cannot meet the industrial requirements. To deal with the challenge, bidirectional Convolutional Recurrent Reconstructive Network (bi-CRRN) is proposed for welding defect detection and localization based on welding video. Spatio-temporal data, especially the forward and backward sequences, are considered in our bi-CRRN to get high detection performance. Moreover, an automatic defect detection equipment is developed to weld a material and monitor the welding bead simultaneously. We demonstrate that the proposed bi-CRRN outperforms the other segmentation network models in welding defect detection.


I. INTRODUCTION
D EEP learning has shown significant progress in various fields, including image classification, semantic segmentation, and object detection. In industrial applications, these advanced algorithms have led to a dramatic increase in performance resulting in the improvement in productivity and efficiency. One of the major improvements is achieved in the defect detection field. The defect detection has been a daunting task for the engineers due to the high accuracy demands, periodical examination, and immense examination areas. To deal with such challenges, an automated defect detection system has been developed using a deep learning based approach. It could reduce the human labor and enhance the accuracy and efficiency. In the welding inspection, the automation based on deep learning algorithms invokes an increase in the reliability and reproducibility of the task. Furthermore, it speeds up the process and decreases labor costs and human errors.
The automated defect detection system employs an acqui-sition equipment to obtain images that are used as the defect detection input. This acquisition equipment contains various measurement devices such as the RGB camera, depth camera, and ultrasonic devices. RGB camera is most widely used owing to its similarity with the human visual inspection. Furthermore, RGB camera-based systems achieve high accuracy and provide an intuitive understanding of images during the process [1], [2].
After the acquisition of RGB images, the defect detection algorithm highlights the defected area within the images. The algorithm is classified into image-wise and pixel-wise methods. The image-wise defect detection method determines the existence of a defect within the entire image [3], while the pixel-wise defect detection method determines the specific defect locations in pixel-level [4], [5]. The advantage of the image-wise defect detection method is the reduced network size. The latent-to-image decoder is not necessary, thus designing the network is less complex as compared to the pixel-wise method. On the other hand, the pixel-wise defect detection not only indicates the presence of a defect but also the location of the defect. Knowing the location of the defect can help optimize the process of the industrial field line. In addition, the pixel-wise defect detection result is an important factor in evaluating the quality of the product. In this light, the pixel-wise method is generally preferred in the industrial defect detection problem, where the location of product defects is required.
However, to utilize the pixel-wise technique on the industrial level, the following three issues need to be resolved. Firstly, the network size is too large so that it requires more inference time and limits the real-time detection. Secondly, spatial information needs to be preserved and employed throughout the network architecture. Lastly, to achieve a higher prediction score in a harsh industrial field, which is difficult to attain static images, the time-sequential information should also be utilized.
The recent line of works has been attempting to deal with such issues. Networks considering spatio-or spatio-temporal information within images have been developed [6], [7]. Furthermore, efforts to reduce the network size while maintaining the high performance have also been made [8]. Most of the researches, however, implemented unsupervised learning architecture [9], [10] only and did not target on defect detection. Traditionally, methods based on the unsupervised learning algorithm typically provide lower performance as compared to the supervised learning algorithm.
In this paper, we propose the bidirectional convolutional recurrent reconstructive network (bi-CRRN) for real-time pixel-wise defect detection, which utilizes spatio-temporal information in videos. Three major contributions are presented here. Firstly, we develop an automatic defect detection equipment to obtain videos as input, and detect the welding defects. The equipment also includes a setup for acquiring the training data manually. Secondly, we design the bi-CRRN algorithm to utilize the spatio-temporal information from the relationship between input images in both forward and backward directions, by adopting the bi-directional LSTM [11] structure. Finally, we compare the performance of the proposed bi-CRRN with recent defect detection algorithms on acquired welding datasets in terms of the accuracy at both frame and pixel levels along with computation time. Evidently, the proposed bi-CRRN outperforms in both defect detection accuracy and computation speed. This paper is organized as follows. In Section II, we describe related works including automated defect detection systems and spatio-temporal networks. Section III briefly reviews the mechanism of CRRN and presents the proposed bi-CRRN followed by the experimental validation in Section IV. Finally, concluding remarks follow in Section V.

A. AUTOMATED DEFECT DETECTION SYSTEMS
In the industrial sites, the presence of defects in manufacturing products can cause several losses such as degradation in production quality, exposure to dangerous materials, and even catastrophic accidents. Various researches have been conducted for defect detection to prevent the losses caused by defects. Traditional methods for defect detection are manual inspections by highly trained human experts. These methods, however, require high labor costs and are highly prone to human errors due to inattention [12]. Thus, automation of defect detection has been widely studied to reduce these errors and operation costs.
An automated defect detection system requires various sensory equipment such as vision cameras, ultrasonic sensors, and radar sensors. The vision camera based imaging system is widely used due to its high performance and similarity to the human visual inspection [13]- [15]. However, traditional vision-based defect detection algorithm suffers from a performance robustness issue. The performance of these algorithms is highly volatile and can be easily affected by small changes in image features such as illuminance, scale variation or object shape.

B. DEEP LEARNING BASED DEFECT DETECTION
Recent researches attempt to overcome the difficulties mentioned above by developing various networks with diverse characteristics. The most basic network for image processing is to utilize a CNN [16], due to its high computational efficiency and preservation of spatial information. Semantic segmentation classifies the image to objects pixel-wise, but the network is excessively bulky. Class activation mapping (CAM) based on CNN [17] provides spatial reasoning for classification results. However, it does not consider a spatiotemporal relationship between input images. Networks for video inputs need to consider spatio-temporal characteristics of inputs for higher accuracy and improved efficiency.
In recent years, several other machine learning techniques have been applied to many industrial applications for robust performance [18]- [20]. In the case of railways, an automated rail defect detection system was studied, using a deep convo-lutional neural network (DCNN) [21]. Cha et al. [22] cropped the building surface images into patches and detected defects with the help of the CNN. Hu et al. [23] implemented defect detection on radiography images using bilinear class activation maps (Bi-CAM) and attention mechanisms. Kang et al. [24] proceeded with high-speed railway insulator defect detection utilizing faster R-CNN [25].
Likewise, automated visual inspections with machine learning techniques have also been applied to welding defect detection. Welding is an essential and commonly used technique in various mechanical industrial fields, including automobiles, aerospace, and shipbuilding. Hence, there have been diverse researches on the automation of welding defect detection. Lee et al. [26] showed that the artificial neural network (ANN) offered better prediction performance than using multiple regression analysis on back-bead prediction in gas metal arc welding (GMAW). Feng et al. [27] utilized an ensemble model incorporating multiple object detection networks for gas tungsten arc welding (GTAW) defect detection. Sassi et al. [28] monitored the welding defects in fuel injectors using transfer learning.

C. NETWORKS CONSIDERING SPATIO-TEMPORAL INFORMATION
Further researches have been focusing on the development of spatio-temporal pixel-wise network. Convolutional LSTM (ConvLSTM) network [6] preserves spatial information and considers the relationship between input images by applying convolutional operators to LSTM-based structure. Spatio-Temporal LSTM (ST-LSTM) [7] is designed to facilitate the flows of the spatio-temporal information by adding a spatiotemporal memory cell. In addition to the spatio-temporal memory in which the memory cell is updated in the time domain, a memory structure is also added to ST-LSTM which updates vertically for each layer within the same time step. Thus, it requires twice as many parameters as the Con-vLSTM. Convolutional Recurrent Reconstructive Network (CRRN) [8] simplifies the network and reduces the necessary amount of the network parameters while maintaining the performance similar to the ST-LSTM. CRRN is utilized as an anomaly detection algorithm based on unsupervised learning. In the case of the industrial defect detection problem, however, supervised learning models tend to be more accurate than unsupervised learning models.

III. PROPOSED APPROACH A. WELDING DEFECT DETECTION FRAMEWORK
The proposed welding defect detection framework is shown in Fig. 1, which is divided into three phases: 1) Automatic welding and obtaining input videos simultaneously; 2) Defect detection and localization at the pixel-level by applying a deep learning network; 3) Defect detection at the framelevel on the basis of pixel-level detection.
Firstly, when the automatic welding machine proceeds with automated welding, the welding bead is captured by the automatic welding defect detection system installed behind the welding machine. To improve the processing speed of the network, images are resized to smaller dimensions. Secondly, the deep learning network detects and localizes defects at the pixel-level. For enhanced defect detection performance, we design two bi-CRRN models. Lastly, the input images are classified into defective and normal classes. If the input image is classified to be defective, a speaker attached to the equipment generates an alarm signal to notify the operator about the possible defects. The predicted outputs with possible defects are saved automatically to a computer with corresponding input images.

B. CRRN
The automatic welding defect detection is performed based on the RGB vision camera. Since images are continuously captured over time, they contain not only spatial information of the welding bead but also temporal information. Therefore, we adopt CRRN [8] as a basic architecture, which is a convolutional recurrent autoencoder based on Convolutional Spatio-Temporal Memory (CSTM) for the anomaly detection in spatio-temporal data. The CRRN is a combination of encoder-decoder consisting of a spatial encoder (S-Encoder), a spatio-temporal encoder-decoder (ST-Encoder-Decoder), and a spatial decoder (S-Decoder). The S-Encoder extracts spatial features from the input, and the ST-Encoder extracts spatio-temporal features from a sequence of the spatial features. In a similar manner to the encoder, the ST-Decoder decodes spatial features of each timestep and S-Decoder generates reconstructed outputs. These outputs help to measure the reconstructed error between the input and the reconstructed output. From the reconstructed error, it is determined whether the input is normal or abnormal.
To extract spatio-temporal patterns efficiently in CRRN, a CSTM is developed to deliver spatial information to other CSTMs without incrementing the number of parameters, VOLUME 4, 2016 as shown in Fig. 2. In CSTM, firstly, the cell gate of the previous time step and the previous layer are concatenated in a channel-wise manner. Then, the cell is updated by adjusting the number of existing channels through one by one convolution. Therefore, fewer parameters compared to ST-LSTM [7] are used and both spatial and temporal patterns can be extracted. The CSTM is updated as follows: In bi-CRRN, the S-Encoder extracts the spatial feature of the image. Then, the spatial and temporal information is processed by the ST-Encoder, which is composed of the CSTM modules. Next, the ST-Decoder exploits spatial and temporal patterns if it is used as a component of the network. Then, the processed information is passed to the S-Decoder, which predicts the final output.
To encode and decode the sequential information, only the temporal forward direction is considered in CRRN. Yet, the detection performance would be enhanced if both the directions of the sequential information are considered. Taking advantage of this presumption, we design two kinds of bi-CRRN, bi-CRRN-E and bi-CRRN-ED, both capable of processing forward and backward time sequences. To prevent decreasing the defect detection performance, bi-CRRN is trained by the supervised learning framework. The supervised learning method tends to be more accurate than the unsupervised one in general. The labeled data, which is addressed in the supervised learning as the target, is prepared by the welding expert at the pixel-level. Fig. 3 shows bi-CRRN-E which is composed of a S-Encoder, a S-Decoder, and a ST-Encoder. The S-Encoder extracts the spatial feature,X t ∈ R Nc×N h ×Nw from the original image,

1) bi-CRRN-E
are the number of channels, height and width ofX t (X t ), respectively. Then, these spatial features are passed through ST-Encoder layers which generates CSTM We calculate the final ST-Encoder output by using the top ST-Encoder layer outputs. The output is formulated as follows: where W 1×1 ∈ R Nc×2Nc represents one by one convolutional operation weight matrix. With the spatial featureX t ∈ R N out c ×N out h ×N out w , the S-Decoder generates the predicted output X t ∈ R Nc×N h ×Nw .
The main advantage of bi-CRRN-E network is the reduced computation time because of the simpler network architecture. Therefore, it comes in handy to adopt in the industrial application. The absence of the ST-Decoder, however, may result in a low performance of the defect detection due to a lack of decoding temporal information. To cater to this issue, the bi-CRRN-E is designed as a fully connected CSTM architecture, in which each pair of forward and backward CSTM layers at time t is fully connected.

2) bi-CRRN-ED
We denote the forward and backward direction hidden states of the ST-Decoder at l-th layer and time frame t by D l t and D l t , respectively. The hidden states of ST-Decoder are respectively formulated as follows: Hidden states of the top ST-Decoder layer contain forward and backward sequential information. The predicted output is formulated as follows: where the top layer hidden states, D L t and D L t are concatenated in a channel-wise manner.
In addition, the spatio-temporal attention (ST-Attention) is used to further improve the long-term dependency. The hidden state of the ST-Encoder is expressed as E t and the ST-Attention map is calculated as follows: where W A ∈ R 1×N c×N k×N k is the weight matrix of the convolutional operation and N k denotes the kernel size of the convolution. A t is replicated in a channel-wise manner to match the number of channels of the hidden state (E t ). The replicated ST-Attention map is added to the ST-Encoder and subtracted from the ST-Decoder. A t acts as a shortcut path between encoder and decoder.

3) Supervised bi-CRRN
The two designed bi-CRRNs are optimized to a supervised learning framework due to the high performance demands in industrial applications. In contrast, the traditional CRRN is a network developed for unsupervised anomaly detection.
In the case of unsupervised learning, the network can be trained a with normal image dataset only. Therefore, it does not require dataset collected from anomalous welded targets.
On the other hand, supervised learning models tend to be more accurate than unsupervised learning models. Thus, we design the bi-CRRN in a supervised learning framework with labeled binary images as ground truth. For the designed network, the loss value is calculated by comparing the output with the ground truth, where the pixel-wise binary cross entropy (BCE) loss function is used in the learning process. The whole process is summarized in Algorithm 1. Note that D tr and D te stand for training and validation datasets, respectively. Also G θ is a bi-CRRN model parameterized with θ.

Algorithm 1 bi-CRRN training and validation algorithm
Input: Image dataset D tr , D te Output: An optimized bi-CRRN model G θ * trained on D tr Phase 1 -Pre-processing phase 1: Images in the dataset D tr and D te are sliced with a Tsized window, since the network input should be consisted of T sequential images. Calculate forward propagationŷ

12:
Calculate accuracy and F1 score 13: end for

A. EXPERIMENTAL SETUP 1) Hardware setup
Two types of equipment were designed and manufactured by DSEC, which is a marine engineering company in Korea.  A manual training dataset acquisition equipment was manufactured as shown in Fig. 5(a). This equipment consists of camera, LED light, rollers moving along with the welding bead, and handle mounted on the top side. The operator can push the handle to collect the welding images. Based on the 3D drawing ( Fig. 5(b)), automatic defect detection equipment (Fig. 5(c)) was manufactured, which is attached on an automatic welding machine. The equipment is composed of the same components as the manual acquisition equipment.
Since the welding bead that the proposed equipment should monitor is at least 10 meters long, the defect detection equipment is attached to the automatic welding machine rather than fixed in a specific place. It means the welding process and the monitoring of the welding bead are carried out simultaneously.

2) Datasets
We captured video clips of the welding bead because of the harsh experimental environment. The length of the welding bead is at least 10 meters long and the height of the bead is not fixed. Also, the experimental environment has vibration due to the movement of the automatic welding machine. In this environment, it is difficult to apply a single image-based defect detection network with static images. Therefore, the video input-based defect detection network was applied to monitor the welding defects. The training dataset was obtained by the camera installed on the manual data acquisition equipment. In contrast, the validation dataset was obtained using the camera installed on the automatic defect detection equipment. Both cameras capture 20 frames per second, with each frame having 1, 280 × 720 resolution. Considering the learning speed and storage capacity of our learning system, the image size was reduced to 160 × 90 resolution.
We generated the ground-truth images by labeling the actual locations of defects at pixel-level, with the guidance of a welding expert in the field. The annotated images of the dataset are shown in Fig. 6 (bottom row) where the green pixels indicate the defective region. We can see the various shapes of defect areas. If there is no defect, the annotated image is the same as the input image. The entire dataset consists of 310 video clips, which is a total of 75,420 frames. The dataset was divided into training and validation datasets with a ratio of 80% and 20%.

3) Implementation details
In the network architecture, the S-Encoder and S-Decoder are composed of two convolution layers, batch normalization and ReLU layers. ST-Encoder and ST-Decoder consist of two CSTM layers. The kernel size was set to 5 × 5 filter. The sequential image frames were obtained by slicing image sequences with a window of size 10 and thus the number of the sequential image frames, T, was 10. Input and output were set to the two channels, and the rest layers were set to 64 channels.
For the network training, the number of epochs, batch size, and the learning rate are set to 150, 20, 0.0001, respectively. In addition, Adam [29] was used as an optimizer for the back propagation to learn the network parameters. The specifications of the workstation are Intel core i9-9900K, 32GB RAM and the GPU is 4 GTX 1080Ti.

B. EVALUATION METRICS
To test our proposed network, we used accuracy, precision, recall, and F β score as the evaluation metrics. tp (true positive) is the number that the network correctly detects the actual defect, f p (false positive) is the number that misclassifies the normal as a defect, tn (true negative) is the number that correctly detects the normal as normal, and finally f n (false negative) is the number misclassifying the defect as normal. From these counts, precision, recall, and F β score are defined as follows: where the parameter β determines the weight of recall in the score.

FIGURE 6:
Examples of defect and non-defect image frames. The middle part of each frame with a comb pattern is the welding spot. The first column shows the non-defect frames, and the remaining columns denote frames containing defects. The first row is the original image and the second row denotes the labeled version of the first row.

C. PIXEL-LEVEL PERFORMANCE EVALUATION
To test the performance of the setup, an experiment for the real time defect detection at the pixel-level was carried out. We verified that the proposed bi-CRRN could successfully detect and localize the welding defects. The performance of the proposed bi-CRRN was compared with recent defect detection algorithms such as mask-RCNN [30], U-Net [31], DeepLabv3 [32], 3D-CNN [33], ConvLSTM, and CRRN. The network architectures of 3D-CNN and ConvLSTM, which exploit spatial and temporal information, were implemented based on the CRRN architecture. In the case of Con-vLSTM network, the CSTMs of the ST-Encoder-Decoder were substituted for ConvLSTMs. On the other hand, instead of the ST-Encoder-Decoder, the autoencoder architecture was implemented for the 3D-CNN. We denote CRRN and bi-CRRN-ED both with the ST-Attention mechanism as CRRN w/attn and bi-CRRN-ED w/attn, respectively. Table 1 reports that mask-RCNN, U-Net, and DeepLabv3, which consider only spatial information without recurrent connection, present low F 1 score. Since the dataset has a temporal property, ConvLSTM, CRRN, and bi-CRRN, which process spatial and temporal information, show better performance except for 3D-CNN. The bi-CRRN-ED, which handles the correlations across all the temporal meaning, presents the best F 1 score. Even though bi-CRRN-E is designed without the ST-Decoder, it has better performance than CRRN mainly due to the bidirectional memory connection and the fully connected memory cell at time t. Also, models with the ST-attention mechanism, which strengthens the long-term dependency, provide better detection performance than those without ST-Attention.
Additionally, we compared the precision-recall curves of the proposed bi-CRRN and other networks by sweeping over decision thresholds. The network output pixels are determined as defective pixels when the output values are greater than the decision threshold. As shown in Fig. 7, the bi-CRRN-ED w/attn outperforms other deep learning models. The qualitative comparison is summarized in Fig. 8. It shows the input images with respect to the time axis, and the defect detection at the pixel-level. Pixels determined to be defective are shown in white, whereas the rest are in black. The results of 3D-CNN, ConvLSTM, and CRRN have some false positives in the non-defective pixels. In contrast, the proposed bi-CRRN-E and bi-CRRN-ED present more accurate defect detection results.
Meanwhile, we investigated imbalance ratios at the pixellevel to address the issues related to the imbalanced dataset [34]. Random masks were generated to adjust the imbalance ratio. Normal pixels were masked to verify how this imbalance affects the defect detection performance. We compared the performance of bi-CRRN-ED w/attn by sweeping the imbalance ratios (3:1 to 30:1). As shown in Table 3, since the imbalance ratio of the validation dataset is about 30:1, the proposed network reports the best performance with the imbalance ratio of 30:1.
In the industrial field, recall is often more important than precision because the missed fault can lead to significant losses. Table 2 shows F β scores for the pixel-level welding defect detection. Since F β scores of bi-CRRN are higher than those of other models, it indicates that the proposed model attains a higher recall value.

D. FRAME-LEVEL PERFORMANCE EVALUATION
The frame-level defect detection was also performed based on the results of the bi-CRRN pixel-level defect detection.
The operator can recognize immediately if the welding bead is defective through frame-level defect detection. The framelevel defect detection was calculated from the sequential image frames, where a T -sized window was employed. The image group is classified to be defective if the sum of all the defective pixels from the pixel-level output is larger than a threshold value as follows: where n is the number of pixels in one image frame, p is the pixel-level binary detection value, and θ thres controls the sensitivity of the defect detection decision making. In this experiment, n and θ thres were set to 14,400 and 1,000, respectively. Note that the number of the sequential image frames, T was set to 10. Fig. 9 shows the performance of the defect detection at frame-level. It shows that the frame-level defect detection using the proposed bi-CRRN-ED w/attn provides the best accuracy performance compared to other networks, as reported in Table 4. Similar to the pixel-level experimental result, the models with the ST-Attention mechanism show better performance than those without ST-Attention. Table 4 also shows the computation time for individual networks. By reducing the number of parameters in CRRN, CRRN and CRRN w/attn take similar computation time to the convLSTM. On the other hand, since our proposed bi-CRRN-E network is designed without ST-Decoder, the simplicity of the network architecture helps to further reduce the computation time. Thus, this network can be used in the fields that prioritize fast computation time over high accuracy. Although bi-CRRN-ED and bi-CRRN-ED w/attn have more computation time than other algorithms, these defect detection networks can also be used in the industrial site due to the outstanding performance.

V. CONCLUSION
In this paper, we proposed a novel deep learning network, bi-CRRN for spatio-temporal defect detection. We focused on two industrial demands: high detection accuracy and lightweight for less computation time. Thus, we designed two kinds of bi-CRRN architecture. Firstly, bi-CRRN-E network was designed to reduce the computation time. To maintain the defect detection performance, each memory cell is fully connected considering both forward and backward time sequences. Another network, bi-CRRN-ED was designed to get the high prediction performance. The efficiency of the designed networks was tested on the custom dataset collected from the hardware equipment developed exclusively for this purpose. The experimental results confirmed that both the bi-CRRN-E and the bi-CRRN-ED demonstrated higher accuracy on the pixel level as well as the frame level. Also, the computation time was verified for practical applications in the industrial fields through the experiments. The proposed network can be applied to other defect detection environments with video as input. When the network is applied to a building or plant surface crack detection, which is difficult to collect static images, higher performance can be expected than the single image-based defect detection models.