A Deep Learning-Based Fine Crack Segmentation Network on Full-Scale Steel Bridge Images With Complicated Backgrounds

Automatic defect detection of steel infrastructures in structural health monitoring (SHM) is still challenging because of complicated background, non-uniform illumination, irregular shapes and interference in images. Conventional defects detection mainly relies on manual inspection which is time-consuming and error-prone. In this study, a deep learning-based fine crack segmentation network, termed as FCS-Net was proposed in light of ResNet-50 and fully convolutional network (FCN). Structural modifications including Batch Normalization (BN) and Atrous Spatial Pyramid Pooling (ASPP) were made. In full-scale steel girder images with complicated background and fine foreground, the proposed FCS-Net achieves a MIoU of 0.7408, outperforming benchmark algorithms such as LinkNet, DeepLab V3, and CrackSegNet. Moreover, the ablation experiments were performed that justified the contribution and necessity of each modification.


I. INTRODUCTION
After infrastructure construction is completed and starts to be utilized, the quality of infrastructures will gradually be challenged by problems such as erosion [1] and damage resulted from external forces and natural factors [2], causing hidden dangers [3]. Defects of a civil infrastructure requires regular inspections that are often relied on human labor [2], [4]- [8] and are sometimes susceptible to strong subjectivity, in addition to drawbacks of low accuracy, high labor cost, and dangerous working environments. With the rapid development of computer technology, methods for infrastructure defects detection based on computer vision and image processing are gradually emerging and stimulating continuous research [1], [6], [9]- [13]. The existing machine learning-based algorithms have reasonable recognition performance on the simple crack image while it requires intervention and empirical judgment by experts [4], [14]- [19]. Moreover, the underground infrastructures have inadequate working conditions such as high temperature, cold weather, high humidity for electronic devices and the image collected from the site may experience uneven illumination causing The associate editor coordinating the review of this manuscript and approving it for publication was Mehul S. Raval . significant amount of noise appearing on the images of complex surface textures. Those factors result in the difficulties of achieving higher accuracy of defects recognition as well as meeting industrial requirements for field application. Recently, deep learning represents the state of the art of artificial intelligence, which has successfully been applied to different aspects such as image recognition, text translation, natural language processing and achieves huge success. The conventional machine learning based image detection and recognition methods have been gradually replaced by more intelligent and effective deep learning algorithms [20]- [23]. Deep learning is a further development of the artificial neural network. By pre-training the neural network layer by layer, the feature expression of different levels can be learned, and the feature expression of each layer is obtained through the previous expression propagation, then all the layers are combined to form a deep convolutional neural network. Compared with the conventional machine-learning and image classification algorithms, deep convolutional neural networks have demonstrated superior performance in parameter prediction and image classification.
In this paper, the atrous spatial pyramid pooling (ASPP) and batch normalization (BN) modules are used to collaborate with the original ResNet-50. Meanwhile, various loss functions are analyzed to determine the best fit for the model so that the segmentation performance can be improved. The remainder of this paper is organized as follows. In Section 2, a literature review is performed to review the previous techniques on crack segmentation. Section 3 is dedicated to elucidate the specific architecture of improved FCS-Net and the loss function of the network. In Section 4, the generation of datasets, the quantitative metrics used to evaluate the prediction results on benchmark datasets and the comparison of proposed network with other methods are explained. In Section 5, the ablation experiment is performed to evaluate the improvement of the modules. Section 6 summarizes the study and discuss limitations requiring further improvements.

II. RELATED WORK
This section reviews studies related to deep learningbased segmentation of cracks in civil infrastructures. Segmentation frameworks enabled by deep learning can be roughly classified as either two-step or one-step. In twostep streamline, an object detection network aiming at localizing region of interests (ROIs) is usually followed by digital image processing (DIP) algorithms to extract crack pixels. For example, a modified tubularity flow field (TuFF) was applied to segment crack within bounding boxes proposed by a trained Faster R-CNN and obtained high performance in concrete structure images [24]. Although DL-DIP frame works rapidly, rule-based DIP method is applicable to images with relatively clean background and may generate noises when the ROI contains interferences. Double-DL framework was proposed to fill this gap, which replaces the DIP post processing by a segmentation network, such as the sequential use of Faster R-CNN and U-Net [25]. Whereas, it would be labor-intensive to train two deep networks separately and time-consuming when conducting object detection and crack segmentation tasks individually.
Hence, some efforts were made to one-step segmentation network, which is normally enabled by FCN and implicitly integrates detection and segmentation tasks into one shot. One of initial trials includes the eight-layer CNN network to localize crack regions with small bounding patches to approximate segmentation results in large-scale images [6]. Fine detection at pixel level of pavement crack was realized by CrackDet, a five-layer network which has been a benchmark study in crack segmentation [26]. With development of computation facilities and progress in DL algorithms, more studies were initiated to update previous baselines. For instance, In light of FCN structure, the segmentation precision outperformed CrackNet [10]. SDD-Net was well-designed for realtime crack segmentation [27], and CrackNet was modified to CrackNet II for more rapid segmentation [28]. However, the above studies were based on pavement or concrete structure surface, which was less contaminated by noises like handwritings and welding joints in steel structure. Moreover, cracks in steel materials are more challenging to be captured with steeper foreground-background rate, especially in largeresolution photograph. Restricted Boltzmann machine was applied to locate crack in steel infrastructures with high accuracy [29]. A deep fusion CNN was proposed to segment fine cracks in steel girder and achieved satisfying performance [30], while the sliding window scanning method with sub patch of 65 × 65 may slow detection speed in a full-scale image with resolution of nearly 5000 × 4000. In this study, a one-step framework was proposed based on a CNN network specialized for fine crack segmentation in 512 × 512 patches and get tested on full-scale images.

III. PROPOSED METHOD A. OVERALL WORKFLOW
The proposed end-to-end method contains a pre-trained CNN model as its core component to perform the crack segmentation on the steel girder images. The method could generate the predicted crack image by obtaining the original image with a series of processing, while the foreground (crack) marked as zero and background as one. Fig. 1 demonstrates the workflow of the proposed method, which can be summarized as below.
(1) The original full-scale image I, with size of 4928 × 3264 or 5152 × 3864, is resized to image II with its width and height being multipliers of 512, with the specific size of 4608 × 3072 or 5120 × 3584; (2) The image II is cropped to a batch of 512 × 512 images; (3) A trained segmentation FCS-Net model takes the image batch III as inputs and produce predicted mask batch IV with crack labeled as 1 (white) and background labeled as 0 (black); (4) The mask batch IV is merged and performed with color inversion to generate the mask V; (5) The size of mask V is recovered to the same size as the original full-scale image to produce mask VI.
FCS-Net is a further improved semantic segmentation network based on FCN [31] and inspired by PSPNet [32] and U-Net [33]. The core idea of FCS-Net is that if more global information is introduced into the segmentation layer, the accuracy of recognition can be improved. The main structure of FCS-Net can be broadly divided into three parts: the backbone ResNet-50 [34] module with Batch Normalization [35], Atrous Spatial Pyramid Pooling [36]- [38], and the FCN output layer. Among them, the batch normalization is used to obtain more feature details in ResNet module while the pyramid pooling module extracts deep and shallow features of image respectively. Then the features are fused to reduce the probability of false segmentation while dilated convolution increases the receptive field. Compared with the original PSPNet and FCN, the modification to the architectural has enhanced feature extraction performance and increased the segmentation accuracy. And the following introduces the individual feature extraction modules and explains their incorporation into the overall model.

B. FCS-NET ARCHITECTURE
Firstly, the feature map is extracted from the network through the pre-trained residual network, and the feature map is transformed into a smaller graph with overall information through the pyramid pooling module. After up-sampling, the smaller graph is restored to the size of the feature graph, and combined with the feature map before pooling, the final output result is obtained after the last fully convolution module. The following introduces the individual feature extraction modules and explains their incorporation into the overall model.
Residual Neural Network [34], referred as ResNet, was proposed to solve the problem of decreased training set accuracy in a deeper network when increasing the layer of the network structure, which is a phenomenon not related to overfit as it did not cause the appearance of extremely high model accuracy or vanishing gradient that results in the stopping of the network training process. The basic unit of ResNet is the residual block. Compared with the conventional plain network structure, residual networks add the skip connections between every two layers, form a residual block so that later layers can learn residuals directly from the previous. This network structure can form a deep residual network and solve the problem of decreasing accuracy during model training. There are two mappings in ResNet, one is the identity mapping, which refers to the input data x itself, and is represented as a curve in the figure, and the other is the residual mapping, which refers to the rest of the network. The advantage of ResNet is the network structure contains skip connection, so that the network could be trained normally while the gradient would not disappear, the layer of the convolutional neural network can be deeper and the error rate of network training will not increase. Spatial Pyramid Pooling Network (SPP-Net) is an algorithm proposed by He [36] to address the problem of repetitive operation in R-CNN architecture. By adding a spatial pyramid pooling structure between the convolutional layer and the fully connected layer to replace method in the R-CNN algorithm which the candidate blocks were clipped and scaled to make the size of the image subblocks consistent before the input of the convolutional neural network. The Atrous Spatial Pyramid Pooling [32], [39] is a further improvement of dilated convolution. It combines dilated convolution with Spatial Pyramid Pooling (SPP) module. The Pyramid Pooling Module is inspired by the success of R-CNN Spatial Pyramid Pooling method, which shows that the region of any scale can be classified accurately and effectively by resampling the convolution features extracted from a single scale. The Spatial Pyramid Pooling Network (SPP-Net) is an algorithm proposed by He [29] to address the problem of repetitive operation in R-CNN architecture. By adding a spatial pyramid pooling structure between the convolutional layer and the fully connected layer to replace method in the R-CNN algorithm where the candidate blocks were clipped and scaled to make consistent the size of the image subblocks before inputting to the convolutional neural network. The spatial pyramid pooling structure can effectively avoid the problem of incomplete clipping and shape distortion caused by the R-CNN algorithm, more importantly, it solves the problem of repetitive feature extraction of images by the convolutional neural network, greatly improves the speed of producing candidate blocks, and reduces the total amount of computation.
In order to avoid a large number of crack-like interferences in the complex background and extract the target features more accurately, the network with deeper layers is adopted. However, due to the intensive feature accumulation operation of the traditional convolution kernel method, there may be overlap between the receptive fields, which increases the complexity of semantic information and results in the waste of computation and efficiency [38]. Therefore, there is a balance between large receptive field and maintaining the resolution of feature map. The dilated convolution VOLUME 9, 2021 technique uses the Atrous convolution kernel, which only has weights in certain positions, and fills other spaces in the kernel with zero [40]. By geometrically increasing the dilation rate in continuous convolution layers, the receptive field can be extended while ensuring the coverage. Compared with the conventional convolution method, the Atrous convolution kernel seems to be 'dilated', extracting features sparsely but still effectively, which reduces the complexity of semantic information and improves the accuracy of image segmentation.
The Pyramid Pooling Module is inspired by the success of R-CNN Spatial Pyramid Pooling method, which shows that the region of any scale can be classified accurately and effectively by resampling the convolution features extracted from a single scale. In ASPP, different rates of parallel dilated convolution are applied in the input feature mapping and are fused together. Since the same kind of objects may have different proportions in the image, ASPP helps to consider different proportions of objects, which can improve the accuracy. The ASPP is a further improvement of dilated convolution. It combines dilated convolution with Spatial Pyramid Pooling (SPP) module. By using dilated convolution with different sampling rates in the first feature map and adding additional Batch Normalization layer, the model aggregates contextual information from different regions to improve the ability to obtain global information [40], as well as to collect more feature details while remaining the same receptive field, thus the segmentation accuracy is improved.
Batch normalization (BN) [35] is widely used in various aspects of deep learning. It is used to solve the converging problem of the modified network during the process of training, when the mathematical distribution of eigenvectors changed, it would result in covariance shift [35]. meanwhile, batch normalization can also effectively prevent the vanishing gradient and the neuron inactivation problem [35]. The BN layer is the same network layer as the convolution layer and pooling layer deployed in the network. the formula of BN is: In Eq (1) X is the eigenvector to be normalized; E [x] is the mean value; Var [x] is the standard deviation; γ and ε are scaling coefficient and displacement parameter; ε is a small value to prevent the denominator from fading to zero.
As shown in Fig. 2, after three convolutional layers with batch normalization and activation ReLU, the input features are passed through a max pooling layer before entering sixteen residual layers with different convolution specification for feature extraction. This structure is designed to enable better feature extraction by deepening the network while the skip connection in residual network prevents the phenomenon of the vanishing and exploding gradient [34]. Then the atrous spatial pyramid pooling is applied to obtain the representation of different sub-regions [36] while expanding the receptive field by dilated the convolution [1], [38], [40], [41]. Finally, dense pixel-wise prediction is obtained by feeding the feature maps into two convolutional layers.

C. DATASET PREPARATION
After processing of the raw image, a total number of 9268 images training sets with size of 512 × 512 pixels including 8338 background images and 928 crack images were established, while the test sets consist 274 crackcontaining images and 2160 background-only images with the same size. Moreover, the original 40 original full-size images before crop also added to the test sets. Considering the significant imbalance between foreground (cracks) and background samples, and the pixels with cracks only account for a small part of the images, the images with only background are removed in the test set, while all the images with cracks are retained.   Among them, 10% of the images are random selected as the validation set to validate the training of the model. This allows the network in the training process to learn the object features more efficiently with faster convergence, and thus improving the overall network training efficiency. Fig. 3 illustrates the preparation of dataset and its component. And Fig. 4 demonstrates the representative full-scale and crack image of the dataset.

D. TRAINING SPECIFICATIONS
Specifications of the deep learning platform software environment for crack identification framework proposed by this paper are Windows 10, Python 3.7.4, and Keras 2.2.5; platform hardware configuration is CPU Intel i7 9800X with 32GB of memory; configuration for graphics card includes one NVIDIA RTX2080Ti, 11GB of video memory, CUDA10.0, and NVIDIA cuDNN7.4.2 are used for GPU acceleration. Considering the graphic card of the deep learning platform is not designed for deep learning task of large data size, the network structure parameters of proposed model have been adjusted appropriately to avoid the phenomenon of running out of memory. The annotated cracks images are used for training the model by using a backward propagation algorithm. Each deep learning model discussed in this study was trained for 2000 steps and 80 epochs, with a batch size of one. The hyperparameter settings of the model are adjusted according to the validation loss responded after epoch of training meanwhile, the learning rate is adaptively adjusted according to the validation loss from 1E-4 to 1E-6 to achieve higher training efficiently and faster convergence. Table 1 summarized the specific hyperparameters of the model.

A. EVALUATION METRICS
There are four kinds of prediction results in a pixel-wide identification task, which are: True Positive (TP), False Positive (FP), True Negative (TN) and False Negative (FN). For this case specifically, TP and FP refer to correctly-identified crack and background pixels. FP indicate background pixels are wrongly labeled as crack by the model, while FN are omitted crack samples which are more undesirable than the others from the perspective of safety. Taking proportion of the four results into an integrated consideration, Mean Intersection VOLUME 9, 2021 over Union (MIoU) was used as an important metric index to measure the accuracy of the models in this study. MIoU can be interpreted as the average ratio of the intersection and union of prediction and ground truth and computes to what degree it has overlapping with the actual one, calculated as: where k is the number of class, p ii is the number of TP, p ii is the number of TN, and p ii is the sum of FP and FN.

B. BENCHMARK PERFORMANCE
The proposed FCS-Net was benchmarked with LinkNet [42], DeepLab V3 Plus [43], and CrackSegNet [1], with regard to segmentation accuracy of fine cracks (Table 2). LinkNet and DeepLab V3 Plus are state-of-the-art semantic segmentation networks, having achieved high performance on multi-and large-scale object datasets like CitySpace or Pascal VOC. However, accuracy of LinkNet and DeepLab V3 significantly declined when detecting fine crack in full-scale images due to the severe data imbalance. In fine crack dataset, the proposed FCS-Net improved MIoU of LinkNet and DeepLab V3 of 9.8% and 15.4%, respectively. Compared with CrackSegNet, which is designed for pixel-wise crack identification in concrete surface, FCS-Net had a performance enhancement of around 2.3%. Segmentation results in representative crack patches are shown in Fig. 6, in which LinkNet and DeepLab V3 failed to fully extract crack skeleton but output more invalid positive predictions than CrackSegNet and FCS-Net. Being specially modified to identify fine crack, CrackSegNet outperformed LinkNet and DeepLab V3 by successfully recognizing small crack samples. However, this dataset has more complicated backgrounds than concrete surfaces with handwritings and crack-like welding joints, which causes a performance reduction of CrackSegNet in this study.
Confusion matrix results were plotted in Fig. 7, where both LinkNet and DeepLab V3 tend to generate much more noises (FP samples) in background images than the fine crack-specialized networks, CrackSegNet and FCS-Net. For example, around 5000 background pixels were improperly classified as crack by CrackSegNet and FCS-Net, respectively. However, the values of LinkNet and DeepLab V3 are nearly 24000 and 19,000, which is the main cause of the performance reductions. Segmentation results of the four networks are almost identical with naked eyes, however, results of LinkNet and DeepLab V3 indicated that both models generate relatively higher proportion of background noises. Whereas, CrackSegNet and the proposed FCS-Net achieved relatively better segmentation performance, which ensured correct identification of the entire cracks without overestimation of mispredictions in background. Moreover, there are discontinuities in crack skeleton extracted by DeepLab V3, which does not conform to the principles of crack generation and propagation. As for details, DeepLab V3 achieve the highest TN, but the highest FN as well, indicating that DeepLab V3 pays more attention to background instead foreground, with the extreme imbalance of positive and negative samples. Compared with CrackSegNet, the proposed FCS-Net recognizes about 2000 more crack pixels than CrackSegNet, which improves the MIoU from 0.7502 to 0.7601 (Fig. 7).

C. ABLATION EXPERIMENTS
The application of ablation experiment was first proposed by Ren [44] in Faster R-CNN, to certify the necessity of different modules in a deep network, by removing each of them and observing variations. In this study, core modules (BN and ASPP) were removed in sequence from the proposed FCS-Net until there was only the ResNet-50 backbone left. With contributions from ASPP and BN, the MIoU of ResNet-50 was improved from 0.6565 to 0.7408 by the proposed FCS-Net (Table 3). It should be noted that the performance of ResNet-50 decreased after adding the ASPP module. This may be caused by enlarged receptive fields and more extracted features enabled by the usage of atrous convolution in ASPP. Details were depicted in Fig. 8, where the original ResNet-50 generated some mispredictions at around the position of the ruler in the input picture.
With existence of BN and ASPP modules, misprediction exist in the location of handwriting, and TP and FP are increased to varied extents, which also verifies the previous inference about the increasing extracted features.  Different modules serve the main goal with various functions. BN module can improve the training efficiency of the model, accelerate the convergence of the model, and reduce the cost through normalization. The SPP algorithm can process the input images with different sizes and aspect ratios, which may improve the scale invariance of images and reduce the over-fitting phenomenon during the model training.
With the expansion of receptive field contributed by dilated convolution in ASPP, the model has an enhanced performance in large-scale images. Therefore, the overall MIoU has been further improved when combining BN and ASPP modules,

V. CONCLUSION
To segment fine cracks from complicated large-scale images of steel girder, this study proposed a deep FCN-based network integrating ResNet-50, ASPP, and BN, termed as FCS-Net. The proposed FCS-Net was benchmarked with LinkNet, DeepLab V3, and CrackSegNet with regards to the ability to identify fine cracks with severe background interferences and sample imbalance. Networks specialized for fine crack detection (CrackSegNet and FCS-Net) achieved higher MIoU than LinkNet and DeepLab V3, which are more robust when segmenting multi and large-scale objects. Specifically, the MIoU was enhanced of around 12% by the proposed FCS-Net, compared with LinkNet, indicating its applicability in pixelwise detection of fine cracks.

CODE AVAILABILITY
The source code and data used in this article can be found in https://github.com/Monash-Civil-CV-Team/FCS-Net.