Road Crack Detection Using Deep Neural Network Based on Attention Mechanism and Residual Structure

Intelligent detection of road cracks is crucial for road maintenance and safety. because of the interference of illumination and totally different background factors, the road crack extraction results of existing deep learning ways square measure incomplete, and therefore the extraction accuracy is low. we tend to designed a brand new network model, referred to as AR-UNet, that introduces a convolutional block attention module (CBAM) within the encoder and decoder of U-Net to effectively extract global and local detail information. The input and output CBAM features of the model are connected to increase the transmission path of features. The BasicBlock is adopted to replace the convolutional layer of the original network to avoid network degradation caused by gradient disappearance and network layer growth. we tested our method on DeepCrack, Crack Forest Dataset, and our own tagged road image dataset (RID). The experimental results show that our method focuses additional on crack feature info and extracts cracks with higher integrity. The comparison with existing deep learning ways conjointly demonstrates the effectiveness of our projected technique. The code is out there at: https://github.com/18435398440/ARUnet.


I. INTRODUCTION
Cracks are the foremost common kind of road illness. If cracks repair isn't disbursed in time, cracks can seriously endanger traffic safety. Therefore, finding and repairing cracks in time is an important responsibility of the transportation department. In recent years, with the event of road crack detection strategies for image and computer vision [1], deep learning has been wide used for crack detection [2], [3], [4]. Zhang et al. [5] first used deep learning for road crack extraction and planned and trained a supervised shallow neural network to find cracks. CrackForest [6] combined multi-level complementary features using structural information in crack patches to find and extract cracks. Yao et al. [7] planned a convolutional neural network for crack recognition, that suppressed the interference of background factors and considerably improved detection accuracy. Liu et al. [8] The associate editor coordinating the review of this manuscript and approving it for publication was Yongjie Li. planned a pixel-level classification network combining native and global information to get richer multi-scale feature information and improve crack detection accuracy. Dorafshan et al. [9] reduced the interference of background factors on crack extraction by connecting edge detectors and deep convolutional neural networks. Li et al. [10] increased and extracted multi-scale crack features using dense connections. Finally, the feature maps at totally scales were amalgamate to attain crack extraction by complementing the options at different levels. However, these methods can less extract fine cracks in pavement images with many interfering factors.Lin H et al. [11] proposed LEDNet neural network for defect detection of LED chips, and achieved high detection results. Wu X et al. [12] generate small blocks centered on a pixel at several different scales and input the blocks into different convolution operations.The experimental results show that the method can learn more real fracture characteristics and the detection results are high precision. VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Olaf et al. [13] proposed a U-Net-based medical image segmentation method to obtain contextual semantics by contracting the paths and determining the location by symmetrically expanding the trails. The encoder and decoder sub-networks of U-Net++ are connected by nested and dense jump paths [14] to reduce the semantic gap between the encoder-decoder sub-network feature mappings and Intersection over Union (IOU) is higher than the original U-Net network. Cheng et al. [15] treated the crack images as a whole; They also introduced a cost function based on distance transformation to improve the detection performance of the network. Fan et al. [16] proposed an encoder-decoderbased structured neural network U-HDN that integrates crack context information into a multi-expansion module to obtain more crack features. Drozdzal et al. [17] studied the importance of skip connections and introduced short skip connections in the encoder. ResNet34 residual network [18] was used, and the original convolution of the residual network was replaced with an expanded convolution [19] to extract crack information, and an attention mechanism was introduced to obtain the final crack detection results. these methods have poor detection accuracy in the presence of many background disturbing factors. Bang et al. [20] proposed a pixel-level detection method using an encoder-decoder to identify road cracks.The encoder consists of a convolution layer of a residual network for extracting crack features, and the decoder consists of a deconvolution layer for locating cracks in the input image.The experimental results are better than those of VGG-16, ResNet-50, ResNet-101 and ResNet-200.
U-Net neural network is a coding and decoding structure that can be trained end-to-end using fewer images to detect road cracks quickly. However, there square measure several distracting factors in road pictures, and also the U-Net network is low to extract the fine cracks within the pictures. when the introduction of the CBAM into the U-Net neural network, the structure of the neural network and also the variety of network layers increase, but network model shows network degradation. to solve the above issues, the add this paper focuses on the subsequent aspects: 1) we design a new network model called AR-UNet by introducing the convolutional block attention module (CBAM) in the U-Net neural network. The CBAM performs global averaging and global maximum hybrid pooling of channels and spaces of input features to focus on more global and local detail information. The performance of the neural network in detecting fine cracks is improved.
2) CBAM's input and output features are pooled using shortcut connections to increase the transmission path of crack features, and the network model can learn more about crack features.
3) BasicBlock replaces the convolutional layers of the U-Net network to avoid network degradation due to the increase in the number of network layers. Further, improve the accuracy of crack extraction.

II. RELATED WORK
Traditional road pavement crack detection principally has the subsequent categories: 1) manual detection, 2) threshold method, 3) wavelet transform, 4) morphological image processing and classification, 5) path method and 6) edge detection method. Manual detection is thru the pavement investigator driving on the road to record the situation of cracks, the degree of harm, and therefore the variety of data. Such a way is careful and comprehensive, however the quantity of human and assets consumption is giant and inefficient.
Thresholding-based image segmentation methods have an early origin and are widely used. The thresholding method detects cracks utilizing the feature that the gray value of crack image pixels is lower than the background [21]. Kirschke et al. [22] proposed a histogram-based threshold segmentation method, which can only be used for more apparent crack identification. Removal algorithms [23] using binary segmentation, morphological operations, and removal of isolated points and regions are prone to the presence of gaps in detected cracks. Segmentation using an improved adaptive iterative thresholding segmentation algorithm [24] can also yield crack images. Zhang et al. [25] took advantage of the significant difference between cracks and background to mark contours using FAST feature point recognition and used PYNQ for crack identification. However, the accuracy of those technique is poor once there's a great deal of noise within the background.
Ju et al. [26] use illumination compensation model (ICM) and k-means clustering algorithm to detect cracks, and use k-means clustering algorithm to extract crack area from road background after removing shadow in image.The proposed method has good performance in terms of average precision, recall and F-measure.
Algorithms like wavelet pavement crack detection [27], [28] use wavelet transform to convert cracks and noise into totally different wavelet coefficients. These strategies need high instrumentality necessities and are prone to disadvantages like over-segmentation and condition to interference by external factors.
Histogram statistics and shape analysis algorithms [29], morphological image processing and logistic regression statistical classification [30], and free-form path calculation methods [31], which combine brightness and connectivity to detect cracks. The detection is not practical under the influence of complex backgrounds and the presence of more background-interfering factors, etc. The median filtering algorithm [32] enhances grayscale pavement images using four structural element reconstructions and combines the morphological gradient operator and morphological closure operator to extract crack edges. However, these method can identify crack pixels with noticeable contrast changes in the crack image, and its crack extraction accuracy is poor for cracks with inconspicuous features.
Shah and Wang et al. [33] [34] studied crack segmentation based on edge detection. Still, the natural properties of road diseases were not considered, and the algorithm's applicability was less than ideal. The segmentation algorithm of edge detection is generally based on local grayscale and gradient information to identify crack edges, which is only applicable to cracks with complete edge information. It is easy to judge the background with strong edge information as crack information points. When there is more noise, the effect of edge detection is poor.
In traditional methods, the feature extraction is mainly dependent on the hand-designed extractor, which requires professional knowledge and complicated parameter adjustment process [35], and each method is specific to specific applications, with poor generalization ability.Deep learning is mainly data-driven feature extraction, learning from a large number of samples can be deep, dataset-specific feature representation, the expression of the dataset is more efficient and accurate, the extracted abstract features are more robust and have better generalization ability, and can be end-to-end training without complex parameters. Deep learning detection of cracks in the road can not only liberate people from the complicated work, but also achieve the accuracy of manual detection.Therefore, it is very important to realize automatic detection of road cracks by deep learning.

III. METHOD A. OVERALL NETWORK STRUCTURE
The U-Net neural network is split into three parts: encoder, decoder, and prediction module. The encoder reduces the image size and extracts the initial image features by convolution and maximum pooling. The decoder obtains the deep features of the image by convolution (a ReLU perform follows every convolution). Finally, pixel classification is completed by 1×1 convolution.
The established network structure is shown in Figure 1. The network structure chiefly consists of a feature extraction network, residual module, and CBAM module. The BasicBlock module replaces the convolutional layer of the U-Net network. BasicBlock module will effectively solve the matter of network model degradation and gradient disappearance once the quantity of network layers will increase. The network introduces CBAM and sums the input and output of CBAM; the module is termed Res-CBAM. Res-CBAM makes the network pay a lot of attention to the channel and spatial dimensions crack information and assign a lot of weights to the network coefficients.

B. CONVOLUTIONAL BLOCK ATTENTION MODULE (CBAM)
CBAM is a light-weight module that contains spatial attention and channel attention. The module derives attention weights consecutive on two freelance dimensions, channel and space, so multiplies the output attention map with the input feature map for adaptative feature refinement. Since CBAM is a light-weight, general module, it is seamlessly integrated into any CNN design. It is trained end-to-end with the underlying CNN. Compared to attention modules specializing in only one facet, CBAM will beware of each side and extract additional information concerning the target.
As shown in Figure 2, assuming F = C × H × W as the input feature map, the CBAM module computes the one-dimensional channel attention feature map M c ∈ C × 1 × 1 and the two-dimensional spatial attention feature map M s ∈ 1 × H × W in turn, and finally outputs the weighted features with channel and space. The overall attention is calculated as follows: where F denotes the input features after the channel attention operation, F is the final refined output.

C. CHANNEL ATTENTION MODULE (CAM)
The structure of the Channel Attention Module is shown in Figure 3; The two M c = 1 × 1 × C feature maps are obtained by feeding the input features into global max pooling and global average pooling, respectively. Then after two layers of the fully connected neural network, the number of neurons in the first layer is C r (r is the compression rate). ReLu is the activation function, and the number of neurons in the second layer is C. Then, the fully connected neural network's output features are summed and passed through the sigmoid activation function to generate the channel attention features (M c ). The channel attention is calculated as follows: where σ denotes the sigmoid function, The structure of the spatial attention module is shown in Figure 4. The spatial attention input features F = C × H × W are averaged and max pooling to obtain F avg and F max . Then, the two feature maps are channel spliced. After a 7×7 convolution operation, it is compressed into H × W × 1. It generates M s by the sigmoid activation function. Finally, the output feature map of this module is multiplied by the input feature map to get the final generated feature map. The spatial attention module is calculated as follows: where σ denotes the sigmoid function and f 7×7 denotes the convolution operation with a filter size of 7 × 7.

E. STRUCTURE DETAILS OF THE ENCODER
As shown in Figure 5, the input features enter the channel attention of CBAM after two convolution operations of size 3 × 3 to get the channel attention weight M c . M c is multiplied by the input feature map to get the input features required by the spatial attention module. Next, the spatial

F. STRUCTURE DETAILS OF THE DECODER
The residual-connected Res-CBAM is also introduced in the structure of the decoder, as shown in Figure 6. The feature map of size C ×H ×W is deconvolved, and the corresponding CBAM input feature map of the encoder is copied and cut, and stitched with the deconvolved feature map to obtain the feature map of size C × 2H × 2W ; The stitched feature map is input to the attention mechanism as the input feature map. The output feature map is connected with the input feature map and then convolved with a 3 × 3 convolution kernel to obtain the final feature map of size C 2 × 2H × 2W .

G. RESIDUAL NETWORK
The residual network comes from the literature [36]. Typically, because the number of layers will increase, the training loss step by step decreases and then saturates, however the fact tells us that the training loss will increase when the network depth is increased again. this is often not overfitting because, in overfitting, the training loss endlessly decreases.
The deeper the network is, the harder it is to train. Therefore, it is essential to integrate shortcut connections in U-Net networks to cut back network degradation. Since the original convolutional layer is computationally long and unsuitable for pixel-level prediction. the original convolutional neural network layer is replaced by BasicBlock, whose structure is shown in Figure 7.
After the input feature map is passed through two convolutional layers and the ReLu function, it is summed with the original input features to obtain the final output feature map. A residual block can be expressed as:     The residual block is divided into two parts: the direct mapping part and the residual part. h (x l ) is the direct mapping, and the response is the curve on the right in Figure 7; f (x l , w l ) is the residual part, which consists of two convolution operations, and the part containing the convolution on the left in Figure 7. The shortcut connections between the input and output feature maps will transfer the crack info extracted by the previous layer of the network to consequent layer. the information loss is avoided to a greater extent, and the network degradation caused by increasing the number of neural network layers is effectively prevented.

A. ROAD IMAGE DATA SET
The datasets used for the experiments are DeepCrack [37], Crack Forest Dataset [38], and our annotated onboard road image dataset, which we named RID. DeepCrack is a dataset containing 537 concrete pavement images of 544 × 384 pixels with multi-scene and multi-scale pavement cracks. The Crack Forest dataset is a dataset of asphalt pavement images, which contains 118 images of size 480 × 320 pixels with background noise such as white markers and shadows. These two datasets have fewer images and are enhanced using rotate, flip, and mirror operations. After enhancement, 2148 and 708 images were obtained from the DeepCrack and Crack Forest datasets, respectively. Then, we made a dataset with 548 images from the road images acquired by mobile LiDAR mapping system. The labeled images in these three datasets were manually labeled. To validate the established neural network models, we selected 80% of each dataset as training data and 20% as test data.

B. EXPERIMENTAL SETTINGS 1) ANALYSIS OF INITIAL LEARNING RATE AND OPTIMIZERS
In the first experiment, In order to obtain a suitable initial learning rate value and the optimization method, we set different learning rates and model optimization methods to analyze the training loss of the model. Figure 8 (a) indicates that we employed the Adam optimizer,The figure indicates that there are large fluctuations in the training loss for the three datasets, and the training loss values are large. Figure 8 (b) indicates that we employed the SGD optimizer. The figure shows that the training loss values of the three datasets are small and stable, therefore, we choose SGD as the network optimizer. The learning rates for the training RID and CrackForst datasets are set to 1e-1 and for the training DeepCrack datasets to 3e-3, because their corresponding loss values are the smallest.

2) OTHER EXPERIMENTAL SETTINGS
We implement all tests in Python 3.6, Pytorch 1.10.1, and CUDA 11.1 framework and use NVIDIA GeForce RTX2080 GPU for training. The model uses the SGD optimisation methodology to update the parameters by arbitrarily choosing VOLUME 11, 2023  little batches of samples with the momentum optimisation algorithmic rule set to 0.9. The ReLu activation function suppresses gradient disappearance during training to accelerate the convergence rate of the model and maintain stability.

C. EXPERIMENTAL EVALUATION INDEXES
Neural network segmentation accuracy evaluation is performed using commonly used metrics, DICE (D), precision (P), recall (R), and F1-score are selected for assessment. Where DICE indicates the ratio of the area where the predicted and true results intersect with the total area, and the value of perfect segmentation is 1. The F1-score can better measure both the precision and the recall. The DICE and F1score are calculated as follows: The exactitude indicates the proportion of properly detected crack pixels that were initially correct. wherever TP indicates the amount of properly classified crack pixels and FP indicates the amount of incorrectly classified crack pixels. Recall indicates the proportion of properly detected cracked pixels to all cracked pixels, wherever FN indicates the amount of pixels incorrectly classified as background.

D. THE RESULTS OF ABLATION EXPERIMENTS 1) VISUAL ANALYSIS OF EXPERIMENTAL RESULTS
To discuss the result of introducing Res-CBAM and BasicBlock within the neural network on crack feature extraction, we tend to validate it by ablation experiments. The tests were done in each of the three datasets. As Figure 9 shows the visualisation results of the experiments, rows 1-2 show the detection results of the DeepCrack dataset, that shows that the original neural network crack extraction is incomplete and the extraction accuracy is poor. after the introduction of Res-CBAM and BasicBlock, the network model can focus more on the crack region, and the crack completeness is higher. Rows 3-4 show the results of the crack forest dataset, and the extracted cracks are more realistic. Rows 5-6 show the results of RID, where the fine cracks are extracted to be more complete.

E. RESULTS OF ABLATION EXPERIMENTS 1) RESULTS ON DEEPCRACK
We explored the contribution of introducing every part on DeepCrack's test set. As shown in Table 1, we found that introducing Res-CBAM improved DICE from 65.39% to 68.72% and F1-scores from 67.26% to 75.64%. And then, we integrated BasicBlock into the original network and found that DICE and F1-scores improved further to 83.91% and 83.67%. we at the same time additional Res-CBAM and BasicBlock into the neural network, and therefore the DICE and F1-scores reached 84.09% and 85.82%, severally. we improve the structure of the encoder and decoder and yield higher extraction accuracy compared to U-Net.

2) RESULTS FOR THE CRACK FOREST DATASET
we can see that the DICE and F1-scores improve to 67.2% and 68.85%, respectively, after the introduction of Res-CBAM and BasicBlock in U-Net. The precision performance of the neural network is better after introducing Res-CBAM alone. The neural networks performed better in recall after introducing BasicBlock alone. But their F1-scores did not perform as well as the networks introduced simultaneously. The experimental results of the crack forest dataset show that the simultaneous introduction of Res-CBAM and BasicBlock can effectively improve the crack detection ability of U-Net.

3) REGARDING THE RESULTS OF RID
we see that the network achieves the simplest performance by introducing attention and residual structure. The DICE and F1-scores reach 50.39% and 55.47%, severally. However, the obtained performance is under the performance on the other datasets. because the road image dataset (RID) has   uneven illumination and skew shooting angles. additionally, the ground labels of this dataset are just one or some pixels wide, that is one amongst the explanations for the low detection results.

A. EFFECTIVENESS OF SHORTCUT CONNECTIONS
We additional verified through ablation experiments whether or not adding shortcut connections in CBAM absolutely affects the extraction of cracks. The experimental results are shown in Table 2. we found that by adding shortcut connections, the crack extraction accuracy of the network was improved as a result of the shortcut connections enhanced the path of feature information propagation. The neural network learned more global and local crack information, proving our method's feasibleness.
Since Res-CBAM plays a vital role within the network structure, the position of Res-CBAM could have an effect on the neural network performance. we compare two position ways in which of Res-CBAM placement within the decoder, as shown in Figure 10 (a) and (b). the consequences of introducing Res-CBAM in convolution and deconvolution on the neural network are discussed. within the same experimental surroundings, the neural networks with the two arrangement methods are tested individually. Table 3 summarizes the test results of different location arrangement methods. The results show that the neural network with the introduction of Res-CBAM in convolution performs higher because the input features of Res-CBAM embrace features from the encoder, that makes the input information richer. Introducing Res-CBAM into the position shown in Fig. 10(b), the DICE and F1-scores are lower because some feature information is lost after the input features are subjected to two convolution operations, leading to a degradation of the network detection performance.

B. NETWORK DEGRADATION IN TRAINING PROCESS
In addition, we also verified the network degradation during the training process by ablation experiments. And we recorded the changes in the training loss values during training of the three datasets. As shown in Figure 11 (a); (b) and (c), the U-Net with the introduction of Res-CBAM shows network degradation due to increased network layers. The figure shows that the loss values of the original U-Net are unstable, fluctuate greatly during the training process, and the neural network converges slowly. After the introduction of Res-CBAM, the neural network pays more attention to the crack features, converging faster. However, due to the increase in network layers, the neural network performance was slightly worse than the original network, and network degradation occurred. So we connected the input and output features of CBAM and replaced the convolutional layer of the original network with BasicBlock. The improved neural network converged faster and with higher accuracy.

C. COMPARISON WITH TRADITIONAL DEEP LEARNING ALGORITHMS
The comparison results with other commonly used methods are shown in Table 4. And our method has higher accuracy compared to SegNet [39], RCF [40], DeepCrack [37] and Literatures [41], [42]. The F1-scores in DeepCrack Dataset are 10.2% higher than SegNet, and also the preciseness and recall square measure 15.7% and 4.5% better, severally. In Crack Forest Dataset, the F1-score is improved by 18.1% compared to DeepCrack, and the precision and recall are improved by 16.5% and 19.7%, severally. In the RID dataset, our network outperforms other networks, with a 10.7% improvement in F1-score compared to RCF, 18.3%, and 2.5% improvement in preciseness and recall, severally. The experimental results show that integration CBAM and residual structure within the U-Net network will improve its crack detection performance and increase detection accuracy.

D. COMPARISON WITH TRANSFORMER ALGORITHM
To further demonstrate the advantages of the method proposed in this study, we also compare the method with the recently published Vision Transformer (VIT) [43], Swin-UNet [44], and TransUNet [45] algorithms. Our method also has some advantages. The comparison results are shown in Table 4; for the DeepCrack dataset, our method's overall accuracy is 87.2%, and the precision and recall are 88.9% and 85.7%, respectively. For Crack Forest Dataset, the precision of our method is lower than TransUNet by 0.6%, but our overall accuracy is 0.2% higher than TransUNet. And for the RID dataset, our method also outperforms other algorithms with an overall precision of 55.4%. Compared with Transformer, our method integrates the channel and spatial location information of cracks in the feature extraction stage, VOLUME 11, 2023 and the attention weight is tilted toward cracks. Transformer focuses more on global information and ignores local information. The proportion of crack pixels in the image is smaller, so ignoring local information will lead to lower detection accuracy.

VI. CONCLUSION
We introduced Res-CBAM and BasicBlock into the U-Net to ascertain a neural network model for crack detection. The experimental results show that the introduction of CBAM enhances the attention of the neural network to the crack region, improves the extraction ability of the neural network for fine cracks, and suppresses the interference of background factors. Meanwhile, The shortcut connections of Res-CBAM and the replacement of the convolutional layer within the network structure by BasicBlock make sure the transmission of crucial information as with efficiency as potential and effectively suppress the matter of network degradation. The created neural network learns a lot of features about cracks and improves the ability of the model to discover fine cracks. Compared with many other neural network methods, the neural network built in this study encompasses a considerably increased ability to extract cracks. the excellent accuracy and robustness of the neural network were verified through extensive experiments on completely different data sets.
PENG JING was born in Datong, Shanxi, China, in 1994. He is currently pursuing the master's degree with the School of Surveying and Land Information Engineering, Henan Polytechnic University, Jiaozuo. His current research interests include deep learning object detection and semantic segmentation.
HAIYANG YU was born in Linyi, Shandong, China, in 1978. He received the Ph.D. degree from the Chain University of Geosciences. He is currently a Professor with the School of Surveying and Land Information Engineering, Henan Polytechnic University, Jiaozuo. He is the author or coauthor of more than 50 papers published in academic journals and conferences. His main research interests include remote sensing theory and application and LiDAR data processing and application.
ZHIHUA HUA was born in Zhoukou, Henan, China, in 1998. He is currently pursuing the master's degree with the School of Surveying and Land Information Engineering, Henan Polytechnic University, Jiaozuo. His current research interests include remote sensing image processing and change detection. CAOYUAN SONG was born in Xuchang, Henan, China, in 1997. He is currently pursuing the master's degree with the School of Surveying and Land Information Engineering, Henan Polytechnic University, Jiaozuo. His current research interest includes deep learning-based point cloud filtering. VOLUME 11, 2023