An Improved Algorithm for Wind Turbine Blade Defect Detection

With the increase in wind power generation, wind turbine blades require regular inspections to ensure they continue to operate safely. You only look once (YOLO) is one of the most widely used object detection algorithms and is easy to deploy into drone devices. To enhance the real-time detection of small target defects in wind turbine blades, this paper proposes an improved attention and feature balanced YOLO (AFB-YOLO) algorithm based on YOLOv5s. Specifically, AFB-YOLO improves the feature pyramid network by using weighted feature fusion and cross-scale connections. The improved feature pyramid network solves the problem that most previous feature pyramid networks treat all input features equally, and obtains more feature information. Furthermore, the coordinate attention (CA) module is introduced into the network to augment the representations of the objects of interest. Finally, the paper redesigned the loss function through efficient intersection over union (EIoU) loss to make the model obtain a better localization effect. The experimental results on the imagery of wind turbine blade defects indicate that our method shows significant gains in performance. The mean average precision (mAP50) of AFB-YOLO is 83.7% and the detection accuracy is improved by 4.0% compared to the original YOLOv5s model. The experiments in this paper demonstrate that AFB-YOLO is more effective and robust than state-of-the-art detectors.


I. INTRODUCTION
To promote the positive development of the world's energysaving and emission-reduction economy, wind power has developed rapidly. However, the lifespan of the earliest wind turbines is nearing the end, and some wind turbines are even out of warranty. Defect detection is of great significance to the normal operation of wind turbines. Yang et al. [1] proposed a two-stage detection framework to detect bolt loosening, combining traditional manual torque methods commonly used in engineering with deep learning models to reduce manual detection costs and missed detection rates. This work discusses the defect detection of wind turbine blade. As the most critical and core component of a wind turbine, the integrity of the blade is an important factor in ensuring the stable operation of the wind turbine. Wind turbine blade are a component that causes many failures and high maintenance costs [2]. Due to the long-term harsh open-air working environment of wind The associate editor coordinating the review of this manuscript and approving it for publication was Hongwei Du. turbine blades, various damage will inevitably occur on the surface of the wind turbine blades. Failure to detect and repair in time will lead to serious failure consequences [3], [4], [5]. Taking effective detection methods to find surface defects of wind turbine blades as soon as possible and repair them in time is the most effective way to reduce accident risks and economic losses. Before unmanned aerial vehicles (UAVs) are widely used, most of the research on wind turbine blade detection mainly focuses on the acquisition and processing of sensor signals. The commonly used methods are: vibration detection [6], acoustic emission technology [7], ultrasonic flaw detection [8], strain detection [9], infrared thermal imaging [10], etc. With the maturity of UAV technology equipped with high-definition cameras and the wide application of deep learning technology in the field of object detection [11]. A deep learning-based defect detection method for wind turbine blades emerges as the times require.
The task of object detection is to identify all interested targets in the image and determine their categories and positions, which is one of the core issues in the field of computer vision. Object detection technology has reduced the burden on human beings to some extent and changed the way of life of human beings. Its accuracy and real-time performance are important criteria for measuring the entire computer vision system. Research on detection methods is divided into two categories: Traditional object detection algorithms [12], [13] and deep learning algorithms [14].
In the field of computer vision, traditional object detection algorithms are still popular. Traditional object detection algorithms extract artificial features based on prior knowledge and then use machine learning algorithms to classify them. The histogram of oriented gradients (HOG) is a commonly used feature extraction method and still performs well in object detection. Kassani et al. [15] improved the HOG feature and proposed soft HOG. Soft HOG is a more discriminative histogram that fully utilizes symmetric shapes. Xiao et al. [16] integrated HOG and convolutional neural networks to achieve the highest recall rate. Although traditional object detection algorithms are relatively mature, it is difficult to design robust features manually. With the continuous breakthrough of deep learning-based algorithms in the field of image recognition, many representative object detection algorithms have been proposed and widely used in industry, agriculture, medicine, and other fields [17], [18], [19], [20], [21]. Target detection algorithms based on deep learning can be divided into two categories: ''twostage detection'' and ''one-stage detection''. The two-stage deep learning algorithm generates candidate regions through selective search; and then combines the candidate regions with convolutional neural networks to extract features and perform regression classification. This kind of detection algorithm needs to be completed in two steps, so it is called two-stage deep learning algorithm [22], [23]. Two-stage deep learning algorithm has the characteristics of high accuracy but relatively slow speed. Typical detection algorithms based on candidate regions include the regions with CNN features (R-CNN) [24], fast region-based convolutional network (Fast R-CNN) [25], faster region-based convolutional network (Faster R-CNN) [26], and mask region-based convolutional network (Mask R-CNN) [27]. The above two-stage deep  learning algorithms have the disadvantage of poor real-time  performance, while the one-stage detection algorithm greatly  improves the running speed of the algorithm. The one-stage  detection algorithm reconstructs object detection as a single regression problem, directly from image pixels to bounding box coordinates and class probabilities [22], [23]. Typical one-stage detection algorithms include single shot multibox detector (SSD) [28], [29] series, and you only look once (YOLO) [30], [31], [32], [33], [34] series. Figure 1 illustrates some representative object detection algorithms.
The main contribution of this paper is to modify the operating mechanism of YOLOv5s and propose AFB-YOLO on the basis of YOLOv5s. AFB-YOLO draws inspiration from bidirectional feature pyramid network (BiFPN) and Coordinate Attention Module. The improved feature pyramid network enables the network to learn more important input features. The CA attention mechanism acquires more details about the target and suppresses other useless information to enhance the representation of objects of interest. The improved model can detect low-resolution and unclear wind turbine blade features, greatly improving detection of small target defects. Although the CIoU loss of YOLOv5s considers the aspect ratio of the predicted box and the ground truth box, it can speed up the regression of the predicted box to a certain extent. However, in the regression process of the predicted box, once the aspect ratio of the predicted box and the real box is linearly proportional, the ω and h of the predicted box cannot be increased or decreased at the same time, and the regression cannot continue to be optimized. AFB-YOLO redesigns the loss function through the Efficient Cross-Union (EIoU) loss, which enables the model to obtain better localization results.
The structure of this paper is as follows. Section 2 introduces the advantages of YOLOv5s and various techniques to improve the performance of YOLOv5s. Section 3 presents AFB-YOLO, including the motivation and structure of AFB-YOLO. In Section 4, experimental results and discussions are presented, where Precision, Recall, F1-Score, mean average precision (mAP), real-time performance in frames per second (FPS), and the model size are compared. Furthermore, Section 4 presents the comparison results of AFB-YOLO with other state-of-the-art lightweight models. Finally, the conclusion is drawn in Section 5.

II. RELATED WORKS A. WHY YOLOV5S
The YOLOv5 algorithm has good network portability and is one of the most widely used models. YOLOv5 model is the most extensive lightweight object recognition network available today. YOLOv5 has four versions, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. YOLOv5s is the smallest depth and width network, which is only 14.4MB. Among them, YOLOv5s is very easy to deploy in drones due to its small size and fast computing speed [35]. The network structure of the YOLOv5s is shown in Figure 2. Convolutionbatch normalization-sigmoid weighted liner unit (CBS) module, cross stage partial (CSP) module, spatial pyramid pooling fusion (SPPF) module, and other modules make up the network structure. The CBS module sequentially performs convolution computation, batch normalization, and SiLU activation function. The green CSP modules are CSP modules with residual structures. The green CSP module evenly divides the input tensor into two branches for convolution operations: one branch passes through a CBS module and then passes through residual units [36]; the other branch directly convolves and then concatenates the two branches. The orange CSP differs from the green CSP by replacing the residual unit with multiple CBS modules. The spatial pyramid pooling fusion (SPPF) module concatenates four paths to improve the receptive field of the network.

B. MULTISCALE FEATURE REPRESENTATIONS
Processing multiscale features is one of the main difficulties in object detection. Table 1 shows the related work of feature pyramid networks in recent years. The feature pyramid network (FPN) method proposed by Lin et al. makes the information of the shallow network and the deep network complement each other and improves the expressive ability of the deep convolutional neural network [37]. The feature layers of the deep convolutional neural network correspond to different information. The shallow network has a higher resolution and can learn the details of the target's texture and shape; the deep network can learn the semantic information of the target. However, since FPN does not fully utilize the fusion of the detailed information of the shallow network and the semantic information of the deep network, Liu et al. [38] proposed the path aggregation network (PANet) method. PANet adds a bottom-up path to the FPN structure to further aggregate feature information of the shallow and deep layers and enhance the structure of the feature layer of the entire neural network. Ghiasi et al. [39] proposed neural architecture search feature pyramid network (NAS-FPN), which was the first to use NAS techniques to find the optimal FPN structure in the search space. The NAS-FPN search process uses reinforcement learning to train the controller. The controller uses the accuracy of the submodel in the search space as an excitation signal to update its parameters. The controller learns by trial and error to find the optimal FPN structure, and the discovered new FPN structure is called the NAS-FPN structure. Although NAS-FPN achieves better performance, NAS-FPN requires thousands of GPU hours to search. Recently, BiFPN has addresses the problem that input features have different resolutions and contribute unevenly to output features. BiFPN introduces learnable weights that enable the network to learn input features of varying importance [40].

C. ATTENTION MECHANISM
The importance of attention in human perception cannot be overstated. An important property of the human visual system is that it does not attempt to analyze the entire image at VOLUME 10, 2022   once, our attention is first focused on outstanding features and salient parts. The attention mechanism is a model proposed by Bengio et al. that simulates the attention mechanism of the human brain [41]. It can be regarded as a combination function, and the influence of a key input on the output is highlighted by calculating the probability distribution of attention. In a range of object detection tasks, attention mechanisms have proven helpful [42], [43], [44]. Channel attention and spatial attention are the two main types of attention mech-anisms. The channel attention module focuses on channel information, but the spatial attention module focuses on positional information. Squeeze-and-excitation (SE) block [38] is a well-known channel attention mechanism that allows the model to learn the importance of different channel focuses on positional information. Squeeze-and-excitation [38] is a well-known channel attention mechanism that allows the model to learn the importance of different channel characteristics. Later works, the convolutional block attention module (CBAM) [45], [46], [47], [48] combines spatial and channel attention mechanism modules, using convolution to compute spatial attention. However, convolution can only capture local relations, but not the long-range dependencies required for object detection. In this work, the coordinate attention (CA) [49], [50] is introduced into the network. The CA attention mechanism decomposes the channel attention into two features for encoding, which capture long-range dependencies and precise location information along different spatial directions.

III. IMPROVED OBJECT DETECTION ALGORITHM A. BALANCE OF FEATURES
The sizes of the wind turbine blade vary widely in the images captured by drones, so handling multiscale features is critical. Traditional top-down FPN does not fully utilize the fusion of detailed information of shallow networks and semantic information of deep networks. To solve this problem, the PANet structure of YOLOv5s adds a bottom-up path, as shown in Figure 3(a). In this work, AFB-YOLO achieves two optimizations for cross-scale connectivity by improving an efficient weighted bidirectional feature pyramid network, as shown in Figure 3(b). First, the feature pyramid network of AFB-YOLO adds extra edges between input and output nodes in a similar way to residual connections, while borrowing the bidirectional path proposed by the PANet structure as the base layer for a more efficient feature fusion. Secondly, AFB-YOLO feature pyramid network introduces learning weight to learn the importance of different input features. AFB-YOLO is not simply adding or concatenating input features, which leads to feature mismatch. We help the model achieve more competitive performance by using fast normalized fusion with learnable weights. The improved weighted feature pyramid network of AFB-YOLO can better balance the feature information of wind turbine blades, so that the network can learn more important input features.
Taking into account the different importance of feature maps on different scales, bidirectional concat (BiConcat) adopts weighted fusion to fuse feature layers. The formulas of the two-branch weighted fusion method and the three-branch weighted fusion method are respectively defined as follows: where ω i represents the input weight of the i branch, p i in represents the input feature of the i branch, and p out represents the output feature. To avoid numerical instability, ε is a small value. At the same time, by applying the Relu activation function [51] to each weight, make sure that the weight value is greater than zero.

B. INTRODUCE ATTENTION MECHANISM
The attention module originates from the processing of image information by the human brain. By observing the global information of the image, humans can use attention to lock the focus area and automatically ignore some redundant background information. We applied the CA attention mechanism to the model. The improved model can obtain more detailed information about defects in the wind turbine blades and suppress otherwise useless information.
The CA attention mechanism is a novel attention mechanism. The CA attention mechanism can capture not only information about channel relationships, but also dependencies with precise location information. The CA attention mechanism can help the model locate and identify important information more accurately. As shown in Figure 4, the CA module is mainly divided into two steps: coordinate information embedding and coordinate attention generation. After input of the image information X, the pooling layer encodes the channels along the horizontal and vertical coordinates, resulting in a feature map of the two coordinates. Generating a pair of orientation-aware feature maps enables the module to capture dependencies along one spatial path and preserve accurate location information along the other. Global pooling compresses global spatial information into channel descriptors, enabling the embedding of coordinate information. The output of the c-th channel with height h and width ω is represented as: The CA attention mechanism first transposes the output dimension of the horizontal dimension, then concatenates the two feature maps, and uses a 1 × 1 convolution function to obtain the intermediate feature map. After the appeal operation, use BatchNorm and Non-linear to encode spatial information in vertical and horizontal directions. The features are then split into two separate tensors along the spatial dimension, where the output dimension of the horizontal dimension is transposed. Finally, convolve the two tensors on the horizontal and vertical coordinates respectively and use the sigmoid activation function to obtain two attention weights. Get the output Y of the coordinate attention block. Coordinate attention generation can accurately capture regions of interest and effectively capture inter-channel relationships. Through experiments, using the CA module embedded in YOLOv5s is beneficial to the extraction of network feature information.
where [·, ·] represents the connection operation along the spatial dimension, F1 is the 1 × 1 convolution transformation function, and δ is the nonlinear activation function. After the above operations, the intermediate feature map f is obtained. Then decompose f ∈ R C/r×(H +W ) into two separate tensors fh ∈ R C/r×H , fw ∈ R C/r×W along the spatial dimension. Here, r is the reduction ratio for controlling the block size as in the SE block.
Here σ is the sigmoid activation function, and F h and F ω are the other two 1 × 1 convolution transformation functions. Then the outputs g h and g ω are expanded and used as attention weights, and finally, the output Y of the coordinate attention block is obtained.
To accurately identify and locate defects in wind turbine blades, this work uses a CA module to replace the CSP structure. The CA module can improve the network's global receptive field and its ability to precisely pinpoint the target.
The improved network accommodates large changes in the frame size of wind turbine blade image detection by introducing a CA module. The whole improved detection architecture is shown in Figure 5.

C. IMPROVED LOSS FUNCTION
Bounding box regression is a key step in predicting bounding boxes to locate wind turbine blade defects, and the loss function of bounding box regression enables faster iterative convergence of the network model. In the YOLOv5s training process, the loss function and backpropagation are used to continuously update the model parameters to improve the detection accuracy. The loss function of YOLOv5s consists of three parts: localization loss, confidence loss, and class loss [18].
Loss total = Loss CIoU + Loss conf + Loss cls (9) Among them, Loss CIoU is the localization loss of YOLOv5s. Loss CIoU solves the inclusion problem of ground truth box and predicted box by calculating the Euclidean distance between boxes. Loss conf is the confidence loss. Loss conf will use cross-entropy to represent whether the anchor box contains objects or not. Loss cls represents the loss of the object category. When the anchor box is predicted to be a real object, the generated box will calculate the class loss. YOLOv5s adopts CIoU loss as the regression of bounding box, and considers the overlapping area, center point distance and aspect ratio of bounding box, which has high detection accuracy. However, once the aspect ratios of the predicted and ground truth boxes are linearly proportional,the predicted boxes ω, and h cannot be increased or decreased at the same time.
Loss CIoU = Loss iou + Loss dis + Loss asp where IoU represents the proportion of intersection on the union of bounding box, where b and b gt represent the center point of the predicted box and the ground truth box, respectively; ρ( * ) represents the Euclidean distance. c represents the diagonal distance of the smallest rectangle formed by the two bounding boxes, and υ is a weight function that measures the similarity of the aspect ratio.
In the loss function of AFB-YOLO, the class loss and confidence loss are the same as YOLOv5s, except that the localization loss uses the EIoU loss [52] instead of the CIoU loss [53]. Although CIoU loss considers the overlapping area, center point distance, and aspect ratio of bounding box regression. However, the aspect ratio difference reflected by CIoU loss is not the real width and height difference, which sometimes hinders effective optimization of the model. In response to this problem, the EIoU loss divides the aspect ratio based on the CIoU loss. The EIoU loss of confidence is as follows: Loss EIoU = Loss iou + Loss dis + Loss asp where b, ω, h represent the center point, width, and height of the predicted box, respectively; b gt , ω gt , h gt represent the center point, width, and height of the ground truth box, respectively. ρ( * ) indicates the Euclidean distance. c is the diagonal length of the smallest enclosing box covering the two boxes.c ω and c h are the width and height of the minimum bounding rectangle of the predicted box and ground truth box. The EIoU loss divides the bounding box regression (BBR) loss into three parts: the overlap loss between the predicted box and the ground truth box (Loss iou ), the center distance loss between the predicted and ground truth boxes (Loss dis ), and the width and height loss of the predicted and ground truth boxes (Loss asp ). The first two parts of the EIoU loss continue the method in the CIoU loss, while the localization loss directly minimizes the difference between the width and height of the predicted box and the ground truth box, making the convergence faster. The figure 6 shows the iterative process of the CIoU loss and EIoU loss predicted box. The red boxes and the green boxes are the regression process of the predicted boxes, the blue boxes is the target box, and the black boxes is the preset anchor boxes.
The CIoU loss of YOLOv5s cannot effectively measure the difference between bounding boxes, resulting in slow model convergence and inaccurate positioning. The Loss asp of CIoU loss has the measure of the aspect ratio of the predicted box and the real box, which can speed up the regression of the predicted box to a certain extent. However, in the regression process of the predicted box, once the aspect ratio of the predicted box and the real box is in a linear ratio, the ω and h of the predicted box cannot be increased or decreased at the same time, and the regression optimization cannot be continued.
AFB-YOLO redesigns the EIoU loss to minimize the width and height difference between the target box and the anchor box for better localization results. The EIoU loss splits the loss term of the aspect ratio into the difference between the predicted box width and height and the minimum bounding box width and height, which accelerates the convergence of the predicted box and improves the regression accuracy of the predicted box. The improvement of the loss function effectively improves the convergence speed and accuracy of the model training. Table 2 lists the software and hardware platforms and parameters used in this work. The operating system is Windows 10, and the neural network framework is Pytorch. Pytorch is an open-source python machine learning toolkit based on Torch, which is widely used in various machine learning algorithms. The processor we used is an Intel Core i5-11400H @ 2.70GHz, and the graphics card is an NVIDIA GeForce RTX 3050.

A. EXPERIMENTAL PLATFORM AND DATASET
In applied research on wind turbine blade defect detection, there is a lack of publicly available datasets suitable for deep learning training. In this work, we used a drone to take different angles around the wind turbine blades, and collected a total of 559 images. Since the insufficient number of samples VOLUME 10, 2022  will lead to problems such as overfitting of the model training, this work uses a random cropping method to expand the initial wind turbine blade dataset. After removing the background image from the cropped image, 2995 images are obtained. Data augmentation by random cropping helps to improve the robustness of our feature detector in the conv network while removing a large amount of background images. The image has two categories of defect: damage and dirt. The training input image size is 640 × 640, and the batch size is set to 8. We roughly divide the training data set and the test data set by 4:1, and the specific data set division is shown in Table 3.

B. PERFORMANCE EVALUATION FOR DEFECT DETECTION
If the IoU of the predicted box and the ground truth box is greater than 0.5, the two boxes are matched. For binary classification problems, examples can be divided into TP, FP, and FN according to their predicted categories, and their concepts are provided below.
(i) True positives (TP) are the number of samples that the model correctly predicted. (ii) False positive (FP) is the number of samples whose background area is mistakenly identified as a defect. (iii) False negative (FN) is the number of defects that the model has not yet predicted. Precision, Recall, F1-Score, average precision (AP), and mean average precision (mAP) are commonly used metrics to evaluate model performance in object detection tasks [54]. The calculation formula of Precision, Recall, F1-Score, AP, and mAP are respectively defined as follows: where APi represents the AP value of the ith class and C represents the total number of classes. The mAP reflects the average classification performance of the model for all classes. In addition, F1-Score is a metric for classification problems and is often used as the final metric for multiclassification problems. It is the harmonic mean of precision and recall.

C. ABLATION EXPERIMENT
We pool a variety of tricks to improve YOLO's performance. The improved feature pyramid network enables the network to learn more important input features. The CA attention mechanism acquires more details about the target and suppresses other useless information. The EIoU loss achieves precise regression of the bounding box. To verify the three improvement tricks proposed in this work for YOLOv5s, ablation experiments were carried out on the dataset of wind turbine blade defects to judge the effectiveness of each improvement point. The weighted feature pyramid network, CA attention mechanism, and loss function are added to the original model in turn. On our experimental platform, 300 epochs of training were performed with the same parameter configuration, and the results are shown in Table 4. After introducing the weighted feature pyramid network, it can be seen that the recall rate increased by 3.7% compared to YOLOv5s. At the same time, the average accuracy is improved by 2.5%. However, after introducing the weighted feature pyramid network and the attention mechanism, the performance improvement is weak when the EIoU loss is introduced. The analysis shows that the introduction of the weighted feature pyramid network and the attention mechanism enhance the ability of the network to extract features, paying attention to much semantic information that is easily submerged. Compared with YOLOv5s, the proposed algo-   rithm AFB-YOLO improves the performance with gains of 3.8% Recall, 2.2% F1-Score, and 4.0% mAP50. The results show that the detection results of AFB-YOLO are relatively comprehensive, and it has a good detection effect on the image of wind turbine blade defects. Table 5 shows the comparative performance of AFB-YOLO with other state-of-the-art models. Compared with the twostage deep learning algorithms FasterR-CNN and MaskR-CNN, AFB-YOLO has a great improvement in accuracy and the speed is nearly 12 times faster. Compared with the classic one-stage object detection detector SSD, AFB-YOLO has a great improvement in various indicators. Compared with YOLOv3 and YOLOv4, AFB-YOLO also has great improvements in accuracy and speed. AFB-YOLO is 24.1%, 13.8%, and 10.7% higher than YOLOv3 in the detection accuracy indicators Recall, F1-Score, and mAP50, respectively. When considering only speed, YOLOv4-tiny has the best detection speed. However, YOLOv4-tiny prunes the network structure, which greatly reduces the accuracy. In this work, the improved AFB-YOLO achieves higher performance compared to YOLOv5s. AFB-YOLO improves F1-Score by 2.2% and mAP50 by 4.0% while maintaining the detection speed.

D. PERFORMANCE COMPARISON OF THE DIFFERENT MODELS
Small object detection has always been a hot and difficult point in the field of object detection. The main challenge is that small target defects have few pixels and it is difficult to extract effective feature information [55]. Figure 7 shows the results of different algorithms on the wind turbine blade defect dataset. On large target defects, the target detection algorithms all perform well. However, compared to the detection performance of large targets, there is still a large gap in the detection performance of small targets detection. In the second and third rows of images, most algorithms miss small target defects seriously. Small target defects carry few features, so it is difficult to extract the effective features of defects. AFB-YOLO achieves good detection performance in small target defects by improving the weighted feature pyramid network, attention mechanism, and loss function. For example, in (e2), YOLOv5s detected small target defects, but with low confidence. AFB-YOLO improves the appealing problem and achieves better detection accuracy. Furthermore, AFB-YOLO detected small target defects in (e3) that were not detected in (d3). The results show that the proposed model has a good detection effect on small target defects.

V. CONCLUSION
Inspired by BiFPN and the coordinate attention module. In this work, a YOLOv5s-based defect detection algorithm for wind turbine blades is proposed. Through experimental verification, AFB-YOLO has the characteristics of high real-time performance and stable performance in the wind turbine blade dataset. The accuracy of AFB-YOLO is similar to the two-stage detection algorithm Faster R-CNN, but the detection speed is significantly improved. Compared with YOLOv3, AFB-YOLO improves the detection accuracy indicators Recall, F1-Score, and mAP50 by 24.1%, 13.8%, and 10.7%, respectively. Compared with the latest one-stage object detection algorithm YOLOv5s, the detection accuracy is still greatly improved. The good performance of the proposed model can be attributed to learning important feature information. The improved model is capable of detecting low resolution and unclear features of wind turbine blades, greatly improving the detection of small target defects. In future work, we plan to use more powerful matching strategies to improve our detection results.