MA-YOLO: A Method for Detecting Surface Defects of Aluminum Profiles With Attention Guidance

Aluminum Profiles (APs) are aluminum materials obtained by hot melting and extruding aluminum rods. It has the characteristics of low cost, strong plasticity, easy processing, and recyclability, and therefore plays an important role in industrial production. However, defects such as Non-Conductive (NC), Scratch, Orange Peel (OP), and Dirty Point (DP) often occur during the production and processing of APs, which can seriously affect the quality of APs. In addition, surface defects of APs also have problems such as fuzzy regional definition, large-scale variation, imbalance of aspect ratio, and high inter-class defect similarity, making defect detection more challenging. To solve these problems, this paper proposes an attention-guided object detection algorithm called MA-YOLO, specifically for Surface Defect Detection (SDD) of APs. The algorithm is based on YOLOv5s. Firstly, the K-Means++ clustering algorithm is used to optimize the anchor boxes, which alleviates the problem of aspect ratio imbalance. Secondly, by improving the multi-scale Feature Fusion Network (FFN), the detection performance of the model to detect the defects with unbalanced aspect ratio is improved, and the adaptability of the model to defects of different scales is enhanced. Finally, a novel Max Pooling Average Pooling (MA) attention module is proposed to improve the overall detection performance of the model, especially for small-scale defects. Experimental results on the aluminum profile surface defect dataset show that MA-YOLO has better detection performance and superiority than the current mainstream object detection algorithms, and compared with the baseline YOLOv5s, the mAP50 and F1 score are increased by 2.9% and 2.2%, respectively, while keeping the model lightweight. This indicates that MA-YOLO has broad application prospects in the surface defect detection of APs.


I. INTRODUCTION
Aluminum Profiles (APs) are an important raw material in industrial production, with good corrosion resistance, thermal conductivity, plasticity, and recyclability. They are widely used in high-end manufacturing industries such as automobile manufacturing, equipment manufacturing, and rail transportation [1], [2]. Furthermore, the low cost and processability characteristics of APs make them a preferred alternative to expensive raw materials such as copper. In recent years, with the continuous development and upgrading of the industrial manufacturing industry, there have been higher requirements for the overall quality of APs [3], [4]. However, improper human operations, uneven production equipment, and low-quality raw materials in the production process of APs can lead to various types of defects on the surface of APs, which affect aluminum profiles' overall quality and cause economic losses [5]. Moreover, some APs with severe defects can pose serious hidden dangers to the performance and quality of the product. The detection of surface defects is a crucial step in guaranteeing the quality of industrial products [6]. Therefore, to improve the overall quality of APs and meet the production requirements of the modern manufacturing industry, a highly efficient detection method needs to be designed to achieve precise detection of surface defects on APs.
The development of Surface Defect Detection (SDD) technology is closely related to the progress of science and technology. It can be divided into three stages of overall history: the manual detection stage, the machine device detection stage, and the Machine Vision (MV) detection stage [7]. In areas where production is not highly automated, most factories still rely on traditional manual detection methods for detecting metal surface defects. However, manual detection methods are affected by subjective factors, missed detection and false detection occur frequently, and the learning and training costs of manual labor are huge. Along with the popularity of automated production, manual detection methods are gradually being eliminated [8]. Machine device detection methods mainly use signal collection devices to collect specific optical, electrical, or magnetic signals or ultrasound to detect defects on metal surfaces [9]. Guo et al [10]. using eddy current testing on Inconel 738LC alloy, derived the relationship between the influencing factors such as excitation frequency, lift-off distance, defect depth and size, residual heat, surface roughness, and the defect EC signal. D'Accardi et al [11] carried out a study on the pulsed thermography technique, comparing the performance of Pulsed Phase Thermography (PPT), Thermal Signal Reconstruction (TSR), and Principal Component Thermography (PCT) for the detection of surface defects in APs, and systematically listing the advantages, disadvantages, and sensitivity of various thermographic algorithms. Lou et al. [12] proposed a non-destructive detection method based on low-frequency (20-50HZ) electromagnetic technology to detect internal defects in steel plates. The results show that the proposed method can detect internal defects buried to a depth of 6mm in a 12-mm thick 20# steel plate or pipeline, achieving the best results compared with the method of the same period. Although, these methods have been proposed to solve the problem of metal SDD to a certain extent. However, machine device-based detection methods suffer from high detection costs, high learning costs, and limited detection accuracy, which are not conducive to large-scale use in industrial production.
Currently, with the rapid development of MV technology, SDD methods based on MV are beginning to be widely used in various industrial sectors, including automotive parts [13], solar panels [14], printed circuit boards [15], electronic displays [16], steel and APs [17].In the traditional sense, MV detection methods first acquire defect images with an industrial camera and preprocess the images, then manually design feature extraction methods based on specific defect features, and finally, perform data dimensionality reduction on the feature information and input it into a classifier for classification. Compared to manual detection methods and machine detection methods, MV detection methods offer advantages such as reliability, convenience, and efficiency. Wang et al. [18]. proposed a guiding template-based SDD method for steel strips. The core idea is to detect surface defects on steel strips through a template-matching method. Experimental results show that a detection accuracy of 96.2% was achieved on 1500 test images, but the implementation process is complex, and the detection speed needs to be improved. Shi et al. [19] improved the Sobel edge detection operator by adding 45 • , 135 • , 180 • , 225 • , 270 • , and 315 • orientation templates to enrich the object edge information and thus improve the detection accuracy of the model. Jayaweera et al. [20]. developed a system for detecting surface defects in APs based on the canny operator, which achieves edge extraction of defects, but their research is not complete and in-depth enough. Chondronasios et al. [21], by using the Sobel edge detector to obtain the gradient magnitude of the image and proposed the Gradient-only Co-occurrence Matrix (GOCM). Classification and detection of defect-free, blistered, and scratched APs surfaces were achieved based on the GOCM, with a test accuracy of 98.6% on a self-built dataset. All of the above are traditional MV-based defect detection methods. Although basically automated detection of surface defects has been achieved, the feature extraction methods used for defect detection require manual design and are not universally applicable and robust, and even require ''one method for one scene''. Therefore, it is difficult to promote its use.
Deep Learning (DL), a major branch of machine learning, has made breakthroughs in recent years, especially Convolutional Neural Networks (CNN), which have been gradually applied to various object detection scenarios by their powerful feature extraction and non-linear representation capabilities [22]. Depending on the processing approach, DL-based object detection algorithms can be broadly classified into single-stage object detection algorithms and two-stage object detection algorithms. The more typical of the two-stage object detection algorithms include Region-CNN (R-CNN) [23], Faster Region-based CNN (Faster R-CNN) [24], Mask R-CNN [25], and so on. This class of algorithms divides the object detection process into two stages, with the first stage generating multiple proposal boxes in the image via a Region Proposal Network (RPN) and the second stage fine-tuning the proposal boxes. Due to the fact that the detection process is divided into two stages, good detection accuracy can be obtained on the one hand, but on the other hand, the detection speed is slow. Unlike the two-stage object detection algorithm, the single-stage object detection algorithm discards the RPN and performs regression detection directly on the object, so the detection process is a holistic one. The representative algorithms are Single Shot MultiBox Detector (SSD) [26] and You Only Look Once (YOLO) [27], and the detection speed of this class of algorithms has been greatly improved while guaranteeing a certain accuracy rate. In industrial production, real-time is an important metric. Therefore, YOLO is widely used as a typical single-stage object detection algorithm for defect detection in industrial production [28].
With the emergence of the concept of Industry 4.0, industrial production is gradually upgrading from automation to intelligence. MV combined with DL provides a new solution for intelligent production. More and more scholars are beginning to apply DL technology to the SDD of industrial products [29], [30]. Compared to traditional MV detection methods, DL-based detection methods have autonomous feature learning capabilities, thus eliminating the need for manual feature design and offering good robustness and generalization, as well as requiring larger datasets. He et al. [31]. designed an end-to-end detection network with multiscale feature fusion for steel plate defects detection and proposed the NEU-DET dataset. The experimental results show that the proposed method achieves up to 82.3% mAP on the NEU-DET dataset and a detection speed of 20 ft/s. Bachmann et al [32], through an experimental study, propose data augmentation and transfer learning as the key ingredients for training small sample datasets. The proposed algorithm achieves a detection accuracy of 47% mAP and a detection speed of 10 FPS on a self-built dataset of surface defects in APs. Dong et al. [33]. proposed a novel PGA-Net for pixel-wise detection of surface defects by designing a pyramidal Feature Fusion Network (FFN) and a global context attention network. The mAPs of 82.15%, 74.78%, 71.31%, and 79.54% were achieved on the NEU-Seg, DAGM 2007, MT_defect, and Road_defect datasets, respectively. Wei and Bi [34]. conducted a study on surface defects in APs and improved on the Faster R-CNN by multi-scale feature fusion, ultimately achieving a detection accuracy of 75.8% mAP, an improvement of 12.5%. Due to the two-stage object detection algorithm used, the detection speed was slow, and the FPS was only 1.2. Duan and Zhang [35]. proposed a two-branch gradient image-based CNN for aluminum SDD, in which the original RGB image and the gradient image are input in two branches, and finally, the feature information of the two branches is fused using the concat operation. However, the two-branch network structure design may slow down the detection speed and have an impact on real-time performance. Ma et al. [36]. Improvement of YOLOv4 using depth-separable convolution and a parallel dual-channel attention module for surface defects detection on APs. Li et al. [37] proposed a method for classifying surface defects in APs based on RepVGG and Convolutional Block Attention Module (CBAM). The classification experiments on ten APs surface defects showed that the classification accuracy was as high as 99.41%. However, the study only carried out the classification of defect images and did not explore defect detection in depth. Wang et al. [38]. proposed MS-YOLOv5 based on YOLOv5, improved with PE-Neck and multi-stream networks. Tests on a dataset of surface defects of APs with seven defects showed that the mAP could reach 87.4%, and the FPS was 19.1, which basically met the real-time requirements. However, the dataset has fewer defect types and does not fully reflect actual industrial production conditions. In summary, it is shown that DL-based object detection methods are feasible for detecting surface defects in industrial products. However, the existing research on the detection of VOLUME 11, 2023 surface defects in APs suffers from several deficiencies as a whole. Firstly, the single pursuit of higher detection accuracy leads to a slower detection speed of the algorithm, ignoring the real-time requirements in industrial inspection. Secondly, only the classification of defect images has been studied without delving into the detection of defects in images, where it is also crucial to clarify the specific defect location in industrial production. Finally, the study covers fewer defect categories, as it is inevitable that a wide variety of defects will be encountered in a complex production environment. On the other hand, there are problems such as a fuzzy definition of the defect area, large variation of defect scale, an imbalance in defect aspect ratio, and high similarity between inter-class defects for APs surface defects. Therefore, combining the shortcomings of current research and the characteristics of aluminum surface defects, this paper improves upon YOLOv5 and proposes an attention-guidance MA-YOLO for APs surface defect detection. Experiments show that MA-YOLO can detect APs surface defects more efficiently while satisfying the real-time requirements of industrial inspection. The specific work in this paper is as follows.
(1) To address the problem of the randomness of the initial values taken by the K-Means clustering algorithm. We adopted the K-Means++ clustering algorithm to optimize the original anchor boxes, making the clustering results more globally optimal. This not only alleviates the problem of defect high aspect ratio imbalance but also improves the detection accuracy and convergence speed of the network. (2) To address the problem of large variation in defect scale and an imbalance in defect aspect ratio in the aluminum surface. This paper optimized a multi-scale FFN, which improved the detection performance of high aspect ratio defects and enhanced its adaptability to different scale defects. of APs have shown that MA-YOLO has a significant improvement in precision while meeting real-time requirements of industrial production, proving the efficiency of the proposed method. The effectiveness of the proposed improvement strategy is also verified by ablation experiments. The rest of the paper is organized as follows: Section II introduces the dataset used in this paper, as well as the data augmentation methods, and outlines the YOLOv5 algorithm. Section III presents MA-YOLO and details specific improvement strategies. Section IV describes the experimental environment, parameter settings, and evaluation metrics for this paper. Section V gives detailed experimental results, performs a visual analysis, and verifies the effectiveness of the improvement strategy and the superiority of the proposed method. Section VI summarizes the work of this paper and indicates future research directions.

II. RELATED WORK A. DATA INTRODUCTION
To facilitate readers' follow-up research, the open data set from the APs surface defect detection contest held by Ali Tianchi is used in the experimental part of this paper [39]. For the sake of presentation, we will name this data set AL_Dataset in this article. AL_Dataset contains a total of ten defects that are more common in production, including Non-Conductive (NC), Scratch, Corner Leak (CL), Orange Peel (OP), Leakage, Jet, Paint Bubble (PB), Crater, Parti-Colour (PC), and Dirty Point (DP). AL_Dataset contains 2776 defect images, all set to a resolution of 2560 × 1920. some of the surface defect images of APs are shown in Figure 1. VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.

B. DATA AUGMENTATION PROCESSING
Compared with other computer vision tasks, SDD does not have a large and unified data set such as ImageNet [40], PASCAL-VOC [41], and COCO [42]. Defect detection mainly studies specific applications in different detection objects and scenarios. Compared with more than 14 million sample data in the ImageNet dataset, the most critical problem faced in SDD is the small sample size problem. Even in many real industrial scenes, there are only a few or dozens of defect images.
In this paper, the number of defective images of each category in AL_Dataset was statistically analyzed and the results are shown in Figure 2a. It was observed that, on the one hand, AL_Dataset has the problem of small data volume compared to large datasets such as ImageNet and COCO.
On the other hand, the number of Leakage defective images in AL_Dataset is 538 (the most), and the number of PB defective images is 82 (the least), which has a severe problem of sample unevenness. DL is a strong data-driven discipline, the quality and quantity of data often significantly impact the detection model's performance [43]. When the training data is insufficient, the detection model is more prone to overfitting, which can affect the model's detection accuracy and generalization ability. When the training samples are unbalanced, data classes with larger samples will affect the model's learning of data classes with smaller samples, which can also interfere with the model's accuracy. Regarding the above two issues, and combined with the characteristics of surface defects in APs and specific production environmental factors, this article uses Gaussian noise, rotation transformation, brightness transformation, and contrast transformation to enhance and expand AL_Dataset through data augmentation. In Table 1. This paper details the reasons for using the above data augmentation methods and the expected results. After data augmentation, AL_Dataset contains 8538 defect images, and each class contains at least about 500 images. The results are shown in Figure 2b, thus effectively improving the problem of an insufficient dataset and an unbalanced distribution of sample data. Finally, we divided AL_Dataset into training set, validation set, and test set in a ratio of 8:1:1 for the research on SDD of Aluminum Profiles in this paper.

C. INTRODUCTION TO THE YOLOv5 MODEL
YOLO is a representative single-stage object detection algorithm, which cleverly transforms object detection into a regression problem in its design. Firstly, the offset between the ground truth box and the predicted box is calculated through the loss function, then an optimizer is used to regress the offset of predicted boxes, finally achieving accurate detection with this algorithm. Moreover, compared to two-stage object detection algorithms such as Faster R-CNN and Mask R-CNN, YOLO has a significant advantage in detection speed, making it more favored by the industry. YOLOv5 is the most widely used and technically mature algorithm in the YOLO family and is widely used in industrial production [44]. Compared to previous versions of YOLO, YOLOv5 combines many of the best designs from advanced networks and therefore offers enhanced detection performance, faster detection, and smaller size. Depending on the depth and width of the network, YOLOv5 can be further divided into YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. From left to right, the detection performance of each network increases in turn, increasing in size but getting slower in detection speed. YOLOv5s was chosen as the baseline algorithm for this paper due to real-time considerations for detecting surface defects in APs. The overall architecture of YOLOv5s is shown in Figure 3.
On the whole, YOLOv5s can be divided into four parts, namely Input, Backbone, Neck, and Head. Firstly, there is the input part of the model, which is responsible for pre-processing the input images, including image enhancement and adaptive image scaling. Next is the Backbone, also called the feature extraction network, which consists of the CBS module, the C3 module, and the SPPF module. The CBS module replaces the traditional Pooling operation and is responsible for the feature map's scale and dimensional transformation. The C3 module is a modified version of the Cross Stage Partial module [45], which adopts a two-branch structure and stacks the residual structure to enhance the learning capability of the network and maintain a good lightweight, which can effectively extract the shallow texture features and in-depth semantic features of the image. SPPF module, on the other hand, uses different scales of the Max Pooling operation, which effectively expands the receptive field of the model and increases the richness of the extracted features. Once again, Neck, also called the FFN, has the main purpose of effectively fusing feature map information from different scales in the feature extraction network, making the network more capable of feature representation. In YOLOv5, a Path Aggregation Network (PANet) is used [46], which improves the Feature Pyramid Network (FPN) and adds a new bottom-up feature transfer path to enhance the interactive fusion of shallow and profound information. Finally, there is the Head, also known as the detection head. In YOLOv5, there are three detection heads, one for large objects (20×20), one for medium objects (40 × 40), and one for small objects (80 × 80). The detection head generates multiple predictor boxes in the detection image, first filtering those below the confidence threshold and then removing the redundant boxes by Non-Maximum Suppression (NMS) to obtain the final detection result of the network.

A. MA-YOLO OVERVIEW
Although, in natural scenes, YOLOv5 exhibits strong object detection performance and real-time object detection speed. However, in complex aluminum SDD scenarios, there are problems such as a fuzzy definition of defect areas, large variation of defect scale, imbalance of defect aspect ratio, and high similarity between inter-class defects, so further optimization of YOLOv5 needs to be taken. To address the above problems, this paper firstly adopts the K-Means++ clustering algorithm to optimize the original anchor boxes [47], which can make the initial anchor box settings more relevant to the AL_Dataset, alleviate the problem of extreme defect aspect ratio, and help improve the detection accuracy and convergence speed of the network. Next, by cross-scale feature linking and weighted feature fusion, PANet is improved to make in-depth feature information more effectively transmitted, alleviating the problem of large variations in defect scale and imbalance of defect aspect ratio. Moreover, the effective fusion of deep and shallow information also enhances the adaptability of the network to different scales of features. Finally, this paper introduces a novel MA attention module, which enables the network to focus on the defect features on the surface of APs and improves the detection accuracy of the network. With the above-mentioned improvement strategies, this paper proposes a SDD method with attention guidance for APs, called MA-YOLO. The overall architecture of MA-YOLO is shown in Figure 4.

B. K-MEANS++ CLUSTERING ALGORITHM
YOLOv5 is a single-stage object detection algorithm based on the anchor boxes. The setting of the initial anchor box affects the algorithm's detection accuracy and convergence speed to a certain extent. In order to make the initial anchor box size more fitting to the training dataset, the ground truth box in the training dataset is usually clustered using the VOLUME 11, 2023 71275 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  K-Means algorithm and fine-tuned using a genetic algorithm before the YOLOv5 training. However, the convergence in the K-Means clustering process is heavily dependent on the initial value of the cluster centers, which may lead to a large difference between the randomly initialized cluster centers and the optimal ones. To address this problem, this paper uses the K-means ++ clustering algorithm to improve it. The core idea is to make the centers as far away from each other as possible when initializing the cluster centers. Thus, the clustering results are closer to the optimal global solution. Unlike natural scenes, the surface defects of APs often present problems of large variation in defect scale and imbalance of defect aspect ratio. Therefore, this paper proposes to use the K-Means++ algorithm to cluster the ground truth boxes in AL_Dataset to get more suitable anchor boxes for model training. The step-by-step process of the K-Means++ algorithm is as follows: 1. randomly selecting a sample as the first initialized cluster center; 2. calculate the distances of all samples to the known initialized clustering centers, dividing them by probability and preferring to select the distant sample as the new clustering center; 3. Repeat step 2 until the required number of cluster centers has been selected; 4. calculate the distance of each sample to each cluster center and reclassify to the cluster with the closest distance; 5. recalculate the average height and width of all samples in each cluster and obtain the new cluster centers; 6. repeat steps 4 and 5 until the distance traveled by the cluster centers is less than a threshold, or the set upper limit of the calculation is reached.
In this paper, we set 9 cluster centers based on the number of detection layers and the preset number of anchor boxes per layer. Specifically, we preset 3 scales of anchor boxes for each detection layer, and the entire model contains 3 detection layers. Before clustering, we adaptively scaled the image from a size of 2560 × 1920 to 640 × 480. In Figure 5, we present the clustering visualization results of the K-Means algorithm and the K-Means++ algorithm. Figure 5a shows the size distribution of all real boxes in the AL_Dataset training set, where the x-axis represents the width of the real box and the y-axis represents the height of the real box. It can be observed from the image that although there is some dispersion in the size of the real boxes, they are mainly concentrated in the lower left corner of the image and on the vertical line with a width value of 640. Figures 5b and 5c show the clustering results of the K-Means algorithm and the K-Means++ algorithm on the AL_Dataset training set, respectively. In the figure, different colors represent different clusters, and red stars are used to represent cluster centers. By observing the figure, it can be found that compared with the K-Means algorithm, the K-Means++ algorithm has a better partitioning effect, and the cluster centers of each cluster are relatively dispersed. At the same time, we also found that there are problems with large defect scale variation and imbalanced defect aspect ratios in the AL_Dataset. Table 2 records the preset sizes of the original anchor boxes in YOLOv5, the anchor box sizes after K-Means clustering, and the anchor box sizes after K-Means++ clustering.

C. MULTI-SCALE FEATURE FUSION NETWORKS
The multi-scale FFN is mainly proposed to solve the problem of multi-scale differences of objects in the detection image [48]. The basic idea is to fuse the image edge texture features extracted by the shallow network with the image semantic features extracted by the deep network for output so that the detection network can achieve better localization and regression [49]. FPN [50] is the first one to propose the idea of multi-size feature fusion, which achieves an interactive fusion of information by building a top-down network structure to transfer feature information from deep layers to shallow layers, enhancing the network's detection of small objects. PANet adds a bottom-up feature information transfer path to the FPN to enhance the information fusion between feature maps, thus improving the overall detection performance of the network. The Bidirectional Feature Pyramid Network (BiFPN) [51] optimizes the bidirectional fusion structure of PANet by proposing cross-scale connections and weighted feature fusion, further enhancing the multi-scale feature sensing capability of the detection network. In this paper, the PANet is improved concerning the ideas of cross-scale connections and weighted feature fusion in BiFPN to address the problem of the large variation of defect scale and an imbalance in defect aspect ratio in APs. The weighted cross-scale feature fusion is achieved by up-sampling the deep feature maps in the backbone network, which enables the deep feature information to be transferred more effectively, thus improving the network's performance in detecting multiscale defects. In Figure 6, a represents the original FFN in YOLOv5, and b represents the improved FFN. Taking the P4 detection layer in Figure 6b as an example, the fusion process of its feature information can be represented by formulas (1) and (2). Where ω i represents different learnable weights, ε = 0.0001 is used to avoid numerical instability. Conv represents a convolutional operation, and UpSample represents the UpSampling operation.

D. MA ATTENTION MODULE
The attention mechanism originates from the human visual system's perception of different things. Its basic idea is to make the detection network focus on useful feature information in the image, and suppress useless background and noise interference, thereby improving the network's detection performance. In recent years, research based on attention mechanisms has been continuously proposed. Squeeze and Excitation (SE) Attention is a typical channel attention proposed by Hu et al. [52]. It mainly learns the correlation between different channels in the feature map by performing SE operations on the input feature map, giving higher weights to feature maps with better performance and lower weights to those with poorer performance. CBAM is an efficient attention module proposed by Woo et al. [53]. Based on the SE attention module, it adds a spatial attention module and combines channel attention and spatial attention to enable the network to perceive the correlation between different channels in feature maps and focus on the position information of objects in feature maps. Currently, the mainstream method mainly learns the correlation between different channels in the feature map through Max Pooling and Average Pooling. In addition, there are mixed Pooling, Stochastic Pooling, and VOLUME 11, 2023 Random Pooling operations. However, they have problems such as high computational complexity, low accuracy, and difficult deployment. Therefore, they are not considered in this research [54]. Based on the research of SE and CBAM, this paper proposes a novel MA attention module for surface defect detection in APs. The specific implementation details are shown in Figure 7. We set the original input feature map as F (H×W×C), and perform Max Pooling and Average Pooling respectively. On the one hand, Max Pooling can better preserve important spatial information in the F and reduce noise interference to some extent. On the other hand, Average Pooling can better preserve overall spatial information in the F and is relatively smooth.

B. EVALUATION METRICS
In DL-based object detection algorithms, Precision (P), Recall (R), Average Precision (AP), mean Average Precision (mAP), and F1 score are important metrics for evaluating the merits of detection models. Where precision represents the proportion of true labeled samples among the number of correct samples predicted by the model, and recall refers to the proportion of all labeled positive samples that are correctly predicted as positive by the model. Next, AP refers to the average precision of a certain category in the dataset. It can be represented by the area enclosed by the PR curve with R as the horizontal axis and P as the vertical axis. Meanwhile, mAP represents the average precision of all categories in the dataset and is usually represented by mAP at an Intersection over Union (IOU) threshold of 0.5, also called mAP 50 . Finally, the F1 score is the best balance between accuracy and recall, which gives a more comprehensive picture of the overall performance of the model. Their calculation formulas are given in (3)- (7). Where TP is True Positive, FP is False Positive, FN is False Negative, TN is True Negative, i is the number of the category, and N is the total number of categories.
71278 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
In our experiments, we not only used the evaluation metrics mentioned above but also conducted a more comprehensive performance analysis of MA-YOLO. In this paper, we use model size and FPS to comprehensively evaluate the change in model lightweight after MA-YOLO significantly improved mAP 50 .

A. OVERALL PERFORMANCE COMPARISON ANALYSIS
In the experimental environment provided in this paper, we have visually analyzed the performance of YOLOv5s and MA-YOLO on AL_Dataset. Figure 8 shows the PR curves of YOLOv5s and MA-YOLO. It can be observed that MA-YOLO improves by 2.9% compared with YOLOv5s on mAP 50 , which is a significant improvement. Figure 9 shows the F1 score curves of YOLOv5s and MA-YOLO. On the whole, the curves of all categories are improved, and the F1 score is generally increased by 3%. By comparing the PR curve and F1 score of YOLOv5s and MA-YOLO, it can be concluded that MA-YOLO is significantly better than YOLOv5s in the overall detection performance of the network.

B. ABLATION EXPERIMENTS
In the previous section, this paper demonstrated the overall performance advantage of MA-YOLO through visual analysis of the PR curve and F1 score curve. To further verify the effectiveness of the various improvement strategies proposed in this paper, we conducted ablation experiments on the AL_Dataset test set. The evaluation metrics were mAP 50 , model size, and FPS, and the specific experimental results are shown in Table 3. Among them, Method A represents the baseline YOLOv5s, with a mAP 50 of 85.2%, a model size of 14.2 M, and an FPS of 38. Method B improved on YOLOv5s by using the K-Means++ clustering algorithm, achieving a 0.3% improve in mAP 50 without increasing computational complexity. This indicates that optimizing the initial anchor box is beneficial for improving the performance of anchor-based object detection algorithms.  maintaining the same FPS, effectively improving detection accuracy for surface defects on APs. This also proves that the MA attention module proposed in this paper can effectively improve the detection of surface defects on APs. Finally, compared with the baseline YOLOv5s, MA-YOLO increased mAP 50 by 2.9% while only increasing the model size by 0.4M and decreasing FPS by 4, still meeting the real-time requirements of industrial detection. Next, we conducted individual experimental analyses on each improvement strategy in MA-YOLO, including the K-Means++ clustering algorithm, the improved multi-scale FFN, and the MA attention module. The detailed experimental results are as follows.

1) EFFECTIVENESS OF K-MEANS++:
The initial setting of anchor boxes is crucial for anchor-based object detection algorithms. This paper introduces the K-Means++ clustering algorithm to alleviate the problem of local optima in the K-Means algorithm, effectively improving the detection accuracy and convergence speed of the network. In Figure 10, we compare the effects of the KMeans algorithm and the K-Means++ algorithm on the box loss during model training. The results show that the KMeans++ clustering algorithm effectively accelerates the convergence speed of the network and further improves the detection accuracy.

2) EFFECTIVENESS OF IMPROVING FPN
In order to better address the problem of large-scale variations in surface defects of APs, this paper has improved the FPN to enhance the fusion of feature information at different scales. As shown in Figure 11, by comparing the detection results of three groups (a is the detection result before FPN improvement, and b is the detection result after FPN improvement), we found that the improved FPN can effectively alleviate the problem of object scale variation, and also improve the detection performance of small objects, thus significantly reducing the phenomenon of missed detection. VOLUME 11, 2023

3) EFFECTIVENESS OF MA ATTENTION MECHANISM
The purpose of introducing the MA attention mechanism is to enhance the model's attention to defect features and suppress useless background and noise interference. In Table 4, we compared the effects of several different attention mechanisms on model performance, among which the MA attention mechanism improved P, R, and mAP50 by 2.5%, 2.3%, and 1.1%, respectively. Figure 12 shows the heatmap visualization analysis results of different attention mechanisms, indicating that the MA attention mechanism can effectively alleviate the interference of complex backgrounds and focus on collecting feature information of different defects. This result further proves the effectiveness of the MA attention mechanism.

C. COMPARISON WITH OTHER OBJECT DETECTION ALGORITHMS
In the previous section, we compared the performance differences between YOLOv5s and MA-YOLO. To comprehensively analyze the performance advantages of MA-YOLO, we compared its performance with that of mainstream object detection algorithms on the AL_Dataset test set. The experimental results are shown in Table 5, where MA-YOLO achieved the best performance in the F1 score and mAP 50 evaluation metrics. Specifically, MA-YOLO outperformed the second-best algorithm YOLOX_s by 1.4% and 0.2% in mAP 50 and F1 score, respectively, and outperformed the baseline algorithm YOLOv5s by 2.9% and 2.2%. Compared with other object detection algorithms, Faster R-CNN achieved the best results in the recall, but its precision was low, resulting in poor overall detection performance of the model. Moreover, because the model detection process is divided into two steps, its detection speed is the slowest, with an FPS of only 7. Unlike Faster R-CNN, SSD300 performs very well in precision, but its recall is poor, resulting in average overall detection performance of the model, with a mAP 50 of 84.1%. At the same time, due to the simple structure of the SSD300 model, its detection speed is fast, with an FPS of 34. YOLOv3 also achieved good results in detection performance, with a mAP 50 of 84.6%, but its model is larger and its detection speed is slow. YOLOv4 and YOLO7 have similar performance in detection metrics, both with high precision and low recall, resulting in poor detection performance of the model. However, the model size of YOLOv7 is only 30.6% of that of YOLOv4, which has an advantage in terms of speed, with an FPS of 27. YOLOv6s is relatively balanced in terms of precision and recall, but its overall detection performance is mediocre, with a mAP 50 of 78.6%. As a Transformer-based end-to-end object detection network, DETR has good detection performance for large objects, so it also achieved good results in the defect detection of APs. However, the model size is 497.3 M, and the FPS is only 8. Through a comprehensive analysis of the overall performance metrics of various object algorithms, our MA-YOLO achieved the best detection performance while maintaining a leading detection speed, with an FPS of 34, second only to YOLOv5s, which meets the real-time requirements of industrial detection.
To verify the effectiveness of MA-YOLO more intuitively, the detection results of YOLOv5s and MA-YOLO are visualized and analyzed on the AL_Dataset test set in this paper. The detection results of the model are shown in Figure 13. In this paper, five groups of APs surface defect images are compared, containing 15 images. From the detection results in the first to fourth rows of Figure 13, it is clear that YOLOv5s suffers from a general defect miss detection phenomenon, especially when the defect size is small and the network has difficulty effectively identifying the defect. In contrast, MA-YOLO enhances the model's focus on minor defects on the surface of APs by introducing the MA attention module, thus significantly optimizing the defect miss detection phenomenon. Taking the example of smaller-sized scratch defects, experimental test results show that MA-YOLO can detect scratch defects with a minimum size of 42 × 42 in the original image (size: 2560 × 1920, defects smaller than 42 pixels cannot be recognized by the detection algorithm at the same confidence level). In addition, the detection results in the fifth row of Figure 13 show that MA-YOLO also improves defect misdetection compared to YOLOv5s. The analysis suggests that by optimizing the multi-scale FFN, the in-depth information is more effectively fused with the shallow information, thus avoiding the false detection of the background as a defect. In conclusion, this paper has demonstrated through extensive experiments and visualized analysis that MA-YOLO performs better than the current mainstream object detection algorithms in the scenario of APs surface defect detection.

D. FAILURE CASE ANALYSIS
Although MA-YOLO has shown the best detection performance in many experiments, MA-YOLO still has the phenomenon of missed detection and false detection in some cases due to various complex situations in industrial defect detection scenarios. The specific details are shown in Figure 14. For the failed defect detection cases, this paper profoundly analyzes and tries to explore the reasons for missed detection and false detection of the model. In Figure 14, a and b represent cases of model miss detection, and the missed defects are marked using yellow boxes in this paper. In case a, the analysis suggests that the missed defect in the image is more similar to the reflection of light from the surface of the APs, which causes MA-YOLO to overlook the defects. In case b, even the human eye could not clearly identify the minor defects on the aluminum profile's surface, making it difficult for MA-YOLO to make an accurate detection. In Figure 14, case c shows a misclassified case, and the defects is marked using blue box. It can be seen that MA-YOLO misclassified the PB defect as the NC defect, which is probably due to the inconsistent characteristics of the continuous and single PB defects and the fact that the size of the PB defect and the NC defect in case c is relatively similar, thus leading to the misdetection of the model. Through the reasonable analysis of failure cases, we realize the shortcomings of MA-YOLO. In future work, we will optimize it to reduce the missed detection and false detection of the model.

VI. CONCLUSION
The detection of defects on the surface of APs has an important influence on the quality of APs products. In this paper, we propose MA-YOLO, an attention-guided algorithm for detecting defects on APs surfaces, to address issues such as a fuzzy definition of the defect area, large variation in defect scale, imbalance in defect aspect ratio, and high similarity between inter-class defects. Firstly, the original anchor boxes are optimized using the K-Means++ algorithm instead of the K-Means algorithm to make the initial anchor box settings fitter to the defects dataset, thereby alleviating the problem of an imbalance in the defect aspect ratio. Secondly, by improving the multi-scale FFN, not only the detection performance of high aspect ratio imbalance defects is improved, but also the adaptability to variation scale features is enhanced. Finally, a novel MA attention module is proposed to improve the detection accuracy of the network, especially for small object defects, by enhancing the model's focus on defect feature information and suppressing unwanted background and noise interference. Experiments on the surface defect dataset in APs show that the MA-YOLO has better detection accuracy than current mainstream object detection algorithms. Compared to the baseline YOLOv5s, MA-YOLO improves mAP 50 and F1 score by 2.9% and 2.2%, respectively. Overall, MA-YOLO achieved significant improvement in mAP 50 while maintaining consistency with YOLOv5s in model lightweight, meeting the real-time requirements of industrial detection. In future work, we will improve and enhance the defect categories that are not detected well, thus improving the overall detection accuracy of the model. YAHONG MA was born in Fuping, Shaanxi, China. She received the Ph.D. degree in electrical engineering from Xi'an Jiaotong University, in 2013. She was with the Department of Information Engineering, Xijing University, where she is currently an Assistant Professor. She is the author of two books and more than 20 articles. She is responsible for or participated in many national projects. Her research interests include intelligent information acquisition and processing and the Internet of Things.
JIAWEI DU received the B.E. degree from Xijing University, Xi'an, Shaanxi, China, in 2021, where she is currently pursuing the master's degree in electronic information. Her research interests include networks and information security.