PCB Defect Detection Method Based on Transformer-YOLO

In order to solve the problem of low accuracy and efficiency in printed circuit board(PCB) defect detection using reference methods, a Transformer-YOLO network detection model is proposed. Firstly, an improved clustering algorithm is used to generate the anchor box suitable for the PCB defect data set of this paper. Secondly, abandoning the traditional idea of using convolutional neural network to extract image feature, Swin Transformer is used as the feature extraction network, which can effectively establish the dependency between image features. Finally, to modify the order of the channels in the feature map and enable the network to more effectively focus on the information with greater value, the convolution and attention mechanism module is added to the feature detection network component. Comparing the network model proposed in this paper with Faster R-CNN, SSD, YOLOv3, YOLOv4 and YOLOv5, the experimental results show that the proposed model improves the accuracy by 23.90%, 15.51%, 10.70%, 7.83% and 6.12% respectively, which is better than other most mainstream target detection models and has relatively small volume.


I. INTRODUCTION
Printed circuit board is a key component in most electronic devices [1]. With the advent of the fourth industrial revolution, there is an increasing demand for electronic products such as smartphones and laptops in our life. As the bottom operating platform of its electronic products and devices [2], the future prosperity of the PCB industry is beyond unquestionable. At the same time, with the increasing demand for PCBs and the rising productivity, PCBs manufacture tends to be more complex and miniaturized [4], which brings a rather harsh challenge to the detection of defects in circuit boards. If there are certain defects in PCBs, it will lead to great changes in performance, indirectly affect the quality and sales of later products, and cause serious economic losses. Therefore, it is necessary to carry out defect detection during the production of PCBs, which can reduce the production The associate editor coordinating the review of this manuscript and approving it for publication was Kathiravan Srinivasan . cost of the product and effectively improve the product qualification rate [5].
In the industrial production process, PCB defects will be caused by welding failure, improper operation mode and storage mode. The current PCB defects are mainly divided into six categories: short, open circuit, spur, spurious copper, mouse bite and missing hole [7]. PCB defect detection methods mainly include manual inspection, performance testing, reference comparison method and non reference method [8]. Manual inspection mostly relies on the technicians' subjective assessment of whether defects are visible to the naked eye on the PCB. The efficiency of this method is low and it is simple to miss the inspection. Performance testing is mainly functional testing of the PCB to determine whether there are defects in the PCB, but this method will cause some damage to the PCB. Reference comparison method [9] is currently widely used in the industrial field, which mainly exploited a technical mean based on template matching. The PCB to be tested is compared with the standard template to determine whether there are certain defects, but the main drawback of the method is the need for accurate alignment in space, otherwise the false detection rate is high. In recent years, Automatic optical inspection (AOI) has become the main method used in industrial inspection, compared to other methods, the accuracy of the detection rate have been improved. However, due to the strict parameter setting requirements and high sensitivity of the AOI system, it is particularly prone to over-screening and therefore requires subsequent manual secondary screening [10]. Compared with the above mentioned traditional defect detection methods, the deep learning-based defect detection methods have better performance and can effectively avoid the low detection accuracy and efficiency. Therefore, deep learning-based defect detection methods have naturally become a hot topic of research in recent years.
Currently, the technical development of PCB defect detection using deep learning algorithms has been relatively mature. The relevant algorithms can be divided into two categories: One is one-stage target detection algorithms represented by SSD [11], YOLO series [12], etc. And the other is two-stage target detection algorithms represented by R-CNN [16], Fast R-CNN [17], Faster R-CNN [18], etc. Meanwhile, myriad scholars, regarding the task of PCB defect detection, have done a large sum of research and achieved some research results. Xin et al. [19] proposed an improved network model for YOLOv4, by analyzing the backbone architecture CSPDarknet53 and appropriately changing the settings of the model hyperparameters, thus increasing the accuracy rate to 96.88%. Moreover, Li et al. [20] achieved 93.07% detection accuracy by applying to a joint approach of real PCB images and virtual PCB images to preprocess the data, while changing the three-layer predicted output layer of YOLOv3 to a four-layer predicted output layer. Additionally, Hu et al. [21] completed the detection of common PCB defect types by improving the Faster RCNN and utilizing ResNet50 with image pyramids as a feature extraction network. Besides, Zeng et al. [22] proposed an enhanced multi-scale feature fusion method for the task of small target detection, as well as the effectiveness of the method was verified on the PCB data set, which also obtained fabulous results.
Based on the above-mentioned related literature research results, and in order to better balance the detection accuracy, detection speed and the volume of the network model for the task of PCB defect detection, the paper makes related improvements based on the YOLOv5 model and proposes a new network model. The main contributions of this paper are as follows: • The original clustering algorithm is boosted in order to make the resulting anchor boxes more appropriate for the PCB defect data set, consequently improving the model's detection accuracy.
• Adopt Swin Transformer as the feature detection network instead of the traditional convolutional neural network, because Swim Transformer has a strong global interaction mechanism that can efficiently establish the dependencies between image features, and there is no problem with limiting the receptive field in convolutional neural network.
• The attention mechanism module is added to the interactive part of the feature extraction network and the feature detection network, and the arrangement order of channels in the feature map is adjusted so that the network model can more efficiently focus on the channel information with high importance. Finally, the task of detecting and identifying common PCB defect types is completed.

II. METHOD
YOLOv5 is a widely used target detection model. Compared with YOLOv3 and YOLOv4, the detection accuracy of this model has been further improved. The prediction results of the network output mainly include the type information and location information of the target object. The overall architecture of the network is mostly consistent with YOLOv3 and YOLOv4. As demonstrated in Figure 1, it mainly includes four parts: The first part is the input section of the network, it mostly consists of the pre-processing of the input picture, including data augmentation and automatically learning anchor boxes, etc. The second part is the backbone network and YOLOv5 adopts CSPDarknet as its network for extracting picture features. The third part is the feature detection network, which fuses several layers of feature information top-down and bottom-up. The fourth part is the output network that displays the predictive information of the kind, confidence level, and position of the target object.

III. RELATED WORKS
This paper mainly carries out relevant improvements based on YOLOv5 network, however the model remains the same architecture, namely input, image feature extraction network, feature detection network, and output prediction results. Firstly, the original clustering technique in this paper was improved to make the resulting anchor boxes more appropriate for the target data set. Secondly, Swin Transformer is used for image feature extraction. Finally, the connection between the feature extraction network and the feature detection network now includes convolution and SE Block module. The overall architecture of its network is shown in Figure 2.

A. IMPROVED CLUSTERING ALGORITHM
The anchor boxes in the original YOLOv5 model is primarily formed by the k-means clustering algorithm and the genetic algorithm, and the anchor boxes have a direct impact on the network's detection accuracy. There are two major problems in the original model using K-means algorithm: One problem is to determine all the initial cluster center point by random selection in the clustering process, and the other problem is that Euclidean Distance is selected as the standard of similarity clustering between samples. In conclusion, these two problems will have an impact on the generated anchor boxes as well as the model's detection VOLUME 10, 2022  accuracy. Therefore, in this paper, K-means++ is selected to cluster data sets. By randomly selecting a clustering center iteratively until all the clustering centers are selected. And secondly the main purpose of setting the anchor boxes is to maximize the overlap area between the prediction box and the ground truth box, so the clustering metric in the clustering algorithm is redefined and formula 1 is used to remeasure the distance formula between samples in the data set. The final result of its clustering is shown in Table 1, and the main process of the improved clustering algorithm is as follows: Step1: Choose one of the PCB defect data set samples to serve as the initialization cluster center.
Step2: Calculate the distance between all the remaining samples in the data set and the initialization cluster center according to formula 1. Step3:Calculate the probability P(x) of each sample being selected as the next cluster center according to formula 2.
Step4: To choose all nine cluster center points, repeat Steps 2 and 3, where 9 is the number of anchor boxes that will be produced by the network model in this paper.
Step5: For each sample x in the data set, calculate the distance between the sample and the 9 cluster centers selected above according to formula 1, and divide the sample into the category corresponding to the nearest cluster center. Step6: For each cluster category, recalculate the cluster center according to formula 3, and repeat Step5 until the location of the cluster center does not change.

B. IMPROVED THE FEATURE DETECTION NETWORK
Defects on PCBs are typically characterized by a relatively small area. The point is best illustrated by the data set distribution shown in Figure 3 in this paper. The defect's pixel size is mostly less than 60, and it represents a very modest part of the entire image in the photograph. At present, the convolutional neural network, which is made up of a series of convolution kernels of various sizes, is still the primary source of support for the feature extraction network of one-stage and two-stage target detection models. As an illustration, take the YOLOv5 model. The model employs CSPDarknet as the feature extraction network, which is mostly made up of Focus and CSP [23] architecture, and its network comprises a significant amount of convolution kernels. Although the depth of convolutional neural networks can be adjusted freely, if only a small number of convolution kernels are used in the network design process, the model can only extract a part of the shallow feature information of the image due to the lack of depth. On the contrary, if it uses a lot of convolution kernels in the process of network design, more deep semantic information can be extracted. But at the same time, a new problem arises: because the network architecture is excessively complex, the feature transfer effect across layers is weakened, which reduces the capacity to extract local fine features. The convolutional neural network also has poor capacity to learn global aspects of the picture and pays more attention to local feature information of the image, neglecting the correlation degree of contextual feature information, as it iterates between layers, eventually expanding the receptive field.
Since Google proposed the model Transformer [24] in 2017, it has profoundly accelerated the development process in the field of Natural Language Processing. In the past two years, many researchers tried to introduce Transformer model into the field of Computer Vision, and achieved relatively remarkable results. The case that Carion et al. [25] applied Transformer to target detection for the first time and proposed a new network framework DETR in 2020 is a prime example. In October of the same year, Dosovitskiy et al [26] proposed the variant network architecture VIT model of Transformer, an image classification network based on pure attention mechanism. During the development of this field of Computer Vision, convolutional neural network model has occupied a dominant position and played an irreplaceable role and its core is mainly the convolution kernel module. And the disadvantages of the convolution neural network model are discussed in detail in the previous paper. As opposed to the convolutional neural network, the Transformer is mostly made up of the attention mechanism module construction, which is entirely distinct from the former model. The Transformer model can quickly determine the degree of connection between image features based on its attention mechanism module, fully utilize upper and lower information, and accomplish the modeling of both global and local image features with a superior global interaction mechanism.
In this paper, the feature extraction network is mainly improved, and the Swin Transformer [18] is used to replace the convolutional neural network. The model is mainly improved on the basis of the Transformer, which is composed of Patch Partition, Linear Embedding, Swin Transformer Block, etc. And the core of the network is the Swin Transformer Block module. In this module, the computation process of the network can be parallelized by using the window attention mechanism; the problem of limited receptive field in convolutional neural networks is skillfully solved by using sliding window multi-head attention mechanism. According to the number of stacks in the Swin Transformer Block and heads used in the attention mechanism, the Swin Transformer Block can be divided into four models with different parameter configurations. In this paper, Swin Transformer Tiny is used as the feature extraction network, the specific network architecture, parameter configuration and shape changes of feature map shown in Figure 4.
The network model Swin Transformer Tiny of Figure 4 is explained here with a color image of input size 640 × 640: Firstly, the network will use Patch Partition to divide the input image into 160 × 160 size according to the non-overlapping size of 4 × 4 pixels, with the number of channels unchanged, and a total of 16 patches will be obtained. The 16 patches will be superposition and flatten on the channels so as to obtain the feature map of 160 × 160 × 48. Secondly, the network implement cascading stacking of Stage1, Stage2, Stage3, and Stage4 modules. The Linear Embedding architecture  in Stage1 only adjusts the number of channels without changing the size of the feature map. Simultaneously, the Patch Merging architecture of Stage2 through 4 changed the size of the feature map and the number of channels. However, the Swin Transformer Block module, does not alter the geometry of the feature map prior to or following its input and output. The module is mainly composed of window attention mechanism, sliding window attention mechanism, layer normalization and multilayer perception mechanism, which need to be used in pairs, the former layer uses window attention mechanism, and the latter requires a sliding window attention mechanism. The calculation process of window attention mechanism is revealed in Figure 5, and the data in the figure mainly correspond to the Swin Transformer Block module in Stage3. The window attention mechanism in the module mainly divides the input feature map according to the specified window size, and uses the way of parallelization to process. The sliding window mechanism mainly enables feature information interaction between different windows, which can not only rapidly increase the size of receptive field, but also pay attention to local feature information and global feature information at the same time.

C. IMPROVE THE FEATURE DETECTION NETWORK
Because the original network model pays attention to the feature information contained in all channels in the feature map indiscriminately, this method is obviously unreasonable and ineffective. For the information of different channels in the feature map, their contributions to network detection are different. Thus, in this paper, considering the importance of network training in Transfer Learning and not destroying the architecture of the feature extraction network, our model choose to add SE Block [28] to the interaction part between the feature extraction network and the feature detection network. The main purpose is to learn a set of weights to represent the importance of the feature information contained in different channels in the feature map, and reorder the channels of the feature map according to the importance of the feature information contained in the channels. Therefore, the network can focus on the channel information with higher importance and ignore a part of the channel information with relatively lower significance. The overall architecture of SE Block is clarified in Figure 6. It main process is as follow: Firstly, after the shape of the given input feature map being (X, Y, C), according to formula 4, the information contained in different channels in the feature map is transformed into an objective data one by one, thereafter combined to obtain a vector of 1 × 1 × C. Secondly, dimension reduction and dimension increase are carried out through two fully connected layers, and the Sigmoid function is used to fix the output value of the vector between 0 and 1, plus the output value represents the importance of feature information contained in each channel number in the original feature map. Ultimately, the vector is multiplied with the original feature map, and the order of the channels in the feature map is  readjusted according to the importance of the channels, a new feature map is obtained consequently.

B. EXPERIMENTAL DATA
The data set comes from the PCB defect data set published by Intelligent Robotics Open Laboratory of Peking University. The defect types mainly include short, open circuit, spur, spurious copper, mouse bite and missing hole. In order to prevent network overfitting, we expanded the original 693 data sets by random rotation, random cropping, brightness adjustment, noise addition, etc. The final number of data sets reached 10,668. The distribution of the number of various defects is shown in Table 2, and the six defects are shown in Figure 7.

C. NETWORK TRAINING
All the experiments involved in this paper adjusted and set the parameters on the basis of the pre-training weight. There were 300 rounds of network training. The batch size was 8 for the first 50 rounds of frozen network training, and the batch size was 4 for the last 250 rounds of unfrozen network training. Data set division: (training set + validation set): test set = 9:1, in which the ratio of training set and validation set is still 9:1, and the final training set is divided into 8640 pieces, the validation set into 961 pieces, and the test set into 1067 pieces.

D. EVALUATION INDEX
In this paper, the mean average precision (MAP), detection speed (FPS) and model size (MB) are the main reference indexes to evaluate the advantages and disadvantages of the VOLUME 10, 2022 model. In the above Figure, P and R represent precision and recall respectively. P can be understood as the accuracy of the network that does not mistake the background for the target. R can be understood as the sensitivity of the network that does not mistake the target for the background.
Here, TP, FP and FN are explained in combination with the data set in this paper: TP represents the actual PCB defect, and the model predicts that it is PCB defect. FP indicates that the actual is not a PCB defect, and the model predicts that it is a PCB defect. FN is actually a PCB defect, and the model prediction is not a PCB defect.

V. RESULTS AND ANALYSIS A. ABLATION EXPERIMENTS
Ablation experiments were designed to verify the three improvements mentioned above one by one. There are mainly five groups of experiments: The experiment 1 represents the or iginal YOLOv5 model. The experiment 2 represents the modification of clustering algorithm based on the original model. The experiment 3 represents the replacement of feature extraction network based on the original model. The experiment 4 represents the addition of attention mechanism architecture based on the original model, and the experiment 5 represents the proposed model in this paper. Meanwhile, for the models used in the five groups of experiments, the PR curves of the six types of defects are shown in Figure 8. And the MAP and AP of the six types of defects are shown in Table 3.
From the analysis of the results in Figure 8, it can be seen that the area enclosed by the PR curve and the coordinate axis represents the detection accuracy of each defect category.
Here we can roughly see that the area enclosed by the PR curve of the six defect categories is obviously larger for the proposed model in this paper, that is, the PR curve of 8(e), compared to the PR curve of 8(a) experiment 1.
From the analysis of the results in Table 3, the experiment was analyzed as follows: For experiment 1, the MAP of six types of defects reached 90.92%, and the overall accuracy was not very low. However, from the AP value, it can be found that the original model had a poor detection effect for three types of defects, mouse bite, spurious copper and spur, which was lower than the MAP, while the detection performance for missing hole was extremely good. For experiment 2, only the clustering algorithm of the original algorithm was improved. It can be seen from the results in Table3 that the MAP was increased by 0.62%, and the detection accuracy of the six types of defects was basically improved to a certain extent by improving the clustering algorithm. For experiment 3, Swin Transformer is used as the image feature extraction network. It can be seen from the results in Table3 that the MAP value is improved by 5.33%, and the improvement effect is obvious. Meanwhile, compared with the AP index of six types of defects in experiment 1, the AP value in experiment 3 is significantly improved. In particular, the detection accuracy of three kinds of defects, including spur, spurious copper and mouse bite, is improved remarkably, and the detection accuracy of six kinds of defects is not  different. For experiment 4, the attention mechanism module was mainly added to the interaction between image feature extraction network and feature detection network. As can be seen from Table 3, MAP was improved by 0.60%, and the detection accuracy of most of the six types of defects was improved to a certain extent. For experiment 5, the proposed network model in this paper, MAP is improved by 6.12% to 97.04%. Among the five groups of experiments, MAP index of the proposed model in this paper has the best performance. And it also maintains good detection accuracy for six types of defects.

B. COMPARISON OF ALGORITHMS
The network model proposed in this paper is compared with other current mainstream object detection models. The results are shown in Table 4. As can be seen from the Table 4, compared with the Faster R-CNN network, the proposed model has certain advantages in model volume and detection speed, and the detection accuracy is improved by 23.9%. Compared with SSD network, the detection speed of the proposed algorithm is improved by 15.51% even though the network model volume and reasoning speed are slightly superior. Compared with the YOLO series algorithm, taking YOLOv3 and YOLOv4 as an example, the network model in this paper is relatively small under the condition of little difference in detection speed, and the detection accuracy is improved by 10.70% and 7.83%. Compared to the YOLOv5 algorithm, the speed of detection is increased by 6.12%, despite sacrificing some detection speed and model size. The model achieved 97.04% detection accuracy and the model of this paper is better than the current mainstream target detection algorithm. Finally, the model in this paper is applied to PCB defect detection. The detection effects of six common defects are shown in Figure 9.

VI. CONCLUSION
In contrast to the prevalent target detection network model currently in use, this research suggests a novel hybrid network model for the detection of PCB defects. First and foremost, the problems with the current clustering algorithm are fixed in order to produce a more appropriate anchor boxes for the PCB defect dataset in this paper. After that, the attention mechanism network is utilized to extract image features, and the interaction between the image feature extraction network and the feature detection network is enhanced by the addition of the attention mechanism module. Briefly said, it is confirmed that the network model developed in this study achieves optimum detection accuracy on PCB public datasets and is better than the present mainstream techniques, which can essentially match the demands of industrial applications.
WEI CHEN received the Ph.D. degree. He is currently an Associate Professor and a Master's Tutor. His research interests include computer vision, artificial intelligence, image processing, and optoelectric detection.
ZHONGTIAN HUANG is currently pursuing the M.S. degree in electronic information with the Xi'an University of Science and Technology, Xi'an, China. His research interests include computer vision, artificial intelligence, image processing, and optoelectric detection.
QIAN MU is currently pursuing the M.S. degree in electronic information with the Xi'an University of Science and Technology, Xi'an, China. Her research interests include computer vision, artificial intelligence, image processing, and optoelectric detection.
YI SUN received the Ph.D. degree. He is currently an Associate Professor and a Master Tutor. His current research interests include blockchain, game theory, swarm robotics, and distributed systems. VOLUME 10, 2022