YOLO-ESCA: A High-Performance Safety Helmet Standard Wearing Behavior Detection Model Based on Improved YOLOv5

To solve the problem of workers incorrectly wearing helmets, this study proposes a standard helmet wear detection model, YOLO-ESCA based on improved YOLOv5n. This model can monitor workers’ helmet wear in real time via UAVs and other means and automatically reduce video streaming detection results. The model is trained using a self-built dataset that containing 4400 images. To address the shortcomings of the original YOLOv5, an improved version of the proposed approach, in which the efficient intersection over union loss function (EIOU-loss), Soft-NMS nonmaximal suppression, and the convolutional block attention module (CBAM) are employed, is proposed, and a small target detection layer (ADL) is added to improve model performance. The experimental results show that the mAP@0.5 of the improved model is up to 94.7%, the FPS is up to 65.3, the model size is only 4.47MB, and that the number of detections on the self-constructed dataset and SHWD dataset is 41.7% and 73% greater, respectively, than that of the original model, respectively.


I. INTRODUCTION
Safety accidents frequently occur in the construction industry due to factors such as labor intensiveness, the intersection of multiple processes, and complex operating environments.Safety helmets, one of the ''three treasures'' of construction, can prevent most of the injuries that occur during the construction process.Moreover, wearing a helmet can prevent fatal injuries.Nevertheless, in some accidents, such as those involving people falling from high heights, the improper wearing of helmets can cause secondary injuries, which eventually lead to tragedy.According to the national standard ''head protective safety helmet'' (GB2811-2019), a safety helmet should be adjusted according to the size of the head circumference cap or chin belt to ensure that it is firmly worn, not accidentally offset or slipped.Even if an accident occurs, even the most straightforward brain injury may require physical and psychological treatment to The associate editor coordinating the review of this manuscript and approving it for publication was Kathiravan Srinivasan .treat memory problems, behavioral changes, depression, and personality changes.Therefore, workers must wear correctly safety helmets to address potential dangers.However, due to low safety awareness, construction workers do not comply with national standards.Traditional manual management is inefficient, consumes resources, and hinders effective accident prevention accidents.Therefore, automatic detection of helmet-wearing situations is critical.
Target recognition based on deep learning has recently been a research hotspot in computer vision.Unlike traditional methods that require manual design and feature extraction, deep learning can improve model accuracy by automatically learning features [1], [2].Deep learning detection algorithms are divided into two-stage detection algorithms based on candidate regions and end-to-end single-stage detection algorithms [3].The two-stage detection algorithm is represented by R-CNN [4], a deep learning algorithm for target detection proposed by R.Girshick et al., Fast R-CNN [5], and Faster R-CNN [6] algorithms with higher performance.Two-stage detection algorithms have the characteristics of high detection accuracy and long detection time, so it is unsuitable for realtime detection.The single-stage object detection algorithm represented by SSD [7] and YOLO [8], [9], [10], [11] have fast detection speeds and high accuracy.Compared with other algorithms, the YOLO series has the advantages of a more straightforward network structure, more vital generalization ability, and better performance.Therefore, many scholars propose applying the YOLO algorithm to real-time construction scene detection.Chen et al. propose a YOLOv5n-based for helmet and reflective undershirt detection algorithm; in their method, they used the efficient intersection over union loss function (EIOU-loss), a mixed convolutional block attention module (CBAM) and a CA attention mechanism in the network structure, and subsequently added a detection layer to improve the model performance [23].
The above research is very important, but there are still the following problems: (1) some algorithms have high detection accuracy, but the number of parameters and calculation amount still greatly burden the computing equipment.
(2) Some detection models have low computational effort but also low detection accuracy.(3) Helmet detection has been widely investigated attention to, and some studies have been conducted on reflective undershirts.However, no scholars have explored whether helmets are worn correctly.(4) all of the above studies optimize and improve the detection performance of algorithms without considering the needs of practical applications.Notably, the YOLO series algorithms have been updated to the eighth version (YOLOv8).Nevertheless, in recent years, the vast majority of scholars have based their research on the YOLOv5 algorithm, because YOLOv5 has a simpler network structure and has premodels with different network depths so that the scholars can choose a premodel that is more suitable for their research.More importantly, YOLOv5 has lower model size and higher FPS.
Therefore, this paper proposes a standard helmet wearing detection model based on improved YOLOv5.To address the problem of the high computational effort of the above algorithms, YOLOv5n, which is the least computationally intensive, is used as the pre-model for training.In response to the low accuracy of model detection, considering the characteristics of small detection targets, high overlap rate, and easy occlusion at construction sites, the algorithm uses the EIOU-loss [24] loss function to replace the CIOU-loss loss function to improve the model performance.Aimed at the original YOLOv5n detection of dense targets with high leakage rates, Soft-NMS [25] is employed instead of NMS to improve the recognition of occluded targets.By adding the CBAM [26] attention module to improve the attention given to target features, the problems of small size and easy confusion with the lower chin strap in the helmet can be solved.A small target detection layer (ADL) is added to improve the detection performance for small targets over a long range.The contributions of this study are summarized as follows: 1.The first standard wearing helmet image dataset was established and included 4400 images in different environments such as dense targets, long-distance targets, dense long-distance targets, and insufficient illumination.
2. For the first time, we propose the theory of whether helmet wearing is standard for target detection research, and apply it to standard helmet wearing detection at construction sites based on the YOLOv5 algorithm to fill research gap on standard helmet wearing detection.
3. The experiment showed that ADL reduces the model's accuracy.Nevertheless, by cooperating with the CBAM, the model can meet the real-time detection accuracy requirements of construction sites and significantly reduce the missed detection rate.
4. From the perspective of improving the practicability of detection results, this study developed an automatic preservation function for video stream detection results, which can be utilized as an important basis and support for decision-making and the implementation of construction site safety management.

II. METHOD 1) YOLOV5
YOLOv5 is one of the most advanced single-stage target detection algorithms; it was released on June 10, 2020, and is still being updated.There are currently eight versions.This paper selects the latest version, 6.1.YOLOv5 officially provides five versions of the network model, according to the network depth from low order to high order for YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x.This paper uses the YOLOv5n model.The YOLOv5n network model has the smallest volume and the fastest detection speed, and the detection accuracy can also meet the actual needs.The can be deployed on low-performance UAVs and has extremely high versatility.
The YOLOv5n network model is divided into four parts: the input, backbone, neck, and output.The network structure of YOLOv5n is shown in Fig. 1 2) EIOU-LOSS The IOU (intersection-over-union) loss represents the difference between the predicted values and the true values of the target position and can be used to correct the position coordinates of the prediction box.However, when the initial IOU prediction box and the real box do not intersect, the difference does not reflect the distance between the two boxes or the size of the overlap.Therefore, we use the EIOU-loss to improve the accuracy of the prediction box.
where A represents the prediction box, B represents the target box, and ρ(•) represents the Euclidean distance between the two centroids of the predictor frame and the target frame, c is the length of the diagonal that minimally encloses the two bounding boxes, and b and b gt denote the centroids of A and B, respectively.α is the weight function; v is used to measure the similarity of the aspect ratio between the anchor frame and the target frame; w and w gt are the widths of A and B, respectively, and h and h gt are the heights of A and B, respectively.The CIOU-loss does not calculate the true difference between width or height and their confidence, which sometimes hinders the convergence of the model.In response to this problem, the EIOU-loss is used instead of the CIOU-loss.The specific formula presented is as follows.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.L dis = ρ 2 w, w gt (w c ) 2 ( 7) In the illustration in Fig. 2, yellow is the prediction box, blue is the target box, green is the minimum closed box, and w c and h c are the width and height, respectively, of the smallest enclosing box that covers both boxes.The EIOU-loss is divided into three parts: the IOU loss L IOU , the distance loss L dis , and the aspect loss L asp .In this way, the difference between the width and height of the target frame and the anchor frame can be reduced while retaining the advantages of CIOU, thereby obtaining faster convergence speed and better positioning results.

3) SOFT-NMS
Soft-NMS was used to replace NMS to increase the accuracy and recall of obscured target detection.The original NMS determines whether to remove the detection frame when removing the redundant detection frame based on the IOU's value.The detection frame is removed when the IOU exceeds the set threshold value.When the targets are dense, mutual occlusion leads to an enormous IOU value, and NMS incorrectly removes the detection frame, causing the target to be missed.In a helmet-wearing environment, there is often overlap of occlusions, so Soft-NMS is used to improve missed detection.
The standard suppression of NMS and the IOU exceeds the threshold of the detection frame score, which is directly set to 0, as shown in (9).Moreover, Soft-NMS advocates the penalty decay of its score.There are two types of penalties.The first penalty function is as shown in (10), but the above equation is not continuous; this leads to an abrupt change in the detection sequence.The continuous penalty function has no penalty when there is no overlap and a very high penalty when there is a high overlap.Moreover, the number of sentences should gradually increase when the overlap is low.Thus, the second Gaussian penalty function is proposed as shown in (11) so that Soft-NMS can avoid setting the threshold size.
where s i denotes the classification score, M indicates the prediction box with the highest prediction score, x i is used to determine whether the prediction box needs to be removed, and N t denotes the threshold value of NMS.Soft-NMS is a greedy algorithm that does not find a globally optimal rescoring detection frame.Soft-NMS is a generalized nonmaximal suppression, and conventional NMS is a particular case of Soft-NMS.

4) CBAM
Because the chin strap target of a helmet is small and the number of pixels is low, it is easy to confuse or miss.This paper adds the CBAM before the SPPF module to the Backbone section.
The CBAM consists of two submodules: the CAM and SAM.As shown in Fig. 3(a), a feature map is input, and the attention feature map is reasoned along two dimensions: channel and space.Then, the two feature maps are multiplied for adaptive operation, and the refined feature map is outputted.The structure of the CAM is shown in Fig. 3(b).The input feature map F is subjected to global maximum pooling and global average pooling in spatial dimensions.The two feature maps that are obtained are subsequently fed into a two-layered shared selective linking layer.The two features are summed to obtain the channel attention feature M c after the sigmoid activation function.The structure of the SAM is shown in Fig. 3(c).M c and the input feature map F are elementwise multiplied to obtain the input feature F' of the SAM and F' is pooled with the maximum and average in the channel dimension and convolved with a convolution operation to reduce its dimensionality.The spatial attention feature M s is generated by the sigmoid activation function.The CBAM is a lightweight module that needs to be added only to the needed parts when used, without additional training, and the impact on the detection time is negligible.The structure of the CBAM is shown in Fig. 3.The improved network model is shown in Fig. 4.

5) SMALL TARGET DETECTION LAYER
The original YOLOv5n model has only three detection layers, as shown in Fig. 1; these layers are used to detect large, medium, and small targets, and the sizes of the corresponding detection layer feature maps are 20 * 20, 40 * 40, and 80 * 80, respectively.[27].Due to the small size of the chin strap of the helmet, it is easy to occlude the strap, and the construction site staff are all over the site.Moreover, the helmets in the  detection image also differ in size, especially the images taken by the drone.Therefore, based on the YOLOv5n network, another target detection layer with a feature image size of 160 * 160 is added to improve the accuracy under the above complex conditions This layer has a smaller receptive domain and richer position information.The featured image can better utilize the multilevel feature information of dense objects, thus improving the detection performance of the model in long-range scenes.The improved YOLOv5n network structure is shown in Fig. 5.

6) AUTOMATIC STORAGE OF VIDEO STREAM DETECTION RESULTS
Construction sites are generally equipped with video monitoring systems (e.g., CCTV).Nevertheless, images in the monitoring room are broadcast on the same screen in multiple venues, and the use of manual labor is not only time-consuming and laborious but also inefficient.Therefore, with the use of an existing video monitoring device, the model proposed in this paper is used to perform real-time detection of safety helmets worn at construction sites, and the detection results are automatically extracted and saved to terminals, which plays an important role in improving the pertinence of safety measures at construction sites.Safety managers punish construction personnel for on-site violations, and test results containing environmental information about the work site are needed as the basis.The original YOLOv5 can only save the clipping map of the target frame and lose important information, such as the working site environment, which cannot serve as a basis.For example, if a worker puts his or her helmet in his or her hand for a brief adjustment due to a problem such as a loose hatband, this behavior is fine, but the 23858 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.model will only save a screenshot of the worker's head and will not be able to provide a reason for why the worker did not wear the helmet.It would be unreasonable if the worker were penalized for this behavior.YOLOv5 has the function  to save the video, but manually checking the surveillance video is not only time-consuming and laborious but also has the possibility of errors, even if the surveillance video contains the detection results.Therefore, this study adds a video stream detection results preservation function, when it detects a target, it will only save the image of the current frame to a terminal such as a monitoring device.These images will only be made available to security managers, so there will be no legal implications.Fig. 6 shows the 1080P definition of the construction site safety education video inspection effect.

III. EXPERIMENT AND ANALYSIS 1) DATASETS
Since there is no research on the standard wearing behavior of helmets at home or abroad, there is a lack of open-source datasets.Therefore, in this paper, according to the national standard ''Head Protection: Helmets'' (GB2811-2019), the wearing style of wearing with the hat band fastened is recorded as the standard wearing of helmets, the wearing style of wearing without the hat band fastened is recorded as the nonstandard wearing of helmets, and then images are collected.The dataset consists of 4400 pictures, which will not involve privacy and interest issues, and the details are shown in Table 1.The dataset contains images of large, small, and small and dense targets; some data visualization is shown in Fig. 7.The labelimg tool was subsequently used to label the pictures.The label format was YOLO, which was divided into three categories-category 0, 1, and 2-corresponding to standard wearing a helmet (helmet), not wearing a helmet (head), not standard wearing a helmet (uncertainty).The process of labeling is shown in Fig. 8.

2) EXPERIMENTAL ENVIRONMENT AND MODEL TRAINING
The experimental equipment used was a Shinelong M7-E6S3 notebook computer.Parameter selection was based on previous studies, and the parameters for data preprocessing were selected from the default data in the hyp.scratchlow.yamlfile.The specific configuration is shown in Table 2, and the training parameters are shown in Table 3.

3) EVALUATION CRITERIA
Precision is the assessment of the accuracy of the forecast.
Recall is an evaluation of the completeness of the search.
The single-category accuracy (AP) refers to the average of all accuracies obtained under the possible values of all the recall rates.mAP@0.5 is the average accuracy of all categories when the IOU threshold is 0.5, where m is the total number of categories.
To verify whether the above four performance improvements can enhance the model performance, we conduct a separate improvement comparison experiment before the ablation experiment.The data before and after the model is improved in a separate place are shown in Table 4, and the default parameters are utilized for training.The improved network structure is shown in Fig. 9, and the improved detection effect is shown in Fig. 10-13.
In Table 4, we show the single class performance and average performance of each model.In terms of average performance, after adopting the EIOU loss function, the models' precision, recall, mAP@0.5 and mAP@0.5:0.95 of the model were increased by 1.4%, 0.2%, 0.9% and 1.5%, respectively; the FPS improved by 0.7; the model size decreased by 0.02 MB; and the size of the target box was closer to the real situation than before the improvement.When the CBAM attention module is added in front of the SPPF module, image feature extraction is enhanced, the target false detection rate is reduced, and the model precision, recall, mAP@0.5, and mAP@0.5:0.95 are increased by 1.2%, 0.3%, 0.7%, and 0.5%, respectively.When the FPS decreased by 0.7, the model size increased by 0.11 MB.After ADL, the precision, recall, mAP@0.5 and mAP@0.5:0.95 of the model decreased by 10.5%, 4.7%, 3.8% and 7.2%, respectively.When the FPS decreases by 8.2, the model size increases by 0.7 MB.Since only part of the images in the dataset contains small targets, the addition of a small target detection layer will reduce the performance of the model.However, as shown in Figure 9, ADL has an acceptable impact on detection accuracy and can effectively detect small targets, significantly improving missed detections.After using Soft-NMS, the model precision, recall, mAP@0.5 and mAP@0.5:0.95 of the model increased by 1.8%, 0.3%, 1.2% and 4%, respectively.When the FPS increases by 1 and the model size decreases by 0.02 MB, the leakage rate can be effectively reduced when the target is dense.However, as shown in Fig. 10, neither algorithm can detect the leftmost helmet, and we believe that the photographer did not focus on the leftmost helmet at the time of the shot and that this portion of the image was somewhat blurred and that some features were still obscured, causing the algorithm to miss the detection.
In terms of single-class performance, the precision performance of the uncertainty category is the lowest for all models because the detection targets of uncertainty and helmets are too similar, and the model considers some of the targets uncertain at the pretraining stage.However, at the late stage of the training stage, as the performance improves, these targets are again considered by the model as helmets, which reduces the uncertainty detection accuracy, and the detection performance of other categories improves after improvement.Adding a small target detection layer and CBAM decreases the FPS of the model and increases the model size, where the effect of CBAM is negligible.However, these two improvements effectively improve the detection performance, so they are necessary.

5) ABLATION EXPERIMENTS
In this study, an ablation experiment was performed to verify the effect of mixing improvements on the model performance.
To ensure the effectiveness of the experiment, 11 ablation experiments were created by arranging and combining four improvements, of which seven groups included ADL.The experimental data are shown in Table 5 In Table 5, we divided all the models into two categories.The first category is the model without adding the small target detection layer, and the second category is the model with adding the small target detection layer.In these two categories, the two models with the best performance, named  YOLO-ESC and YOLO-ESCA after the acronym of the improved method, are selected as representatives.Based on the original YOLOv5n model, method 1-YOLO-ESC shows that at any time, the improvements combined in various ways, with the exception of the addition of a small target detection layer, will improve the performance of the model to varying degrees.The precision and recall increase by 0.13% and 0.1%, respectively, on average.mAP@0.5 has an average increase of 0.45%, mAP@0.5:0.95 has an average increase of 0.2%, FPS has an average increase of 1.9, and the model size has an average increase of 0.04 MB.Moreover, the YOLO-ESC model, which simultaneously achieves three improvements, achieves the best performance, which indicates that these three improvements not only have no negative effects but also may complement each other.Thus, better results are obtained.As demonstrated by method 4-YOLO-ESCA, when a small target detection layer is added without a CBAM attention module, the model precision decreases by 11.3% on average, the recall decreases by 6.5% on average, the mAP@0.5 decreases by 3.9% on average,   and the mAP@0.5:0.95decreases by 6.7% on average.FPS decreased by 12.8 on average, and the model size increased by 0.7 MB on average.After the two models are combined, the precision decreases by 10.53% on average, the recall decreases by 4.78% on average, the mAP@0.5 decreases by 3.83% on average, the mAP@0.5:0.95decreases by 6.58% on average, the FPS decreases by 14.2 on average, and the model size increases by 0.74 MB on average.Compared with other improvements, the CBAM attention module significantly reduces the impact of adding a small target detection layer but also increases the complexity of the model.This result is consistent with the conclusion drawn in the previous section.In summary, both the EIOU-loss and Soft-NMS improve the prediction stage of the model; therefore, improving neither the loss nor the Soft-NMS will reduce the impact of ADL.Although adding a small target detection layer will reduce the model performance, especially the impact on precision, the experiments in the previous chapter have shown its necessity.

6) COMPARISON OF METHODS
To evaluate, the proposed method is compared with existing mainstream lightweight target detection algorithms, and the are shown in Table 6.As shown in Table 6, although the precision, recall, mAP@0.5 and mAP@0.5:0.95performances of the YOLOv3, YOLOv5m, YOLOv5l and YOLOv5x models are better than those of the YOLO-ESC and YOLO-ESCA models, their FPS and model size are unacceptable, and these models hinder deployment.The precision, recall, mAP@0.5 and mAP@0.5:0.95performances of YOLOv3tiny and YOLOv5s are not much different from those of YOLO-ESC; however, YOLO-ESC has a higher FPS and a smaller model size, so YOLO-ESC is easier to deploy.Therefore, the final question is which of the two models, YOLO-ESC or YOLO-ESCA, is more suitable for practical applications?

7) STATISTICAL SIGNIFICANCE TEST
In this section, a statistical significance test (t test) is conducted to assess and ensure the generalizability of the proposed model.A t test is a statistical hypothesis test that determines whether there is a significant difference between the means of two groups or samples by calculating the t test and the p value of the two groups or samples.The t test is a measure of the difference in the mean values of two groups with respect to the within-group variance of each group and indicates how much the means of two groups differ from each other in terms of standard error.The p value is the probability that the t-statistic will reach an extreme value if the null hypothesis holds.The null hypothesis indicates that the means of the two groups are equal.A significance level of 0.05 was used as the cutoff for determining statistical significance.A p value less than 0.05 indicated that the original hypothesis was rejected, and a statistically significant difference existed between the means of the two groups.
In this work, we choose mAP@0.5 for comparison and apply the paired t test to evaluate the performance of YOLO-ESCA against other models.The smaller the t statistic is, the better the performance of the model.The P value was obtained by comparing YOLO-ESCA with different models.The p value of our model relative to the other models is less than 0.05, indicating a significant difference from the benchmark model, as shown in Table 7.Therefore, YOLO-ESCA is a considerable improvement over YOLO-ESC, YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x.

8) DETECTION EXPERIMENT
The comprehensive performance of the model cannot be singularly based on the level of the performance indicators.As a result, the effect of the actual application of the model is also highly important.To further validate the reasonableness of the improvement, 4400 images within the homemade dataset were selected as a detection dataset, and simultaneously, to validate the generalizability of the model, we also selected all the images from the open-source SHWD helmet-wearing detection dataset, which is employed by most scholars, as another detection dataset.YOLOv5s, YOLOv5n, YOLO-ESC and YOLO-ESCA, which have similar performances, were selected according to the above experiments for the detection comparison experiments, and the specific data are shown in Figs. 14 and 15.
From Fig. 14, we can see that on the homebrew dataset, YOLOv5s detects a total of 9415 targets, YOLOv5n detects a total of 8181 targets, and YOLO-ESC detects 9585, which is not much different from the number of detections of YOLOv5s and improves by 17.2% compared to YOLOv5n, whereas the total number of detected targets in YOLO-ESCA is as high as 11591, a 41.7% boost compared to YOLOv5n and a 20.9% boost compared to YOLO-ESC.In terms of categorization, YOLOv5s detected 3701 uncertainty targets, 3063 head targets, and 2651 helmet targets; YOLOv5n detected 3211 uncertainty targets, 2599 head targets, and 2371 helmet targets; YOLO-ESC detected 3837 uncertainty targets, 3084 head targets, and 2664 helmet targets, which is not much different from the number of detections of YOLOv5s and improved by 19.5%, 18.7%, and 12.4%, respectively, compared to YOLOv5n; and YOLO-ESCA detected 4798 uncertainty targets, 3786 head targets, and 3007 helmet targets, which are 49.4%, 45.7%, and 26.8% higher, respectively, than YOLOv5n and 25%, 22.8%, and 12.9% greater than YOLO-ESC.
Fig. 15 shows that on the SHWD dataset, YOLOv5s detects a total of 12750 targets, YOLOv5n detects a total of 10585 targets, and YOLO-ESC detects 13268, which is not much different from the number of detections of YOLOv5s and improves by 25.3% compared with that of YOLOv5n; moreover, the detection of YOLO-ESCA's total number of targets reaches 18314, which is a 73% improvement compared to YOLOv5n and a 38% improvement compared YOLO-ESC.In terms of categorization, YOLOv5s detected 5814 uncertainty targets, 3063 head targets, and 1278 helmet targets; YOLOv5n detected 5338 uncertainty targets, 4291 head targets, and 956 helmet targets; and YOLO-ESC detected 5836 uncertainty targets, 6125 head targets, and 1307 helmet targets, which is not much different from the number of detections of YOLOv5s and improves by 9.3%, 42.7%, and 36.7%,respectively, compared to YOLOv5n.The number of targets detected by YOLO-ESCA includes 7152 uncertainty targets, 9700 head targets, and 1462 helmet targets, which are 34%, 126.1%, and 52.9% improved, respectively, compared to YOLOv5n, and 22.5%, 58.4%, and 11.9% improved.

9) DISCUSSIONS
First, although YOLO-ESCA has good performance, it cannot be denied that our dataset is not large or representative enough, which will lead to the model failing to detect the target if it encounters a situation in which training set does not have or contains fewer detection scenarios.Therefore, it is necessary to expand the dataset by further collecting images from various environments at the construction site.Second, to reduce the number of parameters of the model and facilitate its deployment, we selected YOLOv5n, which has the smallest volume, as the premodel for training.Although YOLO-ESCA based on YOLOv5n has excellent performance, as shown in Fig. 16, there are still cases of missed detections, which is an unavoidable side effect caused by a drastic reduction in volume, and the addition of a small target detection layer increases the volume of the model.Therefore, first, we will improve the model by decreasing the weight in the future and then improve the detection performance of the model without increasing the volume.Third, we have not applied the model to real construction work, and we do not know the actual performance of the model; however, our model has a very low model size and high FPS, and the hardware requirement is not high, which is highly suitable for deploying on UAVs with low computational power.

IV. CONCLUSION
In this paper, we propose a standard helmet wear detection model, YOLO-ESCA; the model can detect not only whether the worker is wearing a helmet but also whether the way he wears the helmet is standard.To improve the performance of the model, we first develop the automatic savings function of video streaming detection to improve the model utility.Second, we improve YOLOv5n by using EIOU-loss, a Soft-NMS nonlinear suppression module, a CBAM attention module and a small target detection layer.Although all the performance indices of the model decrease, the detection experiments prove that YOLO-ESCA is better than the original YOLOv5n and YOLO-ESC models without a small target detection layer in this application.Notably, our model also misses detection when detecting small targets at long distances.More importantly, our model size is only 4.47 MB with FPSs up to 65.3, which is conducive to deploying the model.Our ongoing work is focused on developing reliable target detection models.The goal of future work will be to continue to improve the model and other models for use in terminals, such as Raspberry Pi or NVIDIA Jetson Nano devices.
Fang et al. proposed a helmet detection algorithm based on YOLOv2; they added a dense network to the feature extraction network and utilized a lightweight MobileNet network structure to reduce the model complexity and improve the detection speed [12].Wu et al. employed a DenseNet network instead of the Darknet53 feature extraction network to YOLOv3 to improve helmet detection accuracy [13].Shi et al. added a feature pyramid in YOLOv3 to improve the recognition accuracy of people and helmets [14].Yang et al. investigated helmet-wearing detection based on YOLOv3 and used a support vector machine (SVM) to classify the detection results [15].Wang et al. based their helmet detection algorithm on YOLOv5s, introduced the CA (coordinate attention) attention mechanism in the backbone network structure and utilized a weighted bidirectional feature pyramid (BiFPN) network structure to improve the model detection accuracy [16].Alateeq et al. proposed a personal protective equipment (PPE) and heavy equipment detection model based on the YOLOv5s algorithm and incorporated weather conditions into the model.It is possible to analyze whether the area around the equipment is dangerous based on the prevailing weather [17].Lo et al. constructed a new PPE dataset, trained three PPE detection models using the YOLOv3, YOLOv4 and YOLOv7 algorithms and summarized the advantages and disadvantages of each algorithm [18].Zhu et al. proposed a detection model for electric power based on the YOLOv5s algorithm by using a self-constructed dataset to detect power staff protection equipment [19].Fu et al. used K-means to recluster based on YOLOv5s and added a detection layer to improve the detection accuracy [20].Zhao et al. used the DenseBlock module instead of the Focus structure in the YOLOv5 main network and added the SE-Ne attention module to improve the detection performance [21].Du et al. employed the Swin Transformer as a feature extractor for the YOLOv5s network and introduced a dense spatial pyramid pooling module to improve model detection [22].

FIGURE 2 .
FIGURE 2. Schematic of the mechanism of EIOU-loss.

FIGURE 3 .
FIGURE 3. CBAM: (a) Structure of the CBAM, (b) structure of the channel attention module, and (c) structure of the spatial attention module.

FIGURE 7 .
FIGURE 7. Distribution of some dataset categories: (a) Large target sample, (b) medium target sample, (c) small target, and (d) intensive target sample.

FIGURE 8 .
FIGURE 8. Image annotation: (a) Standard wearing of helmets, (b) nonstandard wearing of helmets, and (c) failure to wear a helmet.

FIGURE 11 .
FIGURE 11.Comparison of detection before and after improvement of the CBAM: (a) Image, (b) YOLOv5n detection results, and (c) CBAM detection results.

FIGURE 12 .
FIGURE 12.Comparison of detection before and after ADL improvement: (a) Image, (b) YOLOv5n detection results, and (c) ADL detection results.

FIGURE 16 .
FIGURE 16.Visual comparison of several results.Columns from left to right are YOLOv5n, YOLO-ESC, and YOLO-ESCA: (a) Intensive target detection results, (b) long-range, small-target detection results, (c) dark environment intensive target detection results, (d) fuzzy target detection results, and (e) target detection results from a UAV perspective.

TABLE 4 .
Comparison of the improvement in performance for each part of the model.

TABLE 5 .
Results of ablation experiments.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE 6 .
Method comparison results.

TABLE 7 .
Statistical significance test results.
FIGURE 14. Self-made dataset detection results.