Loading web-font TeX/Main/Regular
Automated Welder Safety Assurance: A YOLOv3-Based Approach for Real-Time Detection of Welding Helmet Availability | IEEE Journals & Magazine | IEEE Xplore

Automated Welder Safety Assurance: A YOLOv3-Based Approach for Real-Time Detection of Welding Helmet Availability


Flowchart of the methodology outlining the stages of dataset collection consisting of dataset collection, model training, and model testing for a real-time detection of w...

Abstract:

This paper presents the development of a novel real-time monitoring and detection system designed to identify the presence of welding helmets on workers’ faces during wel...Show More

Abstract:

This paper presents the development of a novel real-time monitoring and detection system designed to identify the presence of welding helmets on workers’ faces during welding activities. The system employs a Convolutional Neural Network (CNN) based on the YOLOv3 algorithm and is trained and validated using a diverse dataset that includes images with varying levels of blur, grayscale images, and drone-captured photos. The model’s effectiveness is evaluated using five key performance metrics: accuracy, precision, recall, F1 score, and the AUC-ROC curve. Additionally, the study investigates the impact of various input image sizes, batch sizes, activation functions, and the incorporation of additional convolutional layers on model performance. The results indicate that the Swish activation function, combined with a batch size of 128, an input image size of 256\times 256 , and the addition of one convolutional layer, yielded superior performance. The model achieved outstanding values of 98% for precision, recall, and F1 score, along with an AUC of 0.98, underscoring its accuracy and reliability in detecting welding helmets.
Flowchart of the methodology outlining the stages of dataset collection consisting of dataset collection, model training, and model testing for a real-time detection of w...
Published in: IEEE Access ( Volume: 13)
Page(s): 2187 - 2202
Date of Publication: 30 December 2024
Electronic ISSN: 2169-3536

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Workers operating welding machines should adhere to strict safety protocols, including the use of a welding helmet, protective gloves, and fully covering clothing to mitigate the risk of burns. According to data from the Bureau of Labor Statistics (BLS), the United States experiences 21 welding-related accidents per 100,000 workers annually. The BLS further reports that for every 100 million work hours, approximately 1,000 workers sustain a welding-related injury, a rate that is nearly 100 times higher than the average injury rate for workers in other occupations [1]. The most common injury among welders is eye injury, particularly flash burns. A flash burn, also known as ‘welder’s flash’ or ‘arc eye’, refers to painful inflammation of the cornea, the transparent layer covering the front of the eye. This injury typically results from exposure to intense ultraviolet (UV) light, with welding torches being a primary source [2]. For instance, Tetteh et al. [3] reported that the use of eye and face shields could prevent approximately 90% of welding-related eye injuries. Similarly, Amani et al. [4] researched occupational hazards related to welders. The results of their research showed that 92% of welders suffer eye problems due to the improper use of safety equipment recommended for welding jobs. While traditional supervision techniques, such as employing safety inspectors, can be beneficial in certain instances, they are insufficient for ensuring comprehensive workplace safety. Consequently, advanced approaches leveraging artificial intelligence (AI) and deep learning should be implemented to enhance the monitoring and reduction of these injuries.

The use of required safety equipment is essential for safeguarding workers’ lives in any occupational setting. Welding, in particular, demands adherence to several safety equipment standards. One of the most critical safety measures during the welding process is ensuring proper face protection. Face and eye protection are crucial for preventing injuries, with the welding helmet being the primary piece of protective equipment, as shown in Figure 1(a) [5]. Failure to use a welding helmet can result in serious eye injuries, including flash burns, injuries from flying debris, and inflammation of the eyes caused by prolonged exposure to ultraviolet radiation. Some of these potential injuries are depicted in Figure 1 [6], [7].

FIGURE 1. - (a) Welding helmet (b) Injury due to flying particles (c) Flash burns.
FIGURE 1.

(a) Welding helmet (b) Injury due to flying particles (c) Flash burns.

Existing research in safety monitoring primarily focuses on the detection of conventional safety equipment, such as hard hats [8], [9], [10], [11] and safety vests [12], [13], [14]. However, a notable gap exists in the current literature regarding the detection and real-time monitoring of complex safety equipment, particularly welding helmets. The novelty of this paper lies in its examination of a previously unexplored object of interest – the welding helmet. Detecting welding helmets presents a distinct set of challenges compared to conventional hard hats. Welding helmets are designed to provide full-face protection, incorporating tinted visors to shield welders from intense light of welding arcs. This tinting introduces significant variability in lighting, shadows, and reflections, complicating detection for standard object detection algorithms. Additionally, the dynamic nature of welding tasks, characterized by frequent and rapid head movements, poses additional challenges for consistent detection. The presence of sparks, smoke, and other artifacts further contributes to the complexity, as these elements can obscure or distort the appearance of the welding helmet. Unlike static hard hats, welding helmets are subject to varying degrees of wear and tear, with potential obstructions such as welding visors in different positions. As a result, the detection of welding helmets demands a robust algorithm capable of handling diverse environmental conditions, complex visual features, and rapid changes in the helmet’s appearance during welding operations.

In this paper, we present an implementation of an enhanced deep learning-based monitoring and detection system designed to provide a reliable real-time tool, leveraging the YOLO algorithm. Deep learning methodologies, particularly the YOLO algorithm, have become foundational in both academic research and practical applications for monitoring occupational safety. For instance, Chen et al. [15] developed an enhanced real-time detection algorithm based on YOLOv5 for detecting helmets (hard hat) and reflective vests. This improved model significantly boosted detection accuracy and convergence speed, achieving a mean average precision of 94.9%. Similarly, Hu et al. [16] proposed an improved YOLOv3-based hard-hat detection method, integrated into real-time detection software with an alert function. The system was successfully deployed across multiple construction sites, demonstrating both high accuracy and real-time performance. Benyang et al. [17] introduced a method for detecting helmet compliance using the YOLOv4 algorithm, employing construction site videos and web-sourced images as a dataset. Their model achieved a high accuracy rate of approximately 93%. Fu et al. [18] presented a hardhat detection model based on an optimized YOLOv5 algorithm, improving both feature fusion and multi-scale detection layers by incorporating an additional fusion scale layer for enhanced small-target recognition. Their model achieved an average detection accuracy of 95%, representing a 2.9% improvement over the initial model. Furthermore, the accuracy of helmet recognition reached 94.6%, marking a 2.4% increase. Similarly, Anushkannan et al. [19] developed a YOLOv3-based model for helmet (hardhat) detection, achieving a high mean average precision of approximately 97%. Shanti et al. [20] presented a novel approach for monitoring workers at heights using a pretrained object detector based on the YOLOv3 algorithm. The model demonstrated strong performance, with an accuracy of 91.26% and a precision of 99%. The notable finding in this research is the versatility and effectiveness of the YOLO algorithm in construction and industrial environments. The model developed in this study was further enhanced by integrating drone technology for real-time violation detection [21]. Among the various YOLO algorithms, YOLOv3 was selected due to its demonstrated reliability in detecting objects under dynamic conditions [22]. The YOLOv3-based model developed here specifically targets the detection of welding helmets-critical face protection equipment-on workers during welding activities, helping to mitigate the risk of face and eye injuries.

The primary contributions of this paper are twofold: first, it introduces a YOLOv3 architecture tailored for real-time monitoring of welding activities and helmet usage on construction sites. Additionally, this paper presents a comprehensive parametric study of the YOLOv3, focusing on optimizing the detection accuracy of welding helmets. This study evaluates the influence of several key factors, including batch size, activation functions, and input image resolution, on the model’s performance. To further assess robustness, the model was tested on various image conditions, such as blurry, grayscale, and low-light images. Moreover, the research demonstrates the integration of drone technology with the developed model, resulting in a high- performance real-time monitoring system utilizing drone-captured images. This work addresses a specific gap in the field by focusing on welding helmet detection in construction environments - an area that has received significantly less attention compared to the detection of significantly less attention compared to the detection of hard hats and reflective vests. Our system extends existing object detection methods by optimizing YOLOv3 for the specific task of welding helmet detection, thereby improving both accuracy and efficiency. Furthermore, we highlight the potential industrial impact of this work, particularly in enhancing worker safety through real-time monitoring, which can play a crucial role in reducing workplace accidents and injuries.

SECTION II.

Methodology

This study is conducted in three distinct stages: dataset creation, model training, and model testing, as depicted in Figure 2. The first stage encompasses the collection, creation, and preprocessing of a dedicated welding helmet dataset. In the subsequent stage, the deep learning algorithm YOLOv3 is trained using the collected dataset. Finally, the model is tested to evaluate its performance. Detailed descriptions of each of these stages are provided in the following sections.

FIGURE 2. - Flowchart of the methodology outlining the stages of dataset collection consisting of dataset collection, model training, and model testing for a real-time detection of welding helmet using the YOLOv3 algorithm.
FIGURE 2.

Flowchart of the methodology outlining the stages of dataset collection consisting of dataset collection, model training, and model testing for a real-time detection of welding helmet using the YOLOv3 algorithm.

A. Dataset Creation

One of the most critical tasks in developing and training a deep learning algorithm for accurate welding helmet detection on construction sites is the collections of a diverse dataset. Welding helmets vary in shape, design, and color, making it essential to gather a comprehensive set of images that reflect these variations. The dataset compiled for this study includes images of different types and colors of welding helmets, primarily focusing on workers wearing them during active tasks. The dataset was also diverse, encompassing various industrial environments with images from different sectors. It included a wider range of worker demographics, featuring both male and female workers from multiple geographic locations. Additionally, the dataset accounted for challenging conditions, such as low-lights environments, to enhance the model’s robustness in detecting welding helmets across diverse scenarios. In this study, the two primary sources of training images are real-time photographs captured at actual construction and industrial sites, as well as web-based images. Figures 3 and 4 illustrate examples of the collected images used for training. Additionally, testing was conducted using a separate dataset comprising of unique images collected from various sources, including CCTV footages from factories, online databases, and images obtained from construction sites.

FIGURE 3. - Sample images used in creating the training dataset: (a) black helmet with wide lens, (b) white helmet with medium sized lens, (c) brown helmet with narrow small lens, (d) blue helmet with medium sized lens, (e) black helmet with medium sized lens, and (f) red helmet with medium sized lens [23], [24], [25], [26], [27], [28].
FIGURE 3.

Sample images used in creating the training dataset: (a) black helmet with wide lens, (b) white helmet with medium sized lens, (c) brown helmet with narrow small lens, (d) blue helmet with medium sized lens, (e) black helmet with medium sized lens, and (f) red helmet with medium sized lens [23], [24], [25], [26], [27], [28].

FIGURE 4. - Sample images used in creating the diverse train-ing dataset that contains people from different demographics, different working conditions, and light conditions: (a) Male welder (b) Female welder (c) External working environment (d) Low light condition.
FIGURE 4.

Sample images used in creating the diverse train-ing dataset that contains people from different demographics, different working conditions, and light conditions: (a) Male welder (b) Female welder (c) External working environment (d) Low light condition.

After collecting and creating a dataset of 1550 image files, these images underwent preprocessing to prepare them for training and testing the deep learning model. Preprocessing is widely recognized as a critical factor in achieving high model accuracy [29], [30]. In this study, various preprocessing methods and techniques were employed, including resizing and cleaning the images. Resizing, a common preprocessing step in deep learning, adjusts the dimensions of images to a specific size, thereby improving the computational efficiency of the model. Cleaning encompasses identifying and removing deficiencies in the dataset, such as missing, inaccurate, duplicate, irrelevant, or non-representative images. Although time-consuming, this step is essential for ensuring a clean and diverse dataset suitable for training. Following the preprocessing phase, the images were annotated, a crucial process in which the object of interest-in this study, the welding helmet-was labeled in the training data. The goal of annotation is to provide the algorithm with ground truth data, allowing it to learn patterns and accurately detect and recognize objects in new, unseen data. Data annotation is a critical step in supervised learning, where the model is trained using input-output pairs. In this study, the annotation process was conducted using LabelImg software, as demonstrated in Figure 5, which shows an example of the labeling performed.

FIGURE 5. - Data annotation for one of the training dataset images using the LabelImg software, the welding helmet is annotated in the blue box.
FIGURE 5.

Data annotation for one of the training dataset images using the LabelImg software, the welding helmet is annotated in the blue box.

B. YOLOv3 for the Detection of the Welding Helmet

The object detection model used in this study was based on the YOLOv3 algorithm, which represents a significant improvement over earlier YOLO versions. YOLOv3 enhances mean average precision (mAP) by up to 10% and increases the number of frames per second by 12% [31]. One of its key advancements is the integration of Darknet-53, a CNN architecture specifically optimized for object detections. Darknet-53, a variant of the ResNet architecture, consists of 53 convolutional layers and achieves the state-of-the-art performance on various object detection benchmarks. For YOLOv3, an additional 53 layers are added, resulting in a 106-layer fully convolutional architecture [31]. The YOLOv3 model performs detections at three different places in the network: the 82nd, 94th, and 106th layers, as illustrated in Figure 6.

FIGURE 6. - YOLOv3 architecture for the detection of welding helmets. Detections are performed at the 82nd, 94th, and 106th layers in the network [32].
FIGURE 6.

YOLOv3 architecture for the detection of welding helmets. Detections are performed at the 82nd, 94th, and 106th layers in the network [32].

The network downsamples the input image by factors of 32, 16 and 8 at 82nd, 94th, and 106th layers, respectively. These values, known as strides, indicate the degree of downsampling, reflecting how much smaller the output at these points is compared to the network’s input.

In this research, a YOLOv3 architecture is developed for the accurate detection of welding helmets on construction sites through a parametric study. The proposed YOLOv3 architecture incorporates an additional 2D convolutional layer into the Darknet-53 network. This enhancement improves the architecture’s feature extraction capabilities, enabling it to capture more complex and abstract features from the input data. As a result, the model generates a richer and more informative representation of objects within the image, thereby enhancing its ability to distinguish intricate patterns. Moreover, the inclusion of this additional layer enhances the computational efficiency of the model by deepening its structure, which allows for the learning of more sophisticated and abstract representations of the input data. The adjustment also mitigates the risk of underfitting. By increasing the model’s complexity, it can capture more nuanced patterns in the training data, thereby reducing the likelihood of underfitting. However, it is essential to maintain a balance, as excessive layer addition can lead to overfitting. The addition of a convolutional layer, although seemingly modest, was a deliberate and strategic enhancement designed to improve the model’s feature extraction capabilities and overall performance in the specific task of detecting welding helmets in construction environments. This decision was informed by the recognition that even minor architectural adjustments can lead to substantial performance improvements, especially in specialized applications. To evaluate the impact of this modification, two distinct models were trained: one with a single additional convolutional layer and another with two additional layers. As illustrated in Figure 7, the YOLOv3 architecture developed in this research incorporates the newly added layers, highlighted in orange, into the default structure. These enhancements were deliberately designed to improve the model’s detection accuracy and robustness, demonstrating that even incremental changes to the network architecture can yield significant performance gains. Additionally, the model retains its three prediction layers, each responsible for detecting objects of different sizes: small, medium-sized, and large.

FIGURE 7. - The YOLOv3 Architecture with one additional convolution layer added into DartNet-53 network.
FIGURE 7.

The YOLOv3 Architecture with one additional convolution layer added into DartNet-53 network.

The selection of YOLOv3 in this research is driven by its faster processing time compared to most conventional object detection algorithms. This shorter processing time is crucial for real-time detection in construction and industrial environments. Moreover, YOLOv3 has demonstrated both theoretically and in practice its efficiency and reliability in such settings, making it a suitable choice for this application. The model was trained using Google Colab Notebook, utilizing the built-in GPUs provided by Colab through cloud computing. The GPU was NVIDIA Tesla T4, featuring 15.36 GB of GDDR6 memory, paired with an Intel(R) Xeon(R) CPU operating at 2.00 GHz. Furthermore, Darknet was the deep learning library/framework used to train and build the model.

SECTION III.

Results and Discussion

The YOLOv3 architecture proposed in the previous section was tested using a dataset consisting of diverse images of welding helmets. Of the 1,550 total images, 70% (i.e., 1,085 images) were allocated for training, while the remaining 30% (i.e., 465 images) were used for testing the YOLOv3 model.

A. Training of the YOLOv3 Model

Several combinations of input hyperparameters, referred to as scenarios in this study, were tested to achieve optimal model performance. The first parameter evaluated was the input image size. Various sizes were assessed to determine the dimensions that would yield the highest accuracy. After completing the training, the final loss for each case was obtained. The YOLOv3 loss function is the sum of three losses. In other words, the total loss (L) is represented as the sum of objectness loss ($L_{obj}$ ), localization loss ($L_{coord}$ ), and classification loss ($L_{cls}$ ), i.e.,\begin{equation*} {L} = {\lambda }_{coord} {L}_{coord} + \lambda _{obj} {L}_{obj} + {L}_{cls} \tag {1}\end{equation*} View SourceRight-click on figure for MathML and additional features.where $L_{obj}$ measures how well the model predicts the presence of an object in a given bounding box and $\lambda _{obj}$ penalizes the model for failing to detect an object or for incorrect predictions. $L_{coord}$ is associated with the accuracy of the predicted bounding box coordinates and $\lambda _{coord}$ penalizes the model for imprecise localization by measuring the difference between predicted box coordinates and ground truth box coordinates. $L_{cls}$ is related to the accuracy of object class predictions. Figure 8 compares the different image sizes in terms of loss value. As depicted in the figure, larger input image size facilitates faster model convergence, leading to reduced loss. However, after approximately 500 epochs, the model with a $256\times 256$ input image size demonstrated a convergence pattern similar to that of the larger $416\times 416$ size. Subsequently, by the 1,000th epoch, both the $416\times 416$ and $256\times 256$ configurations exhibited closely aligned loss values, with the latter showing marginally better performance.

FIGURE 8. - Training performance of the YOLOv3 architecture for various image sizes.
FIGURE 8.

Training performance of the YOLOv3 architecture for various image sizes.

The next two hyperparameters of interest were batch size and activation function. In this step, various combinations and scenarios of these hyperparameters were created, as summarized in Table 1. The selection of activation functions was guided by their optimal performance in object detection applications.

TABLE 1 Different model scenarios
Table 1- Different model scenarios

Trial tests were conducted to determine the optimal values for these two deep learning hyperparameters, namely batch size and activation function, within the model’s neural network. Other input hyperparameters were set as follows: number of epochs =1,000, learning rate =0.001, and input image size $= 256\times 256$ . Figure 8 illustrates a comparison of three different scenarios related to batch size. The results indicate that the model achieving the best performance, with the lowest loss value, corresponded to a batch size of 128. As shown in Figure 9, the scenarios exhibited nearly identical behavior during the initial training epochs up to the 600th epoch. Beyond this point, the model with a batch size of 128 demonstrated superior performance, achieving a lower loss value compared to the other scenarios.

FIGURE 9. - Effect of batch size on training performance of the YOLOv3 architecture.
FIGURE 9.

Effect of batch size on training performance of the YOLOv3 architecture.

After determining the optimal batch size, the focus shifted to evaluating five different scenarios (scenarios 3 to 7) related to activation functions. These activation function scenarios were analyzed and compared to identify the most suitable function for the model. An activation function governs whether a neuron should be activated by determining the relevance of its input to the network during the prediction process through mathematical operations. It plays a critical role in producing the output from the input values fed to a node or layer. In this research, five activation functions were tested: linear, tanh, ReLU, leaky ReLU, and swish. Table 2 provides a summary of these activation functions along with their mathematical representations.

TABLE 2 Activation functions
Table 2- Activation functions

Furthermore, Figure 10 presents a comparison of the training loss obtained when each of these activation functions was individually applied to the model. The results indicate that the model trained using the swish activation function performed the best. Additionally, it can be observed in the figure that leaky, swish, and ReLU exhibited very similar behavior throughout the training process. This similarity can be attributed to the fact that Leaky ReLU represents an enhancement over traditional ReLU, and Swish, in turn, is an improvement upon Leaky ReLU.

FIGURE 10. - Effect of activation functions on training performance of YOLOv3 when all other parameters are fixed (batch size=128 and input image size 
$= 256\times 256$
).
FIGURE 10.

Effect of activation functions on training performance of YOLOv3 when all other parameters are fixed (batch size=128 and input image size $= 256\times 256$ ).

After determining the optimal hyperparameters for the default YOLOv3 model, these parameters were applied to the modified YOLOv3 models. Three different models were trained, incorporating up to three additional convolutional 2D layers. Figure 10 presents a comparison of the training loss between the default YOLOv3 model and the developed models. As shown in Figure 11, the model with one additional layer (YOLOv$3+1$ CL) exhibited the best performance compared to the default, the “two additional layers” model (YOLOv$3+2$ CL), and the “three additional layers” model (YOLOv$3+3$ CL). The superior performance of the one additional layer model can be attributed to the mitigation of overfitting, as models with two and three additional layers struggled to learn new features and instead overfit to the training dataset. Table 3 summarizes the final model parameters used in this study.

TABLE 3 Summary of the parameters used in our model
Table 3- Summary of the parameters used in our model
FIGURE 11. - Training loss of YOLOv3 for different number of the convolution layers.
FIGURE 11.

Training loss of YOLOv3 for different number of the convolution layers.

B. Testing of the YOLOv3 Model

The detection of welding helmets was performed on a distinct dataset using the developed YOLOv3 architecture, leveraging the hyperparameters outlined in Table 3, which were obtained during the training phase. As mentioned earlier, a total of 1,550 images were used for both training and testing, with the testing dataset comprising 30% (465 images) of the total. These testing images included welding helmets of various shapes, types, and colors, ensuring a comprehensive validation of the trained YOLOv3 model. Figure 12 illustrates the performance of the developed YOLOv3 model in detecting welding helmets with diverse attributes, such as different colors (e.g., white, blue, and black), shapes, angles, and lens sizes (e.g., wide, medium, and small). Following the testing phase, results were obtained for various evaluation metrics, including (1) Precision (2) Recall (3) Accuracy (4) F1 Score and (5) Confidence in detection. These metrics were calculated to demonstrate the reliability and consistency of the model. The model was tested across all seven scenarios, and the evaluation metrics, which are interrelated, were closely analyzed. For instance, precision quantifies the proportion of correctly detected items, while recall measures the proportion of relevant elements detected [33]. The F1 score represents a weighted average of precision and recall which ranges from zero to one, with higher values indicating better performance. Precision, recall, accuracy and the F1 score were calculated using the following formulas:\begin{align*} \mathrm {Precision}& =\frac {\mathrm {TP}}{\mathrm {TP+FP}}, \tag {2}\\ \mathrm {Recall}& =\frac {\mathrm {TP}}{\mathrm {TP+FN}}, \tag {3}\\ \mathrm {Accuracy}& =\frac {\mathrm {TP+TN}}{\mathrm {TP+FN+TP+TN}}, \tag {4}\\ \mathrm {F1~score}& =\frac {\mathrm {2x(Precision~x~Recall)}}{\mathrm {Precision+ Recall}}. \tag {5}\end{align*} View SourceRight-click on figure for MathML and additional features.

FIGURE 12. - Developed model sample results when various shapes, colors, and angles of welding helmets were used (a) Black helmet with a confidence rate of 97% (b) Silver helmet with a confidence rate of 100% (c) Blue helmet with a confidence rate of 100% (d) Black helmet with a confidence rate of 100% (e) Blue helmet with a confidence rate of 99% (f) Blue and white helmet with a confidence rate of 99% (g) Black helmets with a confidence rate of 99% and 97% (h) Black helmet with a confidence rate of 100%.
FIGURE 12.

Developed model sample results when various shapes, colors, and angles of welding helmets were used (a) Black helmet with a confidence rate of 97% (b) Silver helmet with a confidence rate of 100% (c) Blue helmet with a confidence rate of 100% (d) Black helmet with a confidence rate of 100% (e) Blue helmet with a confidence rate of 99% (f) Blue and white helmet with a confidence rate of 99% (g) Black helmets with a confidence rate of 99% and 97% (h) Black helmet with a confidence rate of 100%.

True Positives (TP) represent the number of correctly detected objects, while False Positives (FP) indicate the number of incorrectly predicted positive outcomes. False Negatives (FN) refer to incorrect predictions where the model failed to detect an object that was actually present. True Negatives (TN) include all true outcomes that went undetected. Additionally, the confidence score reflects the model’s certainty or belief that a particular bounding box contains the object of interest, such as welding helmet. The confidence score (CS) is calculated as follows:\begin{equation*} \mathrm {CS}= \Pr (\text {object})\times \text {IoU}\times \Pr (\text {class}\vert \text {object}) \tag {6}\end{equation*} View SourceRight-click on figure for MathML and additional features.where Pr(object) is represented by the logistic function applied to the output of the objectness prediction and is defined as:\begin{equation*} \Pr \left ({{ \mathrm {object} }}\right )=\frac {1}{1+\mathrm {e}^{-\mathrm {P}_{\mathrm {object}}}} \tag {7}\end{equation*} View SourceRight-click on figure for MathML and additional features.for the raw output $\mathrm {P}_{\mathrm {object}}$ . Moreover, IoU is the intersection of union calculated as the ratio of the area of overlap between the predicted bounding box and the ground truth bounding box (as shown in Figure 13) to the area of their union, i.e.,\begin{equation*} \mathrm {IoU}=\frac {\mathrm {Area~of~Overlap}}{\mathrm {Area~of~Union}}. \tag {8}\end{equation*} View SourceRight-click on figure for MathML and additional features.$\Pr (\mathrm {class}\vert \mathrm {object)}$ is the softmax function applied to the class predictions that can be computed by\begin{equation*} \mathrm {Pr(class \vert object)} = \frac {\mathrm {e}^{\mathrm {P}_{\mathrm {class}}}}{\mathrm {\sum }_{\mathrm {class}}e^{\mathrm {P}_{\mathrm {class}}}} \tag {9}\end{equation*} View SourceRight-click on figure for MathML and additional features.where $\mathrm {P}_{\mathrm {Class}}$ is the raw output for each class.

FIGURE 13. - Ground-truth bounding box and predicted bounding box comparison.
FIGURE 13.

Ground-truth bounding box and predicted bounding box comparison.

As illustrated in Figure 12, the model demonstrated a high detection rate with consistent accuracy. However, a few instances of FP were observed, where the model incorrectly identified objects as welding helmet. Figure 14 presents one such case where the model the model accurately detected three welding helmets but erroneously identified an additional object as a helmet. Although the occurrence of FP in this study was minimal, in scenarios where FP rates are significantly higher, they can compromise the model’s reliability, potentially causing operational disruptions and diminishing user confidence. To mitigate this issue, strategies such as augmenting the training dataset, refining the model’s architecture, or adjusting detection thresholds can be employed to improve model performance.

FIGURE 14. - A sample result where the model had a false positive (e.g. the round black pipe).
FIGURE 14.

A sample result where the model had a false positive (e.g. the round black pipe).

In the preceding section, a systematic comparison of various batch sizes was conducted to determine the optimal configuration for the model. As detailed in Table 4, the results demonstrate that utilizing a batch size of 128 delivers the most effective and high-performing model. This batch size consistently outperformed others, achieving over 90% precision across all evaluation metrics. The data indicates that larger batch sizes facilitate more efficient convergence during training.

TABLE 4 Testing performance of the trained YOLOv3 for different batch sizes
Table 4- Testing performance of the trained YOLOv3 for different batch sizes

In addition to the enhanced model performance observed with a batch size of 128, the choice is also advantageous from a computational perspective. Larger batch sizes optimize parallel processing capabilities, particularly on modern hardware architectures, such as GPUs, leading to improved resource utilization and reduced training time. This makes the selection of a batch size of 128 a balanced decision, offering both high accuracy and computational efficiency. While smaller batch sizes may lead to more frequent updates to the model’s weights, the training process benefits significantly from the reduced communication overhead and increased throughput that larger batches offer. As a result, the selection of a batch size of 128 not only optimizes the model’s performance but also ensures better computational efficiency. This balance between precision and computational resources makes the training process more scalable and effective. The batch size of 128, identified as optimal in the previous analysis, was subsequently employed to train the model using the five specified activation functions. Table 5 provides detailed results, highlighting the comparative performance of each activation function in terms of accuracy, precision, and other relevant metrics.

TABLE 5 Testing performance of the trained YOLOv3 for different activation functions
Table 5- Testing performance of the trained YOLOv3 for different activation functions

The model trained with the Swish activation function demonstrated superior performance, achieving a precision of 93.80% and a recall of 92.0%. Additionally, it exhibited the highest confidence rate in its detections, approaching 92.31%. Conversely, the model trained with the linear activation function failed to detect any objects (e.g., welding helmets). This failure stems from the linear activation function’s suitability for simple tasks that do not demand learning intricate patterns or features. The linearity of the linear activation function fundamentally constrains the expressiveness of the neural network, as the composition of multiple linear transformations yields an overall linear mapping. In contrast, deep learning architectures, particularly those engineered for object detection tasks, are purposefully designed to learn hierarchical and non-linear representations of input data. The reliance on a linear activation function severely limits the model’s capacity to capture and learn intricate and complex features within the data.

Furthermore, Table 6 summarizes the results achieved with various input image sizes. The findings indicate that an input image size of $256\times 256$ consistently produces the highest precision and recall values among all tested configurations. Additionally, the model’s prediction time was significantly reduced, being halved when employing the $256\times 256$ image size compared to the $416\times 416$ image size.

TABLE 6 Testing performance of the trained YOLOv3 for different image sizes
Table 6- Testing performance of the trained YOLOv3 for different image sizes

Despite the satisfactory performance of the default YOLOv3 model when utilizing optimal hyperparameters, there is a pressing need for enhancement due to the critical importance of workplace safety. Even minor improvements in object detection capabilities can have significant implications for saving lives. To address this need, all optimal hyperparameters were employed as a baseline for training an enhanced YOLOv3 model. Subsequently, the performance of this model was compared with that of the default YOLOv3, as well as YOLOv4, YOLOv5, and YOLOv7 models, all trained on the same dataset and hyperparameters for consistency. Table 7 provides a comparative analysis of these YOLO versions in terms of their architectural design. All YOLO models employ a fully convolutional neural network (FCN), which enables the generation of dense pixel-wise predictions, surpassing the capabilities of traditional CNNs typically utilized for image classification. YOLOv3 utilizes Darknet-53 as its backbone feature extractor, while YOLOv4 and YOLOv5 incorporate CSPDarknet53. YOLOv7 employs Cross-Stage Bottom-Up and Top-Down Connections (CBS) architecture, and YOLOv8 leverages CSPNet (Cross Stage Partial Network). Moreover, YOLOv3, YOLOv4, and YOLOv5 share the same loss function, binary cross-entropy, which is widely used in binary classification tasks. This function quantifies the difference between the predicted probability distribution and the actual distribution for a binary classification task. In contrast, YOLOv7 adopts the focal loss function, a variant of cross-entropy loss specifically designed to mitigate class imbalance issues in both binary and multi-class classification scenarios. YOLOv8 utilizes a combination of both binary cross-entropy and focal loss. Additionally, YOLOv3 employs a feature pyramid network (FPN) for multi-scale feature representation, whereas the subsequent versions implement a path aggregation network (PANet) for the same purpose.

TABLE 7 Comparison of various YOLO versions [36]
Table 7- Comparison of various YOLO versions [36]

Additionally, Table 8 provides a comparative analysis of the discussed YOLO models, employing precision, recall, accuracy, F1 score, and average confidence as evaluation metrics. The YOLOv3 model with one additional layer (YOLOv$3+1$ CL) emerges as the top performer across all metrics, achieving a precision of 98.0%, recall of 98.0%, accuracy of 96.2%, and an F1 score of 98.0%. These results highlight the effectiveness of the proposed modifications in enhancing the model’s capability to accurately identify and classify objects. Following closely, the default YOLOv3 model demonstrates strong performance with a precision of 95.9% and a recall of 94.0%.

TABLE 8 Comparison of the developed YOLOv3 model (YOLOv $3+1$ CL) with other YOLO models
Table 8- Comparison of the developed YOLOv3 model (YOLOv
$3+1$
CL) with other YOLO models

YOLOv5 exhibits lower results, particularly in precision (94.2%) and F1 score (94.2%) relative to YOLOv3. YOLOv4 maintains a balanced performance with a precision of 93.3% and recall of 93.3%. However, YOLOv7 and YOLOv8 show comparatively lower values across all metrics, indicating significant opportunities for improvement. The average confidence scores further elucidate model performance, with the YOLOv3 model with one extra layer achieving the highest confidence score at 93.17%, reinforcing its reliability in object detection tasks. Furthermore, Figure 15 illustrates the detection and confidence rates among the various models, confirming that the developed YOLOv3 model outperforms all others. The enhancement was achieved by incorporating a single additional convolutional layer, while two other models with two and three additional layers were tested but exhibited lower performance rates. Overall, the YOLOv3 model developed in this study demonstrated performance exceeding 96% across all evaluation parameters. Moreover, when comparing YOLOv3 with other YOLO versions, several advantages of YOLOv3 become evident, particularly in the context of this study. YOLOv3 typically demonstrates faster inference speeds on certain hardware configurations, which is essential for real-time applications. Its effectiveness in detecting small objects is enhanced by its multi-scale prediction capabilities, making it particularly advantageous for recognizing helmets at various scales. Additionally, YOLOv3 is less computationally intensive, allowing it to operate efficiently on limited resources-an important consideration in the construction industry, where access to advanced technology and expertise may be restricted. Furthermore, YOLOv3 offers superior compatibility with legacy systems and older hardware, which are often prevalent at construction sites [34]. We also conducted a comparison between our model (YOLOv$3+1$ CL) and a Faster R-CNN model documented in the literature, specifically designed for detecting safety hardhats [35]. The results indicate that our developed model outperformed the Faster R-CNN in terms of precision. While the enhanced Faster R-CNN model achieved a precision of 94.7%, our model reached a precision of 98%. These findings validate the reliability of our model and demonstrate its competitive performance compared to other architectures.

FIGURE 15. - Comparison of the developed YOLOv3 model with other YOLOs (a) YOLOv3 - confidence rate 95% (b) YOLOv4 - confidence rate 87% (c) YOLOv5 – confidence rate 99% (d) YOLOv7 – confidence rate 82% (e) YOLOv3 (1 layer) – confidence rate 99% (f) YOLOv8 - Confidence rate 97%.
FIGURE 15.

Comparison of the developed YOLOv3 model with other YOLOs (a) YOLOv3 - confidence rate 95% (b) YOLOv4 - confidence rate 87% (c) YOLOv5 – confidence rate 99% (d) YOLOv7 – confidence rate 82% (e) YOLOv3 (1 layer) – confidence rate 99% (f) YOLOv8 - Confidence rate 97%.

An additional parameter used to evaluate the performance of the developed model was the Area Under Curve (AUC)-Receiver Operating Characteristics (ROC) curve. The AUC-ROC is a widely utilized metric for assessing the performance of classification and detection models, providing insight into the model’s ability to distinguish between classes [37]. Specifically, it evaluates how effectively the model separates positive instances (e.g., the presence of a welding helmet) from negative ones (e.g., the absence of a welding helmet). In the context of our model, the classification results fall into two categories: safe (e.g., welding helmet detected) and unsafe (e.g., welding helmet not detected). The ROC curve is generated based on two key parameters: the True Positive Rate (TPR) and the False Positive Rate (FPR). The TPR, also known as recall, is defined in equation 3, while the FPR is calculated as shown in equation 10 [38].\begin{equation*} \mathrm {FPR}=\frac {\mathrm {FP}}{\mathrm {TN+FP}}. \tag {10}\end{equation*} View SourceRight-click on figure for MathML and additional features.Figure 16 displays the AUC-ROC curve for our developed model alongside those of other YOLO versions, which were trained and tested for comparative analysis. As shown in the graphs in the figure YOLOv$3+1$ CL outperformed all other models since the TPR for the proposed YOLOv3 model is relatively higher than other models, resulting in higher AUC value. Furthermore, Table 10 provides a comparison of the AUC values achieved by the tested YOLO versions. The model achieved an AUC value of approximately 0.98, which is considered excellent by literature standards, as indicated in Table 9 [39]. The results indicate that the YOLOv3 with one additional convolutional layer (YOLOv$3+1$ CL) outperforms other versions with an AUC of 0.98, showing the highest detection accuracy. High value of AUC indicates good separation between classes, i.e., the model’s ability to distinguish between classes. In contrast, YOLOv7 and YOLOv8 achieved slightly lower AUC values of 0.91 and 0.92, respectively. This suggests that the developed model is reliable in both detection and classification tasks.

TABLE 9 AUC values quality categories
Table 9- AUC values quality categories
TABLE 10 AUC values for different YOLO versions
Table 10- AUC values for different YOLO versions
FIGURE 16. - ROC-AUC curves for the various tested YOLO model versions.
FIGURE 16.

ROC-AUC curves for the various tested YOLO model versions.

C. Applications of the Object Detection Algorithm on Diverse Image Sources

After determining the optimal hyperparameters and training the developed YOLOv3 model on these parameters, the model was subjected to a new verification test. Initially, the model was tested using static images of workers engaged in welding tasks. This section extends the testing by introducing the use of a drone to capture images and videos of workers performing welding activities, followed by the detection process. Various parameters, including precision, recall, and detection time, were recorded during this experiment to evaluate the reliability and flexibility of the developed model. The results for each metric will be presented in the following subsections.

To conduct this test, a drone equipped with a camera was required. The Tello drone, a product developed by Ryze Tech in collaboration with DJI, was selected for this purpose. This drone was chosen for its ease of use, affordability, high-quality camera, stability, and hovering capabilities. The Tello drone is lightweight and designed primarily for testing applications, featuring key components, such as four motors, propeller guards for fan protection, and a high-resolution camera. The Tello drone is powered by an 1100mAh battery and is designed to balance energy efficiency with lightweight features, making it ideal for short recreational flights and basic aerial photography. Its efficiency is enhanced by its lightweight frame, allowing it to make the most of its limited power capacity, although factors like wind, rapid maneuvers, and camera use can reduce the actual flight time. The Tello drone battery life lasts approximately 13 minutes. To mitigate this constraint, solutions such as scheduled monitoring intervals and the integration of additional power sources can be implemented to ensure continuous and reliable monitoring. Typically, the average consumer drone battery life ranges between 10 to 35 minutes, for example the DJI Ar2s drone battery life is around 30 minutes [40]. So, when applying our system to real sites more advanced drones can be used, the Tello drone in this study was only used for experimental purposes. For this research, a controlled lab environment was established for drone testing. Various scenarios simulating workers engaged in welding tasks were projected onto a screen, and the drone maneuvered within the area, capturing images from multiple angles. Figure 17 illustrates the Tello drone in action, showcasing its hovering capabilities during the lab tests. Following image capture, detections were performed on the acquired images, with the model exhibiting exceptional accuracy and precision in detecting welding helmets. Performance metrics in this validation experiment remained consistently high, matching those from the initial tests. Both the overall precision and recall were 98%. Figure 18 provides a sample of these detections, demonstrating the model’s robust capability in accurately identifying welding helmets from drone-captured images.

FIGURE 17. - (a) Tello drone and its components, (b) Tello drone hovering in the lab to take images of workers while performing the welding activity – Front view, and (c) Tello drone hovering in the lab to take images of workers while performing the welding activity – Side view.
FIGURE 17.

(a) Tello drone and its components, (b) Tello drone hovering in the lab to take images of workers while performing the welding activity – Front view, and (c) Tello drone hovering in the lab to take images of workers while performing the welding activity – Side view.

FIGURE 18. - Drone detection sample results: (a) detection 1 with 77% confidence rate (b) detection 2 with 98% and 81% confidence rate respectively.
FIGURE 18.

Drone detection sample results: (a) detection 1 with 77% confidence rate (b) detection 2 with 98% and 81% confidence rate respectively.

A key factor in the development of such systems is time efficiency, as the time taken to detect safety violations plays a crucial role in reducing or preventing workplace injuries or fatalities. In this study, the entire detection process took around eight seconds from the time the drone captured an image to when the model produced a detection result. Of this total, capturing and processing the image took about four seconds, while the model’s inference required approximately four seconds. This streamlined process ensures that the system operates efficiently, even in real-world industrial environments. Notably, this detection time is significantly shorter compared to the time typically required by a safety officer to manually identify and report such violations. The computational requirements for running our model are minimal, as the detection process can be executed efficiently on a standard CPU without the need for a GPU. If the model is hosted on a cloud-based provider, only an internet connection is required to perform detections remotely. However, when using a local executor, the model can function offline without any need for an internet connection. This simplicity makes the deployment of the model highly accessible, allowing it to be integrated seamlessly into various environments, both cloud-based and local, with minimal hardware demands.

To further assess the reliability and flexibility of the model, two additional image scenarios were tested. The first scenario involved presenting the model with blurry images, while the second utilized grayscale (colorless) images. These tests were designed to simulate real-world situations where consistently capturing high-quality images may not always be feasible. The results from both tests were highly encouraging, with the model achieving performance metrics comparable to those obtained using standard images. Figures 19 and 20 showcase sample detection results from the blurry and grayscale image tests, respectively. Despite the challenging conditions, the model maintained high accuracy and precision.

FIGURE 19. - Sample results for blur images (a) Normal image - confidence rate 95% (b) Blur image - confidence rate 91% (c) Normal photo – confidence rate 95% (d) Blur image – confidence rate 91%.
FIGURE 19.

Sample results for blur images (a) Normal image - confidence rate 95% (b) Blur image - confidence rate 91% (c) Normal photo – confidence rate 95% (d) Blur image – confidence rate 91%.

FIGURE 20. - Greyscale images sample results: (a) Normal image - confidence rate 96%, (b) Greyscale image - confidence rate 94%, (c) Normal photo – confidence rate 98%, and (d) Blur image – confidence rate 96%.
FIGURE 20.

Greyscale images sample results: (a) Normal image - confidence rate 96%, (b) Greyscale image - confidence rate 94%, (c) Normal photo – confidence rate 98%, and (d) Blur image – confidence rate 96%.

Two additional tests were also conducted. The first involved using images that contained various types of headgear, including welding helmets and hard hats, to evaluate the model’s ability to distinguish between different helmet types. The model performed exceptionally well, accurately differentiating between the different headgears. A sample result is shown in Figure 21. The second test assessed the model’s performance in low-light conditions to ensure its effectiveness across diverse environments reflective of real-world construction sites. Figure 22 illustrates a sample of the detection results under low-light condition. The result shows that the model was capable of detecting the welding helmet in the low light condition with a very high confidence rate of 99%.

FIGURE 21. - The sample result for two different types of helmets present in an image. The confidence rate is 99%.
FIGURE 21.

The sample result for two different types of helmets present in an image. The confidence rate is 99%.

FIGURE 22. - The sample result under the low light condition. The confidence rate in the detection is 99%.
FIGURE 22.

The sample result under the low light condition. The confidence rate in the detection is 99%.

D. Privacy Considerations and Worker Acceptance

Our strategy for implementing an automated monitoring system in industrial settings will prioritize both worker privacy and workplace safety. The primary objective is to enhance safety by ensuring compliance with welding helmet requirements, but we recognize the importance of addressing potential privacy concerns. To build worker trust and acceptance, we will maintain full transparency regarding the system’s purpose and functionality. We will clearly communicate that the system is designed solely to uphold safety protocols, not to conduct surveillance. In addition, we will adhere to laws that regulate the use of surveillance cameras on construction sites, which protect individuals’ privacy rights by requiring informed consent. This will involve posting visible signs around the site to notify workers and visitors that monitoring is in place. By emphasizing that our monitoring aligns with industry standards for safety, we reinforce the protective intent of the system, reassuring workers that it exists to safeguard them, not to invade their privacy.

SECTION IV.

Conclusion

Face and eye protection are critical for ensuring safety when working with construction and manufacturing tools. Welding, a primary task in the industrial sector, is associated with a high incidence of accidents involving the face and eyes. In response, this paper presents a novel deep learning model based on the YOLOv3 algorithm, aimed at detecting whether workers are wearing safety welding helmets during welding tasks. The model was trained on a custom dataset compiled from various sources, which served as the foundation for both training and performance evaluation. The development process involved testing seven distinct scenarios, varying parameters such as batch sizes, activation functions, and input image sizes. The model exhibited notably strong performance, particularly when using the swish activation function, which was identified as the optimal choice. To further validate its real-world applicability, a drone was deployed to assess the model’s performance in practical settings. In this study, five key metrics were used to evaluate the model’s performance: accuracy, precision, recall, F1 score, and AUC-ROC curve. The model achieved exceptional results, with 98% precision, 98% recall, 98% F1 score, and a notable AUC of 0.98. These metrics demonstrate the model’s high accuracy and reliability compared to the default YOLOv3 model, YOLOv4, and other models documented in the literature. Furthermore, the integration of the model with drone technology maintained consistently high performance, similar to the initial tests results. The total time required for violation detection, from image capture by the drone to output generations, was approximately eight seconds. These findings highlight the potential of artificial intelligence and drone technology in enhancing worker safety in construction and industrial environments. Despite the promising outcomes of this development, there are potential limitations. One significant challenge is data acquisition for model training, particularly in ensuring the diverse and representativeness of the dataset, given the limited availability of such data. Additionally, the limited battery life of the drone poses a constraint, although this could be addressed by integrating additional power sources. Regulatory hurdles, such as obtaining government approvals for drone operations, also present a challenge, requiring adherence to specific aviation and safety regulations for drone deployment in industrial settings. Looking ahead, several avenues for future research can be considered. Exploring the field of explainable AI in safety detection could be highly beneficial. Researchers are making strides in addressing the “black box” nature of machine learning models, working towards making the decision-making processes more transparent and understandable. Moreover, we plan to test our model in real construction environments to evaluate its performance under practical conditions including real time images and video streams. Expanding the model’s scope to include the detection of a broader range of Personal Protective Equipment (PPE) items could further enhance workplace safety across various scenarios. In addition, we aim to develop and test similar models using alternative architectures to YOLO, conducting comparative analyses to identify the most effective solutions. This expansion will contribute to a more comprehensive understanding of PPE compliance and its impact on worker safety. Finally, to conclude this study focused on developing a reliable model for detecting welding helmets in real time, prioritizing high accuracy and robustness. For future work, we aim to enhance the system’s functionality by incorporating real-time alerts to enable immediate responses to safety violations, utilizing detection software with alert capabilities inspired by the system developed by Hu et al. [16]. This addition will further strengthen the model’s practical application in construction and industrial environments.

Appendix

A unique dataset was generated especially for the purpose of this research. The dataset produced for this research is available upon request to the corresponding author.

References

References is not available for this document.