Introduction
Workers operating welding machines should adhere to strict safety protocols, including the use of a welding helmet, protective gloves, and fully covering clothing to mitigate the risk of burns. According to data from the Bureau of Labor Statistics (BLS), the United States experiences 21 welding-related accidents per 100,000 workers annually. The BLS further reports that for every 100 million work hours, approximately 1,000 workers sustain a welding-related injury, a rate that is nearly 100 times higher than the average injury rate for workers in other occupations [1]. The most common injury among welders is eye injury, particularly flash burns. A flash burn, also known as ‘welder’s flash’ or ‘arc eye’, refers to painful inflammation of the cornea, the transparent layer covering the front of the eye. This injury typically results from exposure to intense ultraviolet (UV) light, with welding torches being a primary source [2]. For instance, Tetteh et al. [3] reported that the use of eye and face shields could prevent approximately 90% of welding-related eye injuries. Similarly, Amani et al. [4] researched occupational hazards related to welders. The results of their research showed that 92% of welders suffer eye problems due to the improper use of safety equipment recommended for welding jobs. While traditional supervision techniques, such as employing safety inspectors, can be beneficial in certain instances, they are insufficient for ensuring comprehensive workplace safety. Consequently, advanced approaches leveraging artificial intelligence (AI) and deep learning should be implemented to enhance the monitoring and reduction of these injuries.
The use of required safety equipment is essential for safeguarding workers’ lives in any occupational setting. Welding, in particular, demands adherence to several safety equipment standards. One of the most critical safety measures during the welding process is ensuring proper face protection. Face and eye protection are crucial for preventing injuries, with the welding helmet being the primary piece of protective equipment, as shown in Figure 1(a) [5]. Failure to use a welding helmet can result in serious eye injuries, including flash burns, injuries from flying debris, and inflammation of the eyes caused by prolonged exposure to ultraviolet radiation. Some of these potential injuries are depicted in Figure 1 [6], [7].
Existing research in safety monitoring primarily focuses on the detection of conventional safety equipment, such as hard hats [8], [9], [10], [11] and safety vests [12], [13], [14]. However, a notable gap exists in the current literature regarding the detection and real-time monitoring of complex safety equipment, particularly welding helmets. The novelty of this paper lies in its examination of a previously unexplored object of interest – the welding helmet. Detecting welding helmets presents a distinct set of challenges compared to conventional hard hats. Welding helmets are designed to provide full-face protection, incorporating tinted visors to shield welders from intense light of welding arcs. This tinting introduces significant variability in lighting, shadows, and reflections, complicating detection for standard object detection algorithms. Additionally, the dynamic nature of welding tasks, characterized by frequent and rapid head movements, poses additional challenges for consistent detection. The presence of sparks, smoke, and other artifacts further contributes to the complexity, as these elements can obscure or distort the appearance of the welding helmet. Unlike static hard hats, welding helmets are subject to varying degrees of wear and tear, with potential obstructions such as welding visors in different positions. As a result, the detection of welding helmets demands a robust algorithm capable of handling diverse environmental conditions, complex visual features, and rapid changes in the helmet’s appearance during welding operations.
In this paper, we present an implementation of an enhanced deep learning-based monitoring and detection system designed to provide a reliable real-time tool, leveraging the YOLO algorithm. Deep learning methodologies, particularly the YOLO algorithm, have become foundational in both academic research and practical applications for monitoring occupational safety. For instance, Chen et al. [15] developed an enhanced real-time detection algorithm based on YOLOv5 for detecting helmets (hard hat) and reflective vests. This improved model significantly boosted detection accuracy and convergence speed, achieving a mean average precision of 94.9%. Similarly, Hu et al. [16] proposed an improved YOLOv3-based hard-hat detection method, integrated into real-time detection software with an alert function. The system was successfully deployed across multiple construction sites, demonstrating both high accuracy and real-time performance. Benyang et al. [17] introduced a method for detecting helmet compliance using the YOLOv4 algorithm, employing construction site videos and web-sourced images as a dataset. Their model achieved a high accuracy rate of approximately 93%. Fu et al. [18] presented a hardhat detection model based on an optimized YOLOv5 algorithm, improving both feature fusion and multi-scale detection layers by incorporating an additional fusion scale layer for enhanced small-target recognition. Their model achieved an average detection accuracy of 95%, representing a 2.9% improvement over the initial model. Furthermore, the accuracy of helmet recognition reached 94.6%, marking a 2.4% increase. Similarly, Anushkannan et al. [19] developed a YOLOv3-based model for helmet (hardhat) detection, achieving a high mean average precision of approximately 97%. Shanti et al. [20] presented a novel approach for monitoring workers at heights using a pretrained object detector based on the YOLOv3 algorithm. The model demonstrated strong performance, with an accuracy of 91.26% and a precision of 99%. The notable finding in this research is the versatility and effectiveness of the YOLO algorithm in construction and industrial environments. The model developed in this study was further enhanced by integrating drone technology for real-time violation detection [21]. Among the various YOLO algorithms, YOLOv3 was selected due to its demonstrated reliability in detecting objects under dynamic conditions [22]. The YOLOv3-based model developed here specifically targets the detection of welding helmets-critical face protection equipment-on workers during welding activities, helping to mitigate the risk of face and eye injuries.
The primary contributions of this paper are twofold: first, it introduces a YOLOv3 architecture tailored for real-time monitoring of welding activities and helmet usage on construction sites. Additionally, this paper presents a comprehensive parametric study of the YOLOv3, focusing on optimizing the detection accuracy of welding helmets. This study evaluates the influence of several key factors, including batch size, activation functions, and input image resolution, on the model’s performance. To further assess robustness, the model was tested on various image conditions, such as blurry, grayscale, and low-light images. Moreover, the research demonstrates the integration of drone technology with the developed model, resulting in a high- performance real-time monitoring system utilizing drone-captured images. This work addresses a specific gap in the field by focusing on welding helmet detection in construction environments - an area that has received significantly less attention compared to the detection of significantly less attention compared to the detection of hard hats and reflective vests. Our system extends existing object detection methods by optimizing YOLOv3 for the specific task of welding helmet detection, thereby improving both accuracy and efficiency. Furthermore, we highlight the potential industrial impact of this work, particularly in enhancing worker safety through real-time monitoring, which can play a crucial role in reducing workplace accidents and injuries.
Methodology
This study is conducted in three distinct stages: dataset creation, model training, and model testing, as depicted in Figure 2. The first stage encompasses the collection, creation, and preprocessing of a dedicated welding helmet dataset. In the subsequent stage, the deep learning algorithm YOLOv3 is trained using the collected dataset. Finally, the model is tested to evaluate its performance. Detailed descriptions of each of these stages are provided in the following sections.
Flowchart of the methodology outlining the stages of dataset collection consisting of dataset collection, model training, and model testing for a real-time detection of welding helmet using the YOLOv3 algorithm.
A. Dataset Creation
One of the most critical tasks in developing and training a deep learning algorithm for accurate welding helmet detection on construction sites is the collections of a diverse dataset. Welding helmets vary in shape, design, and color, making it essential to gather a comprehensive set of images that reflect these variations. The dataset compiled for this study includes images of different types and colors of welding helmets, primarily focusing on workers wearing them during active tasks. The dataset was also diverse, encompassing various industrial environments with images from different sectors. It included a wider range of worker demographics, featuring both male and female workers from multiple geographic locations. Additionally, the dataset accounted for challenging conditions, such as low-lights environments, to enhance the model’s robustness in detecting welding helmets across diverse scenarios. In this study, the two primary sources of training images are real-time photographs captured at actual construction and industrial sites, as well as web-based images. Figures 3 and 4 illustrate examples of the collected images used for training. Additionally, testing was conducted using a separate dataset comprising of unique images collected from various sources, including CCTV footages from factories, online databases, and images obtained from construction sites.
Sample images used in creating the training dataset: (a) black helmet with wide lens, (b) white helmet with medium sized lens, (c) brown helmet with narrow small lens, (d) blue helmet with medium sized lens, (e) black helmet with medium sized lens, and (f) red helmet with medium sized lens [23], [24], [25], [26], [27], [28].
Sample images used in creating the diverse train-ing dataset that contains people from different demographics, different working conditions, and light conditions: (a) Male welder (b) Female welder (c) External working environment (d) Low light condition.
After collecting and creating a dataset of 1550 image files, these images underwent preprocessing to prepare them for training and testing the deep learning model. Preprocessing is widely recognized as a critical factor in achieving high model accuracy [29], [30]. In this study, various preprocessing methods and techniques were employed, including resizing and cleaning the images. Resizing, a common preprocessing step in deep learning, adjusts the dimensions of images to a specific size, thereby improving the computational efficiency of the model. Cleaning encompasses identifying and removing deficiencies in the dataset, such as missing, inaccurate, duplicate, irrelevant, or non-representative images. Although time-consuming, this step is essential for ensuring a clean and diverse dataset suitable for training. Following the preprocessing phase, the images were annotated, a crucial process in which the object of interest-in this study, the welding helmet-was labeled in the training data. The goal of annotation is to provide the algorithm with ground truth data, allowing it to learn patterns and accurately detect and recognize objects in new, unseen data. Data annotation is a critical step in supervised learning, where the model is trained using input-output pairs. In this study, the annotation process was conducted using LabelImg software, as demonstrated in Figure 5, which shows an example of the labeling performed.
Data annotation for one of the training dataset images using the LabelImg software, the welding helmet is annotated in the blue box.
B. YOLOv3 for the Detection of the Welding Helmet
The object detection model used in this study was based on the YOLOv3 algorithm, which represents a significant improvement over earlier YOLO versions. YOLOv3 enhances mean average precision (mAP) by up to 10% and increases the number of frames per second by 12% [31]. One of its key advancements is the integration of Darknet-53, a CNN architecture specifically optimized for object detections. Darknet-53, a variant of the ResNet architecture, consists of 53 convolutional layers and achieves the state-of-the-art performance on various object detection benchmarks. For YOLOv3, an additional 53 layers are added, resulting in a 106-layer fully convolutional architecture [31]. The YOLOv3 model performs detections at three different places in the network: the 82nd, 94th, and 106th layers, as illustrated in Figure 6.
YOLOv3 architecture for the detection of welding helmets. Detections are performed at the 82nd, 94th, and 106th layers in the network [32].
The network downsamples the input image by factors of 32, 16 and 8 at 82nd, 94th, and 106th layers, respectively. These values, known as strides, indicate the degree of downsampling, reflecting how much smaller the output at these points is compared to the network’s input.
In this research, a YOLOv3 architecture is developed for the accurate detection of welding helmets on construction sites through a parametric study. The proposed YOLOv3 architecture incorporates an additional 2D convolutional layer into the Darknet-53 network. This enhancement improves the architecture’s feature extraction capabilities, enabling it to capture more complex and abstract features from the input data. As a result, the model generates a richer and more informative representation of objects within the image, thereby enhancing its ability to distinguish intricate patterns. Moreover, the inclusion of this additional layer enhances the computational efficiency of the model by deepening its structure, which allows for the learning of more sophisticated and abstract representations of the input data. The adjustment also mitigates the risk of underfitting. By increasing the model’s complexity, it can capture more nuanced patterns in the training data, thereby reducing the likelihood of underfitting. However, it is essential to maintain a balance, as excessive layer addition can lead to overfitting. The addition of a convolutional layer, although seemingly modest, was a deliberate and strategic enhancement designed to improve the model’s feature extraction capabilities and overall performance in the specific task of detecting welding helmets in construction environments. This decision was informed by the recognition that even minor architectural adjustments can lead to substantial performance improvements, especially in specialized applications. To evaluate the impact of this modification, two distinct models were trained: one with a single additional convolutional layer and another with two additional layers. As illustrated in Figure 7, the YOLOv3 architecture developed in this research incorporates the newly added layers, highlighted in orange, into the default structure. These enhancements were deliberately designed to improve the model’s detection accuracy and robustness, demonstrating that even incremental changes to the network architecture can yield significant performance gains. Additionally, the model retains its three prediction layers, each responsible for detecting objects of different sizes: small, medium-sized, and large.
The YOLOv3 Architecture with one additional convolution layer added into DartNet-53 network.
The selection of YOLOv3 in this research is driven by its faster processing time compared to most conventional object detection algorithms. This shorter processing time is crucial for real-time detection in construction and industrial environments. Moreover, YOLOv3 has demonstrated both theoretically and in practice its efficiency and reliability in such settings, making it a suitable choice for this application. The model was trained using Google Colab Notebook, utilizing the built-in GPUs provided by Colab through cloud computing. The GPU was NVIDIA Tesla T4, featuring 15.36 GB of GDDR6 memory, paired with an Intel(R) Xeon(R) CPU operating at 2.00 GHz. Furthermore, Darknet was the deep learning library/framework used to train and build the model.
Results and Discussion
The YOLOv3 architecture proposed in the previous section was tested using a dataset consisting of diverse images of welding helmets. Of the 1,550 total images, 70% (i.e., 1,085 images) were allocated for training, while the remaining 30% (i.e., 465 images) were used for testing the YOLOv3 model.
A. Training of the YOLOv3 Model
Several combinations of input hyperparameters, referred to as scenarios in this study, were tested to achieve optimal model performance. The first parameter evaluated was the input image size. Various sizes were assessed to determine the dimensions that would yield the highest accuracy. After completing the training, the final loss for each case was obtained. The YOLOv3 loss function is the sum of three losses. In other words, the total loss (L) is represented as the sum of objectness loss (\begin{equation*} {L} = {\lambda }_{coord} {L}_{coord} + \lambda _{obj} {L}_{obj} + {L}_{cls} \tag {1}\end{equation*}
The next two hyperparameters of interest were batch size and activation function. In this step, various combinations and scenarios of these hyperparameters were created, as summarized in Table 1. The selection of activation functions was guided by their optimal performance in object detection applications.
Trial tests were conducted to determine the optimal values for these two deep learning hyperparameters, namely batch size and activation function, within the model’s neural network. Other input hyperparameters were set as follows: number of epochs =1,000, learning rate =0.001, and input image size
After determining the optimal batch size, the focus shifted to evaluating five different scenarios (scenarios 3 to 7) related to activation functions. These activation function scenarios were analyzed and compared to identify the most suitable function for the model. An activation function governs whether a neuron should be activated by determining the relevance of its input to the network during the prediction process through mathematical operations. It plays a critical role in producing the output from the input values fed to a node or layer. In this research, five activation functions were tested: linear, tanh, ReLU, leaky ReLU, and swish. Table 2 provides a summary of these activation functions along with their mathematical representations.
Furthermore, Figure 10 presents a comparison of the training loss obtained when each of these activation functions was individually applied to the model. The results indicate that the model trained using the swish activation function performed the best. Additionally, it can be observed in the figure that leaky, swish, and ReLU exhibited very similar behavior throughout the training process. This similarity can be attributed to the fact that Leaky ReLU represents an enhancement over traditional ReLU, and Swish, in turn, is an improvement upon Leaky ReLU.
Effect of activation functions on training performance of YOLOv3 when all other parameters are fixed (batch size=128 and input image size
After determining the optimal hyperparameters for the default YOLOv3 model, these parameters were applied to the modified YOLOv3 models. Three different models were trained, incorporating up to three additional convolutional 2D layers. Figure 10 presents a comparison of the training loss between the default YOLOv3 model and the developed models. As shown in Figure 11, the model with one additional layer (YOLOv
B. Testing of the YOLOv3 Model
The detection of welding helmets was performed on a distinct dataset using the developed YOLOv3 architecture, leveraging the hyperparameters outlined in Table 3, which were obtained during the training phase. As mentioned earlier, a total of 1,550 images were used for both training and testing, with the testing dataset comprising 30% (465 images) of the total. These testing images included welding helmets of various shapes, types, and colors, ensuring a comprehensive validation of the trained YOLOv3 model. Figure 12 illustrates the performance of the developed YOLOv3 model in detecting welding helmets with diverse attributes, such as different colors (e.g., white, blue, and black), shapes, angles, and lens sizes (e.g., wide, medium, and small). Following the testing phase, results were obtained for various evaluation metrics, including (1) Precision (2) Recall (3) Accuracy (4) F1 Score and (5) Confidence in detection. These metrics were calculated to demonstrate the reliability and consistency of the model. The model was tested across all seven scenarios, and the evaluation metrics, which are interrelated, were closely analyzed. For instance, precision quantifies the proportion of correctly detected items, while recall measures the proportion of relevant elements detected [33]. The F1 score represents a weighted average of precision and recall which ranges from zero to one, with higher values indicating better performance. Precision, recall, accuracy and the F1 score were calculated using the following formulas:\begin{align*} \mathrm {Precision}& =\frac {\mathrm {TP}}{\mathrm {TP+FP}}, \tag {2}\\ \mathrm {Recall}& =\frac {\mathrm {TP}}{\mathrm {TP+FN}}, \tag {3}\\ \mathrm {Accuracy}& =\frac {\mathrm {TP+TN}}{\mathrm {TP+FN+TP+TN}}, \tag {4}\\ \mathrm {F1~score}& =\frac {\mathrm {2x(Precision~x~Recall)}}{\mathrm {Precision+ Recall}}. \tag {5}\end{align*}
Developed model sample results when various shapes, colors, and angles of welding helmets were used (a) Black helmet with a confidence rate of 97% (b) Silver helmet with a confidence rate of 100% (c) Blue helmet with a confidence rate of 100% (d) Black helmet with a confidence rate of 100% (e) Blue helmet with a confidence rate of 99% (f) Blue and white helmet with a confidence rate of 99% (g) Black helmets with a confidence rate of 99% and 97% (h) Black helmet with a confidence rate of 100%.
True Positives (TP) represent the number of correctly detected objects, while False Positives (FP) indicate the number of incorrectly predicted positive outcomes. False Negatives (FN) refer to incorrect predictions where the model failed to detect an object that was actually present. True Negatives (TN) include all true outcomes that went undetected. Additionally, the confidence score reflects the model’s certainty or belief that a particular bounding box contains the object of interest, such as welding helmet. The confidence score (CS) is calculated as follows:\begin{equation*} \mathrm {CS}= \Pr (\text {object})\times \text {IoU}\times \Pr (\text {class}\vert \text {object}) \tag {6}\end{equation*}
\begin{equation*} \Pr \left ({{ \mathrm {object} }}\right )=\frac {1}{1+\mathrm {e}^{-\mathrm {P}_{\mathrm {object}}}} \tag {7}\end{equation*}
\begin{equation*} \mathrm {IoU}=\frac {\mathrm {Area~of~Overlap}}{\mathrm {Area~of~Union}}. \tag {8}\end{equation*}
\begin{equation*} \mathrm {Pr(class \vert object)} = \frac {\mathrm {e}^{\mathrm {P}_{\mathrm {class}}}}{\mathrm {\sum }_{\mathrm {class}}e^{\mathrm {P}_{\mathrm {class}}}} \tag {9}\end{equation*}
As illustrated in Figure 12, the model demonstrated a high detection rate with consistent accuracy. However, a few instances of FP were observed, where the model incorrectly identified objects as welding helmet. Figure 14 presents one such case where the model the model accurately detected three welding helmets but erroneously identified an additional object as a helmet. Although the occurrence of FP in this study was minimal, in scenarios where FP rates are significantly higher, they can compromise the model’s reliability, potentially causing operational disruptions and diminishing user confidence. To mitigate this issue, strategies such as augmenting the training dataset, refining the model’s architecture, or adjusting detection thresholds can be employed to improve model performance.
A sample result where the model had a false positive (e.g. the round black pipe).
In the preceding section, a systematic comparison of various batch sizes was conducted to determine the optimal configuration for the model. As detailed in Table 4, the results demonstrate that utilizing a batch size of 128 delivers the most effective and high-performing model. This batch size consistently outperformed others, achieving over 90% precision across all evaluation metrics. The data indicates that larger batch sizes facilitate more efficient convergence during training.
In addition to the enhanced model performance observed with a batch size of 128, the choice is also advantageous from a computational perspective. Larger batch sizes optimize parallel processing capabilities, particularly on modern hardware architectures, such as GPUs, leading to improved resource utilization and reduced training time. This makes the selection of a batch size of 128 a balanced decision, offering both high accuracy and computational efficiency. While smaller batch sizes may lead to more frequent updates to the model’s weights, the training process benefits significantly from the reduced communication overhead and increased throughput that larger batches offer. As a result, the selection of a batch size of 128 not only optimizes the model’s performance but also ensures better computational efficiency. This balance between precision and computational resources makes the training process more scalable and effective. The batch size of 128, identified as optimal in the previous analysis, was subsequently employed to train the model using the five specified activation functions. Table 5 provides detailed results, highlighting the comparative performance of each activation function in terms of accuracy, precision, and other relevant metrics.
The model trained with the Swish activation function demonstrated superior performance, achieving a precision of 93.80% and a recall of 92.0%. Additionally, it exhibited the highest confidence rate in its detections, approaching 92.31%. Conversely, the model trained with the linear activation function failed to detect any objects (e.g., welding helmets). This failure stems from the linear activation function’s suitability for simple tasks that do not demand learning intricate patterns or features. The linearity of the linear activation function fundamentally constrains the expressiveness of the neural network, as the composition of multiple linear transformations yields an overall linear mapping. In contrast, deep learning architectures, particularly those engineered for object detection tasks, are purposefully designed to learn hierarchical and non-linear representations of input data. The reliance on a linear activation function severely limits the model’s capacity to capture and learn intricate and complex features within the data.
Furthermore, Table 6 summarizes the results achieved with various input image sizes. The findings indicate that an input image size of
Despite the satisfactory performance of the default YOLOv3 model when utilizing optimal hyperparameters, there is a pressing need for enhancement due to the critical importance of workplace safety. Even minor improvements in object detection capabilities can have significant implications for saving lives. To address this need, all optimal hyperparameters were employed as a baseline for training an enhanced YOLOv3 model. Subsequently, the performance of this model was compared with that of the default YOLOv3, as well as YOLOv4, YOLOv5, and YOLOv7 models, all trained on the same dataset and hyperparameters for consistency. Table 7 provides a comparative analysis of these YOLO versions in terms of their architectural design. All YOLO models employ a fully convolutional neural network (FCN), which enables the generation of dense pixel-wise predictions, surpassing the capabilities of traditional CNNs typically utilized for image classification. YOLOv3 utilizes Darknet-53 as its backbone feature extractor, while YOLOv4 and YOLOv5 incorporate CSPDarknet53. YOLOv7 employs Cross-Stage Bottom-Up and Top-Down Connections (CBS) architecture, and YOLOv8 leverages CSPNet (Cross Stage Partial Network). Moreover, YOLOv3, YOLOv4, and YOLOv5 share the same loss function, binary cross-entropy, which is widely used in binary classification tasks. This function quantifies the difference between the predicted probability distribution and the actual distribution for a binary classification task. In contrast, YOLOv7 adopts the focal loss function, a variant of cross-entropy loss specifically designed to mitigate class imbalance issues in both binary and multi-class classification scenarios. YOLOv8 utilizes a combination of both binary cross-entropy and focal loss. Additionally, YOLOv3 employs a feature pyramid network (FPN) for multi-scale feature representation, whereas the subsequent versions implement a path aggregation network (PANet) for the same purpose.
Additionally, Table 8 provides a comparative analysis of the discussed YOLO models, employing precision, recall, accuracy, F1 score, and average confidence as evaluation metrics. The YOLOv3 model with one additional layer (YOLOv
YOLOv5 exhibits lower results, particularly in precision (94.2%) and F1 score (94.2%) relative to YOLOv3. YOLOv4 maintains a balanced performance with a precision of 93.3% and recall of 93.3%. However, YOLOv7 and YOLOv8 show comparatively lower values across all metrics, indicating significant opportunities for improvement. The average confidence scores further elucidate model performance, with the YOLOv3 model with one extra layer achieving the highest confidence score at 93.17%, reinforcing its reliability in object detection tasks. Furthermore, Figure 15 illustrates the detection and confidence rates among the various models, confirming that the developed YOLOv3 model outperforms all others. The enhancement was achieved by incorporating a single additional convolutional layer, while two other models with two and three additional layers were tested but exhibited lower performance rates. Overall, the YOLOv3 model developed in this study demonstrated performance exceeding 96% across all evaluation parameters. Moreover, when comparing YOLOv3 with other YOLO versions, several advantages of YOLOv3 become evident, particularly in the context of this study. YOLOv3 typically demonstrates faster inference speeds on certain hardware configurations, which is essential for real-time applications. Its effectiveness in detecting small objects is enhanced by its multi-scale prediction capabilities, making it particularly advantageous for recognizing helmets at various scales. Additionally, YOLOv3 is less computationally intensive, allowing it to operate efficiently on limited resources-an important consideration in the construction industry, where access to advanced technology and expertise may be restricted. Furthermore, YOLOv3 offers superior compatibility with legacy systems and older hardware, which are often prevalent at construction sites [34]. We also conducted a comparison between our model (YOLOv
Comparison of the developed YOLOv3 model with other YOLOs (a) YOLOv3 - confidence rate 95% (b) YOLOv4 - confidence rate 87% (c) YOLOv5 – confidence rate 99% (d) YOLOv7 – confidence rate 82% (e) YOLOv3 (1 layer) – confidence rate 99% (f) YOLOv8 - Confidence rate 97%.
An additional parameter used to evaluate the performance of the developed model was the Area Under Curve (AUC)-Receiver Operating Characteristics (ROC) curve. The AUC-ROC is a widely utilized metric for assessing the performance of classification and detection models, providing insight into the model’s ability to distinguish between classes [37]. Specifically, it evaluates how effectively the model separates positive instances (e.g., the presence of a welding helmet) from negative ones (e.g., the absence of a welding helmet). In the context of our model, the classification results fall into two categories: safe (e.g., welding helmet detected) and unsafe (e.g., welding helmet not detected). The ROC curve is generated based on two key parameters: the True Positive Rate (TPR) and the False Positive Rate (FPR). The TPR, also known as recall, is defined in equation 3, while the FPR is calculated as shown in equation 10 [38].\begin{equation*} \mathrm {FPR}=\frac {\mathrm {FP}}{\mathrm {TN+FP}}. \tag {10}\end{equation*}
C. Applications of the Object Detection Algorithm on Diverse Image Sources
After determining the optimal hyperparameters and training the developed YOLOv3 model on these parameters, the model was subjected to a new verification test. Initially, the model was tested using static images of workers engaged in welding tasks. This section extends the testing by introducing the use of a drone to capture images and videos of workers performing welding activities, followed by the detection process. Various parameters, including precision, recall, and detection time, were recorded during this experiment to evaluate the reliability and flexibility of the developed model. The results for each metric will be presented in the following subsections.
To conduct this test, a drone equipped with a camera was required. The Tello drone, a product developed by Ryze Tech in collaboration with DJI, was selected for this purpose. This drone was chosen for its ease of use, affordability, high-quality camera, stability, and hovering capabilities. The Tello drone is lightweight and designed primarily for testing applications, featuring key components, such as four motors, propeller guards for fan protection, and a high-resolution camera. The Tello drone is powered by an 1100mAh battery and is designed to balance energy efficiency with lightweight features, making it ideal for short recreational flights and basic aerial photography. Its efficiency is enhanced by its lightweight frame, allowing it to make the most of its limited power capacity, although factors like wind, rapid maneuvers, and camera use can reduce the actual flight time. The Tello drone battery life lasts approximately 13 minutes. To mitigate this constraint, solutions such as scheduled monitoring intervals and the integration of additional power sources can be implemented to ensure continuous and reliable monitoring. Typically, the average consumer drone battery life ranges between 10 to 35 minutes, for example the DJI Ar2s drone battery life is around 30 minutes [40]. So, when applying our system to real sites more advanced drones can be used, the Tello drone in this study was only used for experimental purposes. For this research, a controlled lab environment was established for drone testing. Various scenarios simulating workers engaged in welding tasks were projected onto a screen, and the drone maneuvered within the area, capturing images from multiple angles. Figure 17 illustrates the Tello drone in action, showcasing its hovering capabilities during the lab tests. Following image capture, detections were performed on the acquired images, with the model exhibiting exceptional accuracy and precision in detecting welding helmets. Performance metrics in this validation experiment remained consistently high, matching those from the initial tests. Both the overall precision and recall were 98%. Figure 18 provides a sample of these detections, demonstrating the model’s robust capability in accurately identifying welding helmets from drone-captured images.
(a) Tello drone and its components, (b) Tello drone hovering in the lab to take images of workers while performing the welding activity – Front view, and (c) Tello drone hovering in the lab to take images of workers while performing the welding activity – Side view.
Drone detection sample results: (a) detection 1 with 77% confidence rate (b) detection 2 with 98% and 81% confidence rate respectively.
A key factor in the development of such systems is time efficiency, as the time taken to detect safety violations plays a crucial role in reducing or preventing workplace injuries or fatalities. In this study, the entire detection process took around eight seconds from the time the drone captured an image to when the model produced a detection result. Of this total, capturing and processing the image took about four seconds, while the model’s inference required approximately four seconds. This streamlined process ensures that the system operates efficiently, even in real-world industrial environments. Notably, this detection time is significantly shorter compared to the time typically required by a safety officer to manually identify and report such violations. The computational requirements for running our model are minimal, as the detection process can be executed efficiently on a standard CPU without the need for a GPU. If the model is hosted on a cloud-based provider, only an internet connection is required to perform detections remotely. However, when using a local executor, the model can function offline without any need for an internet connection. This simplicity makes the deployment of the model highly accessible, allowing it to be integrated seamlessly into various environments, both cloud-based and local, with minimal hardware demands.
To further assess the reliability and flexibility of the model, two additional image scenarios were tested. The first scenario involved presenting the model with blurry images, while the second utilized grayscale (colorless) images. These tests were designed to simulate real-world situations where consistently capturing high-quality images may not always be feasible. The results from both tests were highly encouraging, with the model achieving performance metrics comparable to those obtained using standard images. Figures 19 and 20 showcase sample detection results from the blurry and grayscale image tests, respectively. Despite the challenging conditions, the model maintained high accuracy and precision.
Sample results for blur images (a) Normal image - confidence rate 95% (b) Blur image - confidence rate 91% (c) Normal photo – confidence rate 95% (d) Blur image – confidence rate 91%.
Greyscale images sample results: (a) Normal image - confidence rate 96%, (b) Greyscale image - confidence rate 94%, (c) Normal photo – confidence rate 98%, and (d) Blur image – confidence rate 96%.
Two additional tests were also conducted. The first involved using images that contained various types of headgear, including welding helmets and hard hats, to evaluate the model’s ability to distinguish between different helmet types. The model performed exceptionally well, accurately differentiating between the different headgears. A sample result is shown in Figure 21. The second test assessed the model’s performance in low-light conditions to ensure its effectiveness across diverse environments reflective of real-world construction sites. Figure 22 illustrates a sample of the detection results under low-light condition. The result shows that the model was capable of detecting the welding helmet in the low light condition with a very high confidence rate of 99%.
The sample result for two different types of helmets present in an image. The confidence rate is 99%.
The sample result under the low light condition. The confidence rate in the detection is 99%.
D. Privacy Considerations and Worker Acceptance
Our strategy for implementing an automated monitoring system in industrial settings will prioritize both worker privacy and workplace safety. The primary objective is to enhance safety by ensuring compliance with welding helmet requirements, but we recognize the importance of addressing potential privacy concerns. To build worker trust and acceptance, we will maintain full transparency regarding the system’s purpose and functionality. We will clearly communicate that the system is designed solely to uphold safety protocols, not to conduct surveillance. In addition, we will adhere to laws that regulate the use of surveillance cameras on construction sites, which protect individuals’ privacy rights by requiring informed consent. This will involve posting visible signs around the site to notify workers and visitors that monitoring is in place. By emphasizing that our monitoring aligns with industry standards for safety, we reinforce the protective intent of the system, reassuring workers that it exists to safeguard them, not to invade their privacy.
Conclusion
Face and eye protection are critical for ensuring safety when working with construction and manufacturing tools. Welding, a primary task in the industrial sector, is associated with a high incidence of accidents involving the face and eyes. In response, this paper presents a novel deep learning model based on the YOLOv3 algorithm, aimed at detecting whether workers are wearing safety welding helmets during welding tasks. The model was trained on a custom dataset compiled from various sources, which served as the foundation for both training and performance evaluation. The development process involved testing seven distinct scenarios, varying parameters such as batch sizes, activation functions, and input image sizes. The model exhibited notably strong performance, particularly when using the swish activation function, which was identified as the optimal choice. To further validate its real-world applicability, a drone was deployed to assess the model’s performance in practical settings. In this study, five key metrics were used to evaluate the model’s performance: accuracy, precision, recall, F1 score, and AUC-ROC curve. The model achieved exceptional results, with 98% precision, 98% recall, 98% F1 score, and a notable AUC of 0.98. These metrics demonstrate the model’s high accuracy and reliability compared to the default YOLOv3 model, YOLOv4, and other models documented in the literature. Furthermore, the integration of the model with drone technology maintained consistently high performance, similar to the initial tests results. The total time required for violation detection, from image capture by the drone to output generations, was approximately eight seconds. These findings highlight the potential of artificial intelligence and drone technology in enhancing worker safety in construction and industrial environments. Despite the promising outcomes of this development, there are potential limitations. One significant challenge is data acquisition for model training, particularly in ensuring the diverse and representativeness of the dataset, given the limited availability of such data. Additionally, the limited battery life of the drone poses a constraint, although this could be addressed by integrating additional power sources. Regulatory hurdles, such as obtaining government approvals for drone operations, also present a challenge, requiring adherence to specific aviation and safety regulations for drone deployment in industrial settings. Looking ahead, several avenues for future research can be considered. Exploring the field of explainable AI in safety detection could be highly beneficial. Researchers are making strides in addressing the “black box” nature of machine learning models, working towards making the decision-making processes more transparent and understandable. Moreover, we plan to test our model in real construction environments to evaluate its performance under practical conditions including real time images and video streams. Expanding the model’s scope to include the detection of a broader range of Personal Protective Equipment (PPE) items could further enhance workplace safety across various scenarios. In addition, we aim to develop and test similar models using alternative architectures to YOLO, conducting comparative analyses to identify the most effective solutions. This expansion will contribute to a more comprehensive understanding of PPE compliance and its impact on worker safety. Finally, to conclude this study focused on developing a reliable model for detecting welding helmets in real time, prioritizing high accuracy and robustness. For future work, we aim to enhance the system’s functionality by incorporating real-time alerts to enable immediate responses to safety violations, utilizing detection software with alert capabilities inspired by the system developed by Hu et al. [16]. This addition will further strengthen the model’s practical application in construction and industrial environments.
Appendix
Appendix
A unique dataset was generated especially for the purpose of this research. The dataset produced for this research is available upon request to the corresponding author.