A Review of Computer Vision-Based Monitoring Approaches for Construction Workers’ Work-Related Behaviors

Construction workers’ behaviors directly affects labor productivity and their own safety, thereby influencing project quality. Recognizing and monitoring the construction-related behaviors is therefore crucial for high-quality management and orderly construction site operation. Recent strides in computer vision technology suggest its potential to replace traditional manual supervision approaches. This paper explores research on monitoring construction workers’ behaviors using computer vision. Through bibliometrics and content-based analysis, the authors present the latest research in this area from three perspectives: “Detection, Localization, and Tracking for Construction Workers,” “Recognition of Workers’ Construction Activities,” and “Occupational Health and Safety Behavior Monitoring.” In terms of the literature’s volume, there has been a notable increase in this field. Notably, the focus on safety-related literature is predominant, underscoring the concern for occupational health. Vision algorithms have witnessed an increase in the utilization of object detection. The ongoing and future research trajectory is anticipated to involve multi-algorithm integration and an emphasis on enhancing robustness. Then the authors summarize the review from engineering impact and technical suitability, and analyze the limitations of current research from the perspectives of technical approaches and application scenarios. Finally, it discusses future research directions in this field together with generative AI models. Furthermore, the authors hope this paper can serves as a valuable reference for both scholars and engineers.


I. INTRODUCTION
The construction industry is recognized as a labor-intensive field that involves multi-tasking operations.On-site workforce management can be challenging due to its cumbersome nature, which has the potential to result in low productivity and high safety risks [1].Possible inactive construction activities, inefficiency, and irresponsible attitudes of operators have been found to result not only in time and resource wastage The associate editor coordinating the review of this manuscript and approving it for publication was Zhenhua Guo .but also in economic loss for the entire project, along with a decline in safety and construction quality [2], [3].
Lately, there has been a growing emphasis on proactively evaluating the occupational health and safety challenges confronted by construction laborers, as indicated by research [4].Furthermore, several countries have recognized the need to strengthen construction supervision regulations at the governmental policy level [5], [6].
Traditionally, professionals conduct on-site inspections and supervise and record work, which often leads to problems such as insufficient coverage, unreasonable personnel scheduling, time-consuming tasks, and subjective supervision results [7].In recent years, many countries have embraced the concept of intelligent construction, driven by the development of information technology and the Internet of Things [8], [9], [10].Thanks to technology and policy, several intelligent technologies, such as sensor technology, audio technology, and computer vision technology, are being employed in engineering construction supervision.
The primary types of sensors used for monitoring construction workers are position sensors [11] and acceleration sensors [12].However, using these sensors requires them to be attached to the worker, which can affect their comfort and increase equipment maintenance costs [13].The audio-based approach cannot identify construction tasks that produce less sound or no sound at all [14].In the field of civil engineering, computer vision technology has become a high-precision non-contact monitoring method, which is used in matters such as cracks monitoring [15], [16], [17], strains measurement [18], [19], damage recognition [20], [21].Vision-based approaches can also record worker activities in real-time, allowing project managers to track construction site productivity, progress, and safety risks, and make timely decisions about project progress.Using computer vision to achieve intelligent control of smart construction sites is not only the next frontier science, but also the driving force for the development of smart construction sites [22].
Nonetheless, there have been fewer studies conducted to monitor worker behavior than those assessing building structure damages, as indicated by Mostafa and Hegazy [23].In recent years, several comprehensive reviews have been conducted, focusing on safety concerns in engineering construction.Zhang et al. [24] and Fang et al. [25] have explored the application of computer vision technology in worker safety.From an economic development perspective, Luo et al. [26]have elaborated on construction safety issues, emphasizing emerging trends like deep learning and interdisciplinary technologies.Additionally, Zeng et al.'s work has pointed artificial intelligence as directions in construction safety research [27] Construction workers' behavior plays a pivotal role in influencing construction quality [2].Their trajectories, work behaviors, and safety practices need to be seamlessly integrated into intelligent management systems [24], [28].In recent years, the number of related works using computer vision technology to conduct research from the perspective of construction workers' behavior has been increasing year by year, and the algorithmic techniques are constantly updated and iterative, so a comprehensive review is needed to summarize and review the latest related literatures in the field in order to understand the current research trends and status, challenges and limitations.At the same time, we hope that this paper can give some reference to engineers to help innovation in construction management.
To achieve this objective, the remainder of the paper is organized as follows: Chapter 2 outlines the methodology employed for literature search, Chapter 3 describes the literature situation, Chapter 4 presents the computer vision techniques utilized in reviewed research, Chapter 5 provides applications scenarios of computer vision techniques to worker construction-related behaviors, Chapter 6 lists summaries from the technical and practical application perspectives, and finally, research gaps and limitations are identified, and potential future research directions are suggested.

II. RESEARCH METHODOLOGY
Figure 1 is the technical framework of this article.Initially, the authors establish the study's background in the introduction, providing the groundwork for subsequent research.Following this, a meticulous screening and analysis were conducted on 137 publications published after 2013, all of which are pertinent to the work behaviors of construction workers.Subsequently, based on the research content within these articles, the examination and advancement of knowledge are executed from three perspectives: the localization and tracking of construction workers, the recognition of construction activities, and the monitoring of occupational health and safety behaviors.These three dimensions comprehensively encompass the critical issues of workers' movement trajectories, construction activities, and safety practices, all of which exert a direct or indirect influence on project quality control and the safety of construction personnel.Building upon the insights gleaned from the literature review, a comprehensive synthesis and discussion are presented, elucidating the current state of research and delineating prospective avenues for future investigation.
The authors conducted a search of relevant literature using Web of Science (WOS).The search string used was: (worker AND construction) AND (''computer vision'' OR ''visionbased'' OR ''deep learning'' OR ''image processing'' OR vision).The retrieval process comprised three sequential phases.In the initial stage, utilizing the aforementioned string, 596 scholarly articles from journal publications and conference papers were obtained from the WOS core collection.In the second stage, review papers were excluded, and literature from 2013 to June 2023 was retained, leaving 496 papers.The third stage involved a careful manual search of the abstract, keywords, and body content to exclude content unrelated to ''construction workers'' and ''machine vision,'' leaving 137 papers for analysis.
Figure 2 illustrates the distribution of the 137 articles analyzed in this study by their year of publication and the number of citations they received in each year.It is evident that the number of publications in this field has grown rapidly since 2015.Moreover, the number of citations reveals that the research outcomes have received significant attention from the academic community.This trend could be attributed to the growing interest in intelligent construction technology and the advancements in computer vision technology since 2010.The popularity of computer vision was further enhanced with the introduction of the Faster R-CNN algorithm in 2015.Figure 3

III. OVERVIEW OF REVIEWED LITERATURES
VOSViewer, a bibliometric software, was utilized to cluster the reviewed papers and identify current topics of interest in the field [29].The authors utilized VOSViewer for an in-depth analysis of the paper's keywords, which led to the creation of a heat map (depicted in Figure 4).In the process of generating this heat map, each point on the map is assigned a color based on the density of the elements in its vicinity.A higher density corresponds to a closer proximity to red, while lower density leans toward blue.The magnitude of this density is contingent upon the significance of these elements, providing a rapid overview of critical research areas [27].It is evident that the central themes in the literature revolve around keywords such as ''deep learning,'' ''computer vision,'' ''construction,'' and ''workers,'' among others.
Figure 7 illustrates the research progress in each country or region, highlighting Mainland China's current leadership in this field.This may be attributed to the rapid growth of China's construction industry since the 21st century, surpassing the United States in terms of construction output by 2012, with the gap in output widening over the following decade.However, this growth has also brought about negative consequences, including safety and environmental concerns, leading China to prioritize enhanced regulation, smart construction, and green construction.As a result, this concerted effort has catalyzed the swift advancement and implementation of computer vision technology in the construction industry.The United States, as the world's number one economy, has also made significant contributions to this field.With its technology powerhouse status, the United States provides fertile ground for the development of artificial intelligence  technology.Furthermore, South Korea, Australia, and other countries have also shown significant development in the ''computer vision + worker construction behavior + construction'' field.

IV. TECHNOLOGIES A. OVERVIEW OF COMPUTER VISION TECHNIQUES
Computer Vision, also known as Machine Vision, is a technology that uses mathematical algorithms and computers to     detection [85].Image classification methods such as support vector machines (SVM) [86] can be used to classify objects based on their features.Object tracking methods like Kalman filter [87] can track the position and motion state of objects in consecutive frames.Additionally, segmentation methods based on thresholding, edges, regions etc. [88], can divide images into different regions or parts.While these traditional methods can be useful for tasks such as image recognition, their application in complex scenarios is limited due to the need for manually designed features.
Currently, deep learning-based computer vision methods have become mainstream.Convolutional neural networks (CNN) and their derivatives, such as ResNet [89], are widely used for various tasks, including image classification, detection, and segmentation.Recurrent neural networks (RNN) [90] are used for sequential data processing, such as video, speech, and text.Object detection networks such as Faster R-CNN [91] and YOLO [92] are used to locate and measure the size of target objects in images, while segmentation networks such as Mask R-CNN [93] are used for image segmentation into different parts or regions.Generative adversarial networks (GANs) [94] are used for image generation and restoration, as they can produce high-quality images and enhance image restoration.Furthermore, long and short term memory networks (LSTM) [95] can be used for visual tasks such as video classification and object tracking.
Computer vision technologies mentioned above provide powerful support for quickly extracting information from construction sites and identifying construction site conditions and worker behavior.

B. COMPUTER VISION ALGORITHMS' INNOVATION
The construction environment is characterized by its intricate and dynamic nature, where diverse categories of heavy machinery coexist and operate in parallel, alongside considerable personnel mobility.To better adapt computer vision technology to engineering needs, researchers have improved and innovated upon existing research results in the field of computer vision, updating the technical framework.For example, the recognition of steel bar-related activities performed by construction workers was addressed by Luo et al. [52] through the proposition of a three-stream CNN which can capture static spatial features, short-term motion, and long-term motion in video clips.Yang et al. [96] proposed the Spatial Temporal Relation Transformer (STR-Transformer), which better fuses temporal and spatial features in construction video clips.They also created a video clip dataset including seven types of construction workers' behaviors.Fang et al. [56] proposed an improved Faster R-CNN that achieves better multi-scale detection in detecting and locating workers and equipment on construction sites.Liu et al. [97] proposed Multi-Domain Convolutional Neural Network to improve the tracking performance of construction workers.Huang et al. [98] achieved better construction worker hardhat wear detection performance by improving YOLO v3. Park et al. [99] introduced DIoU and NMS for YOLO v5, utilizing weight triplet attention, expansion deature-level, and soft-pool to improve the performance and enhance the ability of detecting workers at construction sites, especially the ability to detect overlapping objects in complex environments was enhanced.Other studies have also involved various computer vision methods for algorithm improvements [100], [101], [102], [103], [104], [105], [106].
Establishing a highly stable vision algorithm model is partly dependent on high-quality training data.While some open-source databases like COCO and ImageNet are available, they have limited applications in the construction industry.To address this limitation, Yang et al. [96] created a video clip dataset comprising seven types of construction workers' behavior.Jixiu et al. [107] also developed an open-access dataset called GDUT-HWD, which includes 3174 images for hardhat detection.Additionally, several studies have developed new databases, including Yang et al. [108], Luo et al. [52], Tian et al. [44], and Xiong et al. [48].These efforts have resulted in an increased availability of training data for construction-related computer vision tasks.
Section IV-A emphasizes that different classes of vision algorithms can fulfill distinct functions, and for construction sites with complex scenes, a single class of algorithm may not suffice for behavior recognition.Thus, combining multiple algorithms is necessary.For instance, Xiao et al. [109] proposed a construction worker tracking method that utilizes the features of various algorithms, such as Mask R-CNN and Kalman Filter, for tracking multiple construction workers, even in the face of challenges like occlusion and feature scale variation.Piao et al. [110] proposed a framework for computer vision methods that incorporates Dynamic Bayesian Network, Openpose, and Faster R-CNN to assess the fall risk of construction workers during dynamic construction.Roberts et al. [66] developed a computer vision-based method for construction worker activity analysis, which included YOLO v3, Alphapose, and i3D based on CNN.Their findings indicated that incorporating pose estimation enhances construction activity analysis.Fang et al. [38] integrated Faster R-CNN for detecting construction activities, SORT (Simple Online And Realtime Tracking) for tracking individuals, and face recognition for confirming the identity of workers, to eventually identify whether there are uncertified workers engaged in irrelevant construction tasks.Li et al. [78] used YOLO v5 to detect Personal Protective Equipment (PPE) and Openpose to detect skeleton joints, incorporating the visual feature information to determine whether the PPE was correctly used, using 1D-CNN.Ding et al. [58] combined CNN and LSTM to identify construction workers' unsafe behaviors.Cai et al. [84] used Faster R-CNN to detect the head and body orientation of construction workers, and subsequently applied a multi-task learning network to assess the visual attention direction of construction workers.Fang et al. [51] combined Mask R-CNN and Cascaded Pyramid Network to localize construction workers in monocular vision, then identified their unsafe behaviors.Table 1 presents other papers that integrated multiple algorithms to achieve improved visual recognition.

V. APPLICATIONS
This section provides an in-depth analysis of the reviewed articles from the application perspective.The literature materials selected in this paper consider all construction-related behaviors, including workers' movement trajectory, construction activities, and safety behaviors.These behaviors have indirect or direct impacts on project quality control and workers' safety.The section analyzes the literature from three perspectives: construction workers' location and tracking, workers' construction activities recognition, and monitoring of occupational health and safety behaviors.
Figure 8 illustrates the distribution of literature among various application domains, with the majority of the research dedicated to occupational health and safety behavior monitoring, constituting over 70% of the total literature.This finding suggests that safety in construction is a significant concern among researchers worldwide.Within this domain, researchers show a higher inclination towards detecting Personal Protective Equipment (PPE) use, which constitutes 32.8% of the total reviewed literature.

A. DETECTION, LOCALIZATION AND TRACKING FOR CONSTRUCTION WORKERS
Engineering construction involves multi-tasking operations with a high turnover of personnel, and the distribution of the location and the trajectory of construction workers affect both productivity and efficiency.Thus, detecting and tracking construction workers is a crucial task in construction management.When detecting construction workers, relying on a single algorithm, such as SVM, is often considered in early stage.For instance, Memarzadeh et al. [72] extracted HOG features and then used SVM to detect construction workers and equipment.On the other hand, Fang et al. [56] and Son et al. [62] both used the Faster R-CNN algorithm for object detection and achieved better performance.Park et al. [99] proposed SOC-YOLO to improve the detection of construction workers in complex scenes when they overlap.Achieving more accurate localization for individual workers on a temporal scale requires combining various algorithms [112], [117].For instance, Yang et al. [112] added DeepSORT after Faster R-CNN for multi-object detection and then used the Person ReID algorithm for individual worker localization on the time scale.Some researchers have extended worker detection applications.For example, Huang et al. [118] and Mei et al. [119] used skeleton-based and object detectionbased approaches, respectively, to monitor whether a worker has intruded into a prohibited area based on localization.Wei et al. [54] integrated spatial attention networks and temporal attention networks to recognize individual identities.
Cai et al. [84] used Faster R-CNN to detect the head and body orientation of construction workers, and then applied a multitask learning network to evaluate the visual attention direction of workers.Yan et al. [39] first detected workers using Faster R-CNN and then applied 3D pose estimation to locate them in 3D space.
The fundamental basis for the implementation of worker tracking lies in the localization and detection of construction personnel, facilitating an enhanced comprehension of labor allocation and mobility within construction sites.The Kalman Filtering algorithm has found wide-ranging applications in the field of object tracking, as evidenced by the reviewed literature [109], [113], [120], [121], [122].It is common to combine Kalman Filtering with object detection algorithms.For example, Neuhausen et al. [120], [121] used YOLO v3 to detect construction workers and then tracked them using the Kalman filter, demonstrating the robustness of the Kalman filter in compensating for the limitations of YOLO v3.Yongyue et al. [122] proposed a different approach by combining Kalman Filtering with the individual pose joint point estimation algorithm, Openpose, to reduce the manual annotation workload.Similar results can be achieved by combining object detection algorithms with other techniques.Son and Kim [63] proposed the combination of YOLO v4 and Siamese Network for construction worker detection and tracking, achieving an accuracy of 0.975.Wan et al. [106] improved YOLO v5 by attention mechanism, astrous spatial pooling, and universal upsampling to track construction workers by classifying hardhats, which in turn identified unauthorized intrusions.Angah and Chen [123] integrated Mask R-CNN, Matching, and Rematching, enabling multiple construction worker tracking on the screen.Liu et al. [97] introduced a multidomain representation for CNNs to improve tracking effects in complex dynamic scenes.Furthermore, researchers have expanded tracking into 3D space based on 2D tracking.Cai and Cai [81] proposed a hybrid approach of visual tracking and radio localization to address the problem of easily losing targets in visual recognition.Lee and Park [124] employed two cameras for separate tracking and localization, followed by Entity Matching for 3D tracking.
LSTM networks can analyze input data with a time series component.To predict the action trajectories of construction workers, Cai et al. [83] proposed a context-aware LSTMbased method.Similarly, Tang et al. [73] combined LSTM with a mixture density network to achieve the prediction of construction workers' action trajectories.
Computer vision-based detection and tracking methods have the potential to effectively assist project managers in analyzing the precise location information of construction workers.As computer vision technology continues to advance, the accuracy and stability of tracking methods will gradually improve, leading to enhanced tracking capabilities of individuals in 3D space and improved prediction of action trajectories.

B. RECOGNITION OF WORKERS' CONSTRUCTION ACTIVITIES
The accurate recognition of workers' construction activities plays a crucial role in enabling project managers to effectively monitor construction progress and optimize the allocation of labor resources.It serves as a valuable tool for analyzing labor productivity.Previous studies primarily employed traditional image processing algorithms for construction activities 7142 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.recognition.For instance, Liu et al. [125] utilized a silhouettebased approach, while Yang et al. [108] employed an SVM approach for classifying image features.However, with technological advancements, Yang [126] enhanced the method by incorporating data-driven scene parsing while retaining the dense trajectories.
In the last several years, the advancement of convolutional neural networks has led to reduced training costs for object detection algorithms, as well as improved detection speed and accuracy.These advancements have facilitated the application of object detection algorithms in construction activities recognition.Luo et al. [43] employed Faster R-CNN for the identification of construction workers and various entities depicted in images captured at construction sites.They employed a relevance network to recognize multiple construction activities by analyzing the spatial relationship between them.Similar studies include Fang et al. [38], who employed SORT and face recognition for the detection of non-certified work, and Li et al. [75], who used CenterNet to detect construction workers and objects to evaluate construction productivity while recognizing reinforcement assembly activities.
Extensive investigations in the domain of construction activity recognition have been conducted by the research group led by Luo Xiaochun and Li Heng, delving into the amalgamation of object detection algorithms with complementary methodologies.For instance, they combined YOLO v3 with SORT and 3D CNN to achieve spatial localization and construction activities recognition of construction workers [41].In another study, the researchers utilized YOLO v3, SORT, KCF (Kernelized Correlation Filter), C3D, and CRF (Conditional Random Field) algorithms in multiple steps, enabling the recognition of construction activities performed by workers in groups [50].Considering the spatio-temporal properties of construction activities, Li et al. [76] directly employed the Faster R-CNN algorithm to recognize construction activities in images, simultaneously detecting construction workers and determining the activities performed by different operators based on the spatial relationship between individuals and activities.Bhokare et al. [127] also proposed a method based on YOLO v3 for direct recognition of construction activities.
To effectively integrate construction activities with spatiotemporal information, utilizing video clips for recognition is a more promising approach.Previous studies have proposed the use of multistream convolutional neural networks [45], [52].In recent years, research methods based on videos have become more diverse.Roberts et al. [66] incorporated pose estimation algorithms into video-based construction activities recognition.Similarly, Cai et al. [128] introduced an attention direction estimation method to identify groups of construction workers and subsequently classified their activities using LSTM.Li et al [116] identified three activities of construction workers through YOLO and ST-GCN, which are throwing, operating and crossing.Torabi et al. [129] and Yang et al. [96] employed YOWO53 and Transformer, respectively, demonstrating high recognition accuracy on today's advanced computing hardware.Li and Li [79] applied Openpose and GAN to estimate the complete skeleton joints of construction workers under occlusion, then used ResNet to recognize construction activities.
Table 2 presents the comprehensive details of the reviewed literature in this section.The literature encompasses a broad spectrum of application scenarios, extending beyond the physical activities of construction workers to encompass various processes within civil engineering construction.These technologies hold significant potential for facilitating intelligent construction and enhancing project management.

C. OCCUPATIONAL HEALTH AND SAFETY BEHAVIOR MONITORING FOR CONSTRUCTION WORKERS 1) POSE ESTIMATION
The analysis of construction workers' body posture enables the understanding of their fatigue level, safety condition, and construction behavior.While algorithms like Openpose [130] and Alphapose [131] offer rapid and accurate estimation of 2D skeleton joint points, it is important to note that 2D pose estimation may not fully capture the nuanced movements of construction workers.Consequently, researchers in the engineering field have placed greater emphasis on pose estimation in 3D space.
Earlier studies relied on 3D depth cameras as vision algorithms were not as advanced.Han et al. [68] employed a depth camera to acquire data and reconstruct the 3D coordinates of joint points belonging to construction workers.This approach facilitated the detection of perilous actions performed during the process of ascending ladders.Khosrowpour et al. [69] also employed a depth camera to acquire 3D joint point data and performed activity classification.Liu et al. [132] utilized a stereo video camera to capture 3D skeleton joint points.In an early stage approach, Seo et al. [133] transformed motion capture data obtained from vision-based methods for biomechanical analysis.In another study, Seo et al. [37] employed shape-and radial histogram-based features to estimate the posture of construction workers.Additionally, Seo et al. [134] compared multiple pose estimation methods and concluded that the sensor-based approach yielded the least error.
In past few years, the rise of deep learning methods has facilitated direct vision-based pose estimation.Zhang et al. [46] employed a multi-stage convolutional neural network to estimate the 3D pose of construction workers using monocular vision.Yu et al. [36] utilized a vision-based approach to obtain the 3D spatial pose of construction workers and further collected ergonomic information through an intelligent insole, enabling the analysis of their physical condition.In another study, Yu et al. [47] curated a construction posture dataset encompassing various postures to facilitate 3D posture estimation.Chu et al. [135] developed an ergonomic 3D posture assessment framework for construction workers, which integrated tracking algorithms, 2D detection, and 3D body generation.Kim et al. [136] constructed a synthetic dataset to enhance 3D pose estimation, while Tian et al. [44] compiled a construction worker motion dataset that can contribute to 3D pose recognition.Table 3 provides a list of typical studies in pose estimation.

2) UNSAFE BEHAVIOR RECOGNITION
During construction tasks, it is crucial to monitor the actions of construction workers as they may engage in hazardous movements or enter dangerous areas due to various factors.Such behaviors pose significant safety risks and require vigilant monitoring.
Computer vision techniques facilitated in extracting postural features related to unsafe behaviors of construction workers and assessing safety by analyzing their spatial location in relation to other entities.A research team from Huazhong University of Science and Technology developed a safety monitoring framework during the Wuhan Metro construction project, which effectively detects and predicts unsafe behaviors, including instances of construction workers illegally walking on structural supports [32], [33], [53], [57].The risk of falls is a major concern for construction workers, and Piao et al. [110] successfully employed Dynamic Bayesian Network, Openpose, and Faster R-CNN to detect fall risks step by step.During hoisting operations, workers face the risk of object strikes in dangerous areas.Chian et al. [138] utilized CenterNet to detect and track construction workers entering these hazardous zones.During the Covid-19 pandemic, Chian et al. [139] also employed CenterNet to monitor safe distances between construction workers.Ladder climbing poses a potential fall risk, which has been addressed by Han et al. [70], Anjum et al. [59], Ding et al. [58], and Chen et al. [140] using various computer vision methodologies.Furthermore, several computer vision-based studies have contributed to the recognition of unsafe behaviors, encompassing safety risk determination [71], [141], fall detection [142], and identification of unsafe actions [48], [65], [143], [144].
Construction sites often involve the simultaneous operation of various large machinery and random stacking of materials, increasing the risk of collisions between construction workers and these objects.The application of computer vision methods for assessing collision risks between workers and machinery or materials plays a vital role in ensuring the safety of construction workers.In earlier studies, Kim et al. [145] employed Gaussian mixture models to identify collision risks between construction workers and mechanical equipment.Fang et al. [51] utilized Mask R-CNN and Cascaded Pyramid Network to locate the relationships between construction workers, equipment, and materials using monocular vision.Zhang et al. [74] applied Faster R-CNN to detect workers and machines at construction sites, obtaining their location coordinates, and determining collision risks through fuzzy inference.Yan et al. [35] employed the Faster R-CNN algorithm to detect key points of construction trucks, followed by 3D reconstruction to establish distance recognition in the 3D scene.Shin and Kim [146] integrated YOLO v3 and the Openpose algorithm to identify collision risks between construction workers and trucks.Zhang and Ge [77] employed the latest Transformer technology to identify dynamic collision risks between construction workers and tower cranes.Additionally, other studies have contributed to collision risk identification [64], [147], [148]and fall risk identification [149].Table 4 presents some of the typical research literature in this field.

3) PPE'S USAGE INSPECTION
Safety accidents are prevalent in engineering construction, with a significant number of casualties resulting from fall accidents and object strikes, as reported by the Ministry of Urban and Rural Construction of the People's Republic of China and the U.S. Bureau of Labor Statistics [152], [153], [154].To mitigate the occurrence of such casualties, various countries have implemented mandatory policies and regulations regarding personal protective equipment (PPE).For instance, the State Administration of Quality Supervision, Inspection and Quarantine of China and the U.S. Occupational Safety and Health Administration have specified the use of hardhats during construction operations [155], [156], and the requirement for safety harnesses when working at heights.With the advancements in artificial intelligence technology, numerous computer vision-based studies have emerged to address the detection of PPE usage.
Before the rapid advancement of deep learning techniques, traditional methods for processing image features and machine learning algorithms were more commonly employed.For instance, Park et al. [157] utilized image processing methods to extract the spatial relationship between construction workers and hardhats.They subsequently employed SVM to detect and match this feature, determining whether construction workers were wearing hardhats or not.Mneymneh et al. [158], [159], [160] extracted various image features such as SURF and employed template matching and cascade classifiers for hardhat detection.In order to achieve better performance in safety vest detection, Seong et al. [161] compared various features and classifier algorithms.The results demonstrated that combining the C4.5 classifier with YCbCr and SVM classifier yielded superior outcomes.Similar to previous application areas, deep learning methods are rapidly replacing traditional techniques.Weili et al. [55] employed Faster R-CNN to detect the usage of safety harnesses by construction workers at heights.In another study conducted by the same group, Fang et al. [42] employed Faster R-CNN for far-field monitoring at construction sites to detect hardhat wearing.The utilization of the Faster R-CNN algorithm has significantly enhanced image recognition accuracy [162].Fu et al. [163] employed this algorithm for hardhat wearing detection, while Kamoona et al. [164] applied it to detect high-visibility vests.
Chen et al. [184] utilized YOLO v3 and Openpose to extract features related to PPE and individual joints.They employed Scene graph representation to describe the spatial relationship between joints and PPE, enabling the detection of multiple PPE items such as hardhats, dust masks, safety glasses, and safety belts.Li et al. [78] introduced the hierarchy of control and emphasized the importance of ensuring correct usage of PPE from an administrative control perspective.By combining the features extracted by YOLO v5 and Openpose, they employed 1D-CNN for classification and detection of incorrect usage of hardhats and safety harnesses.Table 5 provides further details on some typical studies.

D. SUMMARY
This section presents an overview of computer vision applications for recognizing construction worker behavior on construction sites, primarily focusing on occupational health and safety risks.These articles represent significant advancements in utilizing computer vision technology to enhance both productivity and safety in engineering construction.Compared to other sensing and monitoring technologies, computer vision technology offers immense potential for development, cost-effectiveness, and a wide range of applications.According to survey results, CNNs emerge as the most widely employed technique among researchers, leveraging their ability to rapidly extract features compared to traditional machine learning algorithms.With the continuous evolution of CNNs, object detection technology has been extensively adopted, demonstrating strong adaptability in monitoring worker construction behavior.It has achieved breakthroughs in detection performance and real-time monitoring, enabling the tracking of construction workers' location, activities, and safety risks.As object detection technology continues to advance and Transformer technology gains prominence, computer vision technology holds the promise of making even greater contributions to the field's development.

VI. DISCUSSION
This article presents the latest research outcomes concerning the utilization of computer vision techniques for the recognition of construction workers' behavior.Considering prevailing trends and the global emphasis on the high-quality development of infrastructure, we anticipate that this research avenue will continue to garner considerable enthusiasm in the ensuing years.Within this section, we introduce the impact of relevant technologies on construction, undertaking a comparative analysis of the appropriateness of various algorithmic approaches.Subsequently, we summarize the current research gaps and delve into possible future research directions.Notably, we incorporate insights from generative AI models, a highly prevalent trend this year, providing an additional perspective and offering recommendations for researchers engaged in related pursuits.

A. IMPACT ON ENGINEERING AND CONSTRUCTION
Artificial intelligence (AI) has spearheaded the evolution of intelligent construction, enhancing the efficiency of construction operations.Some practitioners assert that the application of AI technology can yield cost savings of up to 20% in building construction [186].Over recent years, with the continual refinement of algorithms and methods, a growing number of outcomes are finding practical applications in engineering.China, boasting the largest economy in the construction industry, has witnessed a growth in engineering applications.Noteworthy instances include the successful implementation of results from Luo Hanbin's team at Huazhong University of Science and Technology in the Wuhan metro project [53], [56], [57], [58].The established quality and safety management platform comprehensively monitors on-site construction conditions, provides real-time updates on quality and safety risks, and guides dynamic adjustments to the construction process [187] (refer to Figure 9).Additionally, Li et al. [75], [76] and Yang et al.'s outcomes [80] were integrated into a super high-rise construction project in the Donggang Business District of Dalian City in collaboration with China Construction Eighth Engineering Division.This integration led to a reduction in project managers' workload, indirectly contributing to a zero-accident occurrence throughout the project.While some results cited in this paper may not explicitly showcase the integration of applications in real projects, the data collected originates from actual construction sites and holds the potential for further development into tangible products [43], [84], [99].Technology companies like Baidu and HikVision have launched commercial solutions for smart construction sites, aligning with the computer vision trend.In contrast to university scholars' emphasis on methodological innovation, commercial solutions prioritize functionality.In the current stage, the swift progress of AI technology has influenced public decision-making [8], [9], [10], and the concept of smart sites is expected to persist as a focal point in the engineering and construction field.Anticipating increased investment in research and development, it is foreseeable that novel methodologies will expedite technological integration with commercial solutions, further enriching the body of knowledge.

B. TECHNICAL SUITABILITY
One of the challenges hindering the widespread adoption of computer vision technology in the engineering construction field is the method's applicability.The construction site environment is intricate, exhibiting significant variations from one project to another.Factors such as lighting conditions and occlusion pose substantial impacts on the algorithm's recognition outcomes.Consequently, the algorithm's robustness becomes a critical consideration.
In the realm of computer vision, earlier algorithms predominantly relied on image processing and machine learning techniques, exemplified by methods like HOG features [72] and support vector machine (SVM) [108], [157].However, these algorithms suffer from sluggish image feature extraction, inability to handle extensive data training, and exhibit low robustness and limited engineering applicability.
CNN, featuring weight sharing and automated feature extraction, exhibits the capacity to train effectively on extensive datasets.However, when the dataset primarily represents a singular scene composition, its applicability to other projects may be compromised [35], [50], [80].In such instances, remedying this limitation necessitates secondary training.Conversely, if the dataset encompasses diverse lighting conditions, CNNs demonstrate efficacy in addressing such challenges.Notably, methods utilizing Faster R-CNN have yielded satisfactory results in scenarios involving varied lighting conditions [42], [76].The emergence of larger-scale datasets related to construction behavior further substantiates the success of algorithms like Faster R-CNN and YOLO [44], [48], [52], [107], [108].In terms of methodology, there are also ways to address applicability issues such as occlusion [79].As algorithms mature, the focus on improving robustness becomes pertinent.For instance, Park et al. [99] enhanced YOLO v5 with multiple strategies, enabling the detection of construction obscured and overlapped workers even in low light environments (Figure 10).Anticipating continued algorithmic and methodological advancements, coupled with ongoing dataset updates, the resolve of addressing the technique's robustness appears promising.

C. RESEARCH GAPS AND POSSIBLE FUTURE RESEARCH DIRECTIONS 1) TECHNICAL APPROACHES
Considering the profound influence of visual algorithm performance on the precision of behavior recognition, enhancing the accuracy and speed of visual algorithms remains a pivotal trajectory for future development in this domain.In the realm of CNN-based methods, accuracy enhancements can be achieved by modifying the network structure, while computational speed can be augmented through alterations in convolution techniques or the implementation of alternative methods, such as refining the loss function.
While CNN-based methods excel in local feature extraction, they often exhibit robustness issues to data noise and deformation.Presently, Transformer technology, grounded in a self-attention mechanism, demonstrates superior modeling ability and enhanced robustness.Its application has begun to permeate the field of engineering and construction.Nevertheless, the Transformer model's intricate structure and numerous parameters pose challenges.Hence, a plausible future direction involves developing a lightweight Transformer model tailored for the construction domain.Potential measures encompass the incorporation of lightweight attention modules, utilization of convolution or pooling operations to reduce sequence length, among other strategies.
The integration of various computer vision algorithms can effectively tackle intricate behavior recognition tasks, constituting a methodologically sound approach.However, to align with the practical demands of engineering applications, future research could explore the establishment of a unified specification for operational environments and normalized data formats applicable to diverse computer vision algorithms.Alternatively, the development of novel algorithms or monitoring strategies geared towards multifunctional real-time monitoring could be pursued.

2) APPLICATION SCENARIOS
From the reviewed articles, the predominant focus has been on addressing occupational health and safety risks within the construction context.There is a noticeable dearth of studies pertaining to the tracking of construction workers and the recognition of construction activities.While understanding the location distribution and movement trajectory of workers proves valuable in visualizing the labor force's distribution status and aiding resource allocation, vision-based tracking encounters challenges in preventing the loss of targets across spatio-temporal scales.This predicament has led to the increased attention given to ReID (Re-Identification) algorithms in recent years.Future research directions could involve the development and exploration of ReID algorithms tailored for construction workers or the pursuit of alternative methods with comparable functionality.
Construction activities recognition can intuitively reflect the productivity status of the site, and should be given enough attention in future research, which can establish a large-scale construction activities database for the development of computer vision-based approaches.
Engineering construction sites present intricate scenarios, with environmental variations observed across different locations.Some research is carried out in laboratory settings for image acquisition and testing [78], while others collect data from specific engineering projects [76].Consequently, the performance and effectiveness of these approaches cannot be assured when applied to different sites.To illustrate, considering Personal Protective Equipment (PPE) usage detection, numerous studies report accuracy rates surpassing 90%.However, these achievements are often based on self-constructed datasets for testing rather than utilizing open-source public datasets.
Future research endeavors should focus on constructing extensive public datasets encompassing diverse facets of worker construction behavior.This comprehensive dataset should span static images and dynamic videos, encompassing a spectrum of construction scenarios across various project types.It is strongly advised that researchers employ these public datasets during the development of algorithmic frameworks.Doing so would significantly contribute to the broader comprehension of algorithms among researchers, fostering collaboration and knowledge exchange.

3) GENERATIVE AI'S RESPONSE
In 2023, generative AI has emerged as a prominent field within AI, with ChatGPT [188] and NewBing [189] standing out as two highly popular generative AI models.ChatGPT, developed by OpenAI, is a conversational AI model built upon the GPT 3.5 series models, which have been trained using reinforcement learning techniques.NewBing, on the other hand, is an interactive tool offered by Microsoft that incorporates ChatGPT technology to provide more innovative insights and suggestions.
To gather Q&A information related to generative AI, Dr. Yang Zhaozheng from the University of Strasbourg adopted both NewBing and ChatGPT, directing specific inquiries to each model.It should be noted that the statements presented by Generative AI models do not reflect the perspectives of Dr. Yang or any of the contributors to this article.
Here are some suggestions for future research directions given by generative AI models: a. Enhance object detection and recognition algorithms for construction workers in complex indoor environments.
b. Improvement is sought in accuracy and robustness despite variations in lighting conditions, occlusions, camera movements, and background noises.
c. Combining computer vision with technologies like deep learning, IoT, augmented reality, and digital twin for better performance in progress monitoring systems.
d. Progress monitoring will not only cover physical aspects but also psychological, social, and environmental factors affecting workers' well-being and behavior.
e. Standard benchmarks and evaluation metrics will be established for comparing different computer vision-based progress monitoring methods and systems.
f. Integration of multi-sensor data, such as wearable sensors or GPS, with visual data for a more comprehensive understanding of workers' behavior.
g. Development of worker-centered monitoring approaches while respecting privacy and autonomy.
h. Exploration of robust monitoring systems that can adapt to changing environments, including lighting conditions and weather.
i. Implementation of real-time monitoring and intervention systems to provide immediate feedback to workers and prevent potential hazards.
Moreover, generative AI models offer substantial potential benefits.Excellent design capabilities are exemplified by Cadence's Allegro X AI technology.This generative AI model facilitates the optimization of the system design process.Furthermore, the generative AI models' capacity holds promise for creating algorithms to recognize the construction workers' behaviors.When this technology matures, the AI model can seamlessly execute automated testing of product features, including compatibility testing for software applications.Additionally, in the realm of construction management, where decision-making is integral, generative AI models can prove invaluable by providing targeted advice.

VII. CONCLUSION
This paper presents a comprehensive review of the literature pertaining to the study of monitoring construction workers' behavior using computer vision technology.Initially, the researchers conducted a keyword search on the WOS platform, resulting in 596 papers from the past decade.Through manual inspection, 137 papers were selected for further analysis.The distribution of publication years indicates that the field is undergoing rapid development, with ongoing updates to technical methods.The authors employed VOSViewer software to identify research hotspots and highlight notable researchers and institutions in the field.
The paper provides an overview of computer vision techniques and technological advancements in worker behavior recognition.The reviewed literature is categorized into three sections based on application scenarios: '' Detection, localization and tracking for construction workers,'' ''Recognition of workers' construction activities,'' and'' Occupational health and safety behavior monitoring.''The abundance of literature demonstrates the preference of current researchers for employing computer vision to address safety concerns in engineering construction.The analysis reveals a trend wherein vision-based approaches begin to be adopted in real-world engineering, impacting decision-making processes in civil engineering.Initially, across different application scenarios, early research inclined towards addressing challenges through conventional image processing methods and machine-learning classification.However, as CNN and object detection techniques matured, vision methods exhibited enhanced robustness, adept at handling behavior recognition issues even in intricate scenarios, leading to the swift displacement of traditional methods.Recent years have witnessed an increase in the integration of multiple algorithms to address real-world challenges, marking a prevalent trend.Additionally, the emergence of transformer technology holds the promise of transcending existing algorithmic limitations in this domain.
The authors conclude by summarizing research gaps and outlining future directions.Further optimization is required in terms of accuracy and speed, especially when integrating multiple algorithms, necessitating the establishment of a unified framework platform and integration strategy.In terms of application scenarios, improvements can be made to enhance scene applicability and diversify data sources.Furthermore, insights from the generative AI model underscore the importance of considering the privacy and autonomy of construction workers, suggesting their involvement in monitoring projects.This paper serves as a reference for engineers and researchers, aiming to contribute to the field of engineering and construction while benefiting fellow scholars' scientific studies.

FIGURE 1 .
FIGURE 1.The research logic framework.
displays the sources of the papers, including Automation in Construction, Journal of Computing in Civil Engineering, Journal of Construction Engineering and Management, Advanced Engineering Informatics, IEEE Access, etc.These journals are reputable and may have a considerable potential to make further contributions to the field.

FIGURE 2 .
FIGURE 2. The publication frequency across various years.

FIGURE 3 .
FIGURE 3. The volume of scholarly articles published in relevant academic journals.

FIGURE 5 .
FIGURE 5. Analysis of research authors.

FIGURE 6 .
FIGURE 6. Analysis of research institutions.
automate and simulate human vision.Its primary purpose is to extract valuable information from image and video data and then analyze, process, and understand it.Traditional computer vision methods rely mainly on image processing and pattern recognition techniques.These methods include feature extraction techniques like edge

FIGURE 7 .
FIGURE 7. Analysis of research countries or areas.

FIGURE 8 .
FIGURE 8. Analysis of research countries or areas.

TABLE 1 .
Typical research for algorithms integrated.

TABLE 2 .
Construction activities recognition research details.

TABLE 3 .
Typical research for pose estimation.

TABLE 4 .
Typical research for unsafe behavior recognition.

TABLE 5 .
Typical research for ppe's usage inspection.