Online Safety Zone Estimation and Violation Detection for Nonstationary Objects in Workplaces

This study presents a deep neural network (DNN)-based safety monitoring method. Nonstationary objects such as moving workers, heavy equipment, and pallets were detected, and their trajectories were tracked. Time-varying safety zones (SZs) of moving objects were estimated based on their trajectories, velocities, proceeding directions, and formations. SZ violations are defined by set operations with sets of points in the estimated SZs and the object trajectories. The proposed methods were tested using images acquired by CCTV cameras and virtual cameras in 3D simulations in plants and on loading docks. DNN-based detection and tracking provided accurate online estimation of time-varying SZs that were adequate for safety monitoring in the workplace. The set operation-based SZ violation definitions were flexible enough to monitor various violation scenarios that are currently monitored in workplaces. The proposed methods can be incorporated into existing site monitoring systems with single-view CCTV cameras at vantage points.


I. INTRODUCTION
S AFETY monitoring is an important part of workplace safety that prevents accidents by issuing alarms at critical moments and enforcing safety rules and regulations. However, it is unrealistic for safety officers to monitor large dynamic sites with multiple operations performed by many workers. There have been various developments for assisting or automating site monitoring. Detecting workers and equipment in workplaces and tracking their trajectories for possible accidents are integral parts of automated and continuous safety monitoring. In [1]- [5], a real-time tracking system was used to track workers. An additional advantage of wearable sensors is that they can also be used to measure other data, such as workers' activities and health conditions. However, these require all workers, possibly from many different organizations, to be equipped with compatible and calibrated sensors. When there are many reflective metal structures in a workplace, accurate localization of sensors that utilize radio frequencies may be difficult. Sensitive issues, such as the disclosure of personal information, may also arise.
CCTV cameras are commonly used to monitor safety in the workplace [6]- [8]. Computer vision techniques were applied to achieve an understanding and visualization of situations in images acquired by the cameras. Objects in images are detected using object models based on pixel, color, and other feature information [9], [10]. Trajectories of objects can be tracked through pixels, segmentation, contours, kernels, and graph-based tracking algorithms [11]- [14]. Physical locations in the 3D space of detected and tracked objects cannot be specified through a single CCTV camera [15]. In [16]- [19], two or multiple cameras were used to locate objects in 3D space. Alternatively, range data from distance sensors, such as LIDAR, can be used to determine locations of objects in 3D spaces [20].
Deep learning-based object detection enables accurate detection of multiple objects in the workplace, such as workers and heavy equipment. In [21]- [25], personal protective equipment was detected using DNNs. In [26]- [28], heavy equipment in images was detected using DNNs. In [29], [30], images from multiple cameras were monitored using DNNs to detect workers in near distances of heavy equipment in operation. In [31], support structures at construction sites were segmented using a deep network to detect workers standing on the structures. The safety rule violations in these processes require safety zones (SZs) around stationary objects, such as heavy equipment operating at fixed locations or concrete and steel structures in the sites.
Tracking the trajectories of moving objects has been studied in various areas [32], [33]. Once an object is detected, its locations over time can be tracked using deterministic [34], [35], probabilistic [36], [37], and deep network models [38]. Recently, understanding of the trajectories of multiple objects were studied in sports events [39], [40] or in clouded spaces [41]- [44] is being studied. These approaches try to model the general movement of objects, for example, players and humans, in a given situation. For safety monitoring purposes, we cannot expect heavy equipment and workers to follow general and safe trajectories. A simple tracking method based on a simple deterministic motion model without complicated assumptions is more suitable for our purpose.
This study presents online SZ estimation and detection of safety rule violations involving nonstationary equipment. In particular, we consider collisions of moving workers, moving heavy equipment in plants, and accidental falls of moving workers from elevated platforms arranged with moving pallets in loading docks. Images from a single CCTV camera at the vantage point, which is often already available in workplaces, were used to monitor the site. Objects, such as workers, helmets, forklifts, trucks, pallets, and staircases, were detected and segmented using DNNs. Object locations were mapped to the top plane, where trajectories of objects, as well as the distances between moving objects, were estimated. The SZs of moving workers and heavy equipment are defined by circular sectors adaptive to the time-varying trajectories and speeds. The SZs of elevated platform formations are defined via morphological operations [45]. Various safety rule violations are defined and detected using set operations involving SZ sets. The proposed zone estimation and zone violation detection algorithms, which involve the use of neural networks for detection and segmentation, were trained using images from CCTV cameras installed in a plant and a loading dock. The trained algorithms were implemented in the current site monitoring systems for field testing. We also prepared images acquired from virtual cameras in 3D worlds created using the 3D simulation software Unity 3D [46]. The proposed methods were trained and tested using Unity 3D images, which allowed us to validate the concepts and quantitatively evaluate the detection, segmentation, and tracking accuracy. With DNN-based object detection, segmentation and trajectory tracking and morphological operation-based online zone estimation, we were able to detect zone violations and issue alarms for collision and fall accidents.
The contributions of this work can be summarized as follows. i) The DNN-based object detection and segmentation methods provided accurate detection and segmentation of multiple objects in the workplace with small false nega-tives by a single camera so that they can be used in safety monitoring. ii) Time-varying SZs are estimated based on the trajectories, velocities, headings, and formations of objects so that safety violations involving moving objects in workplaces can be considered. iii) SZ violations are defined as set operations with SZs and trajectories in the top-view plane so that violation scenarios in various workplaces can be expressed easily and detected accurately. iv) The proposed methods can be readily incorporated into existing site monitoring systems with single-view CCTV cameras at vantage points.
The remainder of this study is organized as follows. Section II-B presents the perspective transformation of acquired images to the top-view plane, where trajectories of objects were estimated. In Sections II-C and II-D, the time-varying SZs of moving heavy equipment and moving elevated platforms are defined. Sections III-A and III-B address SZ violation detection using set operations. Section IV provides the experimental results and discussion. Section V concludes the study.

A. OBJECT DETECTION AND SEGMENTATION
DNNs are utilized for object detection and segmentation. Fig. 1 shows the network architectures of YOLOv3 [47] and YOLACT [48] used for object detection and segmentation, respectively. YOLOv3 extracts features with deep convolutional layers. The features at three different resolutions are used to predict the types and locations of objects. YOLACT utilizes a fully convolutional layer (FCN) [49] to produce prototype masks and then refines the prototype masks based on the object detection results to find pixels that belong to each detected object.
Both YOLOv3 and YOLACT learn features using deep convolutional layers from data. DNN-based object detection shows improved performance over detection methods based on domain-specific hand-selected features. For example, features such as histogram of oriented gradients (HOG) [50]- [52] and aggregated channel features (ACF) [53], [54] can be extracted using a sliding window on various scales, and then objects can be detected using a classifier such as a support vector machine (SVM) [55] based on the extracted features. Fig. 2 shows examples of DNN-based and specific featurebased object detection. Images from a frontal view camera in workplaces are used to detect workers. Both HOG-based and ACF-based detection showed difficulties in detecting workers partially occluded by other objects and multiple workers who reside in close proximity. In comparison, DNNbased detection showed robust detection of the workers in these cases. In general, DNN-based detection showed better performance for corner cases where specific feature-based algorithms may suffer. Similar trends in performance comparisons were reported in other applications [56], [57].
The recall, which is defined as recall = true positive true positive + false negative ,   was measured for the images used in the evaluation. The recall values of the HOG-based and ACF-based detection methods were 85.29 and 82.50, respectively, while that of the DNN-based method was 98.95. The DNN-based method reported significantly higher recall than the HOG and ACFbased methods. In safety monitoring, the false negative should be kept as small as possible to prevent missing an object or an occasion that may be involved in an accident. Hence, this work employs DNN-based methods for object detection and segmentation.

B. PERSPECTIVE TRANSFORM
A single camera is used to monitor a workplace. Objects in the scene are detected using a DNN. Locations of the detected objects are provided as pixel locations of bounding boxes that contain the objects. We transferred the object locations to a physical plane. In particular, we transferred the object locations to a plane from a top-view camera via a perspective transform [29], [30]. Let (x,ỹ) and (x, y) be the pixel locations in an image acquired by the camera and in an image transformed to the top-view plane by the perspective transformer, respectively. Furthermore, the pixel locations are related using where M is a 3×3 matrix and γ is a scalar quantity. We locate a square or a rectangular structure at the workplace ground and use the pixel locations of the quadrangle vertices to be mapped to the square or rectangular vertices to determine M and γ.
In various cases, a monitoring camera is installed at a high vantage point with the camera angled downward. Pointing the camera downward allows it to capture a wide angle of the workplace. One disadvantage of this installation is that the distances between objects in an acquired image are different depending upon the locations of the images. Transforming the object locations to the top-view plane via the perspective transformer makes the same distances appear the same, regardless of the object locations. Moreover, if the quadrangle vertices of a structure with a known dimension are used to find the perspective transformer, the physical dimension of a pixel can be found and used to set up safety distances or SZ dimensions.

C. SAFETY ZONE FOR MOVING HEAVY EQUIPMENT
SZs for moving heavy equipment, such as forklifts and trucks, were estimated. First, heavy equipment was detected using a DNN, which provided bounding boxes that contained the objects in the current frame. Furthermore, the centers of the bases of the bounding boxes were compared to those detected in the previous frame. Based on the Euclidean distance between the current and previous bounding boxes, the objects were assigned identification numbers. A newly appearing object was registered and assigned an i.d. number, and an object that disappeared was de-registered and de-assigned the i.d. number. Trajectories of objects were recorded in the topview plane as where t is the current time index, K is the number of previous frames that we tracked, and e = 1, 2, · · · , E is the i.d. number of the moving heavy equipment. The SZ of moving heavy equipment is estimated based on the trajectory. The velocity of the eth heavy equipment is estimated by The safety zone of the eth object is set up as a circular sector with the radius r e t = α e v e t (5) and the angle between ∠v e t −θ e and ∠v e t +θ e . The estimated SZ is placed at the current location (x e t , y e t ). The parameter α e controls how far the SZ extends in front of an object, and the parameter θ e controls how wide the safe zone spreads in front of an object. The safety zone of the eth equipment in the tth frame, denoted by z e t , is defined by the set of pixel indices (x, y)'s inside the circular sector of the eth heavy equipment. Note that the SZ of moving heavy equipment changes frameby-frame depending on the locations and the velocity of the heavy equipment. For visual monitoring, the estimated SZ of the moving equipment was mapped back to the acquired images via inverse perspective transformation.
SZ estimation for moving heavy equipment can be extended to objects whose locations are provided by other methods. For example, the crane operating in a plant is a considerable factor in plant safety. Unlike heavy equipment such as forklifts and trucks, the location of a crane head can be provided by a crane control system. Let (x e t , y e t , z e t ) be the location of the crane head. Then, the trajectory of the crane in the top-view plane is given by The SZ of the crane is set as the union of circles where C(x e τ , x e τ ; r e t ) is a circle centered at (x e τ , x e τ ) with radius r e t . The future locations (x e τ , y e τ ) for τ = t+1, · · · , t+T are predicted using the velocity of the crane head, v e t , which is estimated by subtracting the Kth previous location from the current location. The radius r e t is set as a function of the crane height by where β e and α e are the parameters that determine the size of the height-dependent SZ.

D. SAFETY ZONE ESTIMATION FOR ELEVATED PLATFORMS
Edges of elevated platforms are estimated as the SZ. First, elevated platforms, such as pallets and stairs, were detected using a DNN, which segmented pixels belonging to the detected objects. Furthermore, the binary mask that represented the segmented pixels of the elevated platform equipment was mapped to the top-view plane using the perspective transformer. We applied a series of morphological operations [45] to the binary mask to estimate the SZ. The closing operation was applied to fill small gaps between multiple pieces of equipment: where m t and m c t are the binary masks before and after closing, respectively, and ⊕ and are the dilation and erosion operations, respectively. s c is the structural element for the closing operation. The size of the gaps to close is controlled by the structural element s c . The edge of the closed masks is determined by m e t is the binary mask that represents the edge, and s e is the structural element. The width of the edge region is controlled by the structural element s e . The binary mask of the edge served as the SZ of the elevated platform. The SZ for the elevated platform, denoted by z p t , is the set of pixel indices where m e t is one. Note that the SZ changes frame-by-frame and hence can handle the changes in SZ due to the formation of multiple pallets. For visual monitoring, the estimated SZ of the moving equipment was mapped back to the acquired image via an inverse perspective transformer.
SZ estimation for the elevated platforms can be extended to consider fixed areas. The SZ defined by a user from the image can be transformed to the top view to be considered to be the SZ.

III. SAFETY ZONE VIOLATION DETECTION
Violations of SZ can be expressed in terms of the relations between the locations, trajectories, velocities, heading, and formations of objects. We consider the following violation scenarios: i) future trajectories of workers and heavy equipment that collide, ii) workers staying in dangerous areas for a longer period, iii) future trajectories of workers entering dangerous areas, and iv) workers stepping backward to dangerous areas. Violation scenarios are written as set operations with SZs and trajectories defined in the top-view plane for violation detection.

A. SAFETY ZONE VIOLATION FOR MOVING HEAVY EQUIPMENT
Workers are detected using a DNN, which provides bounding boxes that contain the workers in the current frame. Following the same procedure presented in Section II-C, the trajectories of workers are recorded as for the K previous frames, where w = 1, 2, · · · , W is the i.d. number of workers. The velocity of the wth worker, v w t , is estimated by subtracting the Kth previous location from the current location. The SZ of the wth worker is set as a circular sector with the radius the circular sector of the wth worker. SZ violation occurs when the future trajectories of a worker and heavy equipment collide. In terms of the SZs of the heavy equipment and workers, z e t and z w t , respectively, an SZ violation occurs for the wth worker when the intersection of the two sets is not empty: where E is the number of heavy equipment detections in the frame. When an SZ violation is detected, the wth worker in the acquired images is highlighted for monitoring, and an alarm is issued.

B. SAFETY ZONE VIOLATION FOR ELEVATED PLATFORMS
The same DNN that provides segmentation of pixels for the platform equipment was used to detect workers. A bounding box that contains segmented pixels of a detected worker is found, whose center of the base is used as the location, trajectory, and velocity of the worker. SZ violation occurs when a worker stays at the edge of the elevated platform for a long duration. In terms of the SZs of platform z p t and the locations of workers (x w t , y w t ), an SZ violation occurs for the wth worker when the worker's future locations belong to the current set z p t : where T is a parameter that determines the duration of a worker staying at the edges of the platform. SZ violation occurs when the future trajectory of a worker intersects the edge of the platform. In terms of the SZs of the platform and workers, z p t and z w t , respectively, an SZ violation occurs for the wth worker when the intersection of the two sets is not empty: Another violation that we detected was when a worker stepped backward on the platform. To determine whether a worker is walking forward or backward, the gaze direction is estimated. Workers in the workplace are required to wear helmets, which are white on the front and blue on the back. The helmets are detected using the same DNN. The detected helmets are assigned to the worker i.d. number based on the Euclidean distance between the locations of helmets and workers. The image of the helmet is resized to a fixed dimension. Furthermore, binary maps for the white and blue parts of the helmet are found using thresholding in the HSV color space. The gaze direction of the wth worker,g w t , is defined by subtracting the center of mass of the white mask from that of the blue mask. To compare the walking direction and the gaze direction, the trajectory of the worker is recorded with the locations in the acquired images as The velocity of the wth worker,ṽ w t , is computed. Back- stepping is detected when the angle between the gaze and walking directions is greater than a threshold, or where φ is a threshold.
When an SZ violation is detected, the wth worker in the acquired images is highlighted for monitoring, and an alarm is issued. Fig. 3 shows examples of the perspective transformers. Images in (a) were acquired using CCTV cameras installed at vantage points in a plant and a loading dock. Because the cameras are pointed downward, objects showcase perspective with a vanishing point. In the images, objects in the front appear larger than those in the back. We cannot determine distances between objects by simply measuring the distance between pixels. Images in (b) show the results of mapping the images to the top-view plane via the perspective transforms. Quadrangle vertices of rectangular structures with a known dimension are used to find the perspective transform in (2). In particular, we measured the dimensions of four points in the pathways. Objects may appear stretched out in some directions in the top-view image. However, the footings of the objects bear correct locations in the ground.

1) Perspective Transform
We evaluated the accuracy of the perspective transformer through images created using the 3D simulation software Unity 3D [46]. 3D worlds that are similar to the plant and the loading dock in Fig. 3 were created. Virtual cameras were placed at vantage points, and images from the virtual camera were acquired. The perspective transformers are found using four points in the ground, with which the images are mapped to the top-view planes. Fig. 4 shows examples of the acquired and transformed images. For evaluation, we placed checkerboard patterns on the ground in the virtual worlds and measured the dimensions of the checkerboard patterns in the top-view images. A 2 m × 2 m square and a 4 m × 4 m square were used in the plant and the loading dock, respectively. The average angle between the adjacent sides of the squares mapped to the top-view plane was 89.34 degrees. The average aspect ratio of the square was 1.00:1.07. The physical dimension of a pixel can be calculated from the known dimensions of the checkerboard pattern. A pixel in the top-view images corresponds to 0.015×0.014, 0.016×0.017, 0.018 × 0.017, and 0.026 × 0.020 m in Fig. 4.

2) Online Safety Zone Estimation for Moving Equipment
Objects in the acquired images are detected using a DNN. YOLOv3 [47] is used to detect workers, forklifts, and trucks. Images in a plant were acquired while workers performed various tasks over several days, and the objects in the images were labeled. A total of 1443 images with 4198 labeled objects were used for the training. Data augmentation with scaling by x0.5 and x1.5 and flipping in both directions of the images was implemented. The network was implemented with Keras and TensorFlow using two NVIDIA GTX 2080 Ti GPUs. Adam [58] was used as the optimizer. The learning rate was set to 1.0×10-3 with decay. Random batches were used with batch sizes of 8. The accuracy of object detection was evaluated with 350 images that included 925 labeled objects prepared separately for testing. For evaluation, we also prepared images in two 3D worlds similar to the plant. The movement of the workers, forklifts, and trucks was simulated, while a view from a virtual camera at vantage points was acquired. Objects in the images were labeled, and YOLOv3 was trained using the labeled images. Table 1 shows the accuracy of object detection in terms of the mean average precision (mAP) [59] and recall. Fourfold cross-validation [60] is used for the evaluation. The training set is divided into four folds. The images in three of the  folds are used for training, and the images in the remaining fold are used for evaluation. The process is repeated for each fold to acquire the average and the standard deviation of the detection rates. The recall is the ratio between the detected objects and all the objects. The recall is not 100%; hence, there are false negatives, i.e., some objects are not detected by the network. Since we are detecting objects in video sequences, there is no case where an object is missed for an entire appearance. A rare case of a frame with a missed object is handled by the tracking algorithm, where only objects that were not detected for consecutive frames were deregistered. Several methods are used for evaluating the safety and robustness of deep learning-based systems in [61]- [64]. We also evaluated the applicability of safety-security monitoring based on significant difference measures of our systems by SafeML in [62]. Table 2 shows the difference between various distance measures for the dataset used in YOLO. Five methods were selected for evaluation: Kolmogorov-Smirnov Distance (KSD), Kuiper Distance, Anderson-Darling Distance (ADD), Wasserstein Distance (WD), and a combination of ADD and Wasserstein-Anderson-Darling Distance (WAD). Fourfold cross-validation [60] is used for the evaluation. Error values are sufficiently small such that the results are acceptable in both the Unity 3D and CCTV datasets for YOLO. In all cases, WAD estimated the least error.
The detection, tracking, and trajectory estimation allowed us to obtain the SZ of moving heavy equipment, which is adaptive to the trajectories and speeds. Fig. 5 shows examples of online SZ estimation for moving heavy equipment. The SZ of heavy equipment is a circular section. The radius of the circular section depends on the speed of the heavy equipment. The forklift in the images (a) decelerated spotting a worker in front of it. It can be observed that the SZ shrinks as the forklift decelerates. The trajectories and velocities of the forklifts are shown as red and cyan lines, respectively, in (b). Fig. 6 shows another example, where a truck was turning left. It can be observed that as the truck in the images (b) was turning left, the circular section was directed in the turning direction. The trajectories and velocities of the truck are shown as blue and cyan lines, respectively, in (b). The spreads of the circular sections are controlled by the parameters α w ,  α e , θ w , and θ e , which were determined through experiments.

3) Online Safety Zone Estimation for Elevated Platforms
YOLACT [48] was used to segment pixels for workers, pallets, and stairs. Images in the loading dock were acquired while workers performed loading and unloading in various pallet formations over several days. Pixels belonging to the objects were labeled. A total of 2007 images, including 25433 labeled objects, were used for the training. Data augmentation was implemented as well. The network was implemented with Keras and PyTorch using an NVIDIA GTX 2080 Ti GPU. Adam was used as the optimizer. The learning rate was set to 1.0×10-3 with decay. Random batches were used with batch sizes of 5. The accuracy of object detection was evaluated using 863 images with 7691 labeled objects prepared separately for testing. For evaluation, we also prepared images in two 3D worlds similar to the loading dock. The movement of workers and pallets was simulated,   while a view from a virtual camera at vantage points was acquired. YOLACT was trained with the labeled images. YOLACT provides both bounding boxes and segmentation of pixels of the detected objects. Table 3 shows the accuracy of object detection in terms of the mean average precision (mAP) and the recall based on the bounding boxes of the fourfold cross validation. Table 4 shows the difference between various distance measures for the dataset used in YOLACT. Five methods used in YOLO evaluation were also selected for YOLACT. Fourfold cross-validation [60] is used for the evaluation. Similar to the YOLO case, the error values are sufficiently small such that the results are acceptable in both the Unity 3D and CCTV datasets for YOLACT. In all cases, WAD also estimated the least error.
We used the segmentation results to estimate the SZ, which was determined to be the edge of the segmented pallets. Hence, how accurately YOLACT segments the pixels is important for accurate estimation of the SZ. The pixelwise segmentation accuracy was evaluated with the intersection over union (IoU) [65], which is given by where TP, FN, and FP are the number of pixels in the true positive, false negative, and false positive segmentation, respectively. Table 5 shows the segmentation accuracy. It can be observed that the estimated segmentation accurately overlaps the ground truth segmentation. Morphological operations were applied to obtain SZ for the elevated platform. For morphological dilation and erosion, a 5×5 size rectangle was used as the structural element. Fig. 7 illustrates the procedure of obtaining the SZ. The segmentation of the pallets in (a) was mapped to the topview plane to form a binary mask in (b). The iterations of VOLUME x, 2021  dilation followed by the iterations of erosion close the narrow gaps between the pallets, as given in (c). The number of iterations determined the sizes of the gaps to be closed. The map in (c) was shrunk by the iterations of erosion, the result of which is shown in (d). The shrunk map was subtracted from the closed map in (c) to determine the edges around the combined pallets. The width of the edges was determined by the iterations of the erosion. The SZ for the elevated platform formed by the three pallets is shown in (e). It was mapped to the acquired image using the inverse perspective transformer. The SZ overlay on the image is shown in red in (f). Fig. 8 shows examples of SZ estimation for the elevated platform. In (a) and (b), the pallet in the front moved in and formed the workspace shown in (c). Following unloading, the middle pallet moved out in (d). The SZ is estimated for each frame to accommodate the changing formation of the elevated platform.

1) Safety Zone Violation for Moving Heavy Equipment
Workers and heavy equipment were detected by YOLO. The center of the base of the bounding box of each worker and heavy equipment was tracked in time to find the trajectory.   The ground truth trajectory was difficult to obtain for the CCTV images. We use Unity 3D images, in which we were able to obtain the exact locations of objects that we placed in the 3D world, to evaluate the accuracy of the estimated trajectories. Fig. 9 shows examples of trajectory estimation using Unity 3D images. The trajectories in the top-view plane are shown, where the estimated trajectories and the ground truth are marked with blue and red lines, respectively. Table 6 shows the average error between the estimated and true trajectories of the fourfold cross-validation. Five video sequences from two virtual cameras at two plants were used. The average error was 6.82 pixels. By using the physical dimension of pixels obtained through the checkerboard patterns in Fig. 4, the average error of the trajectory estimation converted to meters was 0.24 meters.
SZ violation occurs when the SZs of a worker and heavy equipment collide. Fig. 10 shows examples of SZ violations for moving heavy equipment using CCTV images. Images with overlaid SZs are shown in (a), and the trajectories of objects in the top-view plane are shown in (b). We also showcased examples of SZ violations using Unity 3D images in Fig. 11. SZ violation for moving equipment is extended to cases where the locations of moving equipment are provided from outside. Fig. 12 and 13 show examples of SZ violations for moving cranes using CCTV and Unity 3D images, respectively. The locations of the crane head are provided from crane control systems. The SZ is the union of the circles under the current and future crane locations with the radius proportional to the height of the crane. SZ violation occurs when a worker's SZ overlaps with the crane's SZ. The trajectories and velocities of the crane head are shown in red and cyan lines, respectively, in (b).

2) Safety Zone Violation for Elevated Platforms
YOLACT returns both the bounding boxes and segmentation of detected objects. The base of the bounding boxes is tracked in the top-view plane to obtain the trajectories. Fig. 14 shows examples of trajectory estimation using the Unity 3D images. Blue and red lines indicate the estimated and ground truth trajectories, respectively. Table 7 shows the average error of trajectory estimation of the fourfold cross-validation, which was 13.00 pixels, or equivalently 0.16 meters.
To evaluate the accuracy of a worker's gaze direction estimation, we prepared test videos of a worker wearing a helmet with different colors on the front and back. The front half of   the helmet is colored white, while the back half is colored blue. Fig. 15 illustrates the procedure of the gaze direction estimation. From the helmet image inside the bounding box in (a), binary maps for the white and blue parts of the helmet are found by the thresholding in the HSV color space, which are (b) and (c), respectively. Then, gaze direction is estimated by subtracting the center of mass of the white mask from that of the blue mask. The estimated gaze direction is overlaid in (d).
Examples of videos prepared to evaluate the detection of backstepping workers are shown in Fig. 16. Three scenarios were shown: workers standing while looking around, moving forward, and moving backward. The worker's movement and gaze directions are plotted in red and blue, respectively. The direction was between -180 degrees and 180 degrees, where the right horizontal direction was at zero degrees. While looking around in (a), the gaze direction changes from -180 to 0 degrees, while the worker looks left and right facing front. In the case of moving workers, the differences between the moving and gazing directions are small while stepping forward and large while stepping backward, as shown in (b) and (c).
We had three SZ violations for the elevated platform: a worker staying at the edge of the platform for a long duration, a worker walking toward the edge of the platform, and a worker backstepping on the platform. Fig. 17 shows examples of workers staying at the edge for a long duration. The images and the corresponding violation situation in the top-view plane are shown in (a) and (b), respectively. The  workers with and without SZ violations are shown in red and green, respectively. The duration of the stay was set to 3 seconds. Fig. 18 shows examples of the SZ violation where workers are walking onto the edge of the platform. The images and the corresponding violation situation, along with the trajectories of workers in the top-view plane, are shown in (a) and (b), respectively. Workers reaching the edges of the platform were detected and marked in red. Fig.  19 shows examples of the zone violation where workers are backstepping on the platform. Accidents have been reported in which workers have stepped backward and fallen down . However, this is a rare incident and is also dangerous to enact for evaluation. Hence, we played the recorded videos backward to envision all walkers stepping backward. The images and the corresponding violation situations along with the trajectories of workers in the top-view plane are shown in (a) and (b), respectively. All workers were detected as backstepping and marked in red. Fig. 20 shows examples of the three cases of SZ violation using Unity 3D images. For the backstepping cases, the videos are also played backward for evaluation. All the   SZ violations were appropriately detected in the simulation using Unity 3D images. SZ violations for elevated platforms can be extended to consider fixed areas. Fig. 21 and 22 show examples of SZ violations for fixed areas in front of the part storage area. The selected fixed areas are transformed to the top-view plane, which are shown in gray in (b). SZ violation occurs when the workers stay in the SZ for longer than a predefined period.

V. CONCLUSION
The DNN-based object detection and segmentation methods provided accurate and efficient detection and segmentation of multiple objects that were adequate for safety monitoring in workplaces. Time-varying SZs involving moving objects and set operation-based SZ violation definitions allowed us to monitor various SZ violation scenarios that are currently monitored by safety monitoring teams in workplaces. Safety and robustness measures for object detection are also provided by the evaluation method. The proposed methods are currently incorporated into existing site monitoring systems, from which feedback is being collected. The proposed methods can be easily extended to various workplaces with their own safety monitoring requirements. However, object detection using a single CCTV camera has a limitation in detecting obscured objects. In future work, we will improve detection and localization performance by incorporating multiple views of a workplace. VOLUME x, 2021