Generating Synthetic Depth Image Dataset for Industrial Applications of Hand Localization

In this paper, we focus on the problem of applying domain randomization to produce synthetic datasets for training depth image segmentation models for the task of hand localization. We provide new synthetic datasets for industrial environments suitable for various hand tracking applications, as well as ready-to-use pre-trained models. The presented datasets are analyzed to evaluate the characteristics of these datasets that affect the generalizability of the trained models, and recommendations are given for adapting the simulation environment to achieve satisfactory results when creating datasets for specialized applications. Our approach is not limited by the shortcomings of standard analytical methods, such as color, specific gestures, or hand orientation. The models in this paper were trained solely on a synthetic dataset and were never trained on real camera images; nevertheless, we demonstrate that our most diverse datasets allow the models to achieve up to 90% accuracy. The proposed hand localization system is designed for industrial applications where the operator shares the workspace with the robot.

shares the workplace with the robot and collaborate in a close 23 proximity.

24
In human-machine interaction, hand gesture recognition 25 and processing is a key topic because they represent a nat-26 ural way for humans to communicate non-verbally. Using 27 a recognized gesture, we can create a specific command to 28 control the robot; by knowing the position of the hand in the 29 workspace, we can guide the robot to a specific location; 30 The associate editor coordinating the review of this manuscript and approving it for publication was Yizhang Jiang . finally, information about the presence of the hand in the 31 workspace can be used for safety measures. Camera-based 32 safety systems often use detection and tracking, which are 33 utilized to localize the operator in the workspace and, poten-34 tially, adapt the technological process [1], [2], [3] in order 35 to insure the safety of each operator. In our approach we 36 focus on localization of fingers, hands and whole arms in the 37 workplace from the top view camera. The output of the hand 38 recognition system can be used to control robotic applications 39 using various [4], [5], [6] gesture-based interfaces. 40 In this paper, we focus on the problem of semi-automatic 41 generation of synthetic datasets for training depth-image seg-42 mentation models for the task of hand localization. Our work 43 contributes to the topic by providing new synthetic datasets 44 for industrial environments suitable for various hand track-45 ing applications as well as ready-to-use pre-trained models. 46 We also provide a simulation scene that can be customized 47 and optimized for specific applications. We further elaborate 48 and analyze the characteristics of these datasets that affect 49 the resulting generalization ability of the trained models, and 50 The majority localization methods generally use RGB 84 camera images because these cameras are widely available. The most important step in training a machine learning model 106 is to obtain a dataset sufficiently large to accurately represent 107 the real-world domain in a wide range of circumstances and 108 contexts. A typical approach for obtaining such a dataset is 109 to manually label hundreds of thousands of images to define 110 the ground truth for each image, or to use third-party labeled 111 datasets available on various platforms. While obtaining sam-112 ple data can be relatively straightforward, the subsequent 113 labeling of the data can take an enormous amount of time, 114 depending on the complexity of the scene and the desired 115 diversity of labels. Melireddi et al. have demonstrated the use 116 of coloured gloves [15] to label hand regions in the created 117 dataset image. Another approach to the labelling automation 118 is the assumption that the hand is the closest object to the 119 camera [16] and can be found by applying a color threshold 120 to the image. An alternative approach [17] is to use tracking 121 sensors, or infrared markers [18] fastened to the hand and 122 fingers to automatically generate the labels.

123
Datasets obtained by collecting images from a real camera 124 have the advantage of being close to the real domain but at 125 the same time -possess problem of limited range of captured 126 conditions, since the environment arrangements which were 127 provided during acquisition is limited and usually cannot 128 cover the full range of scenarios (e.g. changing positions of 129 obstacles in the view, lighting, reflections, shadows).

130
Synthetic datasets provide an alternative to traditional 131 manual collection and labeling. These datasets are cre-132 ated programmatically by simulating a domain, or by com-133 bining real images with a known ground truth into new 134 ones [19], or by combining these methods and overlaying 135 labeled objects over a randomized simulated scene [20]. 136 Each method allows to produce arbitrary large fully-labeled 137 datasets [21]. Prepared simulated environment allows to gen-138 erate extensive synthetic datasets by adjusting the conditions 139 and and applying augmentations to the generated images [22]. 140 Keskin et al. demonstrated synthetic dataset generation based 141 on a fully simulated scene [23]. They used a 3D skinned mesh 142 with a skeleton defining parts of the hand and links of the 143 fingers, which were used for both animating the mesh and 144 creating the ground truth labels. This solution reduces the 145 cost of preparing datasets while increasing data diversity and 146 labeling accuracy.

147
In general, approaches to generating synthetic datasets can 148 be divided into two main groups depending on their appear-149 ances: realistic and randomized datasets. Realistic datasets 150 have an obvious advantage: they are very similar to the 151 real environment, which allows the model to learn important 152 realistic characteristics of the domain. However, the use of 153 synthetic data entails the so-called ''reality gap'', which is 154 the inability to fully reproduce real-world data for numerous 155 reasons, including textures, lighting, and complex domain 156 specifics. All appearances generated by realistic simulations 157 can only cover a user-defined scale of conditions, e.g. day-158 lighting, programmed object position and interactions. Thus, 159 VOLUME 10, 2022 these generated environments represent only a subset of all 160 the conditions that may occur in reality. Achieving higher 161 photo-realism with high fidelity rendering engines comes at 162 the cost of computational resources and rendering time. 163 In an attempt to mitigate the ''reality gap'' the opposite  An additional policy ensured that the camera was rotated so 215 that the hand was at least partially visible from the selected 216 viewpoint, and the random background image did not contain 217 a person. 218 Mueller et al. [29] presented pose and shape reconstruc-219 tion of interacting hands, with model trained on synthetic 220 dataset containing depth images samples complemented 221 with RGB-encoded segmentation masks, where the color 222 represented correspondence to vertices of a MANO hand 223 model [28]. The model was additionally trained on a real 224 camera data to help model to generalize, since the generated 225 dataset did not contain any augmentation nor background 226 obstacles.

227
An extensive review of available depth-based hand datasets 228 is available in [30]. 229 However, each of the discussed publicly-available datasets 230 posses one or several following disadvantages:

231
• based solely on RGB information;

232
• assumption that the hand covers the majority of the 233 image area;

234
• assumption that the hand is the closest object to the 235 camera;

236
• absence of obstructions around the hand.

237
In addition, the available datasets assume a different place-238 ment of the camera than in our specific industrial storage. 239 These factors served as a motivation for creating own cus-240 tomized dataset generator and subsequently training of the 241 network.

242
We focused on the depth image dataset because depth 243 capture is less sensitive to light, color, and texture, but rather 244 focuses on shape.

246
In order to localize the hand in the scene we propose a 247 method based on a convolutional neural network trained on 248 the synthetic dataset. The dataset is generated in the simulated 249 environment which is set according to the testing scenario on 250 real workspace. We compare the effects of different augmen-251 tations and simulated scene settings on the resulting accuracy 252 of the trained neural network by evaluating the quality of the 253 segmentation on a testing dataset which comprises of images 254 obtained from a real sensor [31]. Parameters and general appearance of the simulated scene 257 were set with respect to the presumed use in the industrial 258 application and in close correspondence to our experimental 259 workplace ( Figure 1). In the workplace, a single depth camera 260 heading downwards is mounted 1 meter above the work table. 261 The initial experiments were carried out using setup with 262 a camera placed 1 m above the floor when no robot was 263 involved. We utilized RGB-D Intel RealSense D435 camera 264 as a sensor for capturing the depth images.   industrial conditions it is not possible to ensure that the hand 308 is always the closest object to the camera. Positions and ori-309 entations of these objects are random, but they are governed 310 by the policy, which ensures that the created object never 311 overlaps the hand (considering fingertip and palm centre 312 point) in the camera view. If an overlay is found, the object is 313 VOLUME 10, 2022 moved to a different position until the requirement is satisfied.

357
The neural network used in the experiment is implemented 358 using TensorFlow. The architecture is based on U-Net [32] 359 which is a fast convolutional network for accurate image 360 segmentation. It has a contracting part consisting of convolu-361 tional layers and max pooling operations (see Figure 6). This 362 part is responsible for capturing the context. The symmetrical 363 expanding part of the network provides precise localisation. 364 Our network contracting part consists of 5 convolutional 365 layers with increasing number of filters which are multiples 366 of 16. Convolutional layers are followed by pooling layers 367 with the size 2 × 2. Expanding side is a set of upsampling 368 deconvolution blocks. 369 We trained and validated models using the generated 370 datasets (A, B, C, D, CD) of different sizes (2346, 34166, 371 270106). The fifth dataset (CD) was generated by randomly 372 combining the C and D datasets.

373
All models were trained for 8 epochs with an initial learn-374 ing rate of 0.001 with a total of 32 images per mini-batch. 375 20% of each corresponding dataset was used for validation. 376 For better sensitivity for both hand sides, we used horizontal 377

414
In order to compare the impact of datasets, we propose an 415 evaluation dataset captured on a real sensor as a benchmark 416 and use it to measure the quality of the predictions generated 417 by the trained neural network models. This benchmark has a 418 quantitative evaluation using the metric of mean intersection 419 over union (IoU), which is represented by the ratio between 420 the overlap area and union area of the predicted and baseline 421 regions. In addition, we perform a qualitative evaluation of 422 the obtained results, where we examine the predictions and 423 explain the reason for the quantitative result.   and direct sunlight, and can generally be considered more 434 complex than we would expect for industrial use.

435
The data in Table 2 show that predictions based on the  Table 2) showed that the sim-447 plest datasets (A, B) did have difficulty generalizing, which 448 apparently caused their over-fitting; however, more complex 449 datasets allowed the networks to successfully generalize and 450 over-fitting did not occur. We also assume that the observed 451 fast convergence is partially due to the fact that the utilized  Comparing the results from the Table 2 and Figure 9 it 478 can be observed that when the dataset without obstacles, the 479 number of false positive predictions is high. Random back-480 ground objects added to the dataset increase the accuracy of 481 the predictions. Additional noise may increase the sensitivity 482 to the shape details, nonetheless it has to be balanced to avoid 483 type I errors.

484
To compare our dataset with existing work in this area, 485 we adapted several well-known publicly available RGB-D 486 and depth-based hand datasets (see Table 3); their descrip-487 tions were presented in section II.C. The following modifica-488 tions were applied to adapt the labeled inputs of the datasets: 489 • Single class masks: (HandSeg -merging both hands' 490 masks; DenseHands -binarized dense correspondence 491 was used as hand masks; RHD -all masks except hands 492 were filtered, ObMan -all masks except hands were 493 filtered).

494
• The 0-1 m depth range was mapped to the 0-255 byte 495 range according to the settings in the test environment. 496 The remaining range was truncated to the 1 m boundary. 497 The dataset adaptation code is available in the GitHub repos-498 itory [31]. Our dataset input pipeline automatically adapted 499 all images to 320 × 240 resolution. We then trained the 500 U-Net model using adapted datasets with an 80% / 20% 501 training-validation split and the equivalent training settings. 502 The trained models were evaluated on a set of real camera 503 data representing the expected environment. Because Dense-504 Hands, RHD, and HandSeg contained only hand masks, the 505  Table 2 shows that the difference in performance 510 between 34k and 270k is not significant.  Table 3 516 are partially due to these assumptions and the fact that the 517 environment for use varies. For our task these assumptions 518 cannot be guaranteed, therefore when creating our dataset 519 we tried to avoid these shortcomings by making the mod-520 elled scene contain random obstacles that force the net-

528
The results in the Table 3 correspond to the input data 529 for the trained network. The DenseHands dataset has high 530 similarity to our dataset A, where no objects and noise are 531 present in the scene and the training process tends to over-532 fit. The images of this dataset feature low variability in hand 533 position and orientation. A slight improvement in results can 534 be observed in the Obman dataset, which includes several 535 objects in the surroundings. Yet the position of the hand is 536 mostly in the middle of the image and at approximately the 537 same depth. Better results are shown by the Handseg dataset, 538 which is not synthetic and has a natural representation of the 539 images acquired by the camera. However, the low variability 540 of the dataset features causes the trained model to perform 541 significantly worse than our presented dataset under specified 542 environmental conditions. The high variability of the images 543 in the RHD dataset makes the results better, but the absence 544 of noise that could make the synthetic dataset look similar to 545 the actual camera images limits the quality of the predictions 546 compared to a network trained on our dataset.

548
The initial experiment with the camera mounted above the 549 ground with common items serving as obstacles was extended 550 VOLUME 10, 2022 to a real-world scenario. We used existing workplace with 551 industrial collaborative robot UR3e to test the trained mod-552 els. Figure 1a represents the real workplace with the robot, range of conditions that allows the model to generalize. 596 In terms of labor required to prepare and acquire the dataset, 597 the synthetic dataset in our case remains an advantageous 598 option, since the simulation can easily be extended and 599 adapted to any specific workspace. For synthetic datasets, 600 the most time-consuming operation is the preparation of the 601 simulation, the collision rules for the obstacles and selecting 602 augmentations, which, however, need only be set once; after 603 that the process of data set creation is simple and generating 604 an arbitrary number of images takes little time compared to 605 manual arrangement and labelling of images.

606
The conventional approach of collecting a dataset from 607 images from real cameras requires manual compositing of the 608 workspace to create a sufficiently large and diverse dataset 609 and subsequent manual labeling, which is much more labor-610 intensive. Repeatedly manually rearranging elements in the 611 workspace to provide enough diversity in the dataset needed 612 to generalize the trained network is tedious, time-consuming, 613 and still cannot come close to the diversity of scenes created 614 with Domain Randomization.

615
In terms of performance, a sufficiently complex scene 616 with an arbitrary number of added obstacle objects (which 617 represent objects present in the real environment) will not 618 affect the performance of the simulation, since it uses neither 619 complex rendering nor physical simulation. The simulation 620 can be further optimized to achieve even better performance. 621

622
In this paper, we focused on generating synthetic datasets 623 for training depth image segmentation models for the hand 624 localization task. The use of a domain randomization tech-625 nique enabled the rapid generation of an arbitrarily large 626 synthetic dataset that included a wide range of samples with 627 features important for accurate hand localization. The evalu-628 ations performed on the trained models allowed us to analyze 629 the effects of the complexity of the dataset and the addi-630 tional post-processing augmentations on the resulting image 631 segmentation accuracy. Moreover, these benchmarks allowed 632 us to identify the version of the dataset with the highest 633 accuracy of over 90%. We provide new synthetic datasets 634 for industrial environments suitable for various hand track-635 ing applications, as well as ready-to-use pre-trained models and simulation scenes that can be used to create custom 637 datasets.

638
In the future, we plan to extend the dataset generator to 639 enable a simpler and more user-friendly solution for adapting 640 the simulation to the requirements of the real workspace. He is currently working as a Senior Researcher 796 with the Technical University of Ostrava. His 797 research interests include complex simulations 798 and control of mechatronic systems, visualiza-799 tion, application of virtual and augmented reality 800 in robotics, optimization of layouts of robotized 801 workplaces, algorithms for automatic design of an optimal kinematic struc-802 ture of a robotic manipulator suitable for a given task, and lately also collision 803 avoidance for collaborative robots sharing workspace with human workers. 804 805