Increasing the Robustness of Deep Learning Models for Object Segmentation: A Framework for Blending Automatically Annotated Real and Synthetic Data

Recent problems in robotics can sometimes only be tackled using machine learning technologies, particularly those that utilize deep learning (DL) with transfer learning. Transfer learning takes advantage of pretrained models, which are later fine-tuned using smaller task-specific datasets. The fine-tuned models must be robust against changes in environmental factors such as illumination since, often, there is no guarantee for them to be constant. Although synthetic data for pretraining has been shown to enhance DL model generalization, there is limited research on its application for fine-tuning. One limiting factor is that the generation and annotation of synthetic datasets can be cumbersome and impractical for the purpose of fine-tuning. To address this issue, we propose two methods for automatically generating annotated image datasets for object segmentation, one for real-world and another for synthetic images. We also introduce a novel domain adaptation approach called filling the reality gap (FTRG), which can blend elements from real-world and synthetic scenes in a single image to achieve domain adaptation. We demonstrate through experimentation on a representative robot application that FTRG outperforms other domain adaptation techniques, such as domain randomization or photorealistic synthetic images, in creating robust models. Furthermore, we evaluate the benefits of using synthetic data for fine-tuning in transfer learning and continual learning with experience replay using our proposed methods and FTRG. Our findings indicate that fine-tuning with synthetic data can produce superior results compared to solely using real-world data.


I. INTRODUCTION
M ODERN robotic applications often need to function in dynamic environments, where certain aspects, such as illumination, clutter, and object occlusions, are constantly changing [1], [2].Data-driven approaches, such as deep learning (DL), are commonly utilized in these scenarios.Robots can assess their environment based on visual data processing performed by DL models, providing them the required flexibility to operate under dynamic circumstances.The robustness of these models can be described as their capability to adapt/generalize to a range of settings with different environment factors [1].Models which can generalize to a large variety of settings with different lighting, clutter, etc., conditions are considered more robust than models which are not.In this article, we inspect two ways to achieve good generalization.One is through the use of continual learning [2], [3], where generalization comes from accumulating knowledge over time in a dynamic environment while avoiding forgetting the previously acquired knowledge [3].The other is through transfer learning [4], [5], which relies on training datasets that force the models trained on them to generalize well.We show that both approaches can benefit from using synthetic training data for increasing model robustness, with the help of representative experimental setups for object segmentation and image recognition for robotic tasks.
For both approaches, the training data has great significance as it directly influences the generalization capabilities (and thus the robustness) of the trained models.It has been shown that models trained on a dataset coming from a source domain adapt to a target domain better if the source and target domains are similar [5].Variations in the data distribution of the training set can also help the trained model accommodate the difference in the data distributions between the source and target domains [6], [7].This suggests that training datasets that contain samples from the same or a similar domain the models will be used in and which incorporate sufficient variations in the data distribution are likely to result in more robust trained models.However, data collection with a real-world c 2023 The Authors.This work is licensed under a Creative Commons Attribution 4.0 License.
For more information, see https://creativecommons.org/licenses/by/4.0/Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
robot setup can be highly time consuming and expensive due to the sheer amount of data needed to incorporate variations to obtain robust models [8].Furthermore, since many DL-based solutions use supervised learning techniques, training datasets often have to be annotated [9].However, the manual annotation of the data is also very time consuming or practically impossible, especially in complex tasks, such as semantic scene segmentation or instance segmentation, which are often required in robotic applications.In addition, DL models are also known for their need for extensive training datasets.As a result, collecting and annotating the training data can be a massive roadblock to developing new DL-based solutions for robotics.
It is easy to see the great potential of using automated means for data collection/annotation and synthetic data for training DL models.Synthetic data can be generated rapidly and inexpensively with automated annotation, incorporating arbitrary variations in the data distribution.However, synthetic data for training DL models comes with a tradeoff by introducing a discrepancy between the domains from which the training data is generated and in which the models will be used.Consequently, models trained on synthetic datasets may not be able to adapt well to real-world data, even if they perform well on synthetic data.This phenomenon is known as the "reality gap."Overcoming this challenge (often referred to as "bridging the reality gap" [1], [6], [7]) is crucial for leveraging the benefits of synthetic datasets and ensuring that DL models trained on synthetic data are robust and generalize well to the real world.
There are two primary techniques for bridging the reality gap, regardless of the synthetic data generation approach.The first one is to create a synthetic dataset similar to a real-world dataset [10], [11], [12], [13], [14].In the case of synthetic image data, this is achieved by creating photorealistic images.The other approach is domain randomization.This technique introduces unrealistic levels of variation in the synthetic dataset, which forces the models trained on such datasets to ignore the effects of the randomized factors and thus to generalize to the real-world data as well [6], [7], [13], [14].Many approaches focus on what aspects and to what extent photorealism or domain randomization should be used and whether these two methods could be combined [13], [14], [15].However, these approaches still try to bridge the reality gap, "building from only one side," by improving synthetic data generation.In this article, we propose a novel method, which we call filling the reality gap (FTRG), that takes advantage of an automated real-world dataset generation technique and a corresponding synthetic data generation pipeline and aims to overcome the reality gap by blending real-world and synthetic components inside images.With FTRG, we have complete control over which parts of an image come from the real-world data and which ones come from the corresponding synthetic images.Furthermore, real-world and synthetic parts can also be overlayed on top of each other by taking advantage of the alpha channel of the images.By continuously modifying the opacity, we can achieve a seamless transition between real-world and synthetic image components.In our experiments, we demonstrate how this approach can yield superior results when fine-tuning models for object segmentation over using datasets that contain images that are either completely synthetic or coming from the real-world dataset.
The benefits of using a synthetic dataset are often discussed from the aspect of transfer learning.It is well established that pretraining on a large synthetic dataset and then fine-tuning on real-world data can be superior to training only on real-world data [12], [16].For example, Nowruzi et al. [16] explored the benefits of having a synthetic dataset in addition to real-world annotated data for object detection.In their experiments, they trained an SSD single-shot detector [17] with MobileNet as the backbone [18] from scratch on a dataset containing both synthetic and real-world data.They found that the additional synthetic data significantly reduced the need for training on realworld samples (10%, 5%, and 2.5% of the original real-world dataset was used).Additionally, when training the model from scratch, synthetic pretraining and fine-tuning on real-world data outperformed simply training the model on a mixed dataset (consisting of both real-world and synthetic samples).
Creating a synthetic dataset is often seen as an alternative when one lacks a large annotated real-world dataset for pretraining.This article focuses on the benefits of using synthetic data for fine-tuning when an automatic real-data labeling pipeline or an already annotated real dataset is available.We consider the typical case when a pretrained model is already available, and only the fine-tuning step has to be completed.We show how synthetic data can improve the fine-tuning steps of transfer learning approaches for robotic manipulation and the training of continuous learning methods using experience replay for image classification.
The contributions of this article include the following.1) A real-data labeling procedure for robotic manipulation, which can automatically generate instance segmentation masks for images of real-world scenes.2) A Blender-based annotation tool which can automatically generate instance segmentation masks for rendered synthetic images.3) Experiments showing how synthetic rendered data can help boost the performance of continual learning methods using experience replay.4) A novel way of creating a mixed reality dataset combining our automatic annotation techniques.This method is called "FTRG" as it combines synthetic and real-world data seamlessly.5) Experimental results showing the benefits of using synthetic data in the fine-tuning step of transfer learning in instance segmentation, showcasing the potential of the FTRG method.

II. RELATED WORK
Novel techniques employing synthetic data have shown encouraging results in tasks that were previously deemed challenging, including transparent object detection, robotic cloth folding, and object rearrangement using visual data [19], [20], [21], [22].One particularly successful approach, Dex-Net 2.0, leverages synthetic point cloud data and analytic grasp metrics to train a grasp quality convolutional neural network (GQCNN) for robotic grasp prediction [23].
Liu et al. [24] proposed a synthetic dataset generation pipeline for robotic picking with a vacuum gripper based Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
on RGB-D data.They utilized Blender, an open-source 3-D computer graphics software [25], for generating photorealistic synthetic images and a learning-based (GAN) approach for overcoming the reality gap.They evaluated their approach against the popular Dex-Net 3.0 benchmark [26] and reported its performance on a real-world robotic picking setup.Their results revealed that the proposed approach can deal with multiobject picking and challenging novel objects (transparent, or small objects) robustly.Many recent scientific results utilize similar GAN-based image-to-image domain adaptation approaches to overcome the reality gap [27], [28], [29].Contrary to these solutions, FTRG takes advantage of realworld images directly rather than relying on a learning-based approach to extract features that transfer well to the real-world domain.
Similarly to the solution of Liu et al., Denninger et al. [30] also utilized Blender for their synthetic dataset generator, BlenderProc.BlenderProc is entirely Blender-based and can generate photorealistic rendered images and corresponding segmentation, depth data, surface normals, etc.The potential of BlenderProc is demonstrated in indoor scene segmentation examples, but it could also be adapted to robotic tasks.
Greff et al. [31] introduced Kubric, a synthetic dataset generator utilizing PyBullet [32] and Blender.It is an extremely versatile framework that can generate photorealistic renders and various annotation types, such as segmentation masks, pose, optical flow, surface normals, etc.Several models trained using synthetic data generated from Kubric have been shown to achieve outstanding results in various fields, such as point tracking, semantic segmentation, salient object detection, pose estimation, etc. [31], which proves its effectiveness and flexibility.
Our synthetic data generation pipeline also utilizes Blender to create photorealistic renders and corresponding annotations.However, instead of creating an extensive framework, such as Kubric or BlenderProc, we aimed to keep our solution lightweight and implemented it as a Blender addon.This allows it to be used with no additional installation steps (only Blender is needed) and allows the pipeline to be used for potentially limitless scenarios (not limited to robotics solutions).

A. Photorealistic Synthetic Data
Dvornik et al. [33] proposed the so-called copy-paste method for augmenting image datasets for object detection.Their approach utilizes object segmentation annotations to crop object instances from a set of real-world images and paste these cropped instances into other real-world images.They achieved significantly better model performance by utilizing the copy-paste strategy if the instances were placed considering the visual context.They used a CNN to model different object categories' visual context and used this network to guide the instance placement.
Li et al. [34] also utilized the copy-paste method.They proposed a sim-to-real framework for training object recognition and localization DL models for industrial robotic bin picking.They used photorealistic synthetic images, corresponding depth data, object segmentation masks, and 6-D object poses to train their models.They found that even though their generated synthetic data looked photorealistic, it still lacked the fine details that can be observed in real-world images, such as uneven illumination, object deformations, etc.As a result, they proposed a semi-synthetic dataset inspired by the copy-paste method.They manually cropped instances from real-world images and pasted them in real-captured backgrounds (hence the visual context was correct without the need for a context modeling technique).They showed that models trained on a mixed dataset containing images from their photorealistic synthetic and semi-synthetic datasets outperformed models trained using either only synthetic or semi-synthetic images.They showed that DL models trained with their framework could be directly applied to real-world samples without fine-tuning them on real-world data and achieved superior performance compared to state-of-the-art methods.
A similar approach can be seen in the winning solutions at the Amazon Robotics Challenge (ARC) 2017 [35], [36].They both utilized semi-automated methods to generate segmentation masks for images of novel objects from multiple views, which were later used to construct a semi-synthetic dataset on which their models could be fine-tuned quickly.
FTRG takes this concept further by combining synthetic and real-world images.Instead of pasting the cropped real-world object instances on top of real-world backgrounds, with FTRG, we take advantage of a real-world and multiple corresponding synthetic images with the same camera-object alignments and camera setups but different textures and lighting.This allows us to mix and match different parts in a single image (e.g., realworld and synthetic objects in a single image with synthetic background).Instead of simply cropping and pasting different parts in the images, FTRG makes use of opacity, with which we can continuously control to what extent a certain part of an image is synthetic or real, thus creating a seamless transition between synthetic and real images.
Schwarz and Behnke [37] proposed Stillleben, a synthetic dataset generation pipeline for training DL models used for perception tasks in robotics, such as semantic segmentation, object detection, and pose estimation.Contrary to the copy-paste and similar approaches which facilitate 2-D synthesis, Stillleben utilizes a synthetic 3-D scene to generate high-quality rendered images for known objects with corresponding segmentation masks, depth data, surface normals, etc., in an online fashion.The authors emphasize that Stilleben can be used online, enabling its usage in lifelong/continual learning.However, this property limits the quality of the rendered images.In order to overcome this challenge, Benedikt et al. utilized the GAN-based CUT approach (from Park et al. [38]), an image-to-image translation model, to adapt the synthetic images from Stilleben to the real-world domain [27].They trained the RefineNet semantic segmentation model (proposed by Lin et al. [39]) from scratch.They showed that the model trained with the dataset utilizing their domain adaptation approach achieved higher intersection over union (IoU) scores with narrower distribution when compared to the model trained on images directly from Stillleben.
Our synthetic data generation pipeline also utilizes a 3-D synthetic scene.However, while Stillleben uses OpenGL to generate rendered images quickly, our method utilizes Blender Cycles, a photorealistic renderer.This means that our approach can only generate offline synthetic datasets, but in return, it can provide better-quality rendered images.In our experiments, we demonstrate how using an offline-generated synthetic dataset can still be advantageous in the context of continual learning.
Martinez-Gonzalez et al. [10] and Garcia-Garcia et al. [11] introduced a photorealistic synthetic data generator for indoor semantic scene segmentation and robot manipulation using unreal engine and virtual reality (VR) technologies.The DL models trained on their data showed promising qualitative results in monocular depth estimation, 6-D pose estimation for synthetic samples, and 6-D pose estimation [10].Their results suggest that VR technologies can help bring realism into synthetic scenes, such as realistic robot interactions.In the FTRG method, we aim to leverage the same principle.However, instead of "borrowing" real human actions to replace the synthetic robot interactions, we borrow parts of a real-world scene to replace parts of our synthetic scene in a mixed-reality setting.
Roberts et al. [12] created Hypersim, a photorealistic synthetic dataset for indoor scene understanding.They emphasize the use of publicly available 3-D assets.According to their research, most synthetic datasets only provide rendered images rather than 3-D assets, limiting their use cases' flexibility and potential for future development.They also highlight that their annotation pipeline is not tied to the rendering process, which makes it possible to generate or change the annotations of a scene without rerendering it.Our method for generating annotations for synthetic data is decoupled from the scene preparation and the rendering.It allows the approach to be used for any scene, not limited to robotic manipulation, and to change the annotations without rerendering.For our scenes, we only used publicly available free 3-D assets.In comparison, the preparation of Hypersim came with a cost of $57K, of which $6K was used to purchase the required 3-D assets, although their scenes are of much higher quality and the number of 3-D assets they used is much higher than ours as well.
Roberts et al. also conducted experiments to show the simto-real performance of models trained on Hypersim.Their experiments revealed that Hypersim pretraining could improve the semantic segmentation performance on NYUv2 [40] compared to pretraining on PBRS [41].However, it did not improve performance compared to pretraining on SceneNet RGB-D [42].They attribute these results to the fact that PBRS contains an order of magnitude more samples than Hypersim, while SceneNet RGB-D contains two orders of magnitude more samples.They imply that the reason why Hypersim is still able to yield comparable results to these datasets is because of its better photorealism.They also suggest that there could be a tradeoff between photorealism and the amount of data needed to achieve good sim-to-real performance.This means that datasets with less but more photorealistic data could be on par with more extensive but less photorealistic datasets.This is in accord with the findings of Huh et al. [43], who investigated what makes the ImageNet dataset [44] suitable for transfer learning.They showed that increasing the number of classes or the amount of pretraining data beyond a certain point did not bring significant benefits for transfer learning.These results indicate that an additional, small amount of well-chosen/good-quality data can improve transferability more than simply increasing the size of the dataset.Even though these findings are for the pretraining phase of transfer learning, they are also promising for the fine-tuning phase since the generalization capability should be preserved as much as possible during fine-tuning.In contrast, the size of the fine-tuning set should be kept as small as possible.In response to this, in our synthetic dataset generation pipeline, we aim to create high-quality photorealistic rendered images.

B. Domain Randomization
Domain randomization introduces unrealistic levels of variation in the synthetic dataset, which forces the models trained on such datasets to ignore the randomized factors' effects and thus generalize to the real data as well [6], [7], [13], [14].Nonphotorealistic synthetic images were shown to be useful for training object segmentation models for robotics in recent approaches [45], [46].
Tobin et al. [7] showed that a DL model trained on a large quantity of low-fidelity rendered images could be successfully deployed in a real-world scenario.They randomized camera and object positions and lighting conditions while they used unrealistic randomized environment textures.Their experiments demonstrated that a DL model, which was exclusively trained on their domain randomized synthetic data, was able to detect simple geometric objects in a real scene.Furthermore, the detections were also accurate and reliable enough to be used in a robotic grasping pipeline.
Tremblay et al. [6] presented that the domain randomization technique can also be used for bridging the reality gap in more complex scenarios.They showed that DL models for vehicle detection on the real KITTI dataset [47], which were trained on their domain randomized data, could compete in performance with other models trained on the Virtual KITTI dataset [48].In our experiments, we utilize domain randomization by assigning unrealistic textures to the objects in the synthetic scene.
Seeing the success of photorealism and domain randomization, a logical question arises: Can they be combined to get the best of both worlds?Tremblay et al. [13] proposed combining the two approaches for bridging the reality gap.They used photorealistic synthetic images in combination with domainrandomized ones.They showed that DL models exclusively trained on such a synthetic dataset could achieve state-ofthe-art performance in robotic manipulation.Our experiments also inspect the effects of combining domain randomized and photorealistic data for fine-tuning.
Eversberg and Lambrecht [14] examined whether and to what extent it is worthwhile to implement photorealism versus domain randomization techniques in a synthetic dataset for an industrial object detection task.They used Blender to generate the synthetic dataset.Their experimental results suggest that domain randomization techniques outperformed higher realism for the background and clutter objects (which are not related to the object of interest).For the object of Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
interest, realistic textures and realistic lighting affecting the object resulted in better performance than domain randomization techniques.Similar to their solution, our synthetic data annotation pipeline also uses Blender, and we build upon their findings to determine how the blending of real-world and synthetic data should be done in our FTRG approach.
Prakash et al. [15] introduced structured domain randomization.They aimed to generate domain-randomized synthetic data but keep the structural context of the environment.For example, a conventional domain randomized image dataset for vehicle detection would place the vehicles, the camera, and other environmental objects in the scene according to a random distribution.In contrast, in a structured domain randomization dataset, the vehicles are placed on roads, so the structural context of the environment is preserved.They compared photorealistic approaches, such as the Virtual KITTI and the GTA V dataset [49], a domain randomized dataset [6], and their structured domain randomization approach.Their experiments showed that models trained on a dataset with structured domain randomization could outperform models trained on photorealistic or domain-randomized synthetic data.They suggest the advantage of structured domain randomization comes from the trained model having a better understanding of the scene context than models trained on conventional domain randomized data while also being exposed to a similarly large variation in the data distribution.Our FTRG method can be considered as a variation of structured domain randomization, where the environmental context is given by the arrangement of objects in the actual scene, and the randomization is applied to the texture and lighting of the objects.
Instead of trying to preserve environmental context, a more effective approach might be to do the inverse and identify and randomize the environmental factors that affect a DL model's performance, as proposed by She et al. [2] for continual learning in robotics.

C. Continual Learning
Continual learning techniques aim to overcome the challenge of an ever-changing environment by continuously updating the prediction model as new data becomes available [3].A great challenge for continual learning techniques is catastrophic forgetting.It happens when already acquired knowledge is lost (forgotten) due to training the model on new observations.There are multiple approaches to overcome catastrophic forgetting, such as introducing a regularization term, enforcing architectural changes in the network structure, or keeping a working memory of previous training data for experience replay during training [3].In our experiments, we inspect the experience replay approach.
She et al. [2] aimed to identify the factors influencing the prediction accuracy of continual learning algorithms for robotics.They created a benchmark dataset, OpenLORIS Object, which explicitly contains quantitative information on environmental factors, such as lighting level, object pixel size, clutter, and occlusions.They used the train-test accuracy matrix to evaluate the trained models, from which they derived metrics, such as backward transfer (BWT) and forward transfer (FWT).BWT characterizes how well the model can solve previously seen tasks.This is a measure that continual learning techniques need to consider to avoid catastrophic forgetting.On the other hand, the FWT characterizes how well a model performs on yet-unseen future tasks.This is closely related to model robustness.Their results suggest that one reason for the poor performance of continual learning approaches is their struggle with transferring knowledge to new tasks and scenes.This is most apparent in the FWT measure since BWT is usually improved due to techniques that try to avoid catastrophic forgetting.We hypothesize that a small number of synthetic data could improve the FWT of certain continual learning approaches.The findings in identifying the environmental factors that affect DL performance are accounted for in our solutions, where we explicitly include randomization of said factors in our synthetic datasets.
Using synthetic data in continual learning is not unheard of.Synthetic data can be used during their evaluation process [50], meanwhile, generative models are often used to enrich available data based on past experiences and thus prevent catastrophic forgetting [51], [52].However, to the best of our knowledge, a combination where synthetic data with implicit domain knowledge is used for experience replay instead of a generative model is yet to be seen.Such a system can automatically generate synthetic scenarios ahead of time.For example, suppose lighting conditions are expected to change during the operation, but the model did not encounter samples with different lighting yet.In that case, synthetic samples can be generated with different lighting conditions and added to the experiences encountered by the model.Our experiments show how such a system could improve the FWT of continual learning with experience replay and its effect on BWT and overall accuracy.

A. Real Data Annotation Pipeline
Our real-data annotation pipeline can generate instance segmentation masks for real-world images automatically.We maintain a virtual counterpart of the significant elements of the real scene, such as the camera and the objects.The virtual scene is a digital twin of the real world, so the virtual camera and objects reflect the pose of their real counterparts.This setup enables us to compute the segmentation masks for the objects in the virtual scene and then associate these annotations with corresponding images taken by the camera in the actual scene.
The generation of instance segmentation masks relies on computing the perspective projection of 3-D points on the objects' surfaces onto the image plane.We use the formalism as described in [53], according to which, the perspective projection x = (u, v, 1) of a 3-D point (given in the world frame) w X = ( w X, w Y, w Z, 1) is described as where K is the matrix containing the intrinsic parameters of the camera.These parameters can be determined by camera calibration.Our solution used the OpenCV library [54] for the camera calibration, with a printed A4-sized checkerboard Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.pattern.Without the need for additional calibration steps, we were able to achieve sub-10-pixel average accuracy (in 1920×1080 resolution images) for projecting 3-D points onto the image plane, which resulted in human-level annotations.is the projection matrix, which has the form of [I|0], where I is a 3×3 identity matrix and 0 is a column vector of 3 zeros.c T w is the 4 × 4 homogeneous transformation matrix describing the transformation between the world and the camera frame.
Let P( w X) denote the perspective projection of the point w X, F = { w X 1 , w X 2 , . . ., w X n } a set of points to form a face, R( w X) a ray coming from the origin of the camera frame and going through the point w X, and w X O all all the possible points on the surface of the object O.For creating segmentation masks, each object has to be associated with a selected, finite set of points on their surface: w X O = { w X| w X on the surface of O}, w X O ⊆ w X O all .The power set P( w X O ) contains all the possible (not necessarily meaningful) faces for object O, for a given set of surface points w X O .Polygons can be formed in the image plane by projecting each point of a face: Poly F = {P( w X i ) for w X i ∈ F}, and using the projections as the vertices of the polygon.A set of faces F O ⊆ P( w X O ) have to be chosen for object O, such that all the projections given by P( w X j ) for w X j ∈ w X O all fall inside at least one polygon of Poly F k , for F k ∈ F O , but projections P( w X), where R( w X) does not intersect the object, do not fall into any of the polygons from Poly F k , for In the case of a simple cube, for example, the selected set of surface points should be the vertices ( w X O = vertices), while the selected faces would naturally be the set of six faces of the cube (F O = faces).When projecting the points of the faces, one would get six tetragons in the image plane.It is easy to see that for any point on the cube's surface, the projection would fall into at least one of these tetragons.Any other point which is not on the surface of the cube and also not between the camera and the cube or behind the cube (so the ray from the origin of the camera frame going through the point does not intersect the object) would get projected outside of all the tetragons.Thus, merging all the tetragons into a single polygon gives us the segmentation mask for the cube.
However, for objects with complex shapes, the manual selection of surface points and the manual definition of faces is not feasible.Luckily, a very common virtual representation for 3-D object models is the standard triangle language (STL) format.This representation defines a 3-D surface model of the object, given by an object mesh consisting of triangles formed by vertices.As a result, we can directly use the STL style representation of an object by defining w X O = w T o o X O , where o X O are the vertices in the STL format (they are defined in the object frame), and w T o is the 4 × 4 homogeneous transformation matrix describing the transformation between the object and world frames.F O can be selected according to the triangles in the STL representation.In common robotics scenarios, it is usually assumed that a 3-D model of the objects is available or can be easily created.A photogrammetry application can also be used to acquire a 3-D mesh as such methods are becoming increasingly accessible with a mobile phone [55].In our synthetic dataset generation pipeline, we utilized Qlone [56] for scanning clutter objects.Using the STL representation of the object mesh also has the advantage that one only needs to measure the pose of an object relative to the world frame, from which w T o can be determined, instead of measuring every individual point relative to the world frame w X O .
The transformation matrix c T w from (1) needs to be measured before the annotation procedure.In our setup, we utilize an industrial robot and attach the camera in a known, fixed pose to the robot's end effector.As a result, the transformation between the camera frame and the robot's tool center point ( c T TCP ) will be fixed regardless of the robot's pose.The transformation between the robot TCP and the world frame ( TCP T w ) can be obtained from the robot controller at all times.Thus, the transformation between the camera and the world frame can be written as c T w = c T TCP TCP T w .Fig. 1 shows our setup for the automatic annotation of real data.The preliminary steps needed for the annotations are as follows.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
1) Acquiring object meshes in STL format ( o X O for all objects).2) Camera calibration (determining K).
3) Attaching the camera to the robot (measuring c T TCP ).4) Placing the objects in known poses (measuring w T o ).After the preliminary steps, the data collection and automated annotation can be carried out.We move the camera with the robot to a set of pregenerated target poses scattered in a grid pattern on the surface of concentric spheres centered around the scene.At each pose, the robot stops, and the camera takes an image of the scene.The current pose of the robot TCP relative to the world frame is attached to the image as metadata.Based on this pose, c T w can be determined, and the object masks can be computed.
However, generating the segmentation masks for each object individually is not enough.Since multiple objects are present in the scene simultaneously, occlusions also have to be considered.In most cases, a simplistic approach, such as generating the segmentation masks for the objects in the order based on their distance from the camera and allowing these segmentations to overwrite each other, would be enough.On the other hand, such an approach would not be able to handle complex occlusions where objects could interlock.We propose an algorithmic solution that deals with such occlusions (Algorithm 1).
As seen from Algorithm 1, each object gets a unique color ID (RGB values).The algorithm starts with an empty segmentation mask M, with five channels.The first three channels are used for color (RGB representation), while the other two channels store the information about which triangle of which object the given pixel belongs to.The algorithm goes through each object in order and projects the vertices of the triangles in the object's mesh.We use O and T to represent the object and triangle which are being projected.After projecting the vertices of a triangle T , we check which pixels fall inside of the projection of T .Then, for each of these pixels, a test is performed, which can have three possible outcomes.If the color of the pixel is black, it can be colored with the color ID of O.If the color of the pixel is the same as the color ID of O, it means that the pixel was already marked as one belonging to O, so nothing needs to be done.If the pixel was already colored but with the color ID of a different object Õ, it has to be decided which color the pixel should have.For this purpose, we can look up the triangle T of the object Õ, based on the last two channels of the mask and call the IsOccluded function on the two triangles T and T .This function returns true if T is occluded by T , in which case nothing should be done.Otherwise, the pixel can be colored with the color ID of the object O.The IsOccluded function determines whether there is an occlusion by first checking trivialities (all vertices of one triangle being closer to the camera than the other's) and, in nontrivial cases, using the SignedVolume function.The complete definition of the IsOccluded and SignedVolume functions is in Algorithms 2 and 3, respectively, in the Appendix.
This method has the limitations that occlusions with intersecting objects and occlusions involving objects with significant size differences (one having much larger triangles  in their object mesh compared to the other) may not be handled properly.These situations, however, can be considered marginal concerning a system for automatically generating segmentation-type annotations.Fig. 2 shows some example images demonstrating the segmentation masks generated by our automatic real-data annotation pipeline.The images belong to an instance segmentation dataset with different numbers and types of objects, clutter, and illumination.We use this dataset to train our baseline model and test all the models in our experiments on the benefits of using synthetic data for fine-tuning.

B. Synthetic Datasets and Automatic Annotation
Our synthetic data generation pipeline uses the open-source 3-D computer graphics suite Blender.For our experiments, we created two synthetic scenes.One is a tabletop environment containing objects available at our laboratory for testing the FTRG method.We refer to this scene as the OE scene.The other replicates a scene from the OpenLORIS Object dataset [2] for comparing continual learning model performance with and without synthetic data for experience replay.This scene is the synthetic counterpart of one of the real scenes in the OpenLORIS Object dataset, so we refer to it as SynLORIS.Both scenes were prepared with the scope of a fine-tuning dataset in mind.Although the prepared synthetic datasets demonstrate specific tasks, they are representative in the sense, that without additional complexity in the data generation pipeline, similar results could be achieved in other similar kinds of tasks as well.
1) OE Scene: The OE scene has two types of objects of interest, the 3-D printed "OE" logos (which the scene was named after) and standard DIN EN ISO 10642 M8x55 type bolts (later referred to as bolts).The scene also contains clutter objects of clamps and pliers, which were photo-scanned in our laboratory using the Qlone photogrammetry application.The scene's background is a planar surface serving as the tabletop.The arrangement of the OE logos, the bolts, the clutter objects, as well as the distance of the camera from the background plane were randomized.
We created photorealistic and randomized textures for our objects and the background and used realistic lighting.For the photorealistic shading of the tabletop, we used an image texture, the clutter objects used the textures acquired from the photo-scanning, and the OE logos and bolts used shaders created using Blender.The randomized textures were created to be unrealistic for the purpose of domain randomization.
From this scene, we generated 2500 images (OE synthetic dataset).Table I details the dataset composition.It contains images generated using five different settings, each with a training and validation split (400 and 100 images).The images from the first, second, and fourth settings utilize photorealistic textures, while those from the third and fifth settings use randomized textures.This means 60% of the images in the dataset use photorealistic and 40% use randomized textures.The images generated from the first three settings only show the OE logos, while the ones from the last two settings also contain bolts and clutter objects.Apart from images from the first setting, all other images were created with randomized lighting conditions, which make up 80% of the images in the dataset (light intensity and color were randomized, but the  values were kept within realistic bounds).The number, position, and orientation of the OE logos, the bolt and the clutter objects, as well as the camera's distance from the table, were randomized for each image, resulting in a unique arrangement of the scene.
Some example rendered frames of the OE synthetic dataset can be seen in Fig. 3.It displays rendered images with either photorealistic or randomized textures for the OE logos, bolts clutter objects and the background, different lighting conditions, and randomized object and camera placement.
2) SynLORIS: The SynLORIS scene was modeled after a real scene from the OpenLORIS Object dataset.We attempted to reconstruct a similar environment by replicating the placement of the desk and background elements, as well as the direction of the lighting.We also aimed to use 3-D assets that resemble the real objects and textures, but we limited our selection to freely available 3-D assets from BlenderKit's library.For this scene, we did not aim to create very highquality photorealistic renders and the perfect recreation of the objects.As for the OpenLORIS Object benchmark, the images are resized to 50 × 50 pixels, so the fine details would have been lost anyway.
In the SynLORIS scene, we introduced variation in two factors mentioned in [2]: 1) illumination and 2) object pixel size.There are three sources of illumination in the scene: 1) an HDRI; 2) an area light representing the light coming in through the window; and 3) a point light source above the table.In order to change the illumination, both the power of the lamps and the strength of the lighting from the HDRI were Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.modified.The variations in the object pixel size were emulated by moving the camera closer and further from the object along a manually defined 3-D spline.
Similarly to the single-factor experiments in OpenLORIS Object, we also created nine tasks for the illumination factor, each with a different illumination level.For each task, we considered seven different objects (a small subset of the objects in the OpenLORIS Object dataset, which all use the same scene).We rendered 30 synthetic images for each object in each task.The camera path was the same regardless of the object or the task.In total, the generated dataset contains 1890 rendered images (9 tasks, 7 objects per task, and 30 images per object).Fig. 4 shows some of the rendered frames.We use the SynLORIS dataset in our experiments to show how an image classification model using experience replay can benefit from synthetic data.
3) Blender Annotation Tool: For the generation of the segmentation-type annotations in our OE synthetic dataset, we created a blender annotation tool (BAT), a Blender addon. 1AT can be used via a simple user interface (referred to as BAT panel), located in its own tab of the "n-panel" of the 3-D Viewport.The class for the background is added by default with black color.The BAT panel can be used to create, delete, or rename classes, change the color ID of a class or the collection of objects associated with it, and toggle whether the object collection should be treated as a collection of instances or not.
BAT uses the viewport renderer OpenGL to generate the segmentation masks.As a result, BAT only works if a GUI of Blender is open.It cannot run in background mode.However, the generation of the annotations takes significantly less time and resources (two orders of magnitude in our experience), and the rendering of the synthetic images can be completely decoupled from the generation of the annotations.This allows us to render the dataset on a powerful headless server while, in the meantime, generating the corresponding annotations on a  separate system with limited resource usage.A few examples of the generated annotations can be seen in Fig. 5.

C. Continual Learning Experiments
We conducted our experiments on continual learning using the OpenLORIS Object benchmark by She et al. [2].Their paper introduced an evaluation method for continual learning techniques based on the train-test accuracy matrix, which can be described by Table II.They introduced the metrics FWT which is the average accuracy calculated for the upper triangle of the train-test accuracy matrix (marked in blue in the table), and BWT, which is the average accuracy for the lower triangle of the train-test accuracy matrix (marked in red in the table).BWT characterizes how well a model remembers previous tasks, while FWT describes how well a model can adjust to new tasks after training on the preceding ones.
The results of She et al. showed that FWT is the most challenging to maximize for many continual learning approaches.In our interpretation, FWT measures how well the trained model can generalize to new tasks.Our experiments aim to test our hypothesis that synthetic data, in the form of rendered images, can help improve the FWT of certain continual learning models.For this purpose, we use continual learning with experience replay [57] and evaluate it on the single-factor benchmarks introduced by She et al.
For our experiments, we use a selected set of seven objects from the OpenLORIS Object dataset (bottle_01, bowl_01, cup_02, cup_04, ladle_02, paper_cutter_04, and scisssors_01) as these are the objects for which we generated our synthetic dataset SynLORIS as well.There are four factors in total: 1) illumination; 2) occlusion; 3) clutter; and 4) object pixel size.For each factor, there are nine tasks with different levels of the corresponding factor.We evaluate two models for each factor.One of the models is only trained with data from the original OpenLORIS Object training set, while the other is trained with data from both the OpenLORIS Object and the corresponding SynLORIS data.Both models were given the same memory budget of 2000 and were trained for 100 iterations on each task.We use the validation set of the OpenLORIS Object dataset for the corresponding seven objects for testing across all factors and tasks.We evaluate each model for all four factors according to the same metrics, which were used by She et al.We report our findings in Section IV-A.

D. FTRG Method
FTRG combines our automatic real-data segmentation pipeline with our synthetic data generation and annotation method to create rendered counterparts to real-world images.This allows us to blend different elements of the real and synthetic scenes (such as the objects of interest, the background, or clutter objects) in a single image.By controlling opacity, we can influence how much the virtual scene is blended with the real one, creating a seamless transition between reality and the synthetic environment.According to its purpose, the name of our method is FTRG.
FTRG starts with creating a real dataset with our automatic real-data annotation pipeline.Then, a synthetic counterpart of the scene is built in Blender, and based on the images from the real dataset, the camera pose, its motion, and its internal parameters are determined by the motion tracking module of Blender.After rendering, the real and synthetic images can be seamlessly blended using Blender's compositor workspace by layering them on top of each other and continuously adjusting the opacity of different parts of these layers based on the automatically generated segmentation masks.For our experiments, we created an FTRG dataset by combining different elements of the real and synthetic images (e.g., real background with synthetic objects and synthetic background with real objects).The labels for the FTRG dataset are "inherited" either from the real or the synthetic scene (using BAT).Fig. 6 shows possible ways to blend synthetic and real data in the FTRG dataset, showcasing the seamless transition between real-world and synthetic image components.
In order to show the effectiveness of this method, we compare the detection results of Mask-RCNN [58] networks which were fine-tuned using the FTRG method, and others which were using photorealistic synthetic data, domain randomized synthetic data, or real images for fine-tuning.
The FTRG method is currently limited by the automated real-world dataset collection approach because it requires that the pose of each object in the real-world scene is known.In simple tabletop scenarios, this can be ensured, but in a more complex setting, such as robotic bin-picking, the exact object poses are very hard to determine.In the future, an improved real-world data collection pipeline and advanced camera tracking could solve this issue.

A. Continual Learning Experiments
In our continual learning experiments, we compare the performance of models using experience replay on the OpenLORIS Object benchmark's single-factor experiments.One of the models only had access to formerly seen samples for experience replay, while the other had access to Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.[2])) samples from the SynLORIS dataset.Fig. 7 visualizes the train-test accuracy matrices of the models for all four factors for qualitative assessment.The train-test accuracy matrices are represented as images, where the intensity of a pixel reflects the value of an element of the matrix, with 0 being black and 1 being white.
It is important to mention that the SynLORIS dataset only includes variations in two factors: 1) illumination and 2) object pixel size.The effects of this are visible in Fig. 7, where significant differences in the train-test accuracy matrices can only be observed in the illumination and object pixel size factors.In these cases, the model which had access to synthetic samples outperformed the model which was only trained on real samples.Table III shows our continual learning experiments' quantitative results, using the metrics She et al. introduced in [2].The first row (m1) shows the results of training models exclusively on the OpenLORIS Object dataset.The second row (m2) shows the performance of models which were trained on data from both the OpenLORIS Object and the SynLORIS datasets.Notice how the models trained on real and synthetic samples had significantly better FWT for the illumination and object pixel size factors than models that only used real data.

B. Synthetic Data for Fine-Tuning and FTRG
Our experiments demonstrate the effects of synthetic data for the fine-tuning phase of training a deep neural network.We also highlight the benefit of using the FTRG method compared to using only photorealistic synthetic images and/or domain randomization.We trained multiple Mask-RCNN [59] models for instance segmentation, using different datasets and evaluated them on the same test dataset. 2All the models had the same network architecture and were initialized with the same pretrained weights (pretrained on the COCO dataset [60]).We used the same hyperparameters and a fixed number of training steps to fine-tune all the models.We trained five models in total and named each after the type of data used for their training.
1) MRCNN-R: This model was fine-tuned using only real data.The training dataset was annotated by our automated real dataset annotation method.2) MRCNN-P: This model was fine-tuned using only photorealistic samples from the OE synthetic dataset.3) MRCNN-DR: This model was fine-tuned using only synthetic images with unrealistic textures from the OE synthetic dataset.4) MRCNN-DR-R: This model was fine-tuned using both domain-randomized synthetic samples from the OE synthetic dataset and samples from the real dataset.5) MRCNN-DR-P-R: This model was fine-tuned using synthetic samples from the OE synthetic dataset (both photorealistic and domain randomized) and real samples.6) MRCNN-FTRG: This model was fine-tuned using the FTRG dataset, which combines the real dataset with the domain-randomized synthetic dataset using the FTRG method.For the test data, we collected real images using a variety of scenes (different backgrounds and arrangements of objects), clutter (level of clutter as well as types of clutter objects), and illumination conditions.This test set was also annotated with our automated real data annotation method.In Table IV, we report the mean average precision (mAP) at IoU greater than or equal to 0.5 for all five models over five randomly selected subsets of the test dataset.
The results show that models fine-tuned exclusively on synthetic data performed worse than the model trained exclusively on real samples.However, models trained on datasets consisting of both synthetic and real samples could significantly outperform the model trained using only real data.This suggests that using synthetic data in the fine-tuning phase of training deep neural networks can yield superior performance compared to training only on real data.

Fig. 6 .
Fig. 6.Samples from our FTRG dataset.(a) Seamless transition from real to synthetic textures on a selected subset of objects.(b) Random texture for a selected subset of objects of interest with real background and clutter.(c) Real objects in a synthetic scene with synthetic clutter.

Fig. 7 .
Fig. 7. Train-test accuracy matrices of image classification models, using experience replay and data from the OpenLORIS Object and the SynLORIS datasets; as evaluated by the OpenLORIS Object benchmark on all four factors (brighter color means greater accuracy).

TABLE I COMPOSITION
OF THE OE SYNTHETIC DATASET

TABLE II TRAIN
-TEST ACCURACY MATRIX R FROM [2]; Tr REPRESENTS TRAINING DATA, Te REPRESENTS TESTING DATA, R i,j IS THE ACCURACY OF THE MODEL TRAINED ON Tr i AND EVALUATED ON Te j , AND N IS THE NUMBER OF TASKS

TABLE III QUANTITATIVE
RESULTS FROM OUR CONTINUAL LEARNING EXPERIMENTS.VALUES PER CELL FROM TOP TO BOTTOM: ACCURACY, BWT, FWT, AND OVERALL ACCURACY (AS DESCRIBED IN

TABLE IV PERFORMANCE
OF MODELS (MAP @ IOU≥ 0.5) EVALUATED ON FIVE RANDOMLY SELECTED SUBSETS OF OUR TESTING SET