Object Detection Using Sim2Real Domain Randomization for Robotic Applications

Robots working in unstructured environments must be capable of sensing and interpreting their surroundings. One of the main obstacles of deep-learning-based models in the field of robotics is the lack of domain-specific labeled data for different industrial applications. In this article, we propose a sim2real transfer learning method based on domain randomization for object detection with which labeled synthetic datasets of arbitrary size and object types can be automatically generated. Subsequently, a state-of-the-art convolutional neural network, YOLOv4, is trained to detect the different types of industrial objects. With the proposed domain randomization method, we could shrink the reality gap to a satisfactory level, achieving 86.32% and 97.38% $\mathrm{{mAP}}_{50}$ scores, respectively, in the case of zero-shot and one-shot transfers, on our manually annotated dataset containing 190 real images. Our solution fits for industrial use as the data generation process takes less than 0.5 s per image and the training lasts only around 12 h, on a GeForce RTX 2080 Ti GPU. Furthermore, it can reliably differentiate similar classes of objects by having access to only one real image for training. To our best knowledge, this is the only work thus far satisfying these constraints.


I. INTRODUCTION
N EW-GENERATION intelligent manufacturing (NGIM)   is a recent trend embodying the in-depth integration of new-generation artificial intelligence (AI) with advanced manufacturing technology such as robotics.It became the driving force of the fourth industrial revolution and it belongs to the human-cyber-physical system 2.0 (HCPS 2.0) framework.Contrary to traditional manufacturing where robots work in structured environments and perform high-accuracy, repetitive tasks with minimal sensory input, an NGIM system is designed to be flexible and to take over as much intellectual and manual labor as possible.Thus, human workforce can concentrate on more valuable creative work.[1] Consequently, the main interest has been shifting toward adaptive robotic applications that can cost-efficiently handle low-quantity customized products and integrate human operators with different skills and abilities.Computer vision plays an essential role here-the highly researched field has already proven useful in pick-and-place, bin picking, grasping, navigation, or quality assurance tasks.As two examples, Zeng et al. [2] created a vision-based model that can predict parameters of motion primitives through trial and error for robotic grasping and throwing, and Alonso et al. [3] designed a model for realtime semantic segmentation of RGB images for a mobile robot application.
In a computer vision context, deep convolutional neural networks (DCNNs) have been performing incredibly well on large public datasets such as ImageNet [4] or MS COCO [5].
Having reached human-level performance in classification, the main focus of computer vision research has shifted to object detection and led to networks such as the faster R-CNN [6], single shot multibox detector (SSD) [7], and YOLOv4 [8].Even though these networks are outperforming the traditional machine-vision-based methods by a significant margin, their application in robotic systems has its difficulties.One of the main obstacles in applying deep-learning-based models is that they need to be trained on a sufficiently large, domain-specific, and expertly labeled dataset.
Levine et al. [9] conducted two experiments, in order to create a dataset of real images for their DCNN model predicting the success of robotic grasp attempts, as well as to control these attempts.Yielding records of more than 1.7 million grasp attempts with the simultaneous use of 6-14 robots, the process took months to complete.This example shows that, in general, collecting a dataset from the real world not only requires an immense amount of resources but is a time-consuming process as well.
The main motivation behind transfer learning is to overcome the aforementioned obstacle by transferring knowledge between tasks or domains [10], [11].Sim2real transfer is a special case of transfer learning, where the source domain is the virtual simulation of the real world, while the target domain is the physical reality itself [12].With sim2real knowledge transfer, the model can be trained in a virtual simulation, having the necessary amount of labeled synthetic data.In the case of computer vision, the images are rendered, and the labels, too, can be generated for them in a self-supervised way.Thus, the time-consuming process of data collection and labeling can be omitted.As the domains of training and test datasets are inherently different, ceteris paribus, the learned model will perform poorly in the target domain.The phenomenon of performance loss from the simulation to the real domain is called reality gap.Domain adaptation [13] and domain randomization [14], [15] are two ways of shrinking this gap.
The contributions of this article are as follows.
1) A sim2real domain randomization method for object detection which describes a data generation process with our domain randomization methods, a model training phase, and an evaluation phase.2) A real-world dataset of 190 manually annotated images (RGB and depth) containing 920 objects of ten classes that address the problem of high class similarity to validate our method.The dataset is publicly available alongside our code and can serve as a benchmark for object detection algorithms.3) For evaluation, we introduced an altered type of confusion matrix fit to object detection.It proved to be extremely useful for detecting and quantifying misclassifications which is the primary cause of performance loss in the case of similar classes.4) A real-world robotic implementation of the method as a proof of concept containing an ROS-based robot control system.5) Our implementations of our sim2real data generation and training module, and our robot control framework that are available at 1,2 .Both can be used as out-of-the-box software modules for industrial robotic applications.Results of the article are as follows.1) We achieved 86.32% mAP 50 and 97.38% mAP 50 scores in zero-shot and one-shot transfers that show the usability of our methods even in the case of an industrial application where high reliability is crucial.2) Our experiments show that having even only one sample image from the target domain significantly improves the model's performance for similar classes.3) A thorough ablation study focusing on finding the key factors of the data generation process. 1 [Online].Available: https://git.sztaki.hu/emi/sim2real-object-detection 2 [Online].Available: https://git.sztaki.hu/emi/robot_control_framework The industrial benefit of our work is a freely available tool streamlining the training of CNN-based models for object detection.Our built-in sim2real domain randomization method spares the user the effort of collecting and annotating a large dataset, as it automatically generates training data from 3-D models.Optionally, one annotated image with all relevant objects can be added for improved performance.With training being automated as well, the entire workflow from 3-D models to trained CNN takes only around 13 h.
The rest of this article is organized as follows.Sections II and III present the problem statement and related work.In Sections IV and V, our method is outlined with the evaluation metrics.In Sections VI and VII, our dataset and our results are shown.Additionally, in Section VIII, a thorough ablation study of our method is presented.Section IX shows a real-world robotic implementation of our method.Finally, Section X concludes this article.

II. PROBLEM STATEMENT
The main problem we tackle is how to transfer knowledge efficiently from simulation to the real world in the case of object detection.In the common computer vision problem of detecting objects, the model is given an image in which it finds center points and dimensions of bounding boxes around objects and classifies the latter.Knowledge transfer, which is the primary focus of our article, belongs to the field of transfer learning.Additional challenges arise from the following circumstances.
1) Having no or only one image from the real world.
2) Having industrial objects that share similar features, thus, making it more challenging to properly classify them.In the following, we give a brief overview of transfer learning which is the main field of this article.
Pan et al. [10] and Weiss et al. [11] define transfer learning in their surveys.A domain D is defined with a feature space X and a probability distribution P (X), where X = x 1 , x 2 , x 3 , . .., x n ∈ X , while a task T is defined with a label space Y and a predictive function f(•).From a probabilistic point of view, f(•) can be seen as P (Y | X).Thus, a domain D = {X , P (X)} and a task T = {Y, P (Y | X)}.Given a specific source domain and task pair {D S , T S } and a specific target domain and task pair {D T , T T }, transfer learning can be defined as the process of increasing the performance of the target predictive function f T (•) with the help of knowledge gained from {D S , T S }, where D S = D T or T S = T T .
An example of T S = T T is when an image classifier, trained on a large public dataset, is reused and altered to perform object detection on the same domain (D S = D T ).In this case, the label spaces are different, and consequently, the conditional probability distributions of the inputs and the outputs P (Y | X) are disparate as well.Nevertheless, the marginal distributions of the inputs P (X) are equivalent.
The other case of transfer learning is when T S = T T , but D S = D T , i.e., the tasks are the same, yet, the domains are different.Thus, the marginal probability distributions P (X) and, possibly, the conditional probability distributions P (Y | X) differ in the source and target domains.These phenomena are the frequency feature bias, and the context feature bias, respectively.In order to improve f T (•), our aim is to extract the relevant (not domain-specific) knowledge from {D S , T S }.Thus, the model will perform well on the target domain (D T ).If D S and D T are not related enough, negative transfer can occur, and knowledge transfer does not improve the performance of f T (•), or even decreases it.
Sim2real object detection is a special case of transfer learning: instead of real images obtained from the target domain, the model is trained on synthetic data, thus D S = D T .On the other hand, it performs the same task, namely object detection (on the same classes of objects), therefore T S = T T .Nevertheless, the model trained on synthetic data, ceteris paribus, does not work on real images as the domains are disparate.This phenomenon is referred to as the reality gap, and the main goal of sim2real transfer is to bridge this gap.In our case, the sim2real transfer is the second phase of the knowledge transfer, shown in Fig. 1.
Domain adaptation (DA) is an approach to diminish the reality gap.It attempts to transform one domain into the other domain or transform both domains into a common domain.In the case of sim2real object detection, it usually consists in generating photo-realistic images for the training dataset.The more the generated images resemble the real ones, the more the difference between domains is reduced, and thus, the performance with real images is improved.Typical data generation models for domain adaptation are based on variational autoencoders (VAE) [16] or generative adversarial networks (GAN) [17].
Domain randomization (DR), on the other hand, introduces variability by adding artificial noise to the synthetic training images.The idea is that the added noise makes the model robust to different domains, as it does not overfit on the domain-specific characteristics, but learns the domain-independent underlying data representation.Another possible interpretation is to regard the different domains as perturbed versions of one common domain.The general idea of introducing variance to simulation was first presented by Jacobi [18].
Other important concepts of transfer learning are the zero-shot and the one-shot transfers.In the context of object detection, zero-shot transfer means that not even one image is used from the target domain for training.In the case of one-shot transfer, only one or a few images are used from the target domain.A way to do one-shot transfer is to train the model on synthetic data and then fine-tune it on some examples from the target domain.In our one-shot transfer case, only one real image was used for training.Moreover, we did not separate the process into training and fine-tuning, as we mixed the copies of the real image and the synthetic images.

III. RELATED WORK
In this section, the related work in sim2real knowledge transfer is presented in Section III-A.Furthermore, the different types of object detection models are presented in Section III-B.

A. Domain Randomization and Domain Adaptation
Tobin et al. [14] trained a modified version of VGG-16 [19] deep neural network architecture for object localization.They generated nonrealistic synthetic RGB images randomizing the number and shape of the distractor objects, the position and texture of all objects, the texture of the background, the position, orientation and the field of view of the camera, the number of lights in the scene, the position, orientation, and specular characteristics of the lights, and the type and amount of random noise added to images.The random textures were either a random single color, a gradient between two colors, or a checker pattern of two random colors.The following nonindustrial objects were used: cones, cubes, cylinders, hexagonal prisms, pyramids, rectangular prisms, tetrahedrons, and triangular prisms.The images were rendered with the built-in renderer of the MuJoCo Physics Engine [20], and no real images were used for training the model.They achieved around 1.5 cm accuracy in the real-world environment.Tobin et al. [21] conducted further research, where they trained a deep neural network for grasp planning using only synthetic images and domain randomization, and achieved an 80% success rate in a real-world environment.
Borrego et al. [22] presented a plug-in for the Gazebo simulator [23].Introducing variation reduced the reality gap between simulated and real-world data.In the case study, three types of objects were detected-box (cube), cylinder, and sphere.The simulated scenes contained a ground plate and a single light source.The objects were placed on a grid to prevent collusion, but they were rotated randomly.(In this regard, we found that introducing some disturbance to object placement significantly increases the performance, see in Section VIII-D.)Then, the camera and the light source were moved to random positions.Four different types of textures were used, namely, flat, gradient, checkerboard, and Perlin noise [24].For training, the SSD [7], and separately, the faster R-CNN [6] networks were used.With the two networks, 70% and 88% mAP 50 were achieved, respectively, using 121 real images.Training This article has been accepted for inclusion in a future issue of this journal.Content is final as presented, with the exception of pagination.
the same networks with 9000 simulated images yielded 64% and 82% mAP 50 , respectively.Interestingly, the hybrid approach (using real and synthetic images) accomplished 62% and 83% mAP 50 , respectively.For all experiments, IOU 0.5 threshold (50 in mAP 50 ) was used, and the test results were validated on 121 real images (different from the 121 images used for the training).A follow-up ablation study [15] revealed that Perlin noise has a crucial influence on the performance of the model.Furthermore, data generation process was further accelerated to 9000 full-HD images in roughly 1.5 h (around 0.6 s per image).
Pashevich et al. [12] trained manipulation policies in a simulation environment with an object localization proxy task.Depth images for training were simulated in PyBullet [25] and gathered with a Kinect-1 camera from the real world.For finding the best data augmentation transformation and their order, Monte Carlo Tree Search (MCTS) [26] was used.The transformations were selected from the Python Image Library (PIL).The transformations were evaluated individually and as a sequence as well.From all transformations examined, the cutout transformation [27] performed best on real images (although in our experiments this was not the case for RGB images, see in Section VIII-E), and the best sequence of transformations was: cutout, erasing an object, white noise, edge noise, scale, salt noise, posterize, and sharpness, in this order.With the aforementioned sequence, 1.09 ± 0.73 cm position error was achieved in the real environment, for cubes of 4.7 cm edge length.
James et al. [28] trained an end-to-end robotic controller on synthetic data with domain randomization.The inputs of the deep-neural-network-based model were an image and the joint angles of the robot, while its output were motor velocities.The task was an abstract tidying manipulation, namely, putting a cube into a basket.Similarly to [22], Perlin noise was used as a perturbation.The model was examined in dynamically changing illumination settings, in the presence of distractors, including human presence, new cube size in test time, and with a moving basket.Experiments yielded at least a 75% success rate in all conditions, except for a spotlight and a smaller cube in test time.In these cases, the model had a 56% and a 41% success rate, respectively.Devo et al. [29] used domain randomization to train a targetdriven visual navigation model.The goal was to find a specific object in a maze.Maze wall heights, maze wall textures, maze floor textures, light color and intensity, and the light source angle were subject to randomization.For simulation, the Unreal Engine 4 [30] was used.An average of 72% success rate was achieved in simulation, and 46% in the real world.The experiments showed that direct sim2real transfer is possible for this kind of problem as well.
Chen et al. [31] created the domain adaptive faster R-CNN model for cross-domain object detection.Domain shift stemming from image-level and instance-level shifts were tackled with an approach based on H-divergence theory and adversarial training.The study focuses on street images for self-driving cars where the domains are disparate due to different camera types and setups, different cities and diverse appearance of objects, or the particular weather conditions.Some experiments were also carried out with sim2real knowledge transfer, as the model was trained on SIM 10k [32] and evaluated on the Cityscapes dataset [33].Their method improved the performance of the faster R-CNN model from 30.12 AP 50 to 38.97 AP 50 in the case of the car class.
Focusing on street scenarios, Sankaranarayanan et al. [34] proposed an unsupervised domain adaptation approach based on generative adversarial networks for semantic segmentation problems.For the synthetic source domain, the SYNTHIA [35] and the GTA V [36] datasets, and for the target domain, Cityscapes dataset [33] were used.The approach achieved 36.1 and 37.1 mIOU scores transferring knowledge from SYNTHIA and GTA V, respectively.Without domain adaptation, the method scored 26.8 and 29.6 mIOU.
Tremblay et al. [37] generated synthetic images with domain randomization techniques to perform object detection of cars in street scenarios.100 K images were generated with maximum 14 cars each, selected randomly from 36 car models.The models were evaluated on the KITTI dataset [38].Three DCNN architectures were trained (faster R-CNN [6], R-FCN [39], and SSD [7]), scoring 78.1, 71.5, and 46.3 AP 50 , respectively, on the single-class object detection problem.Interestingly, better results were obtained than by training the same architectures on the virtual KITTI dataset [40] which has a high correlation to the KITTI dataset.The performance could be improved by fine-tuning the models on real images.With 6000 real images, the performance of the faster R-CNN model reached 98.5 AP 50 .
Hinterstoisser et al. [41] inserted 3-D models of objects in real images, using OpenGL with Phong shading [42] for rendering.Small perturbation were permitted in the ambient, diffuse, and specular parameters, and the light color.Gaussian noise and a blur with Gaussian kernel were added to better integrate the objects with the background.A faster R-CNN model was primarily used for training, with freezing the weights of the feature extractor.The latter significantly improved the performance of the model (although, Tremblay et al. [37] later reported the opposite effect in their case).
Zhang et al. [43] propose an adversarial discriminative sim2real approach to transfer visuo-motor policies.The method was demonstrated in a table-top object-reaching task.A blue cuboid object had to be reached with a velocity-controlled 7-DoF robot arm.The method could reduce the real data requirement by 50%, while 97.8% success rate and 1.8 cm control accuracy were achieved.
Clever et al. [44], [45] proposed a method to predict human position (resting on a bed) and contact pressure from depth data and gender information.The method achieved 3.837 MSE(kPa 2 ) trained on 97 K synthetic images.In comparison, the same method reached 3.151 MSE(kPa 2 ) trained on 11 K real images and 2.849 MSE(kPa 2 ) trained on both real and synthetic images (108 K).For evaluation, the SLP dataset [46] was used.
Gomes et al. [47] proposed a simulated model for the GelSight tactile sensor.Having computed the height map of the elastomer, the internal illumination of the elastomer is calculated.The usefulness of the model was also demonstrated with a sim2real classification task.For the study, 12 texture maps resembling real objects were created and randomly perturbed on the captured synthetic data, improving the classification accuracy from 43.76% to 76.19%.
This article has been accepted for inclusion in a future issue of this journal.Content is final as presented, with the exception of pagination.

TABLE I SUMMARY OF RELATED WORKS
The above works are diverse in terms of the problem itself, the input type, the domain of the application, and the amount of synthetic and real images used to train the model, making a complete comparison a challenge.Nevertheless, a general overview organized by selected characteristics is presented for reference in Table I.In general, certain limitations of the above works relate closely to our work (solving object detection).
1) The classification part of the problem is less challenging as the works use easy shapes such as cubes, spheres, and cones, or even have one class only.
2) The works rely on considerably more synthetic and real images for training.Even though the cited works use transfer learning (domain adaptation or domain randomization) to reduce the reality gap, they do not solve the same machine learning problem, and may use different models as well.All of this needs to be taken into consideration if an in-depth comparison is desired.

B. Object Detection
A comprehensive overview of object detection models and the history of the field is not in the scope of this article, therefore, we limit this section to a selection of sources relevant to our work.
For further detail, we refer the reader to standard surveys such as the work of Zou et al. [53].
DCNN architectures can be categorized into two groups: twostage detectors and one-stage detectors.Two-stage detectors have a proposal detection stage where a set of bounding box candidates is generated, and a verification stage where these bounding boxes are separately evaluated whether they contain an object of a specific class.Examples of these networks are R-CNN [54], SPPNet [55], fast R-CNN [56], faster R-CNN [6].
In the case of one-stage detectors, on the other hand, a single neural network is applied to the full image that predicts the bounding boxes straight away.The slow detection time, which is the biggest disadvantage of the two-stage detectors, can be overcome with the one-stage approach.Detection time is crucial for many applications, especially but not exclusively in the field of robotics or self-driving cars.Redmon et al. proposed the first one-stage detector YOLO in 2015 [57], being the first real-time object detector.Subsequent updates introduced its second [58] and third versions [59].Single shot multibox detector (SSD) [7] and RetinaNet [60] are two other popular one-stage detectors.
Bochkovskiy et al. [8] created the fourth version of YOLO aiming to improve the accuracy of the model while still keeping an optimal accuracy-speed tradeoff.With the CSPDarknet-53 [8] backbone, 65.7% mAP 50 could be achieved for the MS COCO dataset [5] and around 65 FPS speed on a Tesla V100.In comparison, on the same dataset, SSD with VGG-16 [19] backbone performed 48.5% mAP 50 and RetinaNet with ResNet-101 [61] backbone achieved 57.5% mAP 50 .

IV. METHOD
This section presents our method in detail-namely, the proposed sim2real knowledge transfer in Section IV-A, the data generation module in Section IV-B, and the training module in Section IV-C.The implementation is freely available at 3 .

A. Sim2Real Knowledge Transfer
The flowchart diagram of our data generation, training, and evaluation process is depicted in Fig. 2. It can be broken down into functionally separable tasks.The data generation process creates randomized and postprocessed synthetic images of given objects.It also automatically generates the annotations for the images.Thus, the output of the data generation process is a set of images paired with their labels grouped into a training and a validation dataset.
In order to train the model (with the set of hyperparameters), only the images from the training dataset are used.As the initial layers of the neural network perform low-level image processing tasks such as detecting contours, lines, or edges, we utilized a pretrained image classifier model as a feature extractor of our object detector.This is the first phase of our knowledge transfer, depicted on the left side of

B. Data Generation
The data generation process is responsible for the creation of synthetic images paired with accurate automatic ground-truth annotations.In several stages of this process, artificial random perturbations are applied as domain randomization techniques.
1) Framework: For data generation, the PyBullet [25] physics simulator was utilized since it has an easy-to-use, intuitive API, including an image renderer tool, and an integrated physics simulator where the gravitational force can be simulated easily.
The duration of dataset generation is a relevant aspect of the method, as in the industry, on many occasions, it is not feasible to wait long hours or even days to start the training, which can be a time-consuming process itself.One of the biggest advantages of domain randomization over domain adaptation is that it is generally faster as images do not need to be photo-realistic.In our case, for data generation, we could achieve less than 0.5 s per image on a GeForce RTX 2080 Ti GPU.With 4000 images, this amounts to around 33 min.If one image is rendered in 1 min instead of the 33 min (which is plausible in the case of photorealistic images), the aforementioned 4000 images would take more than 66 h.Having generated the dataset, the training lasts around 12 h, thus, a complete generation and training process can be executed automatically in around 13 h.
2) Object Generation: The framework is capable of placing any type of object into the simulation if its 3-D description file is given.In the case of industrial applications, which is the aim of this research, these 3-D models are easily accessible.
This article has been accepted for inclusion in a future issue of this journal.Content is final as presented, with the exception of pagination.

TABLE II MOST RELEVANT INPUT PARAMETERS OF THE DATA GENERATOR MODULE IN TERMS OF OBJECT GENERATION
The most relevant input parameters of the object generation process are summarized in Table II and the process works as follows.
r A horizontal plane is placed at the vertical z = 0 position.r According to the grid size (n) and the grid spacing (d) parameters, the x and y coordinates of the grid points are set.
r The vertical z > 0 coordinates of the grid points are set by r The objects are not placed exactly at the grid points.The x, y, and z coordinates of the objects are obtained from a uniform distribution in which the center point is a given grid point and the limits of the distribution are set by pos .
r Once object selection has been performed, the appropri- ate 3-D model of the object is loaded into the specific coordinates.Predefined weights describe the probability of selecting a specific object.Furthermore, a distracting cuboid object (which is not in any of the classes) or a void object (leaving that grid point empty) can be selected.The sizes of the distracting objects are also individually randomized.The aforementioned probabilities are set by P objects .
r The objects are also randomly rotated around the x, y, and z axes, described by a uniform distribution whose limits are set by rot .
r The objects and the ground plane are given some random textures drawn from three public datasets [62], [63], [64], with the probability of P texture .Some examples of the textures are shown in Fig. 3. Random RGB colors are assigned to the objects (or to the ground plane) which do not receive any texture.
r Before rendering the image, the objects are dropped down from z > 0 to the ground plane.Thus, the objects are captured in a natural stable position.The simulation of the free fall takes around 0.05-0.1 s per scenario (with every step included, the generation of an image with its label is around 0.45-0.5 s).Fig. 3. Some examples of the textures used [62], [63], [64].

TABLE III MOST RELEVANT INPUT PARAMETERS OF THE DATA GENERATOR MODULE IN TERMS OF IMAGE RENDERING
3) Image Rendering: For rendering an image, the camera pose, its inner parameters, and additional parameters must be set.The most relevant parameters of the image rendering are presented in Table III.The algorithm works as follows.
r The camera is placed in a randomized position pointing to a random point around the center of the grid defined by C pos and T pos .The randomization is constrained to ensure that the center points of all objects appear on the rendered image.
r The camera field-of-view (FOV) and its additional intrinsic parameters are set.Image width and height are obtained from a uniform distribution defined by R width and R height .
r The RGB, D (depth), or RGB-D images are taken, defined by I type .RGB-D images are created by concatenating the RGB and the D images.For one layout (object generation), only one image is taken.4) Label Generation: Having generated the objects and rendered the image, the ground-truth bounding box (BB) parameters must be computed.For different CNNs, the format is different, but it is generally true that the bounding box parameters describe all the objects in an image as follows.
r The center point (x and y) of the object.r The width and the height of the bounding box.r The class of the object.The aforementioned ground-truth generation is an automatic process involving a coordinate transformation from the simulation 3-D world coordinate system to the image 2-D coordinate system.
The 4×4 transformation matrix is the matrix product of the view matrix and projection matrix of the camera, respectively.In order to transform a point from the world coordinate system to the image space, it must be multiplied with this transformation matrix and then scaled back by its fourth coordinate to get the true projection.For a detailed explanation of the projection, we refer the reader to [65].
In the framework, we implemented two ways of computing the bounding boxes.The two approaches differ in the number of points that are transformed into the image space.One approach transfers only the eight points (per object) of the world 3-D axis-aligned bounding boxes (AABB), which is available in the PyBullet simulator, whereas the other transforms all the points of the objects to the image space.Henceforward, we refer to the former approach as the eight-point method and the latter as the all-point approach.Having obtained the transformed points, the second part of the algorithm is the same in both cases: the minimum and maximum values in x and y directions are selected to define the limits of the BBs.The center point can be computed as the arithmetic means of the minimum and maximum values.As a result, the latter method gives tighter, more accurate bounding boxes at the cost of extra calculations.

5) Postprocessing:
The technique of domain randomization was performed in multiple steps of the previously defined synthetic image generation process.In the postprocessing phase, as a domain randomization technique, additional artificial noise is introduced to alter the images.The images are perturbed with a randomized multicolor pepper-and-salt noise and a Gaussian blur.Furthermore, as Pashevich et al. [12] found the rectangle cutouts useful for depth images, experiments were made with rectangle cutouts, and additional circle, as well as line cutouts in our RGB images.The noise types are shown in Fig. 4, described in Table IV, and were applied in the following order: 1) rectangle cutout; 2) circle cutout; 3) line cutout; 4) multicolor pepper-and-salt; 5) Gaussian blur.The goal of postprocessing is to force the model not to learn the synthetic domain-specific characteristics, but to try to learn the domain-independent underlying data representation.The ablation study on our experiments, described in Section VIII, shows that having the postprocessing module undoubtedly improves the performance of the models with the test dataset.Nevertheless, it also reveals that the added cutout noises did not improve the performance compared to the default Gaussian blur and multicolor pepper-and-salt noise.

C. Training
Even though the method would work with any given CNNs, we have chosen the YOLOv4 [8] architecture for this research for the following reasons.
1) It has the best speed and accuracy tradeoff which makes it a good fit for robotic applications.It also has a tiny version, allowing it to run in real-time even on a microcomputer such as a Raspberry Pi or NVIDIA Jetson Nano.2) Its training framework contains additional advanced data augmentation tools.For more information, we refer the reader to [8].These tools help to introduce further perturbation to the system.For the training, a model pretrained on ImageNet [4] is used.The most relevant hyperparameters for the advanced data augmentation tools are shown in Table V, keeping the original names of the parameters.
In Section VIII-F, we present the results of our method only changing the object detection model from YOLOv4 to faster R-CNN.
This article has been accepted for inclusion in a future issue of this journal.Content is final as presented, with the exception of pagination.

V. EVALUATION
In this section, we outline the metrics used to evaluate our models.First, we define how we measured the reality gap, then we introduce our altered version of the confusion matrix, and last, we outline some further details of our evaluation process.
To evaluate the solution of the classical machine learning problem, (training and evaluation on the same domain), realworld images would not be needed.In this case, the performance is assessed on the generated validation dataset that is not used for training but comes from the same distribution P train (X) = P valid (X).The solution can be assessed by the value of the mAP score of the model on the valid dataset, and the capability of generalization (within the specific domain) can be measured by comparing the performance of the model on the training and the validation datasets, as in the following: To evaluate the performance of the knowledge transfer, a manually annotated test set of real images is needed.In this case, P valid (X) = P test (X).We expect that the given model performs notably worse on the test set than on the validation and training sets.To measure the magnitude of the reality gap, we can define it as the difference of the performance of the model on the validation and test sets, as shown in the following: ( Furthermore, we adapt the confusion matrix measure from the field of image classification to object detection and use it as an additional performance measure.The adaptation works as described as follows. 1) Adding an extra row and an extra column to the classical confusion matrix.Thus, there are c + 1 rows and columns, where c is the number of classes.The additional column represents the objects that are not predicted to any of the classes but actually belong to one class.On the other hand, the additional row of the matrix represents the cases when the model predicted an object of a class in a position where there should not be any object.
2) The values of the diagonal are the correct predictions.For simplicity, the last element of the diagonal is zero.This element should contain the number of objects that are not in the images and the model rightfully did not find them, which does not have any meaning.3) As more than one prediction can belong to one groundtruth object, a given ground-truth object appears in the matrix as many times as many predictions are paired with it.Therefore, contrary to the traditional confusion matrix, the sum of all elements in the matrix will not necessarily be equal to the sum of all ground-truth objects or predictions.The adapted confusion matrix described is proven to be useful for detecting and quantifying misclassifications which turned out to be the primary cause of performance loss in the case of similar classes.Examples are shown in Section VII in Figs. 14  and 20.
As presented in detail in Section VII, several training runs were carried out to test our method.For every dataset generated,  three independent training sessions were conducted, resulting in three different models (sets of weights) in order to measure the deviance of the training process.The average performance of the models refers to the arithmetic mean of the results of these three models.We also use the F 1 score measure, which is defined as the harmonic mean of the precision and the recall values.

VI. DATASET
This section presents the dataset created for the validation elaborated in detail in Section VII.Ten industrial parts were selected for the dataset.Object diversity as well as object similarity were the two major points of consideration.The former helps us to evaluate the detection performance of the model for various types of objects, whereas the latter is important in assessing the classification performance of the model.In general, it is easier to misclassify objects with similar features.Thus, this problem can be considered to be more challenging than the detection of less complex and fairly different shapes such as cubes and spheres.The selected objects are depicted on Fig. 5, and their virtual counterparts on Fig. 6.These images are samples of X ∈ X obtained from two different P (X) probability distributions.The virtual images are from the probability distribution P S of D S = {X , P S (X)}, while the real images are from the probability distribution P T of D T = {X , P T (X)}.
As it can be seen, on one hand, objects of different sizes, shapes, colors, and materials were selected to increase diversity.
This article has been accepted for inclusion in a future issue of this journal.Content is final as presented, with the exception of pagination.On the other hand, some objects share similar characteristics, such as circular holes.Furthermore, two parts, the bonnet (#7) and the body (#8) were chosen because of their high level of similarity, as shown in Fig. 7.
Constructing the dataset, 190 real images of 920 object instances were taken with different layouts and illumination settings.The images were captured with an Intel RealSense D435 camera.For easy and fast image capturing, a frame was designed that holds the camera 310 mm from the ground.A 300×210×10 mm light blue wooden base supports the framethis is also where the parts were placed.The images show not only this base area but the background (tabletop) as well-this is done on purpose.In order to have a slightly different dataset as a reference, we also transformed the aforementioned dataset by cropping the images to fit in the wooden base.The cropped area is signed with dashed green lines in Fig. 5.Some examples of cropped images are shown in Figs. 15 and 21.
The annotations for the test dataset were labeled manually and saved in the YOLO annotation format.As it contains all the necessary bounding boxes and class information, other annotation formats can be generated from them.We emphasize that these images of real objects were not used at any point for training the models, except in the case of one-shot knowledge transfer.In this case, only one real image was used.The experiment of one-shot transfer is presented in Section VII-B.
For all images, the matching depth images are recorded as well.The depth images are transformed in a way that each pixel point of the RGB image can be associated with the same pixel point of the depth image (the transformation is necessary as the fields-of-view of the cameras for RGB and for depth images are different).Thus, all the annotations for the RGB images are the same for the depth images.Even though the depth images were not used in the current research, this additional data can be valuable for later use or for other researchers.
The test dataset is summarized in Table VI.In Group A, every image contains only one object (except one image without any objects).In Group B, every image contains multiple objects, but no class is represented more than once.In Group C, spotlight illumination is applied from one side to test the robustness of the models to illumination settings.In Group D, cluttered scenes are recorded.In Group E, distractor objects (cubes, cylinders, triangular prisms, and a 3-D-printed elephant) are placed in the scene.Finally, in Group F, in every picture, only one class is presented, however, unlike in group A, there are several instances of this class in every image.
The group-wise class distributions are depicted in Fig. 8.As it can be seen, the classes are relatively equally distributed in the groups.Even though in the mAP metric, the mean of the classes is calculated, thus it is less influenced by class imbalance, it is advantageous to create an equally distributed test dataset.Obviously, for training, which can be sensitive to class imbalance, the synthetic data is generated with a random selection of objects, thus eliminating any notable class imbalance.The dataset can be downloaded from the project repository: 4

VII. RESULTS
In this section, we show the strengths of our sim2real object detection method, as described in Section IV, by applying it for the problem presented in Section VI.

A. Zero-Shot Transfer (ZST)
The best-performing zero-shot transfer model (ZST_BEST 1 )5 achieved 86.32% mAP 50 on the cropped test dataset.For data generation, a 2×2 grid with fixed z positions and a placement disturbance of ±10% of the grid spacing was set in the horizontal directions.The simulation of gravity was enabled and the objects (including distractors and empty places) were selected with equal chance.The objects had random textures with a probability of 0.8 and a random color with a probability of 0.2.The camera target position was set to the center of the grid with a fixed 45 • field-of-view.The pitch of the camera was randomized between −0.17 and 0.17 radians.The width and height of the images are chosen randomly, independently of each other.Their values lie between 640 and 1300 pixels.For postprocessing, multicolor pepper-and-salt noise and Gaussian blur were used with the probability of 1.0 and 0.5, respectively. 6ith these parameters, 4000 images were generated for the training dataset, and 200 for the validation dataset.The evaluation of the model's performance on the training set was measured only on the first 200 images of the training set.Two examples of the synthetic images are shown in Fig. 9.
The precision-recall curves of the three ZST_BEST models (from the three training sessions) are shown in Fig. 10.As both the training and validation mAPs are close to 100%, it can be stated that the solution of the classical machine learning problem is satisfactory.Furthermore, observing the sim2real transfer, it can be seen that the models not only have a relatively good performance on the test data, but also have little variance.Moreover, the models perform relatively similarly on the original and on the cropped images which shows the robustness of the method.The F 1 score is depicted in Fig. 11.It shows that while the performance of the model on the cropped images is not affected by the threshold, the models work better with higher threshold values on the original images.
The performance of the models on the different groups of the test dataset (described in Table VI) is presented in Table VII.The performance is relatively stable across the different scene types.Group D, containing the most crowded images, performs just slightly worse than the others.In the case of the original images, group A has relatively low performance.This is due to the fact that the model occasionally falsely identifies the brackets of the camera holder frame (at the two sides of the    camera holder frame which is displayed in Fig. 5, marked with letter "F") as L-brackets-this is not surprising as these parts look similar.As group A has the lowest number of objects, this phenomenon has the most impact on results in this case.The cropped images, as shown in Fig. 5, do not contain this part of the image.
Furthermore, the performance of the models for the different classes is worth investigating.The data are presented in Table VIII and the average results are shown in Fig. 12.Looking at the dataset of cropped images (green), it can be seen that 6 out of 10 classes perform above 92%, one class is relatively close to them with 87.94% AP 50 , two classes have worse results with 69.54% and 67.18% AP 50 , and one class-the bonnet-has significantly worse performance with 30.72%AP 50 .Otherwise, the performance on the validation dataset is close to 100% for all classes.The findings indicate that the performance loss is not caused by the unsuccessful solution of the classical machine learning problem, but by the existence of the reality gap.As most of the classes have relatively good APs, the bad classes are outweighed by them in the mAP calculation.
In order to investigate the aforementioned problem, the classwise precision-recall graph of ZST_BEST 1 is shown in Fig. 13.As it can be seen, the bonnet, the L-bracket, and the seat are the worst classes consistently with Fig. 12.
The proposed confusion matrix depicted in Fig. 14 (described in detail in Section V) is essential in finding the root causes of the weaknesses of the models.In most cases, the objects are detected and classified to the correct class (diagonal).However, several instances of L-bracket and seat are not detected (36 and 49 examples) and many bonnets are classified as body objects (62 instances).The problem with these two objects is not surprising considering the high level of similarity of the two objects, as shown in Fig. 7.In general, this representation of the results not only confirms the aforementioned assumptions of class performances but also shows the underlying reason behind the lack of performance in their cases.
Finally, having presented the quantitative evaluation, two examples are given for qualitative evaluation as well.Fig. 15 shows an accurate and an inaccurate example, both with the This article has been accepted for inclusion in a future issue of this journal.Content is final as presented, with the exception of pagination.

B. One-Shot Transfer (OST)
Even though the best zero-shot transfer achieved 86.32% mAP 50 , it had some difficulties with four classes.With one-shot transfer, we could overcome these difficulties.The hyperparameters of data generation remained the same as it was described in the previous zero-shot transfer example.The difference between the two approaches lies in the data used to train the model, which is shown in Table IX.The OST_BEST 3 model achieved 97.38% mAP 50 on the cropped images, while the OST_BEST 1 model had 97.04% mAP 50 on the original images.These results are significantly better than the results with zero-shot transfer.The precisionrecall curves are shown in Fig. 16, and the F 1 scores are shown in Fig. 17.The mAP scores are close to optimal and there is only an insignificant deviation between the different training sessions.The F 1 score is also relatively high and flat in all cases, indicating that the models are not sensitive to different thresholds.
The performance on the different types of test images is presented in Table X.In general, the models work well, above 94% mAP 50 in each case.The crowded scenes (group D) have slightly worse performance on average, but the difference is not significant.
Furthermore, the mAP scores of the different classes are presented in Table XI and shown in Fig. 18.All classes perform well, the worst-performing class with the original images being the L-bracket with 92.62% mAP 50 .The precision-recall curves of the OST_BEST 3 for the different classes are depicted in Fig. 19.Compared to the zero-shot approach, the curves are shifted toward the top-right corner which demonstrates better performance.The proposed confusion matrix of the OST_BEST 3 with the threshold set to 0.8 is shown in Fig. 20.Almost all the values This article has been accepted for inclusion in a future issue of this journal.Content is final as presented, with the exception of pagination.For qualitative evaluation, Fig. 21 shows an accurate prediction and an inaccurate solution.In the latter case, the distractor objects resembling a body object in main characteristics could mislead the model, implying that the model learned an overly general representation of the object.As the quantitative results show, the vast majority of examples is accurate.

VIII. ABLATION STUDY
In this section, the ablation study of the method is presented, focusing on the different elements of the domain randomization methods, the training data size, and the object detection model.

A. Seed
In general, the initial random seed of stochastic algorithms can significantly influence their performance.This phenomenon is unpleasant as it makes the algorithms unpredictable.We aim to This article has been accepted for inclusion in a future issue of this journal.Content is final as presented, with the exception of pagination.measure the influence of the seed of our domain randomization method in the case of the ZST_BEST models.It is important to note that we do not use the same random seed for the training and for the domain randomization method.Table XII shows different seeds (with two equal seeds for reference), with 3 independent training sessions each.In these experiments, we showed that the magnitude of the deviation of the results due to the stochastic training process of the neural network and due to the different seeds of the randomized data generation methods are comparable.Thus, our randomized data generation method is not less robust to a given seed than the stochastic training method itself.

B. Texture and Postprocessing
Two important factors in our domain randomization method are the random textures of the objects and the postprocessing method.We have generated datasets without these factors.The results are summarized in Table XIII and the results on the original images are shown in Fig. 22.Both the added texture and the postprocessing methods contribute significantly to the performance.Without the added texture, the performance drops to 63.50% and 74.71% mAP 50 in the case of the original and the cropped images.Without postprocessing, the performance is only 55.87% and 60.42% mAP 50 , respectively.Finally, the performance decreases drastically achieving 10.83% and 13.81% mAP 50 without the two methods.These experiments show how essential these types of domain randomization methods are.As the average performance of the model on the validation dataset is 99.84% mAP 50 , according to (2), the reality gap shrinks, in case of the original images, on average from 89.01% (-TPP) to 16.68% mAP 50 (BEST).

C. Data Size
The size of the training dataset is a key attribute of any machine learning problem.In general, the more data are used in the training, the better its distribution will match the real probability distribution.Nevertheless, this phenomenon does not necessarily apply to the case of knowledge transfer.The results of the performance of the ZST_BEST models for different training data sizes are presented in Table XIV.It is important to note that for the case of 8000 images, the training time was doubled from 5000 to 10 000 iterations.Even though increasing the training data size from 1000 to 4000 allows the model to gain notable performance, doubling the data size from 4000 to 8000 only causes marginal improvement.

D. Gravity, Positional Disturbance, and Bounding-Box Calculation
In this part of the ablation study, the effect of simulated gravity, the effect of random disturbance around the grid positions, and the effect of replacing the all-point bounding box calculation with the 8-point bounding box calculation (described in Section IV-B4) are measured.The findings are summarized in Table XV.All of the aforementioned factors have a relevant effect on the performance.In the case of cropped images, on average, gravity brings 11.01%mAP 50 , the randomness of object positions contributes 14.22% mAP 50 , and the tight all-point bounding box calculation method is responsible for This article has been accepted for inclusion in a future issue of this journal.Content is final as presented, with the exception of pagination.

E. Cutouts
Some additional experiments were conducted with different types of cutouts at the postprocessing phase of data generation presented in Table XVI.In these experiments, four types of cutouts were considered: rectangles, partly transparent rectangles, circles, and lines.The number of cutouts and the bounds of the randomized sizes of the cutouts varied over the experiments.The results show that none of the cases could achieve a better performance than the ZST_BEST model which does not have this type of domain randomization.Nevertheless, a more thorough evaluation of the effect of different cutouts can be subject to further research.

F. Faster R-CNN
To test the performance of the data generation process in the case of a two-stage object detection model, we trained the R101-FPN version of the faster R-CNN [6] model using the De-tectron2 [66] framework and Pytorch [67].This model uses the ResNet-101 [61] model with the feature pyramid network [68] backbone.The results are shown in Table XVII.Even though the performance of the model falls behind YOLOv4, it could The training of the faster R-CNN model was approximately four times faster than in the case of YOLOv4.On the other hand, inference time was around ten times slower with 2 FPS.In conclusion, YOLOv4 outperformed the faster R-CNN approach in performance and in inference time which are the two most relevant factors.

IX. ROBOTIC APPLICATION
Object detection can be utilized in many ways.Examples are robotic grasping or pick-and-place applications where the robot needs to detect different workpieces and grasp them or move them to specific locations.
In this section, a real-world robotic implementation of our method is presented.The application can serve as proof of concept built upon our previous work [69], where we proposed a 5 C model-based [70] system architecture for visual-servo-guided cyber-physical robotic assembly cells.Relying on the object detection model, the parameters of grasping (micro plan) can be computed.
The robotic system consists of a six-DoF collaborative robot arm equipped with a digital depth camera, a force sensor, and a two-finger gripper.The task of the robot is to detect scattered workpieces (center points and bounding box information), as well as predict grasping poses.The sensors and actuators of the robot and the sim2real computer vision module are connected in a robot control framework based on ROS [71].The setup is depicted in Fig. 23, while the software components of the robot control framework are shown in Fig. 24 For robotic applications, every component of the system must work reliably an in real time.In order to evaluate the sim2real computer vision module in a new case, three new industrial parts were used, as depicted in Fig. 23.The data generation and training process went without any problems.Thus, within 13 h, the new model was ready to use.As a qualitative evaluation, the robot was programmed to follow a path over the workpieces while streaming the camera data.On a GeForce RTX 3060 GPU, our computer vision model ran with 20 FPS and constantly localized and classified the objects perfectly with more than 98% confidence most of the time, even in significantly different illumination settings and in the presence of distractor objects.
For grasping the workpieces, the grasping pose needs to be estimated and transformed to the robot coordinate system.For This article has been accepted for inclusion in a future issue of this journal.Content is final as presented, with the exception of pagination.Information flow in the robotic application.The computer vision module, which is the main topic of this article, is highlighted in orange.In this application, the force sensor was not used (marked with the dashed line).The following software resources were used: [72], [73], [74], [75], [76].
estimating the orientation we used principal component analysis on the detected and cropped bounding boxes of the objects, and a standard camera calibration method was applied for the transformation.
This use case was presented in an exhibition 7 and the implementation of the ROS-based robot control system is available at8 :

X. CONCLUSION
The article presented our sim2real domain randomization method for object detection.As our general aim is to facilitate the trend of new-generation intelligent manufacturing with adaptive robotic applications, our solution needed to be capable of differentiating similar classes using only one real example for training and working in real time.According to our best knowledge, this was the first work thus far that mutually satisfied these constraints in this domain.
As recent works on transfer learning did not concentrate on validation for objects similar to each other, we addressed this phenomenon by creating a dataset with 190 real annotated images of 920 objects of ten classes of industrial workpieces.The dataset was publicly available and could serve as a benchmark for industrial object detection models.
We introduced a novel type of confusion matrix tailored to object detection.It had proven to be useful for finding the root cause of performance loss.
The results presented in the article validated the strengths of our approach.We achieved 86.32% mAP 50 in the case of zero-shot transfer, while with one-shot transfer, the best model scored 97.38% mAP 50 on the test set.With these experiments, we also demonstrated how to diminish the performance loss caused by similar classes by introducing only one image from the target domain.Even though it was hard to compare solutions to different problems, we believed that these results were better than the ones in the literature considering the complexity of the problem and the size of the synthetic and real training datasets.
In a thorough ablation study, we showed that adding random texture and our postprocessing domain randomization methods were crucial parts of the process.We also found that simulating gravity, random initial placement, and the all-point bounding box calculating method contributed significantly to the performance.
As a proof of concept, we showed that our model works reliably and in real time in a robotic pick-and-place application.
Both the sim2real data generation and training module, and the robot control framework could be used as a freely available, out-of-the-box solution to industrial problems.
In the future, we aim to further improve the performance of the zero-shot learning method.We are also interested in working with point cloud data, exploring the field of adversarial training, and extending our method to be able to predict end-to-end grasping poses as well.

Fig. 1 .
Fig. 1.Different phases of knowledge transfer.The picture of the Boston bull on the bottom left corner of the figure is from ImageNet [4].

Fig. 2 .
Fig. 2. Flowchart diagram of our data generation, training, and evaluation process.The blue and orange boxes depict the data generating and data gathering steps.The purple boxes represent the steps of training and evaluation.

Fig. 1 .
The knowledge transfer goes from {D G , T G }, where D G is the domain of the dataset of general public images and T G is classification, to {D S , T S }, where D S is the domain of synthetic images (source domain of the sim2real knowledge transfer), and T S is object detection.Here, D G = D S , and T G = T S .The second phase of knowledge transfer is the sim2real transfer which goes from {D S , T S } to {D T , T T }, where D T is the domain of our industrial environment (target 3 [Online].Available: https://git.sztaki.hu/emi/sim2real-object-detectiondomain), and T T is object detection.Here, D S = D T but T S = T T .Although the pretrained network does possess some learned knowledge from the domain of a given general public dataset (D G ), it does not have direct knowledge of the target objects.Consequently, D G = D S = D T .Even though D G = D T , the characteristics of the domains are similar.Thus, in Figs. 1 and 2, the domains of general public images and the task-specific real images are marked with different shades of orange.

Fig. 7 .
Fig. 7. Similarity of the body and the bonnet objects.

Fig. 9 .
Fig. 9. Two examples of synthetic images with the automatically generated annotations.The bounding boxes are shown here for illustration purpose only.(a) Example 1.(b) Example 2.

Fig. 10 .Fig. 11 .
Fig. 10.Precision-recall curves of the ZST_BEST models.The train and valid scores are overlapping and relatively close to the perfect 1.0 values.

Fig. 12 .
Fig. 12.Average mAP 50 scores of the ZST_BEST models in the different classes.

Fig. 13 .
Fig. 13.Class precision-recall curves of the ZST_BEST 1 model on the cropped images.

Fig. 14 .
Fig. 14.Confusion matrix of the ZST_BEST 1 model on the cropped images with the threshold set to 0.8.

Fig. 16 .
Fig. 16.Precision-recall curves of the OST_BEST models.The training and validation scores are overlapping and relatively close to the perfect 1.0

Fig. 20 .
Fig. 20.Confusion matrix of the OST_BEST 3 model on the cropped images with the threshold set to 0.8.

Fig. 24 .
Fig.24.Information flow in the robotic application.The computer vision module, which is the main topic of this article, is highlighted in orange.In this application, the force sensor was not used (marked with the dashed line).The following software resources were used:[72],[73],[74],[75],[76].

TABLE IV TYPES
OF NOISES IN POSTPROCESSINGTABLE V MOST RELEVANT ADVANCED DATA AUGMENTATION TOOLS IN THE TRAINING PROCESS

TABLE VI SUMMARY
OF THE TEST DATASET Fig. 8. Class distributions of the test dataset.

TABLE VII mAP
50 SCORES OF THE ZST_BEST MODELS IN THE DIFFERENT TEST GROUPSThis article has been accepted for inclusion in a future issue of this journal.Content is final as presented, with the exception of pagination.

TABLE VIII mAP
50SCORES OF THE ZST_BEST MODELS FOR THE DIFFERENT CLASSES

TABLE X mAP
50 SCORES OF THE OST_BEST MODELS IN THE DIFFERENT TEST GROUPS

TABLE XI mAP
50 SCORES OF OST_BEST MODELS FOR THE DIFFERENT CLASSES

TABLE XII mAP
50 SCORES OF ZST_BEST MODELS WITH DIFFERENT SEEDS TABLE XIII mAP 50 SCORES OF DIFFERENT ZST MODELS.-T: WITHOUT TEXTURE, -PP: WITHOUT POSTPROCESSING, -TPP: WITHOUT POSTPROCESSING AND TEXTURE

TABLE XIV mAP
50 SCORES OF ZST_BEST MODELS WITH DIFFERENT DATA SIZES

TABLE XV mAP
50 SCORES OF ZST MODELS WITHOUT DIFFERENT FACTORS.-G: NO GRAVITY, -R: NO RANDOMNESS IN GRID POSITIONS AND NO GRAVITY, 8P: 8-POINT BOUNDING BOX CALCULATION TABLE XVI mAP 50 SCORES OF ZST MODELS WITH DIFFERENT CUTOUTS AT THE POSTPROCESSING METHODS a 41.76% mAP 50 performance gain.In the case of bounding box calculation, the performance drops with the less tight BBs, implying two reasons.First, the ground-truth BBs are tight, thus computing the IOU 50 with less tight BBs may result in many discarded matches.Second, in the crowded images, the BBs are too extensive, thus, they could significantly overlap each other which may confuse the model.

TABLE XVII mAP
50SCORES OF R101-FPN FASTER R-CNN MODEL be increased with a more exhaustive hyperparameter search.Moreover, the Darknet framework uses extra data augmentation for training which we did not reproduce for the Detectron2 framework.It is important to note that here, too, considerable performance improvement is achieved by having one real image for training.