Analysis of Depth and Semantic Mask for Perceiving a Physical Environment Using Virtual Samples Generated by a GAN

Micro aerial vehicles (MAVs) can make explorations in 3D environments using technologies capable of perceiving the environment to map and estimate the location of objects that could cause collisions, such as Simultaneous Localization and Mapping (SLAM). Nevertheless, the agent needs to move during the environment mapping, reducing the flying time to employ additional activities. It has to be noted that adding more devices (sensors) to MAVs implies more power consumption. Since more energy to perform tasks is required, growing the dimensions of MAVs limits the flying time. Contrarily, Generative Adversarial Networks (GAN) have demonstrated the usefulness of creating images from one domain to another, but the GAN domain changes require a large number of samples. Therefore, an interoperability coefficient is employed to determine a minimum number of samples to connect the different domains. In order to prove the coefficient, the performance to estimate the depth and semantic mask between authentic and virtual samples with the number limited of samples is analyzed. Consequently, an RGB-D sensor can be replaced by a few samples of a real scenario based on GANs. Although GAN allows creating images with depth and semantic mask information, there is an additional problem to be tackled: the presence of intrinsic noise, where a simple GAN architecture is not enough. In this proposal, the performance of this solution against a physical RGB-D sensor (Microsoft Kinect V1) and other state-of-the-art approaches is compared. Experimental results allow us to affirm that this proposal is a viable option to replace a physical RGB-D sensor with limited information.


I. INTRODUCTION
Robotics is a research area whose fundamental challenges have been obstacle detection and collision avoidance. Therefore one of the major topics to be investigated in this field is environment perception [1]. A common way to perceive is through sensory devices that translate stimuli received from the environment into interpretable data [2], where cameras and Inertial Measurement Units (IMU) are used as a complement for Global Navigation Satellite Systems (GNSS) in outside environments [3]. Typically, IMU is designed considering two main modules: accelerometer and gyroscope. When The associate editor coordinating the review of this manuscript and approving it for publication was Seifedine Kadry . combined into a single device, they can create an array of information. However, the incorporation of arrays of sensors and embedded devices in Unmanned aerial vehicles (UAVs) is limited by the power consumption of all devices and vehicle dimensions. An emerging area of mechatronic design is applying artificial intelligence techniques to optimize the design of UAVs, considering both aspects [4]. Visual perception allows knowing the features of the environment [5], [6], while spatial context adds intelligence to autonomous systems to interpret the environment in order to develop a path planning solution [7].
In mobile robotics, there are three challenges to face when navigating in closed areas, especially when considering three dimensions (3D navigation) [8], [9]. The first is collision VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ detection and obstacle avoidance, the second consists of dynamic route planning (finding an alternative route when an obstacle is detected), and the third is the energy consumption of the entire system [10]- [12]. According to [13], the path planning problem has two approaches to describe it: the first one, called direct form, considers the robotic system as a point of reference, while the second, called indirect form, focus on the navigation environment. In the first approach, navigation environment analysis is about movement generated by the system, and we examine it from a first-person perspective.
On the other hand, in indirect form, the robotic system is considered a particle; therefore, we analyze the environment from a third-person perspective.
Recently, technologies have been developed based on specific application devices that facilitate navigation indoors. RGB-D sensors are used together with RGB data for objects tracking [14], [15], as well as to improve objects classification accuracy [16]. Inertial Measurement Units (IMU) are typically used to improve location, avoid collisions [17], [18], as well as to map the environment based on vehicle movement [19], [20]. Other alternatives such as Lidar sensors [21], [22] and ultrasonic sensors [23], [24], both can perform the same tasks properly.
Modern exploration systems need to be efficiently designed and equipped with advanced perception systems to complete the task successfully. This reason implies a compromise between energy consumption and exploration autonomy. One of the main limitations MAVs have to explore 3D spaces is the duration of the batteries available in the vehicle. There is a direct relationship between the number of devices used in the environment perception system and their energy. When the more devices in the system, the greater the energy consumption. There is also an inverse relationship between energy consumption and flight autonomy. The higher the energy consumption, the shorter the battery life, and therefore the exploration time is less.
Hence the importance of optimizing the resources available to the MAV. Most of these vehicles already have a built-in camera, so this resource can be taken advantage of and used as a perception system to estimate authentic images' depth and semantic mask without adding additional devices. This paper proposes a double-GAN-based architecture with noise reduction to estimate authentic images' depth and semantic mask using information generated by a virtual environment representation dataset with limited samples. This approach can effectively represent an RGB-D sensor using a few samples of a real scenario based on a double-GAN approach.
The remainder of the paper is organized as follows. Section II reviews and discusses previous works related to external environment perception systems. We describe in Section III the GAN architecture. Likewise, in Section IV, we introduce the interoperability coefficient to determinate the limited samples and the architecture with noise reduction to estimate the depth and semantic mask of real images. Experimental settings description are presented in Section V.
Results analysis is presented in Section VI. Finally, some conclusions are presented in Section VII.

II. RELATED WORKS
It is well known that Machine Learning (ML) allows improving perception analysis in the robotics field [25]. However, there is less evidence of RGB-D sensor development using few samples of a real scenario by ML techniques. For example, during the last decade, perceiving the environment through RGB-D sensors has regained interest. Likewise, physical sensors such as Microsoft Kinect V1 [26] have been widely used to develop multiple solutions in the field of perception, providing autonomy to ground-vehicles [27], [28]. This RGB-D sensor offers adequate performance to perceive the environment [29] and detect obstacles [30]- [32] improving navigation [33]. Nonetheless, Microsoft Kinect V1 has specific features that limit its use in small spaces. This limitation is in terms of perception range, at least 40 cm. Consequently, objects must be beyond this distance to be perceived with a coverage range of up to 4 meters. Furthermore, Microsoft Kinect V1 also has a coverage range configuration of up to 6 meters, with an 80 cm offset. Thus, it is unsuitable to use this sensor in small scenarios [34].
Computer vision processing has been used to analyze and find features on input data [35], [36]. In the same way, ML algorithms have been widely used in classification [37], [38] and regression tasks [39], [40]. As a result of the intersection of both paradigms, another research area, called Generative modeling, has emerged. Generative modeling uses Generative Adversarial Networks (GANs) to generate realistic examples across various problem domains. This type of network automatically learns the regularities in input data so that the model can be used to generate new examples that plausibly could have been drawn from the original dataset.
GANs are based on a competition approach between two types of neuronal networks: generative and discriminative network [41]. The first one is responsible for generating data from a noisy source while the discriminative network is in charge of extracting a set of known characteristics of examples to validate the generator model [42]. The generative network generates candidates while the discriminative network evaluates them.
Besides, GANs have also been used for image transformation to map data into a different domain [43] and generate data to create an image with different machine learning approaches [44]- [46]. This topic is essential in our approach because we can obtain data from a conventional camera and process them to generate either a depth image, semantic mask image, or both using GANs. The depth data describe a point in a 3D space, whereas the semantic mask represents the information that composes an object.
For this reason, in this work, it is proposed to generate a representation of an RGB-D sensor using information from samples created by a virtual environment to estimate the depth and semantic mask of authentic images to map a 3D environment.
In this way, it is taken advantage of the camera available on most MAVs to obtain an efficient depth sensor without adding additional devices that affect the aerodynamics and structure of the vehicle and energy consumption and consequently flight autonomy.

III. GAN ARCHITECTURE
GANs is an architecture that uses two neural networks to generate new synthetic data instances that plausibly could be considered real data. As it is shown in Figure 1, a simple GAN architecture is composed of an auto-encoder as the generator network and a deep convolutional network as the discriminator, where specific parameters are as follows: kernel size is equal to 3, the stride is equal to two in the generator network, and stride is equal to one in the discriminator network, Relu as activation function in both networks, and Adam as optimizer. Figure 2 shows the architecture parameters for generator and discriminator, respectively. In order to estimate the depth and semantic mask of authentic images using information generated by a virtual environment representation, performance provided by a simple network is insufficient because each set of samples must share some in-domain and cross-domain features [47], [48]. For this reason, it is necessary to have an intermediary element between domains to minimize noise and share some crossdomain features.

A. GAN COST FUNCTIONS
The system requires an adequate implementation of a GAN that allows generating a representation of an RGB-D sensor using information from samples created by a virtual environment to estimate the depth and semantic mask of authentic images to map a 3D environment. It is worth mentioning that GANs contain regularization terms that allow for adequate training. This set of rules is called the cost function. In order to get better performance, an optimization process needs to be incorporated. Specifically, this process involves the maximization of the generator network cost function G 1,2 , the minimization of the discriminator cost function D 1,2 , and the minimization of the noise source cost function Z 1,2 . The cost functions of a GAN network are derived from the calculus of entropy [49]. The principal characteristics of the GAN are given by the 1, 2 and 3 definitions.
Definition 1: Let n be the number of samples, let D 1,2 be the cost function of the discriminator network, let G 1,2 be the cost function of the generator network, and let Z 1,2 be the noise source. Maximization of the cost function of the discriminator network is obtained according to the following expression: Definition 2: Let n be the number of samples, let D 1,2 be the cost function of the discriminator network, let G 1,2 be the cost function of the generator network, and let Z 1,2 be the noise source. Minimization of the cost function of the generator network is obtained according to the following expression: Definition 3: Let D 1,2 be the cost function of the discriminator network, let G 1,2 be the cost function of the generator network and let Z 1,2 be the noise source. The full cost function of a simple GAN architecture is obtained according to the following expression:

IV. PROPOSED WORK
In this section, the generation of the dataset is described.
In addition, a coefficient is employed to determine a limited number of samples of a physical scenario. Finally, the proposed architecture is illustrated to generate a virtual sensor with limited samples.

A. 3D VIRTUAL ENVIRONMENT AND SAMPLES
Considering that access to the physical environment is limited, a 3D virtual environment is built based on a limited number of samples of the environment. Virtual samples generated have a resolution of 256 × 144, with an aspect ratio of 16:9. Thus, each pixel represents a centimeter at a distance of 1.5 m with a Field of View (FOV) of 82.6, as it is shown in Figure 3. AirSim framework generates depth images in a range of 100 m. For this research, the framework was modified to create depth images in a range of 2 m and 5 m. For example, Figure 4 shows depth images in a range of 2 m in which the size for each sample is 256 × 128 with three channels.   Likewise, the dataset is composed of three domains, Figure 4a represents the domains of authentic samples, Figure 4b shows the virtual representation domain, and Figure 4c displays the depth and mask semantic domain.

B. SIMILARITY BETWEEN IMAGES
One of the tools for assessing the correlation between two images is the Histogram of Oriented Gradients (HOG) [51]. This algorithm allows measuring the comparison between real and virtual representation. This descriptor obtains a characteristic vector for each of the samples and computes a coefficient that indicates the similarity level, whose hyperparameters are: orientation equal to 8, pixels per cell equal to 32 × 32, and cells per block equal to 4 × 4. For example, Figure 5 shows a physical sample and its virtual representation with two different detail levels. The first variation has essential lighting, and the second has a more significant number of directional lighting sources and materials that give more realism to the virtual environment.  Table 1 shows correlation measurements between 30 physical world samples and their virtual representation with two different detail levels. The correlation coefficient of more detailed samples (lights and materials) is higher than essential light source samples since lights increase detail level. However, the correlation coefficient between virtual samples created with video game engines and real examples is not high enough to claim that an adequate representation of the physical world is obtained.

C. INTEROPERABILITY COEFFICIENT FOR CONNECT VIRTUAL AND REAL ENVIRONMENTS
In [52] an interoperability coefficient was introduced, which consists of a minimum number of physical samples to connect the virtual and physical domains using the GAN characteristics. This coefficient is composed of a correlation factor generated by HOG and the GAN's entropy, and it is expressed in 4.
In this proposal, the HOG correlation is 0.5490, and the interoperability coefficient is 0.5047 in 43 physical world samples shown in Table 2. For this reason, it is recommended to take the number of samples when the interoperability coefficient is upper than 0.50. Thus, the details in the virtual representation are lower than the authentic sample. We consider that the virtual representation must have enough information that allows deep learning to use textures. Furthermore, it is observed that if the number of virtual representation samples increases their details in light and material, then the interoperability coefficient must increase, and the number of samples can be less. When the joint entropy is low, the data dispersion is similar between the GAN architecture and virtual representation samples. However, we are in a case where the authentic information is limited, and the samples do not have enough details. In this way, it is avoided taking more samples.

D. ARCHITECTURE
Several methodologies have been proposed to solve navigation problems, but most of these approaches require physical world samples for path planning and MAV training. This behavior implies that used samples to train must come from the same domain. Therefore, the MAV training has to realize in an authentic environment. Besides, this type of training requires high consumption of time, costs of operating, and maintenance on the MAV.
In this way, a double-GAN-based architecture with noise reduction is proposed to estimate authentic images with depth and semantic mask using virtual samples. The first GAN creates a virtual sample of the authentic environment. This architecture is an intermediary element between domains (virtual and real) used to minimize noise and share some cross-domain features. Figure 6 shows the parameters of the Double GAN architecture. An additional layer is added in generator networks because the output image size is 256×256 pixels.
Therefore, the samples compose three different domains, namely: the physical world, virtual environment, and the third domain consisting of GAN generated images representing semantic mask and depth information of physical world samples, as is shown in Figure 7. Likewise, we propose a three-module architecture, as it is shown in Figure 8. The first module is an intermediary element between real and virtual domains. This module converts an authentic image to a virtual representation throughout the GAN architecture. On the other hand, the second module generates the depth and semantic mask samples. Finally, Definition 4 describes the third module that minimizes the noise.
Definition 4: Let n be the number of samples, let y i be a generated image, and let y i be an original image. The minimization cost function for noise reduction is obtained according to the following expression: This architecture provides data stability between three domains and maintains some cross-domain features to represent the depth and semantic mask. Furthermore, it is possible to create different virtual environments with the proposed architecture to generate many virtual samples with limited information of an authentic environment.

V. EXPERIMENTAL PHASE
The proposed architecture, shown in Figure 8, was implemented in a g4dn.xlarge instance in Amazon Web Services (AWS) with the following specifications: 4 VCPU XEON 8259CL 2.5GHz, 16GB RAM, 125GB SDD storage, with NVIDIA Tesla T4 GPU with 320 Tensor Core with 16GB RAM. Algorithm 1 was implemented in Tensorflow 2.3. Training time was 3 hours and 10 minutes.
Along with the experimental phase, 43 samples were used for the first GAN architecture (Module 1); for the second module, 1000 samples were generated from the virtual environment; finally, the same 1000 virtual samples were used for the third module. These virtual samples were processed VOLUME 10, 2022  through a serial convolutional network to reduce data randomness and maintain better virtual sample uniformity.
The physical environment is composed of three different types of objects. Two of these objects are cardboard boxes of different colors and sizes, and the third is a ball. Microsoft Kinect V1 minimum perception range is 40 cm in a 4m range configuration. Therefore objects must be beyond 40 cm to be perceived with a coverage range of up to 4 meters. In this way, we have flat surfaces and curved surfaces to measure consistent data. The depth in virtual samples was estimated in two ranges: 2 meters and 5 meters. For the 2 m samples, each pixel is equal to 0.7843 cm, while for 5 m samples is 1.9607 cm.
The stability of the complete cost function of both architectures (Simple GAN and Double GAN), after 1000 epochs, is shown in Figure 9. The graph shows that GAN 1 network is less stable than GAN 2 network. On the other hand, it can also be observed that GAN 2 network improves data output stability. The performance of each module of both architectures is shown in Figure 10. The graph shows that D 2 and G 2 networks have more stable behavior than D 1 and G 1

Algorithm 1 Algorithm for Estimating the Depth and Semantic Mask
Input: a set of 43 images with physical domain. Input: a set of 300 images with virtual domains. Output: a set of estimated depth for each sample.
Initialization: 1: Load dataset. 2: Apply augmented data to increase samples variety. 3: Create Generative network model. 4: Create Discriminative network model. 5: Create Noise reduction model. 6: Define loss function for generative and discriminative networks by mean value. 7: Define loss function for noise reduction model by MSE value. Loop training 8: for i = 0 to 300 do 9: Run sample batch on generative and discriminative networks to change to first domain. 10: Run sample batch on generative and discriminative networks to change to second domain. 11: Run noise reduction model. 12: Update gradients. 13: end for 14: return estimated depth samples networks. Likewise, the noise reduction module performance is notable.

A. METRICS
Evaluation criteria are based on error and accuracy metrics proposed by Eigen et al. [53] to evaluate and compare the performance of depth estimation methods. These metrics are  formulated as follows, where Y p is a pixel in-depth image Y (ground truth image),Ŷ p is a pixel in the estimated depth imageŶ , and k is the total number of pixels for each depth image. VOLUME 10, 2022 Definition 5: Relative error (rel) is obtained according to the following expression: Definition 6: Average (log 10 ) error is obtained according to the following expression: Definition 7: Root mean-squared error (RMSE) is obtained according to the following expression:

VI. RESULTS
The performance of this proposal was evaluated using metrics presented in Section V-A. These metrics have been widely used for depth estimation with adversarial training [54]- [60]. Table 3 shows performance results achieved by this proposal in three different stages: Simple GAN, Double-GAN, and Double-GAN with noise reduction (Double-GAN-NR). Table 4 shows depth estimation average values and standard deviation of 50 samples of flat surfaces on three different objects. The DG-2 and DG-5 results correspond to the architecture output without noise reduction at two distances: 2 m and 5 m (Figure 8). Therefore, Double-GAN with noise reduction results is labeled as Double-GAN-NR-2 and Double-GAN-NR-5 for both distances.
As it is shown in Table 4, DG-2 and DG-5 present less promising results since they achieve depth estimation results that are further from the ground truth value. Microsoft Kinect shows better results than DG-2 and DG-5 models. However, depth estimation results achieved by this sensor exhibit an inverse relationship between distance to be detected and measurement precision. That is, as distance range increases, depth estimation precision decreases. Regarding noise reduction models, we can say that Double-GAN-NR-2 shows better performance than Double-GAN-NR-5. This behavior is because the double-GAN-NR-2 model achieves depth measurements closer to the ground truth value. Table 5 shows how each of the models obtains close distance measurements concerning the ground-truth value. As can be seen, Double-GAN-NR-2 shows the best performance.
Likewise, Figure 11 shows a graphical description of the sample capture procedure to build a 3D environment,    as suggested in [61]. Figure 12 shows 3D environment representation obtained with compared GAN methods. Finally, Figure 13 shows Microsoft Kinect results for 3D scene representation with 2 meters and 5 meters depth configurations.
In order to validate the Double-GAN approach, we compared the complete set of virtual samples with original images. Figure 12a shows randomness behavior that appears on flat surfaces. Therefore, the intermediary module minimizes noise and shares cross-domain features between domains. As can be seen in Figure 12b, more consistent results are obtained. However, when performing 3D representation, irregularities are observed in flat sections of the objects. Figure 12c and Figure 12d show Double-GAN architecture with noise reduction results. Depth in virtual samples was estimated in two ranges: 2 meters and 5 meters, respectively. As can be seen in Figure 12d, the best depth estimation results were obtained with the 5m configuration.
Results of Tables 4 and 5 show that the Double-GAN-NR-2 model can estimate image depth successfully in a range from 50 cm to 150 cm, considering a maximum offset of 7.4 cm. On the other hand, the maximum offset required by the Double-GAN-NR-5 model to estimate the depth of an image in the same range as the Double-GAN-NR-2 model is 20.9 cm, while for the Microsoft Kinect Sensor is 8.5 cm.
A moderately detailed 3D image, but with an adequate lighting level, allows creating a virtual representation of the authentic scene without having access to the authentic environment. Experimental results are shown in Figure 12, and Figure 13 confirm that it is possible to obtain a real scene representation using depth and semantic mask estimation in virtual samples using a Double-GAN approach with a limited number of samples.

VII. CONCLUSION
Due to the interoperability coefficient providing a limited number of samples of authentic environments, we can employ a virtual environment to reduce the samples of the environment with difficult access.
Likewise, experimental results confirm that a low-cost depth sensor developed through virtual samples of an authentic environment can be obtained using GANs. Besides, the results suggest that a simple GAN approach for semantic mask and depth estimation in virtual samples of an authentic environment is not enough to achieve competitive performance. Conversely, a Double-GAN approach can estimate depth in virtual samples of an authentic environment showing competitive performance.
Furthermore, experimental results confirm that the Double-GAN approach provides much information on a conventional camera to estimate semantic mask and depth in a 3D environment that requires continuous depth estimation with limited samples. Thus, we avoid adding more devices for keeping the time of flight on MAVs in a controlled indoor environment where the illumination is constant.
As future work, we proposed to use the Double-GAN approach to detect moving obstacles in MAVs' paths to implement obstacle avoidance algorithms and optimal navigation path planning.