DDL-SLAM: A Robust RGB-D SLAM in Dynamic Environments Combined With Deep Learning

Visual Simultaneous Localization and Mapping (VSLAM) has developed as the basic ability of robots in past few decades. There are a lot of open-sourced and impressive SLAM systems. However, the majority of the theories and approaches of SLAM systems at present are based on the static scene assumption, which is usually not practical in reality because moving objects are ubiquitous and inevitable under most circumstances. In this paper the DDL-SLAM (Dynamic Deep Learning SLAM) is proposed, a robust RGB-D SLAM system for dynamic scenarios that, based on ORB-SLAM2, adds the abilities of dynamic object segmentation and background inpainting. We are able to detect moving objects utilizing both semantic segmentation and multi-view geometry. Having a static scene map allows inpainting background of the frame which has been obscured by moving objects, therefore the localization accuracy is greatly improved in the dynamic environment. Experiment with a public RGB-D benchmark dataset, the results clarify that DDL-SLAM can significantly enhance the robustness and stability of the RGB-D SLAM system in the highly-dynamic environment.


I. INTRODUCTION
Simultaneous Localization and Mapping (SLAM) is a precondition for some robot applications, such as industrial automation, autonomous vehicles, and collision-less navigation. The SLAM technology was first put forward by Smith et al. [1], [2] in 1986.The autonomous robot estimates the pose utilizing data attained by distinct sensors and information of previous locations during it travels around in an uncharted scene, while building incrementally a consistent map of the scene in the meantime. The solution has been seen as a pivotal landmark going after truly autonomous robots over a decade. Nowadays, it is safe to say that the SLAM problem has been solved in many ways, at the very least in theory [3].
Visual SLAM, where the camera is used as the unique exteroceptive sensor, has been extensively investigated over the last years. It uses images as the unique source of external environment information [4], because images contain a large amount of useful information and may be applied to other visual applications, such as semantic segmentation, object detection and tracking. The typical visual SLAM algorithm The associate editor coordinating the review of this manuscript and approving it for publication was Leo Chen. mainly calculates the camera pose, and rebuilds the 3D map with the multi-view geometry theory. In order to improve the data processing speed, many algorithms extract sparse feature points at first, and achieve inter-frame estimation and loop closing through matching feature points. For instance, SIFT [5] or ORB [6]features are widely applied to visual SLAM, because they have better robustness and superior distinction, as well as fast algorithm processing speed. However, manual sparse image features are limited at present, where there are many challenging difficulties under the following conditions: dynamics, too many or very few feature points, large scale scenarios and so on. In visual SLAM, a hierarchical image feature extraction approach represented by deep learning has emerged over the years, which is applied to visual odometry (e.g. [7]- [10]) and loop closure detection (e.g. [11]- [13]). Deep learning is a representation-learning method with multiple levels of representation, acquired by consisting of simple but non-linear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level [14]. Nowadays, the combination of deep learning and SLAM is mainly in three aspects, namely, inter-frame estimation [7], [9], [10], loop closure detection [15]- [17] and semantic mapping [18], [19]. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ In the past few decades, some impressive SLAM systems have evolved and achieved good performance in certain cases. Notwithstanding, substantial issues remain unsolved, for instance, how to cope with dynamic objects under the dynamic circumstances, how to make robots fully comprehend the circumstances and complete advanced work. The primary contributions of the paper are: • A novel RGB-D SLAM framework combined with deep learning is put forward to decrease the impact of moving objects on the camera pose estimation. The combined approach of semantic segmentation and multi-view geometry serves as a preprocessing stage to filter out data which are related to dynamic targets.
• A background inpainting method is utilized to repair the frame background that is covered by moving objects. Then these synthesized frames are used to generate an octree map. The rest of the paper is organized as follows. Section II briefly presents a review of various SLAM achievements in dynamic scenarios. Subsequently, section III elucidates the architecture of our SLAM system. Whereafter, we show in section IV, the qualitative and quantitative results of performance of DDL-SLAM in the TUM RGB-D dataset [20] revealing the effectiveness, availability and accuracy of the system. In the end, in section V we conclude the paper and a brief discussion is given.

II. RELATED WORK
The SLAM problem in dynamic circumstances has been an active field of research in robot community over the years. Some SLAM systems process dynamic contents as outliers and then filter out observations of them. Subsequently, the observations of static areas in the scene are utilized to implement mapping, localization and navigation. The concept of dynamic environments can be further classified in low-dynamic environments, which consist of static objects and entities that move slowly or seldom like doors, chairs, tables or parked cars, and highly-dynamic environments which are continuously changing their pose and occupy most of the scene like moving people or cars.
In low-dynamic environments, [21] presents an algorithm of occupancy grid mapping for robots running in circumstances where non-stationary objects frequently move, [22] proposes a SLAM method for detecting and tracking moving targets simultaneously using a laser scanner, and in [23]an approach is proposed for adding the time dimension to the process of mapping to make a robot preserve an exact map while running in dynamic scenes, where the Dynamic Pose Graph SLAM was presented. However in these methods the laser scanner is used as a sensor, which is different from our approach. On the other hand, [24] describes the parallel execution of monoSLAM and a 3D object tracker, which allows inferring moving objects and occlusion, and [25] proposes an incremental movement segmentation system that effectively segments numerous dynamic targets and concurrently constructs the map of the outdoor scenes with monocular camera.
Multiple clues on the basis of optical flow and two view geometry are combined to implement the segmentation. [26]presents a stereo-based visual SLAMMOT (simultaneous localization, mapping and moving object tracking) approach so as to handle moving objects while performing SLAM in highly-dynamic circumstances. In [27]a method of the combination of stereo-based visual SLAM and dense scene flow is put forward to improve traditional algorithms in highly-dynamic and large-scale environments. Furthermore, some RGB-D SLAM systems deal with moving targets in challenging dynamic scenes in the literatures [28]- [32].Our goal is to enhance the robustness and stability of RGB-D SLAM based on ORB-SLAM2 [33] in highly-dynamic scenarios. We propose some effective improvement measures to achieve better results.

III. SYSTEM DESCRIPTION
We will introduce DDL-SLAM at length in this section. It's consist of five aspects. First, the framework of DDL-SLAM is proposed. Second, we briefly describe the semantic segmentation employed in our system. Then the multi-view geometry algorithm which is utilized to improve the dynamic content segmentation is introduced. Subsequently, the tracking and mapping module is demonstrated, which is based on ORB-SLAM2. Finally, we show the method to inpaint the obscured background and build an octree map.

A. FRAMEWORK OF DDL-SLAM
In real life applications (e.g. autonomous robots, unmanned aerial vehicles), exact pose estimation and dependability in severe circumstances are key factors. To the best of our knowledge, ORB-SLAM2 has a prominent performance in various environments from a handheld camera in indoor scenes, to drones flying in outdoor scenarios and unmanned vehicles driving around in a city. Therefore, in DDL-SLAM, its RGB-D SLAM is adopted to provide an overall SLAM scheme, which allows us to detect moving objects and generate the octree map. Fig.1 shows the overview of DDL-SLAM.
The framework of our DDL-SLAM system is displayed in Fig.2. At first, the raw RGB images are dealt with a CNN (convolutional neural network) that segments out pixel-wise the a previousi dynamic objects, for example human. Then the potentially dynamic objects have been segmented, the camera poses are tracked utilizing the static part of the frame at this phase, where the algorithm of ORB-SLAM2 is easier and the computation load is smaller. Afterwards, the multi-view geometry is used to enhance the dynamic objects segmentation. After all of dynamic content has been detected and the camera localization has been completed, the obscured background of the current frame will be reconstructed using static information born of previous frames. Then the inpainted RGB and depth images are utilized to generate the local point cloud that will be transformed and maintained in an octree map. Finally, ORB features of the static part of the frame are extracted to be used in the tracking and mapping thread.

B. SEMANTIC SEGMENTATION
In order to detect dynamic objects, DDL-SLAM adopts DUNet [34](deformable U-Net [35]) to implement pixel-wise semantic segmentation on the basis of the PyTorch implementation by Tramac. 1 DUNet is an FCN-based [36] network, it greatly enhances deep neural networks' capability of segmentation.
The DUNet trained on PASCAL VOC dataset [37] could segment these classes that are potentially movable (bicycle, person, boat, bird, horse, sheep, cat, cow, dog, aeroplane, bus, car, motorbike, train). In real applications, the moving objects likely to occur are inclusive of this list. The network could be also trained on MS COCO [38], if other potentially dynamic classes came out.
The input of DUNet is an original RGB image of size h×w×3, and the output of the network is a matrix of size h×w×n, where n is the number of dynamic objects in the image. For each of output channel i ∈ n a binary mask is acquired. By the means of merging all the channels into one, the segmentation of all dynamic objects that appear in the image of a scene is acquired. 1 https://github.com/Tramac/awesome-semantic-segmentation-pytorch

C. SEGMENTATION OF DYNAMIC CONTENT USING DUNET AND MULTI-VIEW GEOMETRY
Although the majority of dynamic objects can be segmented with DUNet, there are a handful of objects which cannot be detected just by this means. The reason is that they are not transcendentally dynamic, but movable. For example, the cup, telephone and book keep still in the Fig.3 (a), then they become movable some time separately in the Fig. 3, (c), (d). The multi-view geometry is added to the system so as to improve the dynamic objects segmentation. The segmentation of the dynamic content formerly acquired through the DUNet is refined, what's more, new dynamic objects instances which are static most of the time and not set to be moving in the network stage are detected.The algorithm of multi-view geometry is shown in Algorithm 1.

D. TRACKING AND MAPPING
Based on ORB-SLAM2 this module is mainly constituted of three parallel threads: tracking, local mapping and loop closure. The RGB and depth images, as well as their segmentation mask are input to this stage of the DDL-SLAM. The ORB features belonging to the image segmentation classified as static are extracted in the tracking thread. Then the camera poses are estimated with the previous frames by

Algorithm 1 Multi-view Geometry
Data:  16 MasksList ← CombineMasks(F 1 ,mask); finding features matching in the local map and minimizing the re-projection error employing motion-only bundle adjustment (BA). The algorithm manages the local map and optimizes it, and performs local BA at the same time in the local mapping thread. It detects large loops and corrects the accumulated drift using a pose-graph optimization in the loop closing thread. Then the thread starts the next thread to execute full BA after the pose-graph optimization, to calculate the optimal structure and motion solution.

E. BACKGROUND INPAINTING AND OCTREE MAP BUILDING
To inpaint the obscured background utilizing static information born of previous views, the last 15 previous keyframes are selected to project into the dynamic parts of the current frame. The synthetic images from input frames of some sequences in the TUM RGB-D dataset are displayed in Fig.4. It can be seen how all the dynamic objects have been successfully detected and removed. Moreover, a majority of the segmented areas have been correctly inpainted using the information of static background. However, a few blocks are not inpainted completely on account of their missing parts of the scene have not come up heretofore in the keyframes, or, they do not have valid depth information though they have appeared. These gaps cannot be rebuilt just with geometric approaches and a more elaborate inpainting technique will be required in the future research work.
Then these synthesized frames are used to generate the local point cloud, which will be transformed and maintained in a global octree map. The octree map expression [39] is flexible, compact and updatable. What's more, it is stored efficiently and employed easily for navigation.

IV. EXPERIMENTAL RESULTS
The DDL-SLAM system has been evaluated in the public datasets TUM RGB-D in this section. It provides many sequences in dynamic environments with ground truth acquired using a highly accurate motion capture system, for example walking, sitting and desk. There are two youngsters walking from the foreground to background, then they sit down at the desk in the sequences named walking. These sequences are highly dynamic and hence difficult for general SLAM systems. In the sitting sequences, two youngsters sit at a desk while speaking and gesticulating. These sequences are considered as low-dynamic because the people seldom move. All of the experiments are carried out on a computer with Intel i7 CPU, NVIDIA TITAN GPU, and 12GB memory.
DDL-SLAM adopts ORB-SLAM2 generally accepted as the state-of-art algorithm at present as a global SLAM solution. So we make a comparison against RGB-D ORB-SLAM2. The metric of absolute trajectory error (ATE)  Figure 4 (a), the output of our system is shown in Figure 4 (c), in which dynamic content has been segmented and the background has been reconstructed. Figure 4 (b) and (d) show the depth images input and output respectively, which have also been processed.   is very suitable for measuring performance of the visual system. And the metric of relative pose error (RPE) is utilized to measure the drift of the visual odometry. So we compute the metrics ATE and RPE for the quantitative evaluation.
Tab.1 shows the quantitative comparison results, where halfsphere, xyz, static and rpy in the first column stand for four categories of camera ego-motions [20]: (1) halfsphere: a camera moves according to the trajectory of a 1-meter diameter hemisphere, (2) xyz: a camera VOLUME 8, 2020  respectively moves along the x-y-z axes, (3) static: a camera is kept static manually, and (4) rpy: a camera revolves over roll, pitch and yaw axes. The values of Root-mean-square Error (RMSE), Mean Error, Median Error and Standard Deviation (S.D.) are presented in the research, while RMSE and S.D. are more focused on account of they can preferably demonstrate the robustness and stability of the system. It is obvious from Tab.1, our method makes the property in most highly-dynamic sequences attain an order of magnitude enhancement. As far as ATE is concerned, the improvement values of RMSE and S.D. can respectively come up to 98.6% and 98.2%. The results show that DDL-SLAM can significantly enhance the robustness and stability of SLAM system in highly-dynamic scenarios. And in low-dynamic scenes, the error is similar to the original RGB-D ORB-SLAM2 system. The primary cause is that original ORB-SLAM2 is adept in the low-dynamic environments and achieves good performance, therefore the upside potential of the performance is restricted. Tab.2 and Tab.3 display the performance of visual odometry. It can be seen that the results coincide with the above ATE analysis. Fig.5 displays the selected ATE curve graphs for the highly-dynamic sequences. It is obvious that the errors are significantly decreased with our method. The selected ATE curve graphs of the low-dynamic sequences are shown in Fig.6. It can be seen that the original ORB-SLAM2 expresses good performance in these cases. With our method combined into the SLAM system, the ATE values are greatly decreased. However, in the freiburg2_desk_with_person sequences, we found that our means could not improve the original capacity. We think the primary cause is that there are not moving objects in the early stage of the sequences, as a matter of fact the scenes during this period are static. What's more, the low-dynamic movements are generally not successive in the sequences and moving objects always turn out to be motionless in some frames.

V. CONCLUSION
In this research, a robust and stable RGB-D SLAM (DDL-SLAM) system in highly-dynamic environments using deep learning is proposed. A pixel-wise semantic segmentation convolutional neural network named DUNet is integrated with the multi-view algorithm to filter out all dynamic content of the scenario. Afterwards, the matched ORB feature points will be deleted from those detected dynamic areas, and the synthetic RGB frames without dynamic objects and with the background inpainting, as well as their matching synthesized depth images are acquired. Quantitative evaluations were put into effect utilizing the challenging dynamic sequences of TUM RGB-D dataset. Experimental results elucidate that DDL-SLAM exceeds ORB-SLAM2 obviously due to its the accuracy and robustness in highly-dynamic scenarios. Nevertheless, our approach still possesses a few limitations to be improved. For example, the real-time performance of the algorithm requires to be improved, a more elaborate inpainting background method needs to be put forward, or the octree map attained by our system would be endowed with semantic information to be employed for navigation in future work.