Semantic SLAM With More Accurate Point Cloud Map in Dynamic Environments

Static environment is a prerequisite for most existing vision-based SLAM (simultaneous localization and mapping) systems to work properly, which greatly limits the use of SLAM in real-world environments. The quality of the global point cloud map constructed by the SLAM system in a dynamic environment is related to the camera pose estimation and the removal of noise blocks in the local point cloud maps. Most dynamic SLAM systems mainly improve the accuracy of camera localization, but rarely study on noise blocks removal. In this paper, we proposed a novel semantic SLAM system with a more accurate point cloud map in dynamic environments. We obtained the masks and bounding boxes of the dynamic objects in the images by BlitzNet. The mask of a dynamic object was extended by analyzing the depth statistical information of the mask in the bounding box. The islands generated by the residual information of dynamic objects were removed by a morphological operation after geometric segmentation. With the bounding boxes, the images can be quickly divided into environment regions and dynamic regions, so the depth-stable matching points in the environment regions are used to construct epipolar constraints to locate the static matching points in the dynamic regions. In order to verify the preference of our proposed SLAM system, we conduct the experiments on the TUM RGB-D datasets. Compared with the state-of-the-art dynamic SLAM systems, the global point cloud map constructed by our system is the best.


I. INTRODUCTION
Simultaneous localization and Mapping (SLAM) plays an important role in the field of autonomous robots and unmanned vehicles [1]. The main purpose of SLAM is to use sensors such as cameras to construct the environment model without prior knowledge of the scene and to estimate the pose and motion trajectory of the carrier itself [2]. Recent years have seen the great development of visual SLAM, which can be classified into feature-based indirect SLAM systems [3]- [5] and direct ones based on photometric error [6], [7].
A common assumption for most of these visual SLAM systems is that the environment is static, which greatly limits the practical application of these systems. There are many dynamic objects in the real environment, such as pedestrians, vehicles, etc. When these dynamic objects enter the camera's The associate editor coordinating the review of this manuscript and approving it for publication was Kumaradevan Punithakumar . field of view, the pose estimation of the camera is directly interfered [8], and information of these dynamic objects will be preserved in the constructed map [9]. It's impossible to use this kind of contaminated map for robot navigation or human-computer interaction.
The main idea of these SLAM systems work in the dynamic environments is to classify the static and dynamic points in the environment, using the static points to estimate the pose of the camera and construct the environment map [10]. The dynamic points can be directly discarded or tracked according to the requirements of the task.
With the development of deep learning, great progresses have been made in target recognition and segmentation, so some researchers have combined the semantic information of the objects in the environment with SLAM [11]- [15]. The main idea of the current semantic SLAM systems working in the dynamic environment is to define the potential dynamic objects in advance, obtaining semantic information of the VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ objects in images through a deep CNN, and then combine the semantic with geometric information to segment dynamic regions accurately. Finally, the systems use the static parts of the environment to estimate the camera pose.
In this paper, we proposed a novel semantic SLAM system with more accurate point cloud map in dynamic environments. The semantic information of our proposed SLAM system is provided by BlitzNet [16], which can simultaneously generate bounding boxes and masks of the potential dynamic objects, such as people. In order to remove the noise blocks formed by the leaked information of the dynamic objects, we extend the masks of the dynamic objects by the depth statistical information of the masks in the bounding boxes of the dynamic objects. Then a geometric segmentation is operated on the depth image. The residual information of the dynamic objects which is still not included in the extended masks can be segmented as some islands, and we can delete these islands by a simple morphological operation to obtain the clean local point map. In the feature matching stage, we only use depth-stable feature points, which can effectively eliminate the influence of missing depth values and a sudden change of depth values. The images are divided into environment regions and dynamic ones by the bounding boxes of the dynamic objects. The static matching points in the dynamic regions can be located by the epipolar constraint constructed by inliers in the environment regions. The main contributions of this paper are as follows: 1) We extend the mask of the dynamic object to include more information about the object. 2) A bidirectional search strategy is proposed to track the bounding box of the dynamic object. 3) We integrate our approach with ORB-SLAM2 system [5]. Evaluations and method comparisons are performed with the TUM RGB-D dataset [17]. Our SLAM system can obtain clean and accurate global point cloud maps in both highly and lowly dynamic environments.
The rest of the paper is organized as follows: Section II discusses the related work. Section III presents our proposed method. Section IV shows the experiment results and discussion. Section V, finally, concludes the work.

II. RELATED WORK A. DYNAMIC SLAM
CoSLAM [18] divides points into four types: static, dynamic, false and uncertain based on the assumption that the projected position of a static point in space follows a Gaussian distribution, and the type of the point can be changed according to the motion state of the object. At the same time, the system adopts a multi-camera combination observation strategy to effectively deal with the situation that the number of the static points captured by a certain camera is small or does not exist at all. Evers and Naylor [19] proposed a GEM-SLAM based on probability hypothesis density filters for dynamic scenes, the method probabilistically anchors the observer state by fusing the observer information inferred from the scene with the observer motion reports. Bahraini et al. [20] proposed an approach to segment and track multiple dynamic objects based on the multilevel-RANSAC algorithm. DMS-SLAM [21] uses the sliding window model to achieve feature matching between two discontinuous image frames, and adopts the Grid-based Motion Statistics (GMS) algorithm [22] to filter the initial feature matching points. This approach not only eliminates the impact of the dynamic objects, but also has more feature matching points than ORB-SLAM2, which has a great advantage for estimating camera pose accurately.
RGB-D camera can provide both vision and depth information of the scene and is convenient to install. In recent years, many researchers have used RGB-D camera to study SLAM in dynamic environment. Sun et al. [23] proposed a motion removal approach based on RGB-D data as a preprocessing stage of DVO SLAM [24] to filter out data related to moving objects. The approach consists of two on-line parallel processes: learning and inference. The functions of these two processes are to construct and update the foreground model, and perform pixel-wisely segmentations on the foreground model. Scona et al. [25] segmented the input RGB-D image pair into K geometric clusters by applying K-means on the 3D coordinates of the scene points, and assume that each geometric cluster behaves as a rigid body, the segmentation of static and dynamic objects are converted to the analysis of the states of geometric clusters. This method focuses on building the static environment model rather than analyzing the motion state of objects. Li and Lee [26] proposed a SLAM system based on the static point weight. Static point weight indicates the possibility that a point is part of the static environment. Kim and Kim [27] proposed a nonparametric background model from depth scenes, which can reduce influence of dynamic objects, and the motion of camera is estimated by an energy-based dense-visual-odometry approach based on the background model. Dai et al. [28] used the consistency of point's geometric correlations to resist the interference caused by moving objects, and the geometric correlations between map points are created by Delaunary triangulation. The dynamic objects can be separated from static environment by removing non-consistency connections.
In order to deal with the motion blur caused by high-speed motion of camera in highly dynamic environments, the event camera was introduced into the research of SLAM by the researchers at the University of Zurich [29]- [33], and the output of the event camera is an asynchronous stream of events. However, because the output of the event camera is different from the traditional camera output, the traditional visual algorithm cannot be directly applied to the SLAM system built on the event camera, and the higher price also limits the use of the sensor.
Dyna-SLAM removes all the potential dynamic objects such as people, cars, and animals. Considering that some dynamic objects cannot be detected by Mask R-CNN, because they are not previously defined as potential dynamic objects, such as a rotated chair, or a book hold by a moving person, the authors utilize multi-view geometry to locate these movable objects. The working environment of MaskFusion is mainly indoors, so the authors proposed two strategies to judge whether an object is dynamic or static: firstly, consistence of object motion; secondly, object in contact with people is movable. MID-Fusion consists of four parts: segmentation, tracking, fusion and raycasting. The system creates sub-maps for every possibly rigidly moving object in the environment, and fuses the geometric, semantic, and motion properties information of dynamic objects into these sub-maps. And in the process of camera tracking, MID-Fusion discards the matching points in the human mask area. The system proposed by Zhao et al. refined the boundaries of the detected dynamic objects by integrating the Canny edges of the RGB images with the mask boundaries, because the contours of the dynamic objects obtained by the semantic segmentation are not precise.
DS-SLAM [39], SOF-SLAM [40] and SDF-SLAM [41] use SegNet [42] as the semantic segmentation algorithm. The authors of DS-SLAM assume that the feature points on the people are most likely to be outliers, so they exclude the people in the images and construct epipolar line constraint by the matching points in the environment regions. Then the constraint is utilized to detect whether the people are static. If a person is determined to be static, then matching points on the person can be used to predict the pose of the camera. SOF-SLAM proposed a dynamic features detection algorithm which utilizes the semantic information to aid the calculation of epipolar geometry, and this system can remove the dynamic feature points effectively. SDF-SLAM is the continuation of SOF-SLAM, which outperforms SOF-SLAM, because SDF-SLAM solves these two problems: first, the matching points on the slow-moving objects may be recognized as static matching points, and second, the dynamic information in adjacent frames is easy to be interfered by noise. Brasch et al. [43] proposed a joint probabilistic model based on the semantic prior information provided by a CNN, using temporal motion information to determine the state of a certain map point. This method can deal with the slow-moving and temporarily static objects effectively. Sun et al. [44] proposed a movable object aware SLAM system via weakly supervised semantic segmentation, and the main advantage of this system is that it avoids expensive annotations for training. DDL-SLAM [45] detects dynamic objects with semantic masks obtained by DUNet [46] and multi-view geometry, and then reconstructs the background obscured by dynamic objects with the strategy of image inpainting.
Instead of obtaining masks of potential dynamic objects, some researchers directly utilize the bounding boxes to remove the dynamic regions. Yang et al. [47] used Faster-RCNN [48] to detect the potential dynamic objects, and refined the data association by removing the mismatching points, so the camera pose are calculated by the better data association and the graph optimization. Zhang et al. [49] used YOLO [50] to detect and recognize objects, so that the relationship between keyframes and objects are built to filter the dynamic feature points and generate semantic maps. Dynamic-SLAM [51] uses an SSD [52] object detector to detect the dynamic objects defined by a dynamic characteristic score based on life experience, and the system adopts a compensation algorithm based on the speed invariance in adjacent frames to deal with missing detection. Aiming at the three problems that the dynamic SLAM needs to solve, namely obtaining accurate camera localization, getting navigation maps and real-time segmentation of dynamic objects, Sun et al. [53] proposed a multi-purpose dynamic SLAM framework which is compatible with different segmentation methods for different purposes and situations.
Summarizing the above SLAM systems working in dynamic environment, it can be found that most of these works focus on improving the camera positioning accuracy, but there is almost no research on noise blocks removal in the obtained point cloud map. The quality of the global map constructed by SLAM is related to two factors. One is the accuracy of camera pose estimation, and the other is whether the noise blocks in the map are effectively removed. Since the existing semantic segmentation algorithms are not perfect [35], some information of the dynamic objects will leak into the environment, and this information will be retained in the constructed local point cloud maps to form a large number of noise blocks. These noise blocks will greatly affect the actual use of the map. In addition, if these point cloud maps with a large number of noise blocks are converted to other forms of maps, such as octomaps [54], the quality of maps will not be significantly improved.

III. SYSTEM DESCRIPTION
The overview of our proposed semantic SLAM system in dynamic environments is shown in Fig. 1. First of all, the RGB images pass through a CNN (Convolution Neural Network) that performs object detection and pixel-wise segmentation at the same time. Considering the practical application of the system, the selection of CNN for semantic segmentation of potential dynamic objects requires a balance between real-time and accuracy. We choose BlitzNet, which is a real-time deep neural network that performs object detection and semantic segmentation in one-time forward propagation, as the basic network of our semantic SLAM system. The detected objects include people, tvmonitors, chairs, etc., which are common indoors. We roughly divide the objects in the environment into three categories: dynamic objects, such as people; potential dynamic objects, such as chairs, books, keyboards, whose status are determined by the dynamic objects; static objects, such as tvmonitors, whose positions in the environment are relatively fixed and do not change easily. The essence of the global point cloud map P G constructed by SLAM system is to stitch the local point cloud P i L obtained by each group of key frames, namely where, n is the total number of key frames, i = 1, . . . , n, R i and t i are the rotation matrix and translation matrix for converting the local point cloud map to the coordinate system where the global point cloud map is located. The values of R i and t i are determined by the camera's pose in space. The quality of the global cloud map obtained after stitching in a dynamic environment is related to two factors: first, the accuracy of camera pose estimation; second, the removal of noise blocks formed by the dynamic objects in the local point cloud map.
Smears will occur in the constructed point cloud map because of the dynamic objects. Although removing the image information corresponding to the mask regions of dynamic objects can improve this situation, the dynamic object mask obtained by the existing semantic segmentation is not perfect. Especially the edges of the dynamic object might not be included in the mask, so these parts would leak into the environment. The noise blocks formed by this edge information greatly affect the quality of the point cloud map, and even cause the point cloud map to look chaotic. In this paper, we extend the mask of dynamic object through analyzing the depth information of the mask within the bounding box of the dynamic object. The extended mask can contain as much information of the dynamic object as possible. Then we segment the depth image by geometric method, and remove the islands formed by the residual information of the dynamic objects. After that, a clean local point cloud map can be obtained.
In a dynamic environment, the main reason for the large error of camera pose estimation is that the matching points on the dynamic objects participate in the camera positioning process. Especially when the dynamic objects occupy a large space in the image and the texture information on the dynamic objects is rich, the matching points cannot be effectively removed by traditional algorithms such as RANSAC [55]. In this paper, we quickly divide the image into the environment regions and dynamic regions by the bounding boxes of the dynamic objects, and the matching points in the environment regions are used to construct epipolar constraint to locate the static matching points in the dynamic regions, so as to ensure that the final matching points for camera localization are static.

A. DYNAMIC OBJECT MASK EXTENSION
The masks of the dynamic objects obtained by BlitzNet are not complete, and some information of the dynamic objects will leak into the environment. In this section we take the human mask extension as an example. As shown in Fig. 2, when comparing the RGB image with the obtained human mask image, it can be clearly seen that parts of the sitting person's body are not included in the mask, which are marked by the red boxes in the figure. In order to observe the information of the human bodies leaking into the environment more intuitively, we set the depth values of the areas in the depth image corresponding to the human masks to 0. And the parts of the human bodies leaked into the environment are marked with green boxes. It can be seen from the depth image that a part of the edge of the walking person is also leaked into the environment. A lot of noise blocks will exist in the local point cloud generated by such depth image and RGB image, which are marked with blue boxes.
We use the depth statistical information in the human mask regions to find the pixels that belong to the human body but leak into the environment within the human's bounding box, that is, to extend the human body mask. As shown in Fig. 3, the depth values corresponding to the two human mask regions are counted. Considering that there may be some pixels with a depth value of 0 and some noise in the human mask regions, the 0 value is not counted in the process of statistics. After that, the outliers would be removed. D P(i) is used to represent the set of depth values within the human mask, where P (i) represents the person's label that appears in the image. The maximum and minimum values in D P(i) are used as the upper and lower bounds of the depth value in the bounding box of P (i) respectively, that is Human mask can be extended according to (4) where D BBX P(i) (u, v) denotes the depth value of the pixel (u, v) in the bounding box of P (i), M P(i) represnts the mask of P (i).
If a dynamic object is far away from the camera, or the segmentation algorithm performs poorly on a certain type of object, the area of the obtained mask would be very small, so the depth values in the mask would not be enough to represent the depth information of the dynamic object effectively. In this case, we remove the information in the whole bounding box of the dynamic object to eliminate its impact on the local point cloud map, as follows: where S mask DO(i) represents the mask area of the i-th dynamic object, S BBX DO(i) denotes the area of the bounding box of the i-th dynamic object, i is the number of the detected dynamic object, τ 1 is a preset threshold value, in this paper τ 1 = 5000, the unit is pixel.
At the same time, we have noticed that some contents of the dynamic object may leak into the environment beyond the bounding box. The shape of this leaked information is usually long and narrow. Due to the continuous movement of the dynamic object in the environment, the difference between depth values in the edge of the dynamic object in the two adjacent frames is much larger than that of the environment. We can use this feature to remove these narrow and long edges, as follows: Subtract the previous depth image from the current depth image and take the absolute value, as shown below: If the value of d_sub (u, v) is large, there are two situations for pixel (u, v). One is the point at the edge of the object in the environment. This type of point can be removed because there will be a lot of redundant information in the process of constructing the global point cloud map, and the depth value at the edge of an object changes a lot, which is the reason for the obvious layering at the object edge in the obtained point cloud. Another is the point that the dynamic object leaked into the environment. In this paper, the current frame is operated as follows: where τ 2 is a preset threshold value, in this paper τ 2 = 5000. The dynamic object mask extension algorithm processing is shown in Algorithm 1.

B. INTERACTION JUDGMENT BETWEEN POTENTIAL DYNAMIC OBJECT AND DYNAMIC OBJECT
The state of the potential dynamic object is determined by whether the dynamic object interacts with it. For example, when someone adjusts the position of a chair or sits on it, we should regard the chair at this time as a dynamic object. If the chair is isolated in the environment, it can be considered to be a static object. When a chair is a dynamic object, the camera cannot use the matching points on it for pose estimation, and the information of it should be removed from the constructed point cloud map. When a chair is a static object, the matching points detected on it can participate in the camera pose estimation, and the information of it is also an essential part of the point cloud map. So, it is necessary to judge the status of the potential dynamic object. People are the main dynamic objects in the indoor environment, so we should consider that the state of a potential dynamic object to be dynamic, when it is in contact with a person. We utilize the bounding boxes and the depth information in the masks of the person and the potential dynamic object to judge the status of the potential dynamic object. The flow diagram is shown in Fig. 4.
First, we label people and potential dynamic objects in the image and get the corresponding labels, such as   intersect with each other.
Taking the case shown in Fig. 5 as an example, the dynamic objects in Fig. 5(a) are two persons which are marked by green boxes, while a swivel chair and an ordinary chair are the potential dynamic objects which are marked by red boxes. The labels of the sitting person and the walking one are P (1) and P (2), respectively. C (1) and C (2) represent the swivel chair and the ordinary one. It is obvious in Fig. 5(a) that the bounding box of the sitting person intersects with that of the swivel chair, and the bounding box of the walking person intersects with that of the ordinary chair. That is, the label groups need to be saved are {P (1) , C (1)} and {P (2) , C (2)}.
Next, the depth values within the masks of the two people and the two chairs are counted respectively. For the convenience of observation, we display the statistical histograms of the depth values of the two people and two chairs on one graph, as shown in Fig. 5(b). It can be seen from Fig. 5(b) that the depth interval of P (1) intersects with that of C (1) and C (2). According to the label groups saved in the previous step, we can know that the label group that needs to be actually saved is {P (1) , C (1)}, that is, P (1) interacts with C (1).

C. BOUNDING BOX TRACKING
In some cases, Blitz-Net cannot effectively detect the dynamic object, because the dynamic object is too small, or because only part of the dynamic object appear in the image, as shown in Fig. 6. Therefore, it is necessary to re-detect the dynamic objects when the detection is missing, that is, to track the bounding boxes of the dynamic objects.
Based on the assumption that the moving speed of the dynamic object relative to the camera is constant for a short period of time, this paper proposes a bidirectional search strategy to track the bounding box of the dynamic object. D K t indicator = {0, 1} is used as the indicator of whether Blitz-Net detects dynamic object in the K t frame. When D K t indicator = 0, it indicates that no dynamic object is detected in the frame. If D K t indicator = 1, it denotes that the dynamic  object D target is detected in the K t frame. Let BBX K t = x K t tlc , y K t tlc , x K t lrc , y K t lrc represent the bounding box of dynamic object D target in the K t frame, where x K t tlc , y K t tlc are the upper left corner coordinates of the bounding box, and x K t lrc , y K t lrc are the lower right corner ones. As shown in Fig. 7, the red rectangle represents the current frame K i . When no dynamic object is detected in K i , that is D K i indicator = 0, then 3 frames backward and 3 frames forward are searched. The bounding box of the dynamic object in the current frame K i can be tracked by the values recorded in the indicators of the 6 frames. How to determine whether there is a dynamic object in the current frame, and if there is a dynamic object, how to get the bounding box of the object, as follows: 1) Dynamic object is detected in the previous 3 frames and the later 3 frames, the dynamic object is considered to exist in the current frame K i , and the bounding box of the dynamic object can be obtained by where K i−np and K i+nl are the previous frame and the later frame closest to K i , and a same dynamic object is detected in K i−np and K i+nl , np ≤ 3, nl ≤ 3. 2) When the dynamic object is detected only in the previous 3 frames or only in the later 3 frames, the dynamic object is considered to exist in the current frame K i , and the bounding box of the dynamic object in the image closest to K i is used as the bounding box in K i . 3) When Blitz-Net does not detect dynamic object in these 6 frames, we consider that there is no dynamic object in the current frame K i . The proposed bounding box tracking algorithm can effectively track the bounding box of the dynamic object missed by Blitz-Net in Fig. 6. The results are shown in Fig. 8.

D. GEOMETRIC SEGMENTATION OF DEPTH IMAGE
After removing the dynamic objects, we found that some of the information of these objects is still left in the environment. In order to remove this residual information, geometric segmentation is performed on the depth image. The residual information is usually some small isolated patches in the segmented depth image, which can be removed by a simple morphological operation. In the depth image, the depth of the junction between different objects is not continuous, that is, the depth value between the objects and the background changes a lot. According to this property, the segmentation edges of the depth image can be placed in the depth discontinuities.
Our segmentation method for depth image is as follows: We traverse the depth image with a slider of size 2 * 2. Image coordinates corresponding to the pixel in the upper left corner of the slider is (u, v), and depth values in the slider are recorded, as follows: where d represents the depth image. The depth image can be segmented quickly by the following formula: where τ 3 is a preset threshold value, in this paper τ 3 = 500. Perform area statistics on the image patches with depth information in the segmented depth image to obtain S patch(i) , i = 1, . . . , m, where m is the number of image patches. The islands formed by the residual information of the dynamic objects can be removed as follows: where τ 4 is a preset threshold value, in this paper τ 4 = 1000, the unit is pixel.

E. FEATURE POINTS WITH STABLE DEPTH VALUES
Assume that we get two sets of matching points A = {P a1 , . . . , P an } and B = {P b1 , . . . , P bn }. The external parameter matrix of the camera can be obtained by solving the least squares problem shown below: For two adjacent frames of depth images, there are regions where the depth values are missing, and the depth values of these regions are 0. Matching points in these regions cannot provide any useful information for solving ICP (Iterative Closest Point). In addition, there is a sudden change in the depth values of feature points at the edge of objects, which will directly affect the solution of (14). At the same time, the depth values of some matching points on the dynamic targets will also change greatly. In this paper, we solve (14) by using matching points with stable depth values.
Feature points with stable depth values are usually on the surface of certain objects, such as the desk baffle region marked by the red dotted frame [56], as shown in Fig. 9. The red cross on the RGB image represents the location of a detected feature point, an image block of size 3 * 3 centered on (i, j) is taken on the depth image for later process, (i, j) are the integer pixel coordinates of the feature point.
Firstly, we detect whether the depth value in the image block corresponding to each feature point is missing. If there is a pixel with a depth value of 0 in the image block, the corresponding feature point is considered to be in the region where the depth value is missing, and the feature point is deleted. Fig. 10 is a detailed view of part regions where feature points with missing depth values in Fig. 9. It can be seen from the figure that some parts in the computer keyboard region lose depth values. Although the roof region contains a lot of texture information, the depth values of the feature points in this region are completely missing, since it is far from the camera and exceeds the effective measurement range of the depth camera.
Next, we calculate the standard deviation of the depth values in the image block corresponding to each feature point that is retained. The standard deviation of the depth values is  In the course of experiments, we found that the number of image blocks with sudden change in depth values is usually much less than that with stable depth values. All standard deviations obtained are stored in a sequence, and the feature points corresponding to the outliers in the sequence are eliminated. An outlier value is defined as a value that is more than three scaled MAD (median absolute deviations) away from the median. As shown in Fig. 11, the red asterisk represents the outlier that need to be rejected. Fig. 12 is a detailed view of part regions where feature points with sudden changes in depth values in Fig. 9. These regions with sudden changes in depth values are mainly distributed at the edges of objects. And the depth values on the surfaces of objects at a distance obtained by the depth camera will also be inaccurate. In fact, the accuracy of the depth value of the object obtained by the depth camera is inversely proportional to the distance between the target and the camera. The closer the distance, the more accurate the depth value. In the process of constructing the point cloud map, we discard points whose depth value exceeding 30000 (6m).
After the above two steps, the feature points with stable depth values can be obtained. As shown in Fig. 13, the three types of feature points obtained are displayed on the RGB and depth images respectively, the green point represents the feature point with stable depth value, the red point denotes the feature point with missing depth value, and the blue point indicates the feature point with sudden change in depth value.  When matching feature points between a pair of images, we only use those with stable depth values. It can be seen from Fig. 13 that there are a large number of green points on the walking person, which is the main reason for the large error of the camera pose estimation.

F. LOCATION OF STATIC MATCHING POINTS
After obtaining the dynamic objects, we can use the bounding boxes of the dynamic objects to segment the image quickly, and divide the image into dynamic regions and environment regions. The feature points in the image can be divided into 4 groups after feature matching: inliers set P I E in the environment regions, outliers set P O E in the environment regions, dynamic points set P D D in the dynamic regions, static points set P S D in the dynamic regions. After matching the feature points in the environment regions of the previous frame and the current frame, P O E can be effectively removed by RANSAC algorithm, and the fundamental matrix F between the two adjacent frames can be calculated by P I E . By matching the feature points in the dynamic regions of the previous frame and the current frame, the matching points P P D = u P D , v P D , 1 , P C D = u C D , v C D , 1 can be got. The distance from the matching points to the corresponding epipolar line can be calculated by where l x and l y can be got by Each group of matching points in the dynamic regions can get a distance d i , of which i is the serial number of the group of matching points, and the set to which the i-th group   of matching points belongs can be judged according to the following formula: where τ 5 is a preset threshold value, in this paper τ 5 = 0.5. The matching points groups belong to P D D are directly discarded, and those belong to P S D but are not on the dynamic object mask can participate in camera pose estimation.

IV. EXPERIMENTAL RESULTS
Our system adopts ORB-SLAM2 [5], which is one of the most outstanding SLAM systems based on the feature points matching, as the global SLAM solution. Dyna-SLAM [34] and DS-SLAM [39], the two best solutions for SLAM in highly dynamic environments are both built on ORB-SLAM2.
In this section, we will compare the proposed system with ORB-SLAM2, Dyna-SLAM and DS-SLAM on the five sets of sequences selected from TUM RGB-D dataset. These five sets of sequences include four sets of walking sequences, mainly for our experiments, and a set of sitting sequences, which are selected as the reference group.
In the walking sequences, two persons walk back and forth in the scene, occasionally sitting on the chairs talking and gesturing, so they can be regarded as highly dynamic objects. The walking sequences divided into four groups according to the different movement modes of the camera, which are halfsphere, rpy, static and xyz. Halfsphere means that the camera motion following a halfsphere-like trajectory; rpy means that the camera rotated along the roll-pitch-yaw axes; static means that the camera roughly kept in place manually; xyz means that the camera moved along the x-y-z axes. For the convenience of expression, we use fr3/w/half, fr3/w/rpy, fr3/w/static and fr3/w/xyz to represent the four sets of walking sequences. In the sitting sequences, the two persons just moved only a little bit relative to the environment, most of the time sitting on chairs chatting and gesturing. In this paper, we choose fr3/s/static as the reference group.

A. EVALUATION OF THE CAMERA LOCATION
Metrics Absolute Trajectory Error (ATE) and Relative Pose Error (RPE) are used for quantitative comparison, and the experimental results are shown in Table 1 - Table 3. The values of Root Mean Square Error (RMSE) and Standard Deviation (S.D.) are presented in the tables; RMSE measures the deviation between the observed value and the true value and S.D. reflects the extent of deviation for a group as a whole. The two values indicate the robustness and stability of SLAM systems, respectively. Table 1 gives the results of Absolute Trajectory Error (ATE). ORB-SLAM2 cannot handle with the highly dynamic scenes effectively, and the other three systems have greatly improved compared with ORB-SLAM2. Our proposed system achieved the best results on fr3/w/half and fr3/w/xyz, and the results obtained on fr3/w/rpy and fr3/w/static are close to the results of Dyna-SLAM. Table 2 presents the results of Translational Relative Pose Error (RPE). On fr3/w/half and fr3/w/xyz, the results of our system are the best. On fr3/w/static, the results of our system, DS-SLAM and Dyna-SLAM are very close. The RMSE value of Dyna-SLAM and the S.D. value of DS-SLAM achieved the best results respectively, and the RMSE value of DS-SLAM is the same as our system. Table 3 provides the results of Rotational Relative Pose Error (RPE). Our system got the best results on fr3/w/half and fr3/w/xyz. Dyna-SLAM achieved the best results on fr3/w/rpy, but our system is better than DS-SLAM and ORB-SLAM2. It should be noticed that the RMSE values of the three dynamic SLAM systems on fr3/w/rpy were not obvious improvement. On fr3/w/static, the RMSE value of Dyna-SLAM and S.D. value of our system got the best results respectively, and the RMSE value of our system is better than DS-SLAM. In fact, the results of the three systems are very close to each other. The RMSE value of our system and the S.D. value of DS-SLAM achieved the best results on fr3/w/xyz respectively. According to the results on the fr3/w/rpy, it can be inferred that the performance on the rotation angle estimation of SLAM systems is greatly challenged in a highly dynamic environment when the camera motion mode is rotating along the roll-pitch-yaw axes.  As can be seen from Table 1 -Table 3, the results of the three dynamic SLAM systems on fr3/s/static are not much different from ORB-SLAM2, so we conclude that the ORB-SLAM2 can handle the camera location problem in lowly dynamic environment well. Fig. 14. shows the estimated trajectories of ORB-SLAM2, Dyna-SLAM, DS-SLAM and our system compared with the ground-truth. As can be seen from the first row images, in highly dynamic environments, the trajectories generated by ORB-SLAM2 have large errors compared with the real trajectories. Dyna-SLAM, DS-SLAM and our system have achieved good results compared with ORB-SLAM2. On fr3/w/half, fr3/w/rpy and fr3/w/static, the trajectories generated by Dyna-SLAM are not complete compared with the other three SLAM systems, as shown in the second row. Table 4 gives the results of successfully tracked trajectory points of the four SLAM systems. As we can see from Table 4, our system tracked the same number of trajectory points on the five sequences as that tracked in DS-SLAM.

B. EVALUATION OF THE GLOBAL POINT CLOUD MAP
First, we show the global point cloud maps constructed by the four SLAM systems in a highly dynamic environment. Taking the global point cloud map obtained on fr3/w/xyz as an example, as shown in Fig. 15.
From the front view of the global point cloud map obtained by ORB-SLAM2, we can see that the information of the two persons are remained in the global point cloud map, and other objects in the environment such as the table, tvmonitors and chairs are obscured by these smears. It can be seen from the top view that the plank of the table is twisted, the reason for this phenomena is that the pose estimation of the camera has a large error, causing the points on the plank to be mapped to the incorrect position when constructing the point cloud map. In fact, the map looks so chaotic, and it is impossible to use this map for robot navigation or human-computer interaction.
The camera pose estimation accuracy is greatly improved after removing the interference of the dynamic objects, so compared with ORB-SLAM2, the quality of the global point cloud maps constructed by Dyna-SLAM, DS-SLAM and our system is greatly improved. In the global point cloud maps constructed by these three dynamic SLAM systems, we can clearly see the chairs, screens and other targets in the environment.
However, as can be seen from the images in the second and third columns, due to the lack of operations to remove noise blocks, the information leaked into the environment by these two people exists in the global point cloud maps obtained by Dyna-SLAM and DS-SLAM. The amount of noise blocks is directly related to the masks obtained through the dynamic objects segmentation algorithm used by the two dynamic SLAM systems. The more information contained in the dynamic object mask, the less information the dynamic object leaks into the environment. The noise blocks in the global point cloud map of Dyna-SLAM are mainly some slender edges, while that of DS-SLAM are coarser.
As can be seen from the fourth columns of images, after the operations of removing the noise blocks, the information leaked by these two people into the environment has been effectively removed in our global point cloud map, that is, our SLAM system can construct a clean and accurate global point cloud map in a highly dynamic environment.
Then we show the global point cloud maps constructed by the four SLAM systems in a lowly dynamic environment. Taking the global point cloud map obtained on fr3/s/static as an example, as shown in Fig. 16.
As can be seen from the first column of images, although the global point cloud map of ORB-SLAM2 retains the information of these two people, we can clearly see the objects in the environment, and there is no distortion in the plank of the table. As the conclusion in section IV-A, the camera pose obtained in the lowly dynamic environment is relatively accurate, so most of the points are mapped to the correct position in the reference coordinate system of global point cloud map. That is, the quality of the global point cloud map obtained by the SLAM system in a lowly dynamic environment depends on whether the noise blocks are effectively removed.
In the sitting sequence, these two people are sitting on the chair all the time. From the left view of the global point cloud maps of Dyna-SLAM and DS-SLAM, we can clearly see the body contours of the two people. It can be seen from the fourth column of images that the noise blocks in the global point cloud map of our system are completely removed, that is, our system can effectively deal with the problem of map construction in a lowly dynamic environment.
When a robot uses the constructed map to navigate or interact with the environment, if there are more noise blocks in the map, it will inevitably have an adverse impact on the robot's decision. By comparing global point cloud maps constructed by the four SLAM systems in highly and lowly dynamic environments, we can see that the global point cloud maps of our SLAM system have advantages over the other three SLAM systems.

V. CONCLUSION
In this paper, we proposed a semantic SLAM system with more accurate point cloud map in dynamic environments. The bounding boxes and masks of the potential dynamic objects could be obtained with BlitzNet, and the image can be quickly divided into environment regions and dynamic regions by the bounding boxes. We introduce a novel statistical method of depth analysis to remove the noise blocks formed by the dynamic objects as well as the islands generated by geometric segmentation. We construct epipolar constraint by the depth-stable matching points in the environment regions, and the static matching points in the dynamic regions can be located by the constraint. The experimental results on five sequences of the TUM RGB-D dataset demonstrate that our method can eliminate the influence of the dynamic objects effectively. Comparisons with ORB-SLAM2, Dyna-SLAM and DS-SLAM show that our method has certain advantages in the accuracy of camera pose estimation and the integrity of the trajectory. To our knowledge, the global point cloud map constructed by our method looks the best among the maps built by the existing dynamic SLAM systems. Our system can effectively remove noise blocks from global point cloud maps in both highly and lowly dynamic environments, which is the main advantage of our system. However, there are some shortcomings of the proposed method: Firstly, the potential dynamic objects are specified in advance based on life experience. If an unknown dynamic object occupies most of the camera's field of view, the system will regard the object as a part of the static environment regions, causing the camera's pose and trajectory estimation error. Secondly, the semantic information provided by BlitzNet is not fully utilized. Finally, we did not study the specific motion state of the dynamic object in the environment.
In view of the problems existing in the system, our future work includes: unknown dynamic object processing, construction of semantic map. At the same time, the robot is likely to collide with some dynamic objects when exploring the unknown environment. Therefore, we need to further study the motion of the dynamic objects in the environment to provide a safe navigation routes for the robot.
QICHI ZHANG received the B.S. degree in electronics and information engineering from Shenzhen University, China, in 2017. He is currently pursuing the M.S. degree in electronics and communication engineering with the School of Artificial Intelligence, Xidian University, Xi'an, China. His research interests include computer vision, simultaneously localization and mapping, and machine learning, with a focus on dynamic slam.