SIIS-SLAM: A Vision SLAM Based on Sequential Image Instance Segmentation

Simultaneous localization and mapping (SLAM) is a fundamental function of intelligent robots. To reduce the influence of dynamic objects on SLAM in dynamic environments, this study pro-poses a visual SLAM based on sequential image segmentation, referred to as SIIS-SLAM. Based on ORB-SLAM3, SIIS-SLAM integrates the sequential image instance segmentation and optical flow dynamic detection module. The sequential image segmentation module is designed to eliminate the effectiveness of dynamic objects in the estimation of relative pose between sequential frames. Specifically, based on the coarse relative pose estimated by ORB-SLAM3 and the box coordinates of instances detected by Mask R-CNN, the sequential image segmentation module effectively improves the speed and accuracy of instance segmentation. Dynamic objects can be effectively detected by combining the instance segmentation results and optical flow module. Filtering the feature points in dynamic objects can improve the accuracy and robustness of SLAM. Experimental results demonstrate that SIIS-SLAM achieves the better accuracy in dynamic environments compared to ORB SLAM3 and other advanced methods.


I. INTRODUCTION
Simultaneous localization and mapping (SLAM) is a research topic that has attracted considerable attention in recent years. Its main purpose is the detection of a set of corresponding features from multiple images, while concurrently estimating the camera position and pose, and the three-dimensional layout of the scene [1].
With the continuous reduction in the power requirements and cost of various sensors, SLAM is being increasingly studied. Researchers have attempted to solve common problems, including uncertain scale, difficulty in the initial positioning, and continuous accumulation of drift in the SLAM process from different perspectives, such as joint detection with multisource sensors, by combining traditional and deep learning methods. In general, visual SLAM has attracted widespread interest because of its cost effectiveness. Many excellent visual SLAM frameworks have been established, including sensor information reading, front-end visual odometry (VO), The associate editor coordinating the review of this manuscript and approving it for publication was Heng Wang . back-end state estimation, and loop-closure detection and mapping. Simultaneously, certain advanced algorithms, such as the series of ORB-SLAM [2]- [4], LSD-SLAM [5], and MonoSLAM [6] have achieved remarkable performances.
However, certain problems continue to exist in many existing general frameworks and algorithms. Several algorithms focus on the geometric relationship between images acquired in different time and space but ignore the semantic information of the key objects in the environment, which is not conducive to understanding the environment. In addition, many algorithms can only achieve excellent localization and mapping results in a static environment, but when an unknown number and type of dynamic objects appear in the environment, the localization and mapping accuracy are significantly reduced. To solve these two problems, many researchers have integrated deep learning methods in SLAM research, such as semantic SLAM and dynamic SLAM. Semantic SLAM, which adds semantic information to the SLAM process for improving the accuracy and optimizing the localization and mapping results by providing advanced understanding of the environment, exhibits superior performance and specific task-driven perception characteristics. For example, pop-up SLAM [7] based on monocular vision has proven that semantic information can improve the accuracy of state estimation and dense reconstruction, in a low-texture environment. In addition, [8]- [10] show that the inclusion of semantic information acquisition in SLAM can improve the accuracy of localization and mapping, and can be applied to senior tasks as well. Dynamic SLAM, which is a SLAM method suitable for dynamic factors in the environment, is realized by incorporating dynamic detection, object recognition, and dynamic object tracking in the SLAM process. For example, DS-SLAM [11] combines a semantic segmentation network with the moving consistency check method to reduce the impact of dynamic objects; thereby, the localization accuracy is highly improved in dynamic environments. Detect SLAM [12] combines the SLAM system and a CNN-based object detector to simultaneously improve the accuracy of object detection and localization. As shown in [13], [14], these methods accomplish localization and mapping in a dynamic environment by detecting dynamic objects, achieving good robustness. But all these methods do not take advantage of the characteristics of sequential images in the process of object detection or instance segmentation.
To reduce the influence of dynamic objects in the SLAM feature extraction process under a dynamic environment, the instance segmentation and optical flow modules are integrated in SLAM to deal with dynamic objects. This study is based on the ORB-SLAM3 [4] system, where the sequential segmentation and optical flow module are applied to isolate dynamic objects and prevent the application of their features to the subsequent processing steps. Similar to the visual object tracking task, it is necessary to infer the relationship between the objects in each frame in visual SLAM. In this study, according to the characteristics of the sequence of images, we propose a joint segmentation strategy for sequential images in order to improve the speed and accuracy of the deep-learning based instance segmentation method, which is then combined with the optical flow dynamic detection module to determine whether the target is dynamic. The overview of SIIS-SLAM is shown in Fig. 1. Mask R-CNN [15] is first used to predict the position and mask information of the object in the first key frame of the sequential images. Further, we use the depth information of RGB-D images and the camera pose information to estimate the position of the object in the next frame, which is used as prior information for the instance segmentation of the next frame. Finally, the optical flow dynamic detection module is used to determine whether the object is dynamic, and the system can then judge whether the feature points in the mask are used for localization and mapping. The main contributions of this study can be summarized as follows: 1) Based on the ORB-SLAM3 system, a complete dynamic SLAM system with semantic information is proposed, which reduces the influence of dynamic objects on localization and mapping.
2) A sequential image instance segmentation method based on the improved Mask R-CNN is proposed for the SLAM requirements, which improves the speed and precision of instance segmentation using the characteristics of the sequential image.
3) We combine the sequential image instance segmentation and optical flow dynamic detection module to effectively reduce the influence of dynamic targets in the SLAM process.

II. RELATED WORK A. IMAGE SEGMENTATION
Instance segmentation is essential for many computer vision applications. The accuracy and efficiency of segmentation have a cumulative impact on the subsequent work. In this study, instance segmentation is introduced in the SLAM system for reducing the influence of dynamic objects. Instance segmentation methods can be divided into two-and one-stage methods. Two-stage methods, such as Mask R-CNN [15] and BlendMask [16], first detect objects and then classify each pixel using a classification network to obtain the final segmentation results. In contrast, one-stage methods, such as Yoloct [17], CondInst [18], and SOLOv2 [19], directly predict the instance masks without region proposals. The two-stage method has an advantage in terms of the accuracy of detection and segmentation, whereas the one-stage method has a significant advantage in terms of the speed. To improve the accuracy and speed of image segmentation, many strategies have proposed. Several post-processing methods [20]- [22] have been proposed for instance segmentation, among which PointRend [20] is a module based on an iterative subdivision algorithm. SegFix [21] is a modelagnostic post-processing scheme to increase the boundary quality of the segmentation result. For sequential images, the geometric relationship between frames can be used to im-prove the robustness and speed of segmentation. For example, FEELVOS [22] adds semantic information and a local-global matching mechanism in the segmentation task and uses position information as the prior information of the subsequent frame processing steps to improve the accuracy of semantic segmentation.

B. DYNAMIC SLAM
To solve the problem of SLAM in dynamic environments, many scholars have proposed many different solutions and achieved certain results. SLAM based on dynamic object recognition first uses the object detection or segmentation algorithm to obtain the position or mask information of the object, and then determines the dynamic or static states of the feature point based on the conventional method. These methods avoid the use of feature points in the dynamic object during the entire SLAM estimation process to reduce the additional impact caused by object movement. For example, DS-SLAM [11] combines semantic segmentation network with moving consistency check method to reduce the impact of dynamic objects, improves the localization accuracy, but the feature of sequential image is not used in semantic segmentation. MaskFusion [23] combines Mask R-CNN and the geometric edge method based on ElasticFusion [24] to recognize, segment, and assign semantic labels to different objects in the scenario. DymSLAM [25] is a dynamic stereo visual SLAM system capable of reconstructing a 4D (3D + time) dynamic scenario with rigid moving objects. DynaSLAM [13], building on ORB-SLAM2 [23], adds the capabilities of dynamic object detection and background inpainting. DynaSLAM II [38] makes use of instance semantic segmentation and ORB features to track dynamic objects. The structures of the static scene and the dynamic objects are optimized jointly with the trajectories of both the camera and the moving agents within a novel bundle adjustment proposal. However, the addition of the multi-view geometry stage is an additional slowdown, due mainly to the region growth algorithm. The background inpainting also introduces a delay. ClusterSLAM [26] uses the 3D motion consistency extracted from rigid bodies to achieve clustering and the estimation of static and dynamic objects. ClusterVO [27] uses stereo images as the input, performs object detection with YOLO for each frame and utilizes the extracted ORB features to track moving objects. Iqbal et al. [28] incorporates a classifier in the SLAM pipeline to add semantic information to a scenario. Crowd-SLAM [29] adds object detection in ORB-SLAM2, for crowded environments. AcousticFusion [30] fuses sound source direction into the RGB-D image and thus removes the effect of dynamic obstacles on the multi-robot SLAM system, but the robustness wil be reduced in the case of serious noise. DOT [31] combines instance segmentation and multi-view geometry to generate masks for dynamic objects in order to avoid such image areas in their optimizations, which reduces the rate at which segmentation should be done and reduces the computational needs with respect to the state of the art. FlowFusion [32] decouple dynamic pixels from static background pixels by comparing camera motion consistency clustering dynamic pixel points and removing them. DVO-SLAM [37] integrates a novel RGB-D data-based motion removal approach into the front end of RGB-D SLAM to filter out data that were associated with moving objects. However, the homography estimation will be degraded when parallax between consecutive frames is large, and the tracking will fail when moving objects become motionless. In summary, a method that relies only on object detection or semantic information may not be applicable to unknown scenarios. Conversely, if a method only relies on conventional feature point methods, it may result in suboptimal efficiency. Therefore, a combination of deep learning and conventional methods is a future developmental trend for SLAM.

III. SYSTEM DESCRIPTION
In this section, we first introduce the framework of the proposed SLAM system and then explain the proposed segmentation method for sequential image instances. Subsequently, the optical flow dynamic detection module for judging whether the object is dynamic is introduced. On this basis, a strategy for filtering feature points using both instance segmentation and optical flow is proposed. The proposed system is designed based on ORB-SLAM3, which includes tracking, partial mapping, loop-closure detection, and overall adjustment optimization. Compared to ORB-SLAM2, ORB-SLAM3 achieves better accuracy and robustness using a fast and accurate IMU (Inertial Measurement Unit) initialization technique and the multisession map-merging function, and supports a variety of image data. To solve the impact of dynamic objects on the localization and mapping accuracy in dynamic scenarios, we combined the sequential image instance segmentation method with the optical flow dynamic detection module. We designed an extra segmentation thread and an optical flow dynamic detection module, then integrated them into the SLAM system.

A. SYSTEM FRAMEWORK OVERVIEW
As shown in Fig 2, the proposed system comprises six parts: tracking, sequential instance segmentation, local mapping, loop, map merging, and full bundle adjustment (BA). A set of consecutive frames of RGB-D images and IMU data are first input to the system and processed by the tracking and segmentation threads. In the tracking thread, the ORB features are extracted, after which IMU data (optional) are integrated to obtain an estimate of the initial pose, while waiting for the segmentation result. In the segmentation thread, a sequential  image instance segmentation method is proposed for improving the segmentation speed and reducing the accumulation of errors. First, a set of sequences is first divided into multiple subsequences. Each subsequence contains 10 frames of image. Further, the first image frame of each sub-sequence is input to the original Mask R-CNN for segmentation, the object information is obtained, and the subsequent images are input to the improved Mask R-CNN. Generally speaking, in each sub-sequence, the first image is segmented with the original Mask R-CNN, and the other nine images are segmented with the improved Mask R-CNN. This whole process is called sequence segmentation. Specifically, we use the pose in the previous and subsequent frames to estimate objects' positions in the current frame image, which then input into Mask R-CNN as a prior information. We improved the region proposal network (RPN) in Mask R-CNN to generate the region proposal within the prior coordinate range, instead of generating in the whole image. The time consumed in the process of mapping, classification, and regression can be reduced by this improvement. The feature points are then filtered using the instance segmentation as well as optical flow module. Finally, the static feature points are entered into the subsequent tracking and mapping processes. And the implementation details are given in section B. The remaining processes in the system continue to execute the processes and corresponding algorithms of ORB-SLAM3.

B. SEQUENTIAL IMAGE SEGMENTATION METHOD
In the proposed system, the impact of dynamic objects is mitigated by adding the instance segmentation thread. VOLUME 11, 2023 Considering that the SLAM system is designed for practical applications, it is necessary to achieve optimal balance between the speed and accuracy. Mask R-CNN operates at a speed of 195 ms per image, which is far below the speed requirement of SLAM for processing sequential images. To achieve optimal balance between the accuracy and speed, we propose a sequential image segmentation method based on Mask R-CNN. We divide the whole image sequence into several subsequences and use the characteristics of sequence image to improve the segmentation speed of each frame. In addition, by initializing the segmentation result, the accumulation of errors can be reduced effectively.
As depicted in Fig. 3, the sequential image segmentation process can be divided into the following steps:

1) POINT MAPPING
In the sequential images, if the estimated camera pose in the previous and subsequent frames are known, they can be used in conjunction with the point positions of the previous frame to calculate the estimated point positions in the current frame image through point mapping. Let P 1 and P 2 represent the same point in the previous frame and current frame, respectively: where u and v are the coordinates of the image pixel, and d is the depth information measured by the depth camera. The camera coordinates of the points of the two camera poses can be calculated using the intrinsic parameters of the camera and Eq. (2): where (x c , y c ) are the coordinates of the principal point of the camera, f x and f y are the focal distances of the camera, and (X P 1 , Y P 1 , Z P 1 ) and (X P 2 , Y P 2 , Z P 2 ) are the camera coordinates of the points in the two camera poses. And depth represents a fixed parameter of depth cameras. The pose relationship between the poses of the two cameras can be expressed by Eq. (3): where R is the rotation matrix and t is the translation matrix. Both can be obtained from the external parameters of the camera poses in the previous and subsequent frames. Let (x 1 , y 1 , x 2 , y 2 ) represent the coordinate of the box detected by the Mask R-CNN, where (x 1 , y 1 ) and (x 2 , y 2 ) represent the pixel coordinates of the upper-left corner and lowerright corner of the detection box. Based on Eqs.
(1)- (3) and (x 1 , y 1 , x 2 , y 2 ), the object coordinates (x 1 , y 1 , x 2 , y 2 ) in the subsequent frame can be estimated. This can be used as prior information to be input to the subsequent instance segmentation module.

2) IMPROVED MASK R-CNN
Mask R-CNN is a two-stage segmentation method. The region proposals are mapped onto the feature map and input to the fully connected layer for classification and regression. Numerous region proposals are generated, and each needs to be mapped, classified, and regressed. Therefore, original Mask R-CNN cannot meet the time-efficiency requirements of sequential image processing in the SLAM process. In this study, we improved the RPN layer process for generating region proposals. Using the position of the object as prior information, the number of initial anchors and region proposals are reduced, and the time consumed in the process of mapping, classification, and regression is reduced as well. Fig. 4(a) shows the structure of the RPN layer. RPN takes an image as input and outputs a set of rectangular object proposals, each with a score. In the original Mask R-CNN, the RPN anchors span five scales and three aspect ratios. The blue dots in Fig. 4 denote the anchors, which refer to the center points of the current sliding window in the feature map, and are projected onto the original pixel space. All the five feature layers (P2-P6) are traversed to select anchor points on each feature layer with a certain step size. Each anchor point generates three anchors and then outputs four regression and two classification scores, which are used to calculate the crossentropy loss function for network training. To reduce the number of generated anchors and increase the speed of the network, the following was performed. The characteristics of the potential dynamic objects can be summarized as follows: On the one hand, the movement range of the dynamic object between each frame is limited. On the other hand, the size of the dynamic object in the new frame will also change given that the camera and dynamic object are moving at the same time. In view of these two characteristics, we change the anchor generation rule. Fig. 4(b) shows the anchor generation area (represented by black grids) of feature layers P6 andP5 (owing to the number of grids, feature layers P4, P3 and P2 are not displayed) for an input image sized 512 × 512. In the figure, scale is a parameter, that is, ''RPN_ANCHOR_SCALES,'' which depends the size of generated anchors. Specifically, heights = scales/sqrt (ratios), widths = scales · sqrt (ratios). The new anchor point selection strategy reduces the number of anchors to 1/8 that of the original network.

C. OPTICAL FLOW DYNAMIC DETECTION
After object segmentation, the next step is to judge whether the objects are dynamic. The relationship between feature points and objects can be divided in three categories: (i) When the feature points are not included in the object, they can be used for tracking and reconstruction. (ii) When the feature points are contained in the static object, they can be used for tracking and object reconstruction, but not for scene reconstruction. (iii) When the feature points are contained in the dynamic object, they cannot be used for tracking and reconstruction and should be eliminated. We use the optical flow algorithm to analyze the movement of feature points. If there are more than a certain number of feature points in one object, the object can be judged as dynamic. When the detected target texture is too weak, the number of feature points on the target will be small. In this case, if the number of dynamic feature points exceeds half of the total number of feature points, we judge the target as a dynamic target. The first step is to obtain the movement speeds of the feature points to calculate the optical flow pyramid. The average movement speeds of the foreground and background can then be calculated. Specifically, we take the optical flow velocity of the feature points in the background (except for the target being detected) as the average speed of the background, so it is the average optical flow speed of objects detected. Assuming that N is all background feature points, [V x , V y ] are the velocities of pixels in the x and y directions respectively, [E x , E y ] is the velocities of background. The calculation formula is as follow: If the difference between the speeds of the object and background exceeds a threshold value, the object can be judged as dynamic. Formula is as follows: The pyramidal Lucas-Kanade (LK) optical flow algorithm [33] calculates the optical flow size at the highest layer, which is then used as the initial value for the optical flow of the next layer to obtain the optical flow of that layer. This procedure is repeated until the optical flow of the last layer is obtained. I L and J L are the L th layer of pyramids I and J , respectively. u = [u x , u y ] T is the coordinate of image I . For I L , the coordinate is u L . Let g L represent the optical flow value of layer L + 1. It is required to find pixel v L in image J that renders the difference value minimal, where For calculating the optical flow, the problem is treated as an optimization problem to minimize the optimization function, as depicted in Eq. (6): where ε L (d L ) is the difference value, and w x and w y are the neighborhood areas. The results are as follows: where I L x (x, y), I L y (x, y) are the gradients of the image. δI L (x, y) = I L (x, y) − J L (x, y) is the image difference. After obtaining the optical flow d L of layer L, it is used as the initial value for layer L − 1 for calculating the optical flow of layer L − 1; this step is repeated up to the last layer.

IV. EXPERIMENT
In this section, we evaluate the performance of SIIS-SLAM in dynamic environments on public dataset TUM RGB-D [34] and KITTI [35] dataset. TUM dataset provides multiple sequences in dynamic scenes, obtained using Kinect sensors in different indoor scenes, including RGB images and depth images, as well as ground truth trajectories. We performed experiments using eight sequences extracted from an office with rich texture, wherein the static objects include tables, chairs, and computers, and two persons are the dynamic objects. Among the eight sequences, those with names containing ''Sitting'' (shown as ''sit'' in the tables) are considered less dynamic, whereas those with names containing ''Walking'' (shown as ''walk'' in the tables) are considered highly dynamic. KITTI dataset contains stereo sequences recorded from a car in urban and highway environments. An NVIDIA DGX station equipped with four NVIDIA Tesla V100 graphics processors was used for the whole training and testing process. In addition, we also tested the full SLAM system on a laptop with a 2080S graphics card. The Ubuntu 16.04 (64-bit) operating system was used for all the experiments.

A. DYNAMIC OBJECT DETECTION
In this section, we validate the proposed sequential image segmentation and analyze the time and accuracy. In addition, we verify the effectiveness of the optical flow detection module in dynamic scenes.

1) INSTANCE SEGMENTATION OF SEQUENTIAL IMAGE
Two sets of sequential images were selected from the TUM dataset for the experiment, each comprising 100 frames. The main objects in the images are people and chairs. We tested the selection of the anchor generation strategies and the effectiveness of the proposed joint segmentation method for sequential images.
We compared four strategies for anchor generation. Considering an 8 × 8 feature map as an example, the four different strategies are depicted in Fig. 5. Specifically, black grids represent the anchor generation area. (combining with Fig.4 (b)   and the description of Improved Mask R-CNN in Section B of Part III, can be helpful to better understand the experimental content) To verify the effectiveness of the proposed joint segmentation method for sequential images, we tested three sets of sequential images and analyzed the accuracy and speed of object segmentation. For network training, the COCO [36] dataset was first used to train the network; Secondly, Labelme software was used to mark instance targets of images in the TUM dataset as ground truth segmentation results, which were then used to further train the network, particularly for dynamic objects (people and chairs). The results are shown in Table 1. The proposed sequential image segmentation method is almost three times faster than the original Mask R-CNN; moreover, there is an improvement in the accuracy, demonstrating the effectiveness of the proposed strategy.
The anchor generation strategy depicted in Fig. 5(d) can achieve the highest segmentation accuracy of 84.5%, but more time is required. In contrast, the anchor generation strategy shown in Fig. 5(c) has the least segmentation time, but the accuracy is lesser. In addition, the segmentation accuracies of the anchor generation strategies in Figs. 5(a) and 5(b) are similar, but the segmentation time for that in 4(b) is lesser. Thus, 5(b) is the optimal strategy.  to filter the feature points, whereas SIIS-SLAM (M + O) denotes the usage of both sequential segmentation and optical flow dynamic detection to filter the feature points. In this experiment, the segmentation results were used for qualitative analysis, and the absolute trajectory of the root-mean-squared error (RMSE) [m] value of the SLAM results were used for quantitative analysis. The results are as shown in Fig. 6 and Table 2. The ORB extracted feature points are denoted by green dots in Fig. 6(a). The optical flow information is shown in Fig. 6(b), where the lines represent the motion of the corresponding feature points. After filtering (based on a set threshold), the green longline segments represent the motion trajectory of the dynamic feature points, whereas the red shortline segments represent the motion state of the static feature point. Fig. 6(c) shows the instance objects, marked by blue and red masks, which are regarded as potential dynamic objects in the experiment. The green mask in Fig. 6(d) represents the dynamic object obtained after segmentation and optical flow dynamic detection, in which the feature points will not be used for localization and mapping. Qualitative analysis indicates that the sequential segmentation and optical flow dynamic detection modules used in this study can effectively filter the dynamic feature points. The results in Table 2 indicate that the proposed method can effectively reduce the absolute trajectory error and improve the accuracy of SLAM localization. In the Walking_xyz dataset, this improvement is particularly significant. In the Walking_rpy data set, the error increases only when segmentation is used, proving the effectiveness and necessity of the proposed strategy.

B. EXPERIMENTAL COMPARISON 1) TUM DATASET
ORB_SLAM2 is a complete SLAM system that supports monocular, binocular, and RGB-D cameras. DS-SLAM is a SLAM system designed based on the framework of ORB-SLAM2, whereas SIIS-SLAM is based on the framework of ORB-SLAM3. ORB-SLAM3 is the first SLAM system that can process both visual and inertial data, and supports monocular, binocular stereo, and RGB-D cameras. These four systems were compared to verify the effectiveness of the proposed system. We applied the absolute trajectory error (ATE) as the error metric, which represents the global consistency of the SLAM trajectory. In addition, we report the relative pose error (RPE), which denotes the rotational and translational drift, of ORB-SLAM3 and SIIS-SLAM for comparison. The ATE results presented in Tables 3 demonstrate that the performance of the proposed method on most dynamic sequential images is improved compared to those of ORB-SLAM3 and DS-SLAM. For the Walking_xyz, Walking_static, Walking_rpy, and Walking_half sequences, the RMSE and standard deviation (Std.) of the ATE are minimum, demonstrating that SIIS-SLAM can effectively improve the robustness and stability of the SLAM system in a highly dynamic environment. However, in the case of the Sitting_static sequence, the ATE values are slightly lower than those of ORB-SLAM3. This indicates that in low-dynamic or static sequences, the performance improvement is not significant. The performance mainly depends on the basic SLAM system. Tables 4 depicts the RPE of ORB-SLAM3 and SIIS-SLAM. The results show that the rotational drift as well as translational drift are minimum in the ''Walking'' sequences. For the Sitting_static sequence, the rotational drift and translational drift of SIIS-SLAM are almost equal to those of ORB-SLAM3. These results demonstrate that the proposed SIIS-SLAM can effectively improve the robustness and stability in a highly dynamic environment.
In addition, to further verify the effectiveness of the proposed method, we compared it with two advanced SLAM systems, DVO-SLAM [37] and DynaSLAM [13] and their original SLAM system. Similar to our approach, DVO-SLAM and DynaSLAM are both based on an original SLAM system. In order to compare the improved effects of the three methods, we also report the accuracy of their original methods. Table 5 reports the RMSE [m] of the absolute.
As shown in Table 5, the RMSE value of the ATE of the proposed method is the least, compared to the original ORB-SLAM3, DVO-SLAM, DynaSLAM, and their original SLAM system. The ATE RMSE values for the four dynamic sequences are 1.2 cm, 0.5 cm, 1.2 cm, and 2.3 cm, respectively. These results demonstrate the effectiveness of the   proposed method. In terms of the improvement of the respective systems, DVO-SLAM and DynaSLAM generate considerable improvement. This is because ORB-SLAM3 itself is already a SLAM system with superior performance, and the scope for improvement is limited. However, SIIS-SLAM achieves the highest absolute localization accuracy.

2) KITTI DATASET
The Absolute trajectory RMSE [m] for different methods were shown in table 6. In KITTI 04, KITTI 08 and KITTI 09, the accuracy of the tracking is improved than ORB-SLAM3 and DynaSLAM. This result is expected because most of the vehicles in these data sets are moving. However, in KITTI 00, KITTI 02 and KITTI 06, our method did not work well and the error was larger than that of ORB-SLAM3. This is because the texture of these areas is weak and the number of feature points is too small after removing dynamic target points. According to the experimental results, we believe that our method is only suitable for outdoor scenes with rich environmental textures. From the overall process and results of the system, our method has higher robustness in indoor environment.

V. CONCLUSION
This study proposed a complete dynamic SLAM system with semantic information based on ORB-SLAM3, which reduces the impact of dynamic objects on localization and mapping. Besides TUM dataset, we also used KITTI dataset for our experiments. According to the experimental results, the method proposed in this paper is more suitable for indoor scenes, and the robustness is not high in outdoor scenes. Therefore, future work should be oriented to more common scenarios. In addition, there is scope for improvement in this SLAM system. The scene used by the system is affected by the learning samples and the final performance of the instance segmentation network. It is a challenge to select efficient and robust instance segmentation algorithms based on scenarios; the background environment and motion trajectory of each dynamic object should be restored and recorded in the followup work. Thereby, the foreground and background in the tracking and mapping process can be separated and used for other high-level tasks.