Semantic SLAM Based on Improved DeepLabv3⁺ in Dynamic Scenarios

Simultaneous Localization and Mapping (SLAM) plays an irreplaceable role in the field of artificial intelligence. The traditional visual SLAM algorithm is stable assuming a static environment, but has lower robustness and accuracy in dynamic scenes, which affects its localization accuracy. To address this problem, a semantic SLAM system is proposed that incorporates ORB-SLAM3, semantic segmentation thread and geometric thread, namely DeepLabv3+_SLAM. The improved DeepLabv3+ semantic segmentation network combines context information to segment potential a priori dynamic objects. Then, the geometry thread uses a multi-view geometry method to detect the motion state information of the dynamic object. Finally, a new ant colony strategy is proposed to find the group of all dynamic feature points through the optimal path, and avoids traversing all the feature points to reduce the dynamic object detection time and improve the real-time performance of the system. By conducting experiments on public data sets, the results show that the method proposed in this paper effectively improves the positioning accuracy of the system in a high-dynamic environment compared with similar algorithms, and the real-time performance of the system is improved.


I. INTRODUCTION
With the rapid development of robotics and computer science, autonomous mobile robots are widely used in various fields such as industry and agriculture. As one of the most advanced technologies in the field of robot motion, SLAM uses the sensor data from a robot for autonomous positioning and map construction. From the mutual dependence of robot autonomous localization and map construction, only accurate autonomous localization is necessary to build a correct map. A correct map can help the robot determine its position in the map accurately.
At present, most visual SLAM frameworks operate under the assumption of a static environment, such as ORB-SLAM [1], ORB-SLAM2 [2], ORB-SLAM3 [3], LSD-SLAM [4], RGB-D SLAM [5]. Among these frameworks, ORB-SLAM3 is considered to be the advanced method currently used in static scenes. ORB-SLAM3 is a system based on ORB-SLAM2 and ORB-SLAM-VI that can operate robustly in The associate editor coordinating the review of this manuscript and approving it for publication was Zhenbao Liu . purely visual or visual inertial guidance systems, and is a complete and highly accurate generalized system. These algorithms can achieve satisfactory results in a static environment or an environment with a small number of dynamic objects. However, when the robot is operated in an environment with a large number of dynamic objects (e.g., people, vehicles), the performance of the visual SLAM algorithm will significantly decrease. This is a result of the visual features from dynamic objects in the environment, which affects the positional estimation of the robot and greatly decrease the positioning accuracy of the system. In recent years, with the development of deep learning technology, increasingly more excellent image algorithms have been applied to visual SLAM, which provide methods and ideas for improving the localization accuracy of the system.
In this article, we propose a multi-threaded parallel semantic SLAM system to solve the problem when facing dynamic objects. The system is mainly based on the ORB-SLAM3 algorithm framework and introduces semantic segmentation and multi-view geometry approaches to the original framework. In the semantic segmentation thread, the ResNest [6] classification network with higher accuracy is used to replace the original ResNet [7] in the DeepLabv3 + [8] segmentation network, which helps segment object boundaries more accurately. The dilated convolution [9] with a smaller dilation rate is more effective in extracting low-resolution feature map information, so a new layer of dilated convolution is added to the Atrous Spatial Pyramid Pooling (ASPP) module of DeepLabv3 + , and the dilation rate size is adjusted. Simultaneously, to reduce the amount of network parameters and improve the efficiency and training speed of the network, we replace all the dilated convolutions with depthwise separable convolutions [10] and perform 2D decomposition. In the geometry thread, the method of multi-view geometry is used to determine the motion state of the object, and a new ant colony search strategy is proposed to avoid the multi-view geometry method having to analyze all the feature points using the distribution characteristics of feature points on the image. This improves the robustness and real-time performance of the system. The rest of this paper is organized as follows. Section II briefly describes some achievements and shortcomings of various visual SLAMs in dynamic scenarios. Section III elucidates the architecture of our SLAM system. In Section IV, we conduct experiments on the TUM RGB-D dataset to verify the effectiveness and accuracy of the DeepLabv3 + _SLAM system. Finally, in Section V, we conclude and discuss the paper.

II. RELATED WORK
The main methods for obtaining the semantic information of objects include target detection and semantic segmentation. Target detection is the determination of the object bounding box, and semantic segmentation is the accurate classification of objects. Both target detection and semantic segmentation can be used to recognize dynamic objects in a scene. In comparison, semantic segmentation is better at recognizing the results of objects, because the contour of objects can be accurately segmented. However, the bounding box may contain pixels that do not belong to the object. After processing the abnormal objects using semantic segmentation, a static background model without any dynamic objects is established, thus improving the accuracy and robustness of the visual SLAM system in dynamic environments.
With the rise of neural networks, semantic segmentation has been gradually introduced into the SLAM semantic system. For instance, Yu et al. [11] proposed a DS-SLAM scheme, which combines the visual SLAM algorithm with the SegNet [12] network to filter the dynamic part using semantic information and motion feature points in dynamic scenes. This method improvs the accuracy of pose estimation, but the types of objects that can be recognized by the semantic segmentation network in this scheme are limited, which limits the scope of its application. Zhong et al. [13] combined ORB-SLAM2 and SSD [14] into a new coupling framework named Detect-SLAM, and proposed a method to propagate the motion probability of key points in real time to overcome the delay of target detection threads. Semantic information is used to eliminate the negative effects caused by moving objects in SLAM. This framework aims to improve the efficiency of target detection and sensitivity to the viewpoint transformation problem, and the real-time performance of the system must to be further optimized. Xiao et al. [15] proposed Dynamic-SLAM, constructed an SSD target detector based on a convolutional neural network, and proposed a missed detection compensation algorithm based on constant speed of adjacent frames to address the problem of low recall of SSD target detection network, which greatly improved the recall of detection. A selective tracking algorithm is also proposed to simply eliminate dynamic objects, which improves the robustness and accuracy of the system. Cui and Ma [16] proposed a semantic optical flow method, which combines the semantic information before motion, aids in the calculation of the epipolar geometry, filters out the true dynamic features, and keeps only the remaining static features fed into the tracking optimization module to achieve accurate estimation of the camera pose in a dynamic environment. Zhang et al. [17] proposed VDO-SLAM, a dynamic feature-based SLAM system, which utilizes image-based semantic information in the scene without prior knowledge of object pose or geometry to achieve localization, map building, and tracking of dynamic objects simultaneously. However, there are cases in which large errors occur due to problems with the algorithm or optimization function, and the real-time performance requires improvement. Chen et al. [18] proposed DM-SLAM, which combines the instance segmentation network Mask-R CNN with optical flow and epipolar geometry to constrain the outliers in the scene. Two different strategies for obtaining segmentation results of potential dynamic objects in the dynamic point detection segment are proposed. One method reprojects the feature points with depth information to the current frame, and uses the reprojection offset vector to distinguish dynamic points. The other method uses the epipolar geometric constraints. Long et al. [19] proposed PSPNet-SLAM, which integrates the semantic thread and geometric thread of the pyramid structure into ORB-SLAM2 through pyramid scene resolution SLAM, which uses a semantic thread combined with contextual information to segment dynamic objects. The best error compensation homography matrix is designed to improve the accuracy of dynamic point detection, but the ability of the network to process image frames affects the real-time performance of the system, and the ability to remove dynamic objects needs to be improved. Bescos et al. [20] proposed DynaSLAM, which processes monocular and RGB-D cameras differently. In the case of monocular, Mask R-CNN [21] is used to detect moving objects, and in RGB-D mode, the Mask R-CNN network and the multi-view geometric model are combined to detect moving objects. This method can detect multiple moving objects in the environment and repair the background occluded by dynamic objects. However, the system has difficulty operating in real-time, because the Mask R-CNN network is time and resource-consuming for images processing. Ai et al. [22]  proposed the DDL-SLAM system, which improves the segmentation and background restoration abilities. By combining semantic segmentation and multi-view geometric algorithms to filter out dynamic objects in the scene, the static scene map can repair the background obscured by moving objects for restoration, thus improving the localization accuracy in highly dynamic environments. However, the real-time performance is remains insufficient.
Compared with the traditional ORB-SLAM3, although the various solutions proposed above how better performance when detecting the semantic information of objects, there is room for improvement and research on the correlation between objects in the semantic information, localization accuracy and the real-time performance of the system.

III. SYSTEM DESCRIPTION
The system proposed in this paper improves on the basis of ORB-SLAM3. The overall structure block diagram is shown in Fig. 1. In the improved framework, semantic threads and geometric threads are added. First, the RGB-D camera collects image data. Then, the data is passed into the tracking thread for pre-processing, and the DeepLabv3 + model subdivides all the a priori dynamic contents by pixels, while using the geometric thread module to distinguish the dynamic and static feature points in the image. Second, the segmentation results from the DeepLabv3 + model and the motion state information judged by the geometry module are combined and used to extract the contour regions of dynamic objects. Finally, feature points and spatial points of dynamic object regions are removed, and image frames with only static features are used for subsequent tracking and map building, thus improving the accuracy and robustness of the visual SLAM system in a highly dynamic environment.

A. SEMANTIC SEGMENTATION DeepLabv3 +
In traditional semantic systems, convolutional neural networks such as fully convolutional neural network [23] (FCN), U-net based on codec architecture [24], SegNet and other algorithms are used in visual SLAM systems. However, each of these algorithms have problems, such as the lack of in ability to infer information from the context, the inability to handle the relationship between the scene and global information, or unable to effectively deal with the relationship between categories leading to the failure of label association, resulting in discontinuous predictions. DeepLabv3 + is the best segmentation model among a series of DeepLab [25]- [27] models proposed by Google, but the model is not superior in terms of processing speed and model capacity. The overall structure of DeepLabv3 + is shown in Fig. 2. This model introduces the idea of Encoder-Decoder based on Dilated FCN. The main function of Encoder is to gradually reduce the resolution of the feature map and provide high-level semantic information. The main body of Encoder is DCNN with dilation convolution, and the classification network used can be ResNet, Xception or another network, followed by the ASPP module, which introduces multi-scale information to capture rich contextual information by performing pooling operations at different resolutions.
Assuming H r k is expressed as a convolution operation with a convolution kernel size of k and a dilation rate of r, its output can be expressed as: The main function of the Decoder module is to further fuse the low-level features and high-level features to improve the accuracy of the segmentation boundaries and recover spatial information. The Decoder obtains a feature map with a resolution of 4 after bilinear upsampling 4 times from the feature map output by Encoder, and then splices and fuses this feature map with the feature map obtained after 1 × 1 convolution and dimensionality reduction in the backbone network. Finally, the module up-samples 4 times by 3 × 3 convolution to obtain the final predicted semantic segmentation map.
As the backbone network of DeepLabv3 + , ResNet performs well. ResNet mainly uses a residual structure based on bottleneck design, which is generally used when the number of network layers is greater than 30, so that the network parameters can be significantly reduced and deeper networks can be trained. The ResNet network has largely alleviated the problem of network degradation caused by the deepening of network layers to a great extent, so that the network can learn deeper image features. However, the size of its receptive field is fixed and single, which cannot be used to fuse multiscale features, and does not take advantage of the interaction between cross-channel features. ResNest's proposal makes up for the shortcomings of ResNet.
ResNest is a modification of ResNet that combines the split attention of the feature map in a single network, and extends the attention mechanism of the channel dimension to the representation of the feature map group to form modularization, as shown in Fig. 3. Compared with ResNet or its variants, ResNest does not require additional calculations, and the result is a significant improvement compared to ResNet and its variants. Therefore, this article uses ResNest as the backbone network of DeepLabv3 + , so that the semantic thread in the SLAM system has better image segmentation performance.

B. ASPP MODULE
In the Encoder session, the convolutional layers in the original ASPP module are 1 × 1 dilation convolution, 3 × 3 dilation convolution with a dilation rate of 6, 12, and 18, and a global average pooling layer. With the continuous extraction of image features by the backbone network, the resolution of the feature map will continue to decrease, and the dilation convolution with a larger dilation rate is not conducive to extracting feature map information with lower resolution. To address this problem, a new layer of dilation convolution is added to the original dilation convolution, and the dilation rate is adjusted to 4, 8, 12, and 16 to improve the extraction of low-resolution feature map information, the output of which can be expressed as: ASPP stacks the dilation convolutions of different dilation rates in parallel to obtain multi-scale information gain. The one-dimensional mathematical expression of dilation rate is: where x[i] means the input signal, y[i] denotes the output signal, r is the step size of the sampling, w[s] represents the size of the convolution kernel as a parameter at position s, and S means the size of the convolution kernel.
Comparing the depthwise separable convolution with the standard convolution, we found that depthwise separable convolution can largely reduce the excessive number of parameters in the training process. The number of parameters in the standard convolution is about three times the number of parameters in the depthwise separable convolution for the same input. Therefore, we replace all the dilation convolutions in ASPP with depthwise separable convolutions to improve the training performance and efficiency of the system with less impact on the segmentation accuracy.
The main function of ASPP is to extract multi-scale information from the feature map. However, the 3 × 3 convolution will learn redundant information, result in an increase in the number of system parameters that affects the speed of the system. In this paper, all 3 × 3 convolutions in ASPP are transformed into 3 × 1 and 1 × 3 convolutions using 2-dimensional decomposition without changing the dilation rate. This reduces the number of parameters compared with the original structure by about 1/3, effectively reducing the computation of this module, with faster training speed and the ability to extract important feature information.
The improved ASPP module is shown in Fig. 4. When the feature map generated by the backbone network is sent to ASPP for processing, the feature map is first subjected to a 1 × 1 convolution, the convolution with dilation rates of 4, 8, 12 and 16, and the global average pooling operation is performed. Then, the six feature maps obtained are spliced and fused in the channel dimension. Finally, the feature map containing high-level semantic features is obtained after 1 × 1 convolution and dimensionality reduction operation.

C. DYNAMIC OBJECT DETECTION BASED ON MULTI-VIEW GEOMETRY
Semantic segmentation networks can only detect dynamic objects with a priori high probability, but in actual scenes, the SLAM system will often be disturbed by static objects. Books and chairs are examples of static objects. However, VOLUME 10, 2022 when people move with books or chairs, they should be regarded as dynamic objects but are regarded as static objects to participate in the positioning and mapping. This results in a great impact on the SLAM system. Therefore, we use a dynamic object segmentation method based on multi-view geometry for processing. As shown in Fig. 5, the map point cloud is projected to the current frame, and the object is distinguished as a dynamic object or static object based on the viewpoint difference and size of the change in depth value. By calculating the viewing angle value v cf of each key point in the current frame (cf ) and the viewing angle value v hf of the historical frame (hf ), if the difference v = |v cf − v hf | of the viewing angle value is greater than the set threshold, the key point is determined to be a dynamic point. At the same time, we also need to calculate the depth value d cf of the key point in the current frame and the projection depth value d proj of the historical frame in the current frame. If the difference between the depth values is d = |d proj − d cf | = 0, the key point is determined to be a static point. If d is greater than the set threshold d thresh , the key point is considered dynamic.

D. ANT COLONY STRATEGY
The ant colony algorithm [28] is a simulation optimization algorithm that simulates the foraging behavior of ants. Ants release pheromones related to the path length during movement. The path length is inversely proportional to the pheromone concentration, where the optimal path has the largest pheromone concentration. Ants choose their path according to pheromone concentration. The ant colony algorithm has two main processes: state transfer and pheromone update. Assuming that the probability of ant m moving from node i to node j is p m ij , its state transition rule is given by the following equation: where τ (i, j) denotes the pheromone concentration on the path from i to j, η(i, j) is the corresponding heuristic information function, α is the information heuristic factor, β is the expected heuristic factor, and allowed m is the node not visited by the ant. The greater the value of α, the more likely the ant is to choose the path before moving, and the randomness of the search path is weakened. The smaller the value of α, the smaller the search range, and it is easy to fall into the local optimum. The larger the value of β, the easier it is for the ant colony to choose the local shortest path, and the convergence speed of the algorithm is accelerated. When the ant completes a path transfer, it will perform a pheromone update. The update rules are as follows: where ρ is the information volatilization factor, ρ ∈ [0, 1), 1−ρ denotes the residual factor, and τ ij (t) is the pheromone increase from i to j at time t. When ρ is too small, there are too many pheromones that remain on each path, resulting in the continued search of invalid paths, which affects the efficiency of the algorithm. When ρ is too large, although invalid paths can be excluded from the search range, valid paths may also be excluded, affecting the search for the optimal solution.

E. NEW ANT COLONY STRATEGY
When the multi-view geometry method transforms the image of the historical frame into the current frame by projection, a large number of projected feature points will be obtained. A point is determined to be a static point or dynamic point by traversing all the projected feature points. However, there are thousands of feature points in the feature extraction, and if each feature must be determined to be static or dynamic, the real-time performance of SLAM will be limited. In this paper, based on the strategy of the ant colony algorithm, we propose a new ant colony strategy to find the group of all dynamic feature points through the optimal path, so as to avoid traversing all feature points, reduce the time-consumed by feature point extraction and improve the real-time performance of SLAM.
In the ant colony algorithm strategy, throughout the process from the origin to the destination, the ant colony avoids obstacles they encounter to find an optimal path to the destination. Based on this strategy, this paper sets a search path from the starting point to the destination, and searches feature points on the path in turn. Because the dynamic points or static points in the image are distributed in groups rather than chaotically scattered throughout the image, when a dynamic feature point is found, the search will be transferred to the group in which the feature point is located until all the feature points of the entire group are detected or the search exceeds the range of the group. The next dynamic feature point group will then be searched. When a static feature point is detected, the point and its group will not be processed, and the search will continue according to the path.
According to the distribution of feature points in the image, this paper designs a path l from the departure S to the destination T , as shown in Fig. 6. The search strategy is: the ant colony moves continuously from the feature point m i = 0 on the path to the next point m i (i = 1, 2, . . . , n) until reaching the destination target T . On the moving path, each feature point m will take itself as the origin, and search for feature points within a radius R. If a dynamic point is not found, the search will continue to move forward on path l. When a dynamic point is found, expand outward with the bandwidth h. If the next new dynamic point is found, continue to expand outward with h until no dynamic point is found in the expanded area, then return to path l and continue to search the next feature point m i that matches the dynamic feature in turn until path l is completed.

A. EXPERIMENTAL ENVIRONMENT AND DATA SET
In this section, to compare the performance of our semantic SLAM system and other excellent SLAM systems in dynamic environments, experiments are conducted on the data set TUM RGB-D. In addition, the proposed system is compared with the original ORB-SLAM3 to quantify its improvement in dynamic scenarios. All experiments were performed on a computer equipped with an Intel i7 CPU, RTX2080Ti GPU and 16 GB of memory.
The TUM dataset is an excellent dataset for evaluating camera positioning accuracy and provides an accurate ground truth for the sequences. The dataset contains 7 sequences recorded by an RGB-D camera at 30 fps with a resolution of 640 × 480. In this section, we use 5 sequences from the TUM dataset to evaluate the performance and demonstrate the effectiveness of DeepLabv3 + _SLAM in dynamic environments, namely fr3_s_static, fr3_w_static, fr3_w_rpy, fr3_w_xyz, fr3_w_halfsphere. Besides fr3_s_static which is a static sequence, the other sequences are dynamic sequences. The ''s'' in the sequence name means ''sitting'' and ''w'' means ''walking''. The word after the underscore indicates the state of the camera, for example, ''xyz'' indicates that the camera moves along the x-y-z axis.
To quantitatively evaluate the advantages of our algorithm, the overall performance of the system is evaluated using Absolute Trajectory Error (ATE), which indicates the global consistency of the trajectory, and Relative Pose Error (RPE), which measures translational and rotational drift. Root Mean Square Error (RMSE) can reflect the accuracy and robustness of the system better than the mean and median values, and the Standard Deviation Error (S.D.) can reflect the stability of the system. Therefore, in this paper, the RMSE value and S.D. value of ATE and RPE are obtained by processing each sequence separately to judge the positional accuracy and system stability.

B. EXPERIMENTAL RESULT
The ATE and RPE of ORB-SLAM3, DynaSLAM and DeepLabv3 + _SLAM algorithms were obtained by conducting experiments on 5 sequences. The results are shown in Tables 1-3.
As shown in the table, DeepLabv3 + _SLAM and DynaSLAM can significantly reduce the ATE and RPE of each sequence compared to ORB-SLAM3. In highly dynamic sequences, the method in this paper shows a significant improvement in ATE and PRE compared to DynaSLAM, and in terms of ATE, the improvement values of RMSE and S.D. reach 25.18 % and 31.88%, mainly because the proposed semantic segmentation network not only has better performance, but also considers the information correlation with geometric threads, so that the DeepLabv3 + _SLAM system can significantly improve its localization accuracy and robustness in high dynamic environments. In the low dynamic sequence fr3_s_static, the improvement of the method in this paper is not obvious compared with ORB-SLAM3. This is mainly because ORB-SLAM3 itself is designed for low dynamic environment and can handle low dynamic scenes well and achieve good results, so the room for improvement is limited.
Figs. 7-9 show the ATE and RPE of ORB-SLAM3, DynaSLAM and DeepLabv3 + _SLAM in the high dynamic sequence fr3_w_xyz. The black line represents the real trajectory of the camera and the blue line indicates the camera trajectory estimated by the SLAM algorithm. In the high dynamic environment, the motion trajectory estimated by the ORB-SLAM3 system has a large gap with the real trajectory, and even produces wrong trajectories in some regions. On the contrary, DynaSLAM and DeepLabv3 + _SLAM systems have high overlap between the estimated motion trajectories and the true trajectories because the dynamic objects in the scene are eliminated, and the motion trajectories estimated by DeepLabv3 + _SLAM are closer to the true trajectories than those estimated by DynaSLAM. This indicates that the method in this paper is more capable of handling highly dynamic scenes.
The purpose of this paper is to remove the feature points on dynamic targets and keep only the remaining static feature points. Therefore, to verify the effect of dynamic feature point rejection, this paper conducts experiments on the high dynamic sequence fr3_w_xyz. Fig. 10 shows the original image, the semantic segmentation image, and the image with unprocessed dynamic feature points from top to bottom, where the green dots represent the locations of ORB feature    points. As can be seen from the figure, the feature points falling on dynamic objects have been detected and removed by the method in this paper, while other feature points falling on static objects are retained. There are also feature points in some regions at the edges of the human body that are not well rejected, which is related to the accuracy of semantic segmentation.
In practical applications, real-time performance is an important metric for evaluating SLAM systems. Therefore, to evaluate the real-time performance, we let DeepLabv3 + _ SLAM and DynaSLAM run five sequences under the same hardware conditions and record the time consumed by the geometric threads, and the results are shown in Table 4. In terms of running time, since this paper introduces a new ant colony strategy in the geometric threads, which greatly reduces the time consumed by the geometric method to judge the object state information, the method in this paper has   better real-time performance compared with DynaSLAM, thus improving the overall real-time performance of the SLAM system.

V. CONCLUSION
In order to eliminate the influence of dynamic objects on the positioning accuracy of the system, we propose the DeepLabv3 + _SLAM system. This system introduces semantic and geometric threads based on the original ORB-SLAM3. First, a priori dynamic information is obtained through semantic threads. Then, the dynamic feature points in the scene are detected in the geometry thread using a multi-view geometry approach, while a new ant colony strategy is proposed to selectively detect dynamic feature points using the distribution characteristics of the feature points in order to improve the real-time performance of the geometry thread. Finally, to verify the overall performance of the system in this paper, we conducted experiments and analyses on the TUM RGB-D dataset, and the results show that the localization accuracy and real-time performance of the system in this paper are improved in a highly dynamic environment compared with existing advanced SLAM frameworks.
Despite the progress in localization accuracy and real-time performance, there are still many deficiencies. On the one hand, the real-time performance of the system still needs to be improved, and the speed of geometric thread image frame processing needs to be improved. On the other hand, we still need to continuously optimize the semantic segmentation network to improve the accuracy of network segmentation, or select other excellent and lightweight networks to help the system more effectively eliminate the impact caused by dynamic objects.