I. INTRODUCTION
For Simultaneous Localization and Mapping (SLAM) tasks of small-scale robots in limited and dynamic environments, vision-based methods usually use features on image sequences to estimate camera poses and reconstruct feature point clouds in 3D virtual space, such as ORB-SLAM [1], RTB-MAP [2] and LSD-SLAM [3], etc. Based on feature extraction, these methods have achieved good results in texture-rich scenes, while they usually have reduced the reliability of state estimation and often suffer from ambiguity in textureless scenes. In comparison, LiDAR-based SLAM [4] is efficient and versatile due to the advanced hardware setup with rich structural information. However, LiDAR-based SLAM systems typically require relatively large equipment. These issues are particularly unacceptable in resource-constrained environments, such as for small robots or during long-duration missions.