A Robust Keyframe-Based Visual SLAM for RGB-D Cameras in Challenging Scenarios

The accuracy of RGB-D SLAM systems is sensitive to the image quality, and can be significantly compromised in adverse situations such as when input images are blurry, lacking in texture features, or overexposed. In this paper, based on Continuous Direct Sparse Visual Odometry (CVO), we present a novel Keyframe-based CVO (KF-CVO) with intrinsic keyframe selection mechanism that effectively reduces the tracking error. We then extend KF-CVO to a RGB-D SLAM system, CVO SLAM, equipped with place recognition via ORB features, and joint bundle adjustment & pose graph optimization. Comprehensive evaluations on publicly available benchmarks show that the proposed RGB-D SLAM system achieves a higher success rate than current state-of-the-art-methods. The proposed system is more robust to difficult benchmark sequences than current state-of-the-art methods, where adverse situations such as rapid camera motions, environments lacking in texture, and overexposed images when strong illumination exists.


I. INTRODUCTION
Visual SLAM has been a focused research topic and widely applied to areas like 3D reconstruction [1], [2], [3], augmented reality [4], [5], and mobile robotics [6], [7]. The feature-based SLAM has long been regarded as the mainstream method [8], [9]. Key strengths of the feature-based SLAMs such as ORB-SLAM series [10], [11], [12] come from the computational efficiency, view point and illumination invariant property of ORB features, which enable accurate feature matching and reliable place recognition performance. However, using a sparse set of features comes at the cost of discarding most of the image information, resulting in tracking lost and lack of robustness.
On the contrary, odometry and SLAM systems based on direct methods [13], [14], [15], [16], [17], [18] have achieved robust tracking performance as sufficient image The associate editor coordinating the review of this manuscript and approving it for publication was Wai-Keung Fung . pixels are used. However, the precision of direct methods can be vulnerable to photometric inconsistency, and their loop-closure performances highly depend on the initial odometry accuracy.
In this work, we design a novel RGB-D SLAM system that is based on a robust direct odometry method from RGB-D images, Continuous Direct Sparse Visual Odometry (CVO) [19], which is combined with place recognition and bundle adjustment fueled by ORB features to enhance accuracy. The contributions of this paper are summarized below.
1) We design a keyframe selection mechanism based on the sensor registration problem formulation of CVO, and propose Keyframe-based CVO (KF-CVO), which effectively reduces the tracking error of CVO and outperforms the state-of-the-art direct odometry method, Dense Visual Odometry (DVO). 2) We extend KF-CVO to a RGB-D SLAM system, CVO SLAM, equipped with a sparse-to-dense loop-closure FIGURE 1. Performances of RGB-D SLAM systems under adverse conditions. Images in this example sequence were taken when the camera was facing strong illumination. The proposed system, CVO SLAM, is robust in this case and results in a more precise trajectory than state-of-the-art RGB-D SLAM systems.
detection module, and joint bundle adjustment & pose graph optimization. 3) We present extensive evaluations on publicly available benchmarks and show that comparing to the state-ofthe-art methods, the proposed SLAM system achieves a higher success rate over all evaluated benchmark sequences, and is more robust to challenging scenarios such as significant motion blur, no texture, and overexposure in images ( Figure 1). 4) An open-source implementation of this work is available at https://github.com/bexilin/CVO-SLAM. The remaining of this paper is organized as follows. A review of related work is given in Section II. Background on continuous sensor registration is introduced in Section III. Section IV presents the methodology of KF-CVO. Sections V and VI discuss the extension of KF-CVO to CVO SLAM via loop-closure detection and graph optimization. The results of evaluation experiments are shown in Section VII. Finally, section VIII concludes the paper and comments on future works.

II. RELATED WORK
Henry et al. [20] proposed a RGB-D Iterative Close Point (ICP) method. The RGB-D-ICP initializes the sensor motion with SIFT feature matching and utilizes the ICP algorithm to estimate the pose. The SIFT features are also used for loop-closures detection. A pose graph optimization is applied to maintain global consistency. In KinetFusion, Newcombe et al. [2] proposed 3D scene reconstruction using depth cameras and the truncated signed distance function. The camera pose is tracked using a coarse-to-fine ICP by aligning each frame to the global model. The align and fuse strategy associates the current frame with measurements from the past, thereby establishes the loop-closure and reduces the drift compared to frame-to-frame tracking. Whelan et al. extended KinetFusion to an RGB-D system [21]. ElasticFusion also used the same frame-to-model tracking approach [22]. Surface loop-closure is applied frequently to maintain the consistency of the map. Using a frame-to-model dense odometry method, Yang et al. [23] maintain a globally consistent sparse feature map in the backend to improve the global map quality and pose estimation accuracy. Although not focusing on reconstructing a dense map, we use a similar structure that combines a semi-dense tracking frontend and feature-based loop-closure detection to achieve both stable local tracking and precise global pose estimation.
In [13], [14], and [15], direct image alignment is achieved by minimizing the energy function that corresponds to the photometric error. Kerl et al. [15], [24] combined this method with a Keyframe-based pose graph SLAM, where the Gaussian entropy ratio was proposed as a metric of similarity between frames. The keyframe selection and the loop-closure validation are based on the hard thresholding of such a ratio. As for loop-closure detection, they choose the metrical nearest-neighbor search, the performance of which is impaired when odometry estimation is not sufficiently accurate. Schöps et al. [25] proposed BAD SLAM, which uses a fast direct Bundle Adjustment (BA) formulation. Starting from a Keyframe-based direct RGB-D tracking, BAD SLAM creates dense surfels for each keyframe. The surfels are associated with descriptors and tracked carefully with BA. BAD SLAM can achieve high-accuracy trajectory estimation on sequences recorded with a well-calibrated global shutter camera. However, even with the ideal camera setup, this method is volatile to motion blur. As soon as the image gets blurry, the method diverges, and recovery rarely can happen. The odometry we use does not require data association during image registration and get rid of the impact of wrong data association, and is robust to motion blur.
Min et al. [26], [27] propose a dense indirect visual odometry method that estimates poses by optimizing optical flow residues, and extend it to a SLAM system that is compatible with RGB-D inputs. Mur-Artal et al. [10], [11] and Campos et al. [12] designed ORB feature-based SLAM systems for monocular, stereo, and RGB-D cameras. ORB features extracted from images are used in odometry, map points generation, loop-closure detection and recovery from tracking lost. While direct methods are vulnerable to photometric inconsistency, ORB features are robust towards view point and illumination variance so that accurate matching can be achieved. Besides, bundle adjustments are performed to jointly optimize the map point positions and camera poses in the backend and further improve localization accuracy. However, low texture environments are typical failure scenarios of these systems due to difficulty in extracting ORB features from these environments. Our method uses a semi-dense odometry and a sparse-to-dense loop-closure estimation method to improve the reliability and estimation accuracy in texture-less envrionments.
More recently, high level features and informative pixels have also been used to develop Visual Odometry and SLAM for structural environments [28], [29], [30], [31], [32], [33]. Company-Corcoles et al. [28] extract both point and line features from images, and perform both point and line feature alignments for pose estimation. Shu et al. [29] use point and line features not only for tracking, but also for plane detection and reconstruction, which can help to increase pose estimation accuracy and map semantic interpretation. For structural indoor environments, SLAM systems that use plane features have been developed [30], [31], [32], which are more robust to measurement noise than point features, to improve tracking and mapping accuracy. Fontan et al. [33] propose a semi-direct method based RGB-D SLAM system, which simultaneously tracks both features and informative pixels such as those have high gradients, and perform a joint optimization on both feature alignment and photometric residues to enhance robustness. We also adopt a similar idea of joint optimization with constraints from semi-dense frame alignment and feature matching during bundle adjustments.

III. BACKGROUND ON CONTINUOUS SENSOR REGISTRATION
CVO estimates the relative transformation between two overlapping RGB-D images by performing a continuous registration process on the semi-dense point clouds computed from the RGB-D images. The problem formation of the continuous registration process is illustrated as follows.
Consider two (finite) collections of points, where R ∈ SO(3) and T ∈ R 3 , aligns the two point clouds X and hZ = {hz j } the ''best.'' To assist with this, we will assume that each point contains information described by a point in an inner product space, (I, ⟨·, ·⟩ I ). To this end, we will introduce two labeling functions, ℓ X : X → I and ℓ Z : Z → I.
In order to measure their alignment, we will be turning the clouds, X and Z , into functions f X , f Z : The problem of aligning the point clouds can now be rephrased as maximizing the scalar products of f X and h.f Z , i.e., we want to solve arg max We follow the same steps in [19] with an additional step in which we use the kernel trick to kernelize the information inner product. For the kernel of our RKHS, H, we first choose the squared exponential kernel k : for some fixed real parameters (hyperparameters) σ and ℓ, and ∥·∥ 3 is the standard Euclidean norm on R 3 . This allows us to turn the point clouds to functions via We can now define the inner product of f X and f Z by We use the well-known kernel trick in machine learning [34], [35], [36] to substitute the inner products in (4) with the appearance (color) kernel. The kernel trick can be applied to carry out computations implicitly in the high dimensional space, which leads to computational savings when the dimensionality of the feature space is large compared to the number of data points [35]. After applying the kernel trick to (4), we get where we choose k c to be also the squared exponential kernel with fixed real hyperparameters σ c and ℓ c that are set independently.

IV. KEYFRAME-BASED CVO (KF-CVO)
When performing consecutive frame odometry, i.e., frameto-frame odometry, the total trajectory drift is accumulated at every frame. To reduce the total drift, we adopt the idea of Keyframe-based tracking [4], [24], [37], and propose KF-CVO. Instead of estimating the global pose at the current frame with respect to that of the last frame, the current pose is computed using the current keyframe pose and keyframeto-frame odometry. The overall pipeline of the Keyframe-based odometry is shown in the red dashed box in Figure 2. A detailed illustration of the step ''Update local pose graph'' is shown in the orange dashed box, where the notations are defined in the following paragraphs. Note that all graph optimizations are disabled when we evaluate the performance of KF-CVO in section VII.

A. KEYFRAME-BASED TRANSFORMATION ESTIMATION
When receiving a new image n, we first track it with respect to the last frame n − 1 to get the consecutive transformation T n−1 n by solving an instance of equation (1): where we denote T n n−1 = (T n−1 n ) −1 . In equation (6), f X n−1 and f x n are point functions at the last frame n − 1 and VOLUME 11, 2023 FIGURE 2. Pipeline of CVO SLAM. When a new scan is fed into the system, both frame-to-frame tracking and Keyframe-based tracking will be performed. The resulting transformations will then be added to the local pose graph. When the current local map finishes, a pose graph optimization is performed on the local pose graph, and loop-closure detection will be triggered. The optimized relative pose of normal frames with respect to the keyframe will be stored. When a new keyframe is decided, ORB features will be extracted from the frame, and a bag-of-words vector will be computed accordingly for future loop-closure detection. When the resulting new map points, projection edges and relative pose edges from loop-closure detection are inserted into the global map, a local map optimization will be performed. When the SLAM process finishes, a final optimization will be performed on the global pose graph, and the poses of all normal frames are retrieved from optimized keyframe poses. current frame n, and h 0 is the initial pose for the solver, which is set to the previous pose T n−2 n−1 . This assumption can also be interpreted as a constant velocity motion model for consecutive tracking initialization.
To get the transformation from the current keyframe m to the current frame n, T m n , we similarly solve another instance of equation (1): In equation (7), we set the initial value h 0 using the current frame-to-keyframe pose T n−1 m and the consecutive pose T n n−1 . Consequently, the estimated pose of the current frame in the world frame, T W n , can be obtained multiplying the current keyframe pose T W m with the keyframe-to-frame pose T m n .

B. NEW KEYFRAME SELECTION
For keyframe selection, we define a novel parameter that is intrinsic to CVO using the ratio of two inner products: where f X m and f X n have the same definition as those in equation (6) and (7). Since the frame next to the current keyframe m + 1 is the closest frame to the current keyframe m, the images are highly similar and T m m+1 is expected to be more accurate than future frames. Therefore, point cloud X m and the aligned point cloud T m m+1 X m+1 are also expected to be more similar than future point clouds. As such, the inner product ⟨f X m , T m+1 m .f X m+1 ⟩ H computed using equation (5) serves as a reference value for evaluating the alignment quality between current keyframe m and current frame n.
The inner product ratio γ indicates the relative alignment quality of the current keyframe-to-frame transformation T m n . In addition, γ is a relative quantity, which means it is independent of the scale variation in absolute inner product values. A consequence of this property is that we can set a fixed threshold γ thres .
In addition, we also set maximum thresholds in translation and rotation for neighboring keyframes such that they are not too far from each other and have similar view angles. In other words, the relative transformation between two consecutive keyframes m and n, T m n = T m W T W n , should not be too large. We use the absolute value of the rotation angle, θ mn , computed through where Tr(·) denotes the trace of a matrix. For the translational difference we use the Euclidean norm of translation part of T m n , denoted as ∥t mn ∥ 3 . We set a translation norm threshold t thres for ∥t mn ∥ 3 and a rotation angle threshold θ thres for θ mn . If any of them exceeds the threshold, the difference between 97242 VOLUME 11,2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. two keyframes is considered too large; hence, a new keyframe is initialized.
In summary, we select a new keyframe that has good image alignment quality with respect to the last keyframe and acceptable pose difference with last keyframe. When γ < γ thres or ∥t mn ∥ 3 > t thres or θ mn > θ thres , we set the last frame n − 1 to be a new keyframe m + 1.

C. LOCAL POSE GRAPH UPDATE
Along with the tracking process, we also maintain and update a local pose graph, which is similar to that of DVO SLAM [24]. Every time the Keyframe-based tracking of the current frame completes, a new node is created in the local pose graph. The pose T m n is set to be the estimation of an edge between the keyframe and the current frame nodes. The edge between the last frame and the current frame nodes is set to T n−1 n . When the current local pose graph is complete, a new keyframe is created and a new local pose graph is initialized.

V. LOOP-CLOSURE DETECTION
To correct accumulated tracking drift, we perform place recognition and generate extra relative pose constraints among similar keyframes, as well as local mapping to establish the association between map points and their visible keyframes. A schematic pipeline is shown in Figure 2 (the blue dashed box).

A. PLACE RECOGNITION
When a new keyframe is selected, similar to ORB-SLAM2 [11], we extract ORB features from its image and compute a bag-of-words vector defined in [38] for it. Once the current local pose graph in tracking completes, place recognition between the current keyframe and other keyframes except the last keyframe is performed.
To measure the similarity between images of the current keyframe and other keyframes, we compute the ℓ 1 -score [38], s(v i , v j ), between the bag-of-words vector of the current keyframe v i and that of each previous keyframe except the last keyframe as We accept at most 10 previous keyframes that have greatest ℓ 1 -score with the current keyframe as loop-closure candidates.

B. INITIAL RELATIVE POSE ESTIMATION
For each candidate keyframe, we match its ORB features to those of the current keyframe. Inlier matches are determined by a RANSAC-based algorithm that checks whether both 2D and 3D reprojection error of each match are below given thresholds (σ H and σ T respectively), which is described in [39]. In each iteration, 4 pairs of matched features are selected randomly, we compute a homography H with the 2D coordinates of feature key points, and a rigid body transformation T that aligns point clouds constructed with matched features and their corresponding depth values. H and T are then used to compute 2D and 3D reprojection errors respectively. We use the SVD method described in [40] to compute T , which minimizes distance between two sets of points with known correspondences. If the largest number of inlier matches is less than 10 after RANSAC, the loop-clousure candidate is rejected. Otherwise, the transformation T max that corresponds to most number of inlier matches with keypoints {x} 1,...,n and {y} 1,...,n is refined by solving the following problem to obtain initial relative pose estimation T i j , from the current keyframe i to the previous keyframe j. (11) where π −1 D maps a pixel to its corresponding 3D points in the local camera frame with depth measurement D, and π D reflects the inverse process.

C. LOCAL MAP UPDATE FOR THE CURRENT KEYFRAME
If there are more than 10 inlier matches between the current keyframe and a loop-closure candidate, we try creating new map points and check visibility of existing map points in the current keyframe. If both features in the match do not correspond to an existing map point, we follow the method in ORB-SLAM2 [11] and try triangulating the matches to create a new map point with the pose of two keyframes. If the feature in the loop-closure candidate keyframe correspond to an existing map point but that in the current keyframe does not, we check the projection error of the existing map point into the current keyframe to decide the visibility. Finally, we store a list of best covisible keyframes (up to 10) that have most number of covisible map points with the current keyframe.

D. RELATIVE POSE REFINEMENT
Since the number of ORB feature matches is limited, the corresponding point clouds are often sparse. To get a more accurate estimate of the pose, we adopt a sparse-to-dense approach similar to [41] and refine the initial estimation T i j with CVO, which uses denser semi-dense point clouds for registration. Hence the refined pose T CVO is the solution to the following problem.
Before we decide to add a loop-closure edge into the global pose graph, we conduct quality control. In KF-CVO, we use the inner product of CVO functions as a proxy for the alignment quality. We also use it here and define a parameter α as follows. VOLUME 11, 2023 where ⟨f X i , f X j ⟩ H corresponds to the case where two frames are identical, T rel = T i W T W j , and T W i and T W j are the global pose estimates for two loop-closure keyframes. If α > 0, it means that T CVO has better image alignment quality in the sense of CVO inner product than other transformations. In addition, we also define another parameter β here.
It's the cosine of angle between two CVO functions after registration, hence very small β value indicates large angle difference and poor alignment quality of T CVO . When α > 0 and β ≥ β thres , we accept the loop-closure for incorporation into the global pose graph.

VI. GRAPH OPTIMIZATION A. INFORMATION MATRIX FOR CONSTRAINTS
Since CVO registration is equivalent to solving equation (1), Hessian matrix of the inner product function F(h) at the solution indicates how fast the function converges, and we use it as the information matrix that can serve as the weight of the corresponding CVO measurement in optimization. The closed-form derivation of the Hessian matrix is introduced in [42]. As for the information matrix of a map point measurement, we use a similar one as that of ORB-SLAM2 [11].

B. LOCAL POSE GRAPH OPTIMIZATION
When the current local pose graph mentioned in section IV completes, an optimization is performed on it, which is equivalent to solving (15), where P is the set of all frames in the local pose graph, S is the set of pairs of frames with relative pose measurement T ij , P ij is the information matrix of T ij , e P ij is the error (residual) between pose of frame i and j estimates and T ij . To suppress the influence of large error terms and make optimization more robust to outliers, we add a Cauchy robust kernel, using g2o library [43], where the Cauchy loss function is ρ Cauchy (δ, x) = δ 2 log( x δ 2 + 1). δ can be tuned to adjust where the function starts becoming sublinear.

arg min
After optimization, we store the relative pose of all other frames with respect to the current keyframe, and only insert the current keyframe to the global pose graph.

C. LOCAL MAP OPTIMIZATION
After loop-closure detection, we perform a joint bundle adjustment & pose graph optimization on the current local map, which includes local keyframes P 1 and local map points L. P 1 includes all keyframes from the current keyframe k to the farthest keyframe that has either a loop-closure with k or is one of the best covisible keyframes of k. L includes all map points seen by k and its best covisible TABLE 1. Parameters used in experiments. Best match feature threshold in the table is the ratio of descriptor distance between a feature and its best match feature and its second-best match feature.
keyframes. The optimization is equivalent to solving (16). F is the set of pairs of keyframe in P with relative pose measurement. G is the set of pairs of map points in L and their visible keyframes, denoted as P 2 . For keyframes in P 2 \P 1 , their poses remain fixed in optimization. e L mn is the 2D error when projecting map point n to its visible keyframe m, L mn is the information matrix of the corresponding projection measurement.
arg min

D. FINAL GLOBAL POSE GRAPH OPTIMIZATION
When SLAM process ends, a final pose graph optimization that includes all keyframes and relative pose measurements among them will be performed. After that, the poses of all normal frames are computed with optimized keyframe poses and their relative pose to theirs keyframes.

VII. EVALUATION EXPERIMENTALS
To evaluate the performance of KF-CVO and CVO SLAM, we conduct experiments using TUM RGB-D Dataset [44] and ETH3D RGB-D Dataset [25]. We also choose some opensourced state-of-the-art methods, which are introduced in the following paragraphs, as baselines and run them on the same evaluation experiments and device as proposed methods. All experiments are performed on a PC with a AMD Ryzen threadripper 3970X CPU, and a Nvidia RTX 3090 GPU. The GPU is only used for running BAD-SLAM [25]. Table 1 shows parameters used by KF-CVO and CVO SLAM in all evaluation experiments.

A. KF-CVO PERFORMANCE
As mentioned in section IV, we disabled other three blocks (''update local pose graph'', ''loop closure detection'', and ''Pose graph optimization'' in Figure 2) to study the pure KF-CVO odometry performance. We choose CVO [19], DVO [15] and KF-DVO as odometry 97244 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. baseline methods. KF-DVO was introduced and used in DVO-SLAM [24]. Table 2 and Table 3 show the results on fr1 and fr3 sequences of TUM Dataset respectively. KF-CVO has averagely 27% and 21% improvement on translational and rotational RPE over CVO, and also outperforms DVO and KF-DVO in general.

B. CVO SLAM PERFORMANCE
We use three state-of-the-art RGB-D SLAM methods as baselines, which are ORB-SLAM2 [11], ORB-SLAM3 [12] and BAD SLAM [25]. To obtain a comprehensive understanding of the performance of CVO SLAM on a variety of operating scenarios, we evaluate it as well as baselines on 27 sequences from TUM Dataset and 61 sequences from ETH3D Dataset. Table 4 summarizes the number of successfully completed sequences by each SLAM method, and Figure 3 shows corresponding cumulative error curves. Some sequences in ETH3D Dataset contain images that are completely dark, where systems using feature-based methods (ORB-SLAM2 and ORB-SLAM3) and semi-direct method (CVO SLAM) can hardly extract any useful information and fail consequently. Thus BAD SLAM has a generally superior performance on ETH3D Dataset. However, CVO SLAM can still work reliably in most cases. CVO SLAM achieves the highest success rates in both TUM and ETH3D, and the consistent growth of its corresponding error curves comparing to other methods also indicates its robust performance towards a variety of scenes. Figure 4 and Figure 5 show the results on some difficult sequences from TUM and ETH3D Dataset respectively. In TUM fr1/teddy sequence, rapid camera motions cause blurry images. CVO-SLAM and ORB-SLAM2 precisely capture most camera motions; CVO-SLAM3 clearly deviates from the groundtruth in the lower left part of the x-y plane plot; BAD SLAM completely fails on this sequence. The 3D plot shows estimated camera poses at a timestamp where rapid camera movement in z-axis exists, which can also be seen in RGB-D images. CVO SLAM provides a good estimation that is very close to the groundtruth. In addition, the estimated trajectory by CVO SLAM near the plotted frame has a better precision than ORB-SLAM2 and ORB-SLAM3.
In TUM fr3/structure_notexture_far sequence, the camera moves in an environment that consists mainly of texture-less planar surfaces. As feature-based methods, ORB-SLAM2 and ORB-SLAM3 lose tracking when evaluating on this sequence and only finish a small portion, due to the difficulty in extracting features. Using direct methods, CVO SLAM and BAD SLAM successfully complete these TABLE 3. The RMSE of Relative Pose Error (RPE) results on TUM fr3 sequences. The sequence name is in the format of (structure texture distance). ✓ and × mean with and without respectively. The results show that KF-CVO has an overall better odometry accuracy in non-structure or non-texture environments than other methods. sequences. It can be seen in the 3D plot that CVO SLAM gives more accurate pose trajectory estimations than BAD SLAM. Similar to TUM fr1/teddy, rapid camera motions also exist in ETH3D mannequin_7 sequence, which are more significant than that in TUM fr1/teddy. CVO SLAM is the only one that finishes this sequence, and the resulting trajectory is mostly accurate.
In ETH3D sfm_lab_room_1 sequence, small joggles accompanying camera motions causes blurry images. CVO SLAM and ORB-SLAM2 are robust in this case and maintain a good trajectory accuracy. ORB-SLAM3 fails with these 97246 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  small perturbations, and the performance of BAD SLAM is also affected.
ETH3D vicon_light_1 and vicon_light_2 sequences were recorded when the camera was facing strong light, and thus contain many overexposed images. In these cases, CVO SLAM dominates other state-of-the-art systems, and still gives reliable pose and trajectory estimations.

VIII. CONCLUSION AND FUTURE WORK
In this paper, we report on a novel RGB-D SLAM system that takes advantage of the robust Continuous Direct Sparse Visual Odometry (CVO), and enhances the accuracy with place recognition and bundle adjustment using ORB features. We design an intrinsic keyframe selection technique and develop Keyframe-based CVO (KF-CVO) that has a better tracking performance than CVO and the state-of-the-art DVO. We extends KF-CVO to a RGB-D SLAM system via loop-closure detection and joint bundle adjustment & pose graph optimization. Comprehensive evaluations on TUM RGB-D Dataset and ETH3D RGB-D Dataset show that CVO SLAM achieves an overall higher success rate than state-ofthe-art RGB-D SLAM systems, and is more robust to difficult working conditions such as rapid camera motion, texture-less environments, and overexposed images.
Currently, we use the original implementation of CVO [19], which takes about 0.2 seconds (5Hz) on average to perform a single frame-to-frame registration [45]. Consequently, our current implementation is not real-time. For future work, we will work on accelerating the system with GPU. Besides, since CVO is a locally optimal algorithm, the addition of an inertial measurement unit (IMU) may helps to enhance its performance by improving the initialization of its point cloud registration.