Tightly-Coupled Monocular Visual-Odometric SLAM Using Wheels and a MEMS Gyroscope

In this paper, we present a novel tightly coupled probabilistic monocular visual-odometric simultaneous localization and mapping (VOSLAM) algorithm using wheels and a MEMS gyroscope, which can provide accurate, robust, and long-term localization for ground robots. First, we present a novel odometer preintegration theory on manifold; it integrates the wheel encoder measurements and gyroscope measurements to a relative motion constraint that is independent of the linearization point and carefully addresses the uncertainty propagation and gyroscope bias correction. Based on the preintegrated odometer measurement model, we also introduce the odometer error term and tightly integrate it into the visual optimization framework. Then, in order to bootstrap the VOSLAM system, we propose a simple map initialization method. Finally, we present a complete localization mechanism to maximally exploit both sensing cues, which provides different strategies for motion tracking when: 1) both measurements are available; 2) visual measurements are not available; and 3) wheel encoders experience slippage, thereby ensuring the accurate and robust motion tracking. The proposed algorithm is evaluated by performing extensive experiments, and the experimental results demonstrate the superiority of the proposed system.


I. INTRODUCTION
Simultaneous localization and mapping(SLAM) from onboard sensors is a fundamental and key technology for autonomous mobile robot to safely interact within its workspace.SLAM is a technique that builds a globally consistent representation of the environment(i.e. the map) and estimates the state of the robot in the map simultaneously.Because SLAM can be used in many practical applications, such as autonomous driving, virtual or augmented reality and indoor service robots, it has received considerable attention from Robotics and Computer Vision communities.
In this paper, we propose a novel tightly-coupled probabilistic optimization-based monocular visual-odometric SLAM(VOSLAM) system.By combining a monocular camera with wheels and a MEMS gyroscope, the method provides accurate and robust motion tracking for domestic service robots moving on a plane, e.g.cleaning robot, nursing robot and restaurant robot waiter.A single camera provides rich information about the environment, which allows for building 3D map, tracking camera pose and recognizing places already visited.However, the scale of the environment can not be determined using monocular camera, and visual tracking system is sensitive to motion blur, occlusions and illumination changes.Most ground robots are equipped with wheel S. Piao(piaosh@hit.edu.cn) and M. Quan(15b903042@hit.edu.cn) are the corresponding authors.
encoders that provide precise and stable translational measurements of each wheel at most of the time, the measurements contain the absolute scale information.Whereas, the wheel encoder cannot provide accurate self rotational estimates and occasionally provides faulty measurements.In addition, the MEMS gyroscope is a low cost and commercially widely used sensor, and provides accurate and robust inter-frame rotational estimate.However, the estimated rotation is noisy and diverges even in few seconds.Based on the analysis of each sensor, we can know that wheel encoder and gyroscope are complementary to the monocular camera sensor.Therefore, tightly fusing the measurements from wheel encoder and gyroscope to the monocular visual SLAM can not only dramatically improve the accuracy and robustness of the system, but also recover the scale of the environment.In the following, we will call the wheel encoder and MEMS gyroscope the odometer.
In order to tightly fuse the odometer measurements to the visual SLAM system in the framework of graph-based optimization, it is important to provide the integrated odometer measurements between the selected keyframes.Therefore, motivated by the inertial measurement unit(IMU) preintegration theory proposed in [1], we present a novel odometer preintegration theory and corresponding uncertainty propagation and bias correction theory on manifold.The preintegration theory integrates the measurements from the wheel encoder and gyroscope to a single relative motion constraint that is independent of the change of the linearization point, therefore the repeated computation is eliminated.Then, based on the proposed odometer preintegration model, we formulate the new preintegrated odometer factor and seamlessly integrate it in a visual-odometric pipeline under the optimization framework.
Furthermore, both visual and odometer measurements are not always available.Therefore, we present a complete visualodometric tracking framework to ensure the accurate and robust motion tracking in different situations.For the situation where both measurements are available, we maximally exploit the both sensing cues to provide accurate motion tracking.
For the situation where visual information is not available, we use odometer measurements to improve the robustness of the system and offer some strategies to render the visual information available as quick as possible.In addition, for the critical drawback of the wheel sensor, we provide a strategy to detect and compensate for the slippage of the wheel encoder.In this way, we can track the motion of the ground robot accurately and robustly.
The final contribution of the paper is the extensive evaluation of our system.Extensive experiments are performed to demonstrate the accuracy and robustness of the proposed algorithm.The presented algorithm is shown in Fig. 1, and the details are presented in later sections.

II. RELATED WORK
There are extensive scholarly works on monocular visual SLAM, these works rely on either filtering methods or nonlinear optimization methods.Filtering based approaches require fewer computational resources due to the continuous marginalization of past state.The first real-time monocular visual SLAM -MonoSLAM [2] is an extended kalman filter(EKF) based method.The standard way of computing Jacobian in the filtering leads the system to have incorrect observability, therefore the system is inconsistent and gets slightly lower accuracy.To solve this problem, the first-estimates Jacobian approach was proposed in [3], which computes Jacobian with the first-ever available estimate instead of different linearization points to ensure the correct observability of the system and thereby improve the consistency and accuracy of the system.In addition, the observability-constrained EKF [4] was proposed to explicitly enforce the unobservable directions of the system, hence improving the consistency and accuracy of the system.
On the other hand, nonlinear optimization based approaches can achieve better accuracy due to it's capability to iteratively re-linearize measurement models at each iteration to better deal with their nonlinearity, however it incurs a high computational cost.The first real-time optimization based monocular visual SLAM system is PTAM [5] proposed by Klein and Murray.The method achieves real-time performance by dividing the SLAM system into two parallel threads.In one thread, the system performs bundle adjustment over selected keyframes and constructed map points to obtain accurate map of the environment.In the other parallel thread, the camera pose is tracked by minimizing the reprojection error of the features that match the reconstructed map.Based on the work of PTAM, a versatile monocular SLAM system ORB-SLAM [6] was presented.The system introduced the third loop closing thread to eliminate the accumulated error when revisiting an already reconstructed area, it is achieved by taking advantage of bag-of-words [7] and a 7 degree-of-freedom(dof) pose graph optimization [8].
In addition, according to the definition of visual residual models, monocular SLAM can also be categorized into feature based approaches and direct approaches.The above mentioned methods are all feature based approaches, which is quite mature and able to provide accurate estimate.However, the approaches fail to track in poorly textured environments and need to consume extra computational resources to extract and match features.In contrary, direct methods work on pixel intensity and can exploit all the information in the image even in some places where the gradient is small.Therefore, direct methods can outperform feature based methods in low texture environment and in the case of camera defocus and motion blur.DTAM [9], SVO [10] and LSD-SLAM [11] are direct monocular SLAM systems, which builds the dense or semi-dense map from monocular images in real-time, however its accuracy is still lower than the feature based semi-dense mapping technique [12].
The monocular visual SLAM is scale ambiguous and sensitive to motion blur, occlusions and illumination changes.Therefore, based on the framework of monocular SLAM, it is often combined with other odometric sensors, especially IMU sensor, to achieve accurate and robust tracking system.Tightly-coupled visual-odometric SLAM can also be categorized into filtering based methods and optimization based methods, where visual and odometric measurements are fused from the raw measurement level.Papers [13]- [17] are filtering based monocular visual-inertial SLAM, the approaches use the inertial measurements to accurately predict the motion movement between two consecutive frames.An elegant example for filtering based visual-inertial odometry(VIO) is the MSCKF [16], which exploits all the available geometric information provided by the visual measurements with the computational complexity only linear in the number of features, it is achieved by excluding point features from the state vector.
OKVIS [18] is an optimization based monocular visualinertial SLAM, which tightly integrates the inertial measurements in the keyframe-based visual-inertial pipeline under the framework of graph optimization.However, in this system, the IMU integration is computed repeatedly when the linearization point changes.Therefore in order to eliminate this repeated computation, Forster et al. presented an IMU preintegration theory, and tightly integrated the preintegrated IMU factor and visual factor in a fully probabilistic manner in paper [1].Later, a real-time tightly-coupled monocular visual-inertial SLAM system -ORB-VISLAM [19] was presented.The system can close loop and reuse the previously estimated 3D map, therefore achieve high accuracy and robustness.Recently, another tightly-coupled monocular visual-inertial odometry was proposed in [20] [21], it provides accurate and robust motion tracking by performing local bundle adjustment(BA) for each frame and its capability to close loop.
There are also several works on the visual-odometric SLAM that fuses visual measurements and wheel encoder measurements.In [22], wheel encoder measurements are combined to the system of visual motion tracking for accurate motion prediction, thereby the true scale of the system is recovered.In addition, paper [23] proved that VINS has additional unob-servable directions when a ground robot moves along straight lines or circular arcs.Therefore a system fusing the wheel encoder measurements to the VINS estimator in a tightlycoupled manner was proposed to render the scale of the system observable.

III. PRELIMINARIES
We begin by briefly defining the notations used throughout the paper.We employ (•) W to denote the world reference frame, (•) O k , (•) C k and (•) B k to denote the wheel odometer frame, camera frame and inertial frame for the k th image.In following, we employ R F1 F2 ∈ SO(3) to represent rotation from frame {F 2 } to {F 1 } and p F1 F2 ∈ R 3 to describe the 3D position of frame {F 2 } with respect to the frame {F 1 }.
The rotation and translation between the rigidly mounted wheel encoder and camera sensor are R C O ∈ SO(3) and p C O ∈ R 3 respectively, and R O B ∈ SO(3) denotes the rotation from the inertial frame to the wheel encoder frame, these parameters are obtained from calibration.In addition, the pose of the k th image is the rigid-body transformation , and the 3D position of the j th map point in the global frame {W } and the camera frame {C k } are denoted as f W j ∈ R 3 and f C k j ∈ R 3 respectively.In order to provide a minimal representation for the rigidbody transformation during the optimization, we use a vector ξ ∈ R 3 computed from the Lie algebra of SO(3) to represent the over-parameterized rotation matrix R. The Lie algebra of SO(3) is denoted as so(3), which is the tangent space of the manifold and coincides with the space of 3×3 skew symmetric matrices.The logarithm map associates a rotation matrix R ∈ SO(3) to a skew symmetric matrix: where (•) ∧ operator maps a 3-dimensional vector to a skew symmetric matrix, thus the vector ξ can be computed using inverse (•) ∨ operator: Inversely, the exponential map associates the Lie algebra so(3) to the rotation matrix R ∈ SO(3): The input of our estimation problem is a stream of measurements from the monocular camera and the odometer.The visual measurement is a set of point features extracted from the captured intensity image where z kl is the corresponding feature measurement, and σ kl is the 2 × 1 measurement noise with covariance Σ C kl .The projection function π is determined by the intrinsic parameters of the camera, which is known from calibration.
In addition, the gyroscope of the odometer measures the angular velocity ω k at time-step k, the measurement is assumed to be affected by a slowly varying sensor bias b g k with covariance Σ bg and a discrete-time zero-mean Gaussian white noise η gd with covariance Σ gd : The wheel encoder of the odometer measures the traveled distance Dl k and Dr k of the both wheels from time-step k − 1 to k, which is assumed to be affected by a discrete-time zeromean Gaussian white noise η ed with variance Σ ed : Therefore, the measured 3D position of frame {O k } with respect to frame {O k−1 } from wheel encoder is: where constitute the pose of frame {O k−1 }, and R O k W and p O k W constitute the pose of frame {O k }.In many cases, the ground robot is moving on a plane.The motion on a plane has 3 dof in contrast to 6 dof of 3D motion, i.e. the roll, pitch angle and translation on zaxis of frame {O k } in the frame of physical plane should be close to zero.Since the additional information can improve the accuracy of the system, we also provide planar measurement pl k = [0, 0, 0] T ∈ R 3 with covariance Σ pl for each frame, where the first two elements correspond to the planar rotational measurement and the third element corresponds to the planar translational measurement.

IV. TIGHTLY-COUPLED VISUAL-ODOMETRIC NONLINEAR
OPTIMIZATION ON MANIFOLD We use K to denote the set of successive keyframes from i to j, and L to denote all the landmarks visible from the keyframes in K. Then the variables to be estimated in the window of keyframes from i to j is: where is the state of the keyframe k.We denote the visual measurements of L at the keyframe i as Z Ci = { z il } l∈L .In addition, we denote the odometer measurements obtained between two consecutive keyframes i and j as O ij = { ω t , Dl t , Dr t } time−step(i)≤t<time−step(j) .Therefore, the set of measurements collected for optimizing the state X is:

A. Maximum a Posteriori Estimation
The optimum value of state X is estimated by solving the following maximum a posteriori (MAP) problem: which means that given the available measurements Z, we want to find the best estimate for state X .Assuming measurements Z are independent, then using Bayes' rule, we can rewrite p(X |Z) as: The equation can be interpreted as a factor graph.The variables in X are corresponding to nodes in the factor graph.The terms p(X 0 ), p(O ij |x i , x j ), p( z il |x i , f W l ) and p( pl i |x i ) are called factors, which encodes probabilistic constraints between nodes.A factor graph representing the problem is shown in Fig. 2.
The MAP estimate is equal to the minimum of the negative log-posterior.Under the assumption of zero-mean Gaussian noise, the MAP estimate in (10) can be written as the minimization of sum of the squared residual errors: where r 0 , r Oij , r C il and r pl i are the prior error, odometer error, reprojection error and plane error respectively, as well as Σ 0 , Σ Oij , Σ C il and Σ pl are the corresponding covariance matrices, and ρ is the Huber robust cost function.In the following subsections, we provide expressions for these residual errors and introduce the Gauss-Newton optimization method on manifold.

B. Preintegrated Odometer Measurement
In this section, we derive the odometer preintegration between two consecutive keyframes i and j by assuming the gyro bias of keyframe i is known.We firstly define the rotation increment ∆R ij and position increment ∆p ij in the wheel odometer frame {O i } as: Then, using the first-order approximation and dropping higher-order noise terms, we split each increment in (13) to preintegrated measurement and its noise.For rotation, we have: where . Therefore, we obtain the preintegrated rotation measurement: For position, we have: Therefore, we obtain the preintegrated position measurement:

C. Noise Propagation
We start with rotation noise.From (14), we obtain: The rotation noise term δφ ij is zero-mean and Gaussian, since it is a linear combination of zero-mean white Gaussian noise η gd .
Furthermore, from (16), we obtain the position noise: The position noise δp ij is also zero-mean Gaussian noise, because it is a linear combination of the noise η ψd and the rotation noise δφ ij .
We write ( 18) and ( 19) in iterative form, then the noise propagation can be written in matrix form as: or more simply: Given the linear model ( 21) and the covariance Σ η d ∈ R 6×6 of the odometer measurements noise η d , it is possible to compute the covariance of the odometer preintegration noise iteratively: with initial condition Σ ii = 0 6×6 .Therefore, we can fully characterize the preintegrated odometer measurements noise as:

D. Bias update
In the previous section, we assumed that the gyro bias b gi is fixed.Given the bias change δb g , we can update the preintegrated measurements using a first-order approximation.For preintegrated rotation measurement: where For preintegrated position measurement: where

E. Preintegrated Odometer Measurement Model
From the geometric relations between two consecutive keyframes i and j, we get our preintegrated measurement model as: Therefore, preintegrated odometer residual

F. Gyro Bias Model
Gyro bias is slowly time-varying, so the relation of gyro bias between two consecutive keyframes i and j is: where η b gd is the discrete-time zero-mean Gaussian noise with covariance Σ b gd .Therefore, we can express the gyro bias residual r bg ∈ R 3 as:

G. Odometer Factor
Given the preintegrated odometer residual in (27) and the gyro bias residual in (29), the odometer error term in ( 12) is:

H. Visual Factor
Through the measurement model in (4), the l th map point expressed in the world reference frame {W } can be projected onto the image plane of the i th keyframe as: Therefore, the reprojection error r C il ∈ R 2 for the l th map point seen by the i th keyframe is:

I. Plane Factor
The x-y plane of the first wheel encoder frame {O 1 } coincides with the physical plane, so the planar measurement in section III corresponds to that the roll, pitch angle and translation on z-axis between frame {O 1 } and {O k } should be close to zero.Therefore, we express the plane factor r pl ∈ R 3 as:

J. On-Manifold Optimization
The MAP estimate in ( 12) can be written in general form on manifold M as: We use the retraction approach to solve the optimization problem on manifold.The retraction R is a bijective map between the tangent space and the manifold.Therefore, we can re-parameterize our problem as follows: where δX is an element of the tangent space and the minimum dimension error representation.The objective function F(R X (δX )) is defined on the Euclidean space, so it is easy to compute Jacobian.For the rigid-body transformation SE(3), the retractraction at where δϕ ∈ R 3 , δp ∈ R 3 .However, since the gyro bias and position of map points are already in a vector space, the corresponding retraction at b g and f ω are: where δb g ∈ R 3 and δf ω ∈ R 3 .We adopt the Gauss-Newton algorithm to solve (34) since a good initial guess X can be obtained.Firstly, we linearize each error function in (34) with respect to δX by its first order Taylor expansion around the current initial guess X : where, J mn is the jacobian of r mn R X (δX ) with respect to δX , which is computed in X , and r mn = r mn ( X ) = r( X m , X n , Z mn ).Substituting (38) to the each error term F mn of (34), we obtain: where r, Σ −1 , J are formed by stacking r mn , Σ −1 mn , J mn respectively.Then we take the derivative of F R X (δX ) with respect to δX and set the derivative to zero, which leads to the following linear system: Finally, the state is updated by adding the increment δX * to the initial guess X : Following the scheme of the Gauss-Newton algorithm, we solve (34) by iterating linearization in (38), the computation of increments in (40), and the state update in (41) until a given termination criterion is met.Moreover, the previous solution is used as the initial guess for each iteration.

V. MONOCULAR VISUAL-ODOMETRIC SLAM
Our monocular visual-odometric SLAM system is inspired by the ORB-SLAM [6] and visual-inertial ORB-SLAM [19] methods.Fig. 1 shows an overview of our system.In this section, we detail the main changes of our visual-odometric SLAM system with respect to the referenced system.

A. Map Initialization
The map initialization is in charge of constructing an initial set of map points by using the visual and odometer data.Firstly, we extract ORB features in the current frame k and search for feature correspondences with reference frame r.If there are sufficient feature matches, we perform the next step, else we set the current frame as reference frame.The second step is to check the parallax of each correspondence and pick out a set of feature matches F that have sufficient parallax.When the size of F is greater than a threshold, we use the odometer measurements to compute the relative transformation between two frames, and triangulate the matched features F.
Finally, if the size of the successfully created map points is greater than a threashold, a global BA that minimizes all reprojection error, odometer error and plane error in the initial map is applied to refine the initial map.

B. Tracking when Previous Visual Tracking is Successful
Once the initial pose of current frame is predicted using the odometer measurements, the map points in the local map are projected into the current frame and matched with the keypoints extracted from the current frame.Then the pose of current frame is optimized by minimizing the corresponding energy function.Depending on whether the map in backend is updated, the pose prediction and optimization methods are different, which will be described in detail below.In addition, we provide a detection strategy and solution for wheel slippage.The tracking mechanism is illustrated in Fig. 3.

1) Tracking when Map Updated:
When tracking is performed just after a map update in the back-end, we firstly compute the preintegrated odometer measurement between current frame k and last keyframe m.Then the computed relative transformation T O k Om is combined with the optimized pose of last keyframe to predict the initial pose of the current frame.The reason for this state prediction is that the pose estimate of the last keyframe is accurate enough after performing a local or global BA in the back-end.Finally, the state of the current frame k is optimized by minimizing the following energy function: After the optimization, the resulting estimation and Hessian matrix are served as a prior for next optimization.
2) Tracking when no Map Updated: When the map is not changed in the back-end, we compute the preintegrated odometer measurement between current frame k and last frame k − 1, and predict the initial pose of current frame k by integrating the relative transformation T O k O k−1 to the pose of last frame.Then, we optimize the pose of current frame k by performing the nonlinear optimization that minimizing the following objective function: where the residual r prior error term of last frame: where and Σ 0 k−1 are the estimated states and resulting covariance matrix from previous pose optimization.The optimized result is also served as a prior for next optimization.
3) Detecting and Solving Wheel Slippage: Wheel encoder is an ambivalent sensor, it provides a precise and stable relative transformation at most of the time, but it can also deliver very faulty data when the robot experiences slippage.If we perform visual-odometric joint optimization using this kind of faulty data, in order to simultaneously satisfy the constraints of both odometer measurements with slippage and visual measurements, the optimization will lead to a false estimate.Therefore, we provide a strategy to detect and solve this case.We think the current frame k experienced a slippage if the above optimization makes more than half of the original matched features become outliers.Once the wheel slippage is detected, we set slippage flag to current frame and reset the initial pose of current frame k as the pose of last frame k − 1.Then we re-project the map points in the local map and rematch with features on the current frame.Finally, the state of current frame is optimized by only using those matched features: After the optimization, the resulting estimate and Hessian matrix of current frame are served as a prior for next optimization.

C. Tracking when Previous Visual Tracking is Lost
If visual information is not available in current frame, only odometer measurements can be used to compute the pose of the frame.So in order to obtain more accurate pose estimate, we should make the visual information available as early as possible.
Supposing the previous visual tracking is lost, then one of the three cases will happen for the current frame: (1) the robot revisits to an already reconstructed area; (2) the robot visits to a new environment where sufficient map points are newly constructed; (3) the visual features are still unavailable wherever the robot is.For these different situations, we perform different strategies to estimate the pose of the current frame.For case 1, a global relocalization method as done in [6], i.e. using DBOW [7] and PnP algorithm [24], is performed to compute the pose of the current frame and render the visual information available.For case 2, we firstly use the odometer measurements to predict the initial pose of current frame, then project map points seen by last keyframe to the current frame and optimize the pose of current frame using those matched features.For case 3, we use the odometer measurements to compute the pose of the current frame.
When enough features are extracted from the current frame after visual tracking is lost, we firstly think the robot may returned to an already reconstructed environment, therefore perform the global relocaliation method(solution for case 1).However, if the relocalization continuously fails until the second keyframe with enough features is selected to enable the reconstruction of the new map, we think the robot entered into a new environment, thereby the localization in newly constructed map is performed as solution for case 2. We deem the visual information becomes available for motion tracking of current frame when the camera pose is supported by enough matched features.So if the computed pose in case 1 and case 2 is not supported by enough matched features or fewer features are extracted from the current frame, we think the visual information is still unavailable for motion tracking of the current frame and set the pose of current frame according to the solution for case 3.

D. Keyframe Decision
If the visual tracking is successful, we have two criteria for keyframe selection: (1) current frame tracks less than 50% features than last keyframe; (2) Local BA is finished in the back-end.These criteria insert keyframes as many as possible to make visual tracking and mapping to work all the time, thereby ensure a good performance of the system.
In addition, if visual tracking is lost, we insert a keyframe to the back-end when one of the following conditions is satisfied: (1) The traveled distance from the last keyframe is large than a threshold; (2) The relative rotation angle from the last keyframe is beyond a threshold; (3) Local BA is finished in the local mapping thread.These conditions ensure that when the previous map is not available and the robot enters into a new environment where there are enough features, the system can still build new map that are consistent with the previous map.

E. Back-End
The back-end includes the local mapping thread and the loop closing thread.The local mapping thread aims to construct the new map points of the environment and optimize the local map.When new keyframe k is inserted to local mapping thread, we make small changes in convisibility graph update and local BA with respect to paper [19].If the visual tracking of new keyframe k is lost, we update the covisibility graph by adding a new node for keyframe k and an edge connected with the last keyframe to ensure the ability to build new map.In addition, visual-odometric local BA is performed to optimize the last N keyframes(local window) and all points seen by those N keyframes, which is achieved by minimizing the cost function (12) in the window.One thing to note is that the odometer constraint linking to the previous keyframe is only constructed for those keyframes without the slippage flag.The loop closing thread is in charge of eliminating the accumulated drift when returning to an already reconstructed area, it is implemented in the same way as paper [19].

VI. EXPERIMENTS
In the following, we perform a number of experiments to evaluate the proposed approach both qualitatively and quantitatively.Firstly, we perform qualitative and quantitative analysis of our algorithm to show the accuracy of our system in Section VI-A.Then the validity of the proposed strategy for detecting and solving the wheel slippage is demonstrated in Section VI-B.Finally in Section VI-C, we test the tracking performance of the algorithm when the previous visual tracking is lost.The experiments are performed on a laptop with Intel Core i5 2.2GHz CPU and an 8GB RAM, and the corresponding videos are available at: https://youtu.be/EaDTC92hQpc.In addition, our system is able to work robustly in raspberry pi platform that has a Quad Core 1.2GHz Broadcom BCM2837 64bit CPU and 1GB RAM, at the processing frequency of 5Hz.

A. Algorithm Evaluation
We evaluate the accuracy of the proposed algorithm in a dataset provided by the author of [23].The dataset is recorded by a Pioneer 3 DX robot with a Project Tango, and provides 640 × 480 grayscale images at 30 Hz, the inertial measurements at 100Hz and wheel-encoder measurements at 10 Hz.In addition, the dataset also provides the ground truth which is computed from the batch least squares offline using all available visual, inertial and wheel measurements.
We process images at the frequency of 10 Hz, qualitative comparison of the estimated trajectory and the ground truth is shown in Fig. 4. The estimated trajectory and the ground truth are aligned in closed form using the method of Horn [25].We can qualitatively compare our estimated trajectory with the result provided by the approach of Wu. et.al [23] in their figure 6.It is clear that our algorithm produces more accurate trajectory estimate, which is achieved by executing the complete visual-odometric tracking strategies, performing the local BA to optimize the local map and closing loop to eliminate the accumulated error when returning to an already mapped area.Quantitatively, the sequence is 1080m long, and the positioning Root Mean Square Error(RMSE) of our algorithm is 0.606m, it is the 0.056% of the total traveled distance with comparison to 0.25% of the approach [23].

B. Demonstration of Robustness to Wheel Slippage
In the following experiments, we use data recorded from a DIY robot with a OV7251 camera mounted on it to look upward for visual sensing.The sensor suite provides the 640 × 480 grayscale images at the frequency of 30Hz, the wheelodometer and gyroscope measurements at 50 Hz.Since there is no ground truth available, we only do qualitatively analysis.
The wheel slippage experiment is performed in two situations.In the first experiment, we firstly let the ground robot to walk normally, then hold the robot to make it static but the wheel is spinning, and finally let it to normally walk once again.The estimated results in some critical moments are shown in Fig. 5 and Fig. 6.Fig. 5a is the captured image at  the first critical moment when the platform start to experience wheel slippage, and the trajectories estimated by our method and the odometer from the beginning to this moment are shown in Fig. 6a.We can see that both methods can accurately estimate the position of the sensor suite under normal motion.The image and the estimated trajectory obtained at second moment when wheel slippage is over are given in Fig. 5b and Fig. 6b.As evident, the images at first and second moments are almost the same, our method gives the very close pose for these two moments with comparison to the odometer who provides far away positions for these two moments due to the wheel slippage.Thus the validity of the proposed strategy for detecting and solving wheel slippage can be proved.The reconstructed 3D map for the sequence are shown in Fig. 5c, the map is globally consistent, which is achieved by effectively solving the problem of wheel slippage.The situation is also tested in artificial lighting and relatively low texture environment.In Fig. 7 and 8, its intermediate and final results are given, which also demonstrates the robustness of our system to the slippage of wheel encoder.
The second experiment is performed as follows.The sensor suite walks normally at first, then the wheel turns normally, however the platform is moved to another location artificially, and finally it normally walks once again.The test results for the second situation are shown in Fig. 9 and Fig. 10.Before the sensor is moved away, the estimated trajectories from both our method and the odometer are close to each other as shown in Fig. 10a.Fig. 9a and Fig. 9b are the captured images at first moment when platform starts to move and at second moment when the platform has been moved to another location.Comparing to the estimated motion by the odometer, the proposed method gives precise tracking for the movement as shown in Fig. 10b.Thereby, the performance of the proposed strategy for wheel slippage is demonstrated again.

C. Demonstration of Tracking Performance when Previous Visual Tracking is Lost
The tracking performance of our system when previous visual tracking is lost is tested in two sequences, sequence 1 includes the case 1 and case 3 in Section V-C and the sequence 2 includes the case 2 and case 3 in Section V-C.Firstly, we use sequence 1 to test the proposed solution for the case 1 and case 3, the estimated results in some critical moments are shown in Fig. 11 and Fig. 12.The robot firstly moves on areas where enough visual information is available to build a map of the environment shown in Fig. 11a.Then we turn out the lights to make the visual information unavailable.The motion of robot is continuously computed in the period of visual loss as shown in Fig. 12b, which is achieved by using the odometer measurements as solution to case 3. Finally, we turn on the lights to make the robot to revisit an already reconstructed area, at which moment, the global relocalization is triggered.The reconstructed map at the end of sequence is shown in Fig. 11c, it is globally consistent without closing the loop.Therefore, we can demonstrate the validity of the proposed solution for case 1 and case 3. Furthermore, we can conclude that our system is robust to visual loss, thanks to the stable measurements from the odometer.
Secondly, we perform the case 2 experiment in sequence 2, the test results for the experiment are shown in Fig. 13.The ground robot firstly moves on areas where enough visual information is available to build the map of the environment shown in Fig. 13a.Then the robot goes to the low texture environment, and later enters new environment where enough features are available.From Fig. 13b, we can know that new map is created when there are enough feature points in new environment, which however is not consistent with the previously reconstructed map.Finally, the robot returns to an already mapped area, this leads the system to trigger loop closure for eliminating the accumulated error, thereby globally consistent map is constructed as shown in Fig. 13c.

VII. CONCLUSION AND FUTURE WORK
In this paper, we have proposed a tightly-coupled monocular visual-odometric SLAM system.It tightly integrates the proposed odometer factor and visual factor in the optimization framework to optimally exploit the both sensor cues, which ensures the accuracy of the system.In addition, the system uses the odometer measurements to compute the motion of frame when visual information is not available, and is able to detect and reject false information from wheel encoder, thereby ensuring the robustness of the system.The experiments have domenstrated that our system can provide accurate, robust and long-term localization for the wheeled robots mostly moving on a plane.
In future work, we aim to exploit line features to improve the performance of our algorithm in environments where only fewer point features are available.In addition, the camerato-wheel encoder calibration parameters are only known with finite precision, which can pose a bad effect on results, so we intend to estimate the extrinsic calibration parameters online and optimize this parameters by BA.Finally, we will add full IMU measurements to our system for accurate motion tracking when both visual information and wheel odometric information cannot provide valid information for localization.

Fig. 1 .
Fig. 1.Flow diagram showing the full pipeline of the proposed system.

Fig. 2 .
Fig. 2. Factor graph representing the visual-odometric optimization problem.The states are shown as circles and factors are shown as squares.The blue squares represent the odometer factors and connect to the state of the previous keyframe, red squares denote the visual factors corresponding to camera observations, black squares denote prior factors and gray squares denote the plane factors.

Fig. 3 .
Fig.3.Evolution of the factor graph in the tracking thread when previous visual tracking is successful.If map is updated, we optimize the state of frame k by connecting an odometer factor to last keyframe m.If map is not changed, state of both last frame k − 1 and current frame k are jointly optimized by linking an odometer factor between them and adding a prior factor to last frame k − 1.The prior for last frame k − 1 is obtained from the previous optimization.At the end of each joint optimization, we judge whether the wheel slippage has occurred through the optimized result.If wheel slippage is detected, the pose of frame k is re-optimized using factor graph in the diamond.

Fig. 4 .
Fig. 4. Comparison between the estimated trajectory and the ground truth.

Fig. 5 .
Fig.5.Sample images when the platform begins to experience wheel slippage and wheel slippage is over, and the finally reconstructed 3D map at the end of the sequence.

Fig. 9 .
Fig. 9. Sample images when the platform begins to move and the platform has been moved to another location, and the finally reconstructed 3D map at the end of the sequence.
Fig.13.Reconstructed 3D map when visual tracking begins to be lost, when the robot enters new environment where new 3D map is constructed after the visual loss and at the end of the sequence.