Simple But Effective Scale Estimation for Monocular Visual Odometry in Road Driving Scenarios

In large-scale environments, scale drift is a crucial problem of monocular visual simultaneous localization and mapping (SLAM). A common solution is to utilize the camera height, which can be obtained using the reconstructed 3D ground points (3DGPs) from two successive frames, as prior knowledge. Increasing the number of 3DGPs by using more proceeding frames can be a natural extension of this solution to estimate a more precise camera height. However, merely employing multiple frames based on conventional methods is hard to be directly applicable in a real-world scenario because the vehicle motion and inaccurate feature matching inevitably cause large uncertainty and noisy 3DGPs. In this study, we propose an elaborate method to collect confident 3DGPs from multiple frames for robust scale estimation. First, we gather 3DGP candidates that can be seen in more than a predefined number of frames. To verify the 3DGP candidates, we filter out the 3D points at the exterior of the road region obtained by the deep-learning-based road segmentation model. In addition, we formulate an optimization problem constrained by a simple but effective geometric assumption that the normal vector of the ground plane lies in the null space of a movement vector of the camera center, and provide a closed-form solution. ORB-SLAM with the proposed scale estimation method achieves the average translation error with 1.19% on the KITTI dataset, which outperforms the state-of-the-art conventional monocular visual SLAM methods in road driving scenarios.


I. INTRODUCTION
In autonomous driving, simultaneous localization and mapping (SLAM) algorithms play an important role. The SLAM algorithm estimates the motion of a vehicle using a mounted sensor such as a depth sensor, calibrated stereo cameras, or a single RGB camera. Especially, the monocular visual SLAM (vSLAM) algorithm is attracting the automobile industry because a single RGB camera is cheaper and requires less calibration effort than other sensors. However, since there is no depth information available in the monocular vSLAM system, a pose of the current camera is usually obtained based on the previously estimated camera poses. This consequently causes error propagation, referred to as The associate editor coordinating the review of this manuscript and approving it for publication was Wai-Keung Fung .
the scale drift problem [1]. To solve this problem, the loop closure method [1], [2] or the prior knowledge of the environment [3]- [12] are widely adopted. In the loop closure method [1], once the closed-loop trajectory of a vehicle is detected, all camera poses are corrected through a pose-graph optimization [13] constrained with a similarity transformation. However, vehicles do not always move around a loop in real autonomous driving scenarios. Recently, the prior knowledge of the environment, such as the size of a detected object [4], [5], [8], [9] and the known height of a camera mounted on the vehicle [3], [6]- [12], has been used as a reference to modify the camera pose. The relative camera pose is scaled to the real-world coordinate data using the ratio between the size of a reference object and that in the real world. This ratio is commonly denoted as a scale factor. As the size of the reference objects, e.g.,cars, may vary VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ significantly, the information on the camera height, which can be easily measured, is generally employed to obtain the scale factor. In [7]- [12], the camera height is computed using FIGURE 1. For the sequence 00 in the KITTI dataset [14], 3DGPs collected from multiple keyframes are represented by black points. Due to the large uncertainty of 3D ground points, the planes obtained by using the RANSAC plane-fitting algorithm are different from each other. The obtained planes are shown in different colors. Such non-unique solutions may cause a large error for scale estimation.
homography decomposition, or 3D ground points (3DGPs) visible in the current frame, or both of them. Earlier works estimate the camera height using homography induced by the ground plane between two successive frames. However, the homography decomposition process is sensitive to noise and is challenging in terms of providing numerical stability [7], [8]. In [7], to improve the stability of the homography decomposition, a divide-and-conquer technique is proposed, in which the rotation and translation are obtained by using the entire set of feature points matched between the two successive frames; then, these camera motion parameters are substituted into the decomposition of homography. To compute the camera height, Grater et al. [12] fit the ground plane to the 3D points triangulated between the two frames using the random sample consensus (RANSAC) method [15]. In addition, based on the property that the direction of the vertical vanishing points of an image has the same direction as the normal vector of the ground plane, the normal vector is computed and then used to select the 3DGPs for the groundplane-fitting. Compared with [4], [5], [7], multiple cuesbased methods [8]- [10] have been also proposed, in which two or more predictions of the camera height are made using the homography decomposition, 3DGPs, and the size of an object. Then, the resulting camera height is obtained by using the Kalman filter [8], [9] or pre-trained classifier [10]. Currently, the SLAM methods employing the information from deep learning such as the estimated depth and segmented region or 3D bounding box of an object have also been developed to lessen the scale drift [16]- [19]. The hand-crafted methods mentioned above estimate the scale factor based on 3DGPs, planar homography, or both of them. Methods [11], [12] based on the 3DGPs are simple but their performance is limited since these methods only use the 3DGPs that are visible in the current frame and conduct inaccurate 3DGP validation. Multiple-cue-based methods [8]- [10] can produce reliable results. However, they require the auxiliary data-fusion process followed by the parameter training process. Considering that the scale estimation is a sub-system of the monocular vSLAM system, we propose a light-weight algorithm which estimates the scale factor only according to the available information from the conventional vSLAM system, i.e.,we do not need additional processes such as optimization-based image patch registration.
Our motivation starts from a very natural extension that is enlarging a candidate 3DGP pool based on multiple frames. For ground plane estimation, robust estimation methods such as RANSAC or MLESAC [20] are widely applied. In these methods, the largest consensus set is found and used to estimate the plane parameters via least-square estimation. If we simply fit a plane with the likely 3DGPs collected from multiple keyframes, however, the estimation is often unstable since the 3DGPs collected from multiple keyframes have large uncertainty. For example, Fig. 1 illustrates the 3DGPs (black points) and the ground planes (differently colored) obtained by conducting three times the RANSACbased 3D plane-fitting algorithm with these 3DGPs. Similar results can be obtained when the MLESAC was employed. In scale estimation, such non-unique solutions may yield a large error.
In this study, we propose an elaborate 3DGP validation method to estimate robustly the ground plane with multiple keyframes for scale estimation. To confirm the 3DGP candidates, we adopt a road segmentation network model and employ the segmentation mask to reduce the outliers. Also, we use a simple geometric assumption that the normal vector of the ground plane lies in the null space of a movement vector of the camera center. Under this assumption, we formulate a constrained optimization problem and provide a closed-form solution. We evaluated the performance of ORB-SLAM with the proposed scale estimation method on the KITTI [14] dataset. The experimental results demonstrate that the proposed method outperforms state-of-the-art conventional methods in road driving scenarios. The main contribution of this paper is summarized as follows: • To the best of our knowledge, the proposed method is a first attempt to employ multiple keyframes to improve the performance of scale estimation. The proposed scale estimation method can be a baseline for a multi-keyframe-based approach.
• We adopt a supervised road-segmentation network and collect the 3DGPs among the 3D points employing the semantic information from the road segmentation results.
• In the view of an optimization framework, we formulate the geometric assumption that, in the co-visibility graph of the current keyframe, a vehicle moves on a local plane, and provide a closed-form solution. As compared to the conventional objectives, the proposed closed-form solution can effectively improve the performance without a significant increase in computation. Overview of the proposed scale estimation algorithm. We collect 3D points s m , camera pose T k , camera center c w k from the local mapping thread. If enough 3DGPs are collected, the moving direction of the camera m w is estimated. Based on the m w , we fit a plane to the collected 3DGPs via proposed constrained optimization problem (5). Finally, we correct the camera poses based on the scale factor obtained via the fitted plane.
• ORB-SLAM using our scale estimation method w/ and w/o the proposed geometrically-constrained optimization achieves 1.19% and 1.23% in terms of the average translation error on the KITTI dataset [14], which is a superior result to state-of-the-art conventional monocular vSLAM methods in road driving scenarios.

II. BACKGROUNDS
The proposed method employs the monocular feature-based vSLAM system to obtain the camera pose. In this section, we briefly introduce the vSLAM algorithms and the 3D plane geometric model used in the proposed method.

A. FEATURE BASED vSLAM
The popular feature-based vSLAM algorithms include MonoSLAM [21], PTAM [22], and ORB-SLAM. Mono-SLAM is the firstly proposed vSLAM algorithm. In this method, the camera poses and 3D points are estimated simultaneously by using the extended Kalman filter (EKF). However, MonoSLAM is computationally expensive and is not suitable for large scale environments due to the large size of state vectors of the EKF. PTAM is a ground-breaking vSLAM algorithm that is developed for augmented reality applications in small workspaces. To achieve real-time performance, they first propose to apply the tracking and mapping in two separate threads. ORB-SLAM, as an extension of PTAM, has become one of the most popular and complete feature-based monocular vSLAM algorithms [23].
As compared with other feature-based vSLAM algorithms, ORB-SLAM achieves fully automatic initialization without requiring human intervention and obtains the outstanding performance [24]. The keyframes are adaptively selected to achieve robust tracking in various environments [25], [26]. 1 The ORB-SLAM algorithm mainly consists of three steps: tracking, local mapping, and loop closure. To achieve acceptable real-time performance, these three steps are performed in parallel by using multi-threading. In the tracking step, the camera pose of the upcoming (current) frame is estimated based on the detected and matched ORB feature points [31]. 1 Comparing with feature-based vSLAM, a denser 3D map can be reconstructed by (semi-)dense vSLAM methods such as SVO [27], LSD-SLAM [28], and DSO [29]. However, in the case of road driving scenarios, the (semi-)dense methods tend to lose tracking or fail to initialize the vSLAM system if the inter-frame motion is large. We use the ORB-SLAM as our baseline algorithm due to its robustness of vSLAM initialization [30].
In the local mapping step, the 3D point corresponding to each 2D feature is reconstructed and refined by the local bundle adjustment (BA) [32]. The local BA efficiently improves the accuracy of both the reconstructed 3D points and the camera pose by minimizing the re-projection error of the 3D points with respect to their corresponding 2D features. In practice, as running BA with all feature points and all frames is not feasible [22], [26], the local BA is executed on a part of frames. Such selected frames are referred to as keyframes. In general, the successive frames containing redundant information, such as frequently repeated features, are not included in the keyframes. In the loop-closure step, all the computed camera poses are optimized to eliminate the scale drift problem based on the similarity transformation if loop closure is detected.

B. PLANAR GEOMETRY
As mentioned before, some of the conventional methods adopt a plane equation to compute the camera height [7], [10], [12] based on the RANSAC-based plane-fitting method [33]. A plane in the 3D space can be represented by its unit normal vector n ∈ R 3×1 and the distance d ∈ R from the plane to the camera center. Let g n ∈ R 3×1 be the nth point among N 3D points on the plane, referred to as 3DGPs; then, the plane equation can be defined as n T g n − d = 0. The normal vector n is obtained by minimizing the constrained optimization problem defined as where G = [g 1 , · · · , g N ]−ḡ1 T ,ḡ = 1 N N n=1 g n , and 1 is the vector of which the entries are all ones with an appropriate length. The solution of (1) can be obtained as stated in the following basic theorem [34].
Theorem 1 (Rayleigh-Ritz): Let A ∈ R n×n be real and symmetric, and let the eigenvalues of A, λ 1 , · · · , λ n , be arranged in descending order. Then, x T Ax, and the minimum is attained at x n which is the unit eigenvector corresponding to the smallest eigenvalue of λ n . Let G = U V T be the singular-value decomposition (SVD) of G. Based on Theorem 1, the normal vector is VOLUME 8, 2020 a column vector of U corresponding to the smallest singular value. Then, the distance from the camera center to the ground plane, denoted as h, can be computed by h = n Tḡ . (2)

C. ROAD SEGMENTATION
In recent years, thanks to the increases in available computing power, deep-learning-based road segmentation methods have achieved great success and have been widely used in the autonomous driving system and advanced driver assistance systems (ADAS) [35]. Traditional road segmentation methods based on low-level features such as color, edge, and texture are not robust to illumination changes, and the assumption of the location prior does not hold when there is an on-road object in front of the autonomous vehicle, resulting in unsatisfactory segmentation results [36]. Since the CNN-based segmentation methods such as [37]- [39], show their superior performance and robustness in terms of illumination changes and occlusion, a variety of CNNbased road segmentation methods [36], [40]- [43] have been introduced to improve the accuracy of road segmentation. Due to big data and the plentiful computing resources, better results are generally obtained by using CNN-based methods as listed in the benchmark [44]. In scale estimation methods, since the scale factor is estimated based on the feature points or image patches matched between two successive frames on the road region, larger and more accurate road segmentation results can lead to higher scale estimation performance [11]. In this paper, since we mainly attempt to accurately estimate the scale factor, we use the CNN-based road segmentation method to reduce the error sources. Fig. 2 represents the flowchart of the proposed scalecorrection framework. According to the results obtained in the local mapping step, we first collect 3D points s m for m = 1, · · · , M in the world coordinate system and the camera pose T k for k = 1, · · · , K representing a rigid transformation of each keyframe, where M and K are the numbers of 3D points and keyframes, respectively. 2 With the camera pose, we can transform the coordinate of 3D points from the world coordinate system to that of each keyframe. The camera position c w k without scale correction is also collected. Among the collected 3D points s m , the 3DGP candidates g n are selected using the road segmentation of mask (Section III-A). If the numbers of 3DGPs, N , is larger than a threshold, δ, a motion vector m w representing the movement direction of the camera is computed. Then, a 3D plane orthogonal to m w is fitted to the 3DGPs. Finally, using the normal vector n corresponding to the ground plane, and the mean point of the collected 3DGPs,ḡ, the scale factor is computed and used to correct the camera poses. If N ≤ δ, we assume that the result of the proposed method is unavailable, so the previously computed scale factor is used to correct the camera pose.

A. 3DGP CANDIDATES COLLECTION USING ROAD SEGMENTATION
In conventional methods [8]- [10] that employs two successive frames, the accuracy of the reconstructed 3DGPs is low due to the feature-matching error and small triangulation angle. In the early stage of the proposed scale estimation, we first collect confident 3DGP candidates from each keyframe based on the semantic information from the roadsegmentation network model. Fig. 3 depicts the proposed 3DGP candidates collection method using road segmentation. The marked points and camera-shaped icons represent 3D space points and the camera positions corresponding to keyframes in the world coordinates, respectively. Let an arbitrary 3D point s m be given in the world coordinate in the given keyframe. Then, we can determine whether this point is a 3DGP or not by projecting s m onto the image plane as follows: where T (·) is the rigid transformation of the given keyframe, which transforms a 3D point from the world coordinate system to that of the given keyframe, and π(·) is the transformation that projects the 3D point onto the image plane. If the projected point x g is inside the ground region of the keyframe, we assign s m as a 3DGP candidate. To obtain 3DGP candidates with high confidence, we obtain the ground region using our previous work [36], referred to as RBANet. We employ the VGGNet [45] as a backbone. The coarse road region is first predicted from the top-level feature maps. Then, the road region is gradually refined by learning the residual refinement module in a top-down scheme. Since most of the false predictions occur on the road boundary, the reverse and boundary attention units in the residual refinement module are employed, which guides the network to focus on the region near the road boundary. The detailed architecture and implementation details can be found in [45]. The road segmentation network model is trained on the URBAN road category of the KITTI benchmark [46]. Without the loss of generality, any road segmentation methods can be applied in the proposed scale estimation method. The scale estimation performance for different road segmentation methods will be presented in Section IV-B. Fig. 4 shows an example of the road segmentation result and the points projected from the collected 3D points. The road region identified by the network is shown in green. The estimated 3D ground and space points are marked as red and blue points, respectively. Conventional methods employing the fixed region of interest (RoI) on the road [7]- [10], [12] often fail to estimate the camera height if the RoI is occluded by an unexpected object on the road. On the contrary, with the proposed segmentation-based 3DGP verification method, we can secure 3DGP candidates with high confidence.

B. 3DGP SELECTION FROM MULTIPLE KEYFRAMES
For a vehicle running on the road driving scenarios, generally, the movement in the up-down (y) direction is much smaller than that in the forward-back (z) and left-right (x) directions. Based on this fact, we assume that, if the captured images share enough common 3D points, the vehicle is running on a locally planar surface. To enlarge the candidate 3DGP pool for plane fitting, we propose to select 3DGPs from multiple keyframes.
To this end, we use the co-visibility graph adopted in ORB-SLAM [26]. In the co-visibility graph, any two keyframes are linked if they share observations of the same 3D points (at least 15 3D points). For the current keyframe, we first form a set of keyframes, denoted as K 1 , in which each element contains shared 3D points with the current keyframe. For example, as shown in Fig. 3, keyframes linked with the current keyframe (see a red camera icon) in the co-visibility graph are denoted by the blue camera icons. The red dots and cross lines represent their 3DGP candidates and 3D space points (non-ground 3D points), respectively. Then, we gather all 3DGP candidates of the keyframes in K 1 , which are denoted by blue and red dots. To select 3DGPs with higher confidence, the 3DGP candidates with the large triangulation angle are going to be employed. For each 3DGP candidate, we count the number of keyframes in which that point can be also observed. To this end, we first enlarge the keyframe set K 1 by further collecting the keyframes that are linked with the keyframes in K 1 . We denote the enlarged keyframe set as K 2 , which is indicated by blue and green camera icons in the figure. Only the 3DGP candidates that can be observed in at least λ keyframes are used as 3DGPs for plane-fitting. An example of 3DGP collection results when the road region is occluded by an object. The first row shows a keyframe in the co-visibility graph of the current frame. The second row shows the current frame. The space 3D points and 3DGPs are represented as blue and red points, respectively. One can note that since we collect 3DGPs from multiple keyframes, the 3DGPs can be successfully collected even though the road region is largely occluded by the truck.
When a vehicle moves forward, the proposed method of 3DGP selection ensures that the triangulation angle is sufficiently large because the triangulation angle becomes larger as the distance between the camera centers gets bigger. Note that, since we collect 3DGPs from multiple keyframes, the number of the 3DGPs is sufficient for the robust planefitting algorithm. Even though the road region is largely occluded by unexpected objects or segmentation fails due to over-exposure for several frames, the proposed 3DGP collection strategy can work robustly. Fig. 5 illustrates an example in which the 3DGPs are robustly detected even a truck occludes a large part of the road region.

C. ESTIMATE MOVEMENT DIRECTION OF PREVIOUS KEYFRAMES
We compute the plane equation based on the movement direction of the camera. For a vehicle movement on a ground surface, the normal vector of the ground plane is expected to be orthogonal to the movement direction of the camera center. To obtain the movement direction of the camera, we first collect all 3D coordinates of camera positions for each keyframe in K 1 . Then, we fit a line in a vector form to the 3D coordinates of the camera positions, as follows: where K = |K 1 |, m w ∈ R 3×1 is the motion vector of the camera to be estimated, c w k ∈ R 3×1 is the 3D coordinate of the kth camera center in the world coordinate system, and c w = 1/ K k=1 c w k . The solution of (4) can be obtained by the SVD of a matrix C ∈ R 3×K which is given by c w 1 , · · · , c w K − c w 1 T . Let C = U c c V T c be the SVD of C. The movement direction of the camera in the world coordinate system m w is the column vector of U c corresponding to the largest eigenvalue. Computing the movement direction only VOLUME 8, 2020 with two camera centers tends to produce incorrect results since the movement direction may be frequently changed in pitch direction if the vehicle is moving on an uneven surface. Therefore, the proposed multi-keyframe-based cameramotion estimation method can provide a more stable result.

D. 3D GROUND PLANE FITTING
The RANSAC scheme is widely adopted in the conventional methods to estimate the plane parameter. The largest consensus set of inliers is found and used to estimate the plane parameters via least-square estimation. In the proposed method, based on the selected inliers, the 3D plane is obtained by solving the proposed constrained optimization problem as follows: where G ∈ R 3×N is a matrix formulated as defined in (1), M = [m w , 0, 0], 0 ∈ R 3×1 is a vector of which the entries are all zeros, and m w is the movement direction of the camera, obtained in (4). Unlike in (1), in (5), we add another constraint whereby the normal vector of a ground plane to be fitted should be orthogonal to the movement direction of a camera. Note that, to insure that the normal vector of the fitted plane is perpendicular to the motion vector m w , instead of solving (5) by least square minimization, we first found the left null space of the motion vector, denoted as null(M T ), i.e., any vector in the left null space should be orthogonal to the motion vector m w . Since M is a rank-one matrix, the two vectors spanning null(M T ) can be easily obtained from the SVD of M, i.e., M = PSQ T . The null space is shown as a gray-colored plane in Fig. 3. Let B = [p 2 p 3 ] ∈ R 3×2 , where p 2 and p 3 are the two column vectors of matrix P corresponding to the singular values of 0. Then, the normal vector of the plane can be obtained by a linear combination of the basis vectors of B, as follows: where y ∈ R 2 . By substituting (6) into (5), the optimization problem used to fit the 3D plane can be rewritten as It should be noted that y 2 = 1 as B is formed by two orthonormal vectors and n 2 = 1. The minimum of (7) can be achieved by using Theorem 1. Consequently, the height of the camera is obtained as We collect 3DGPs from |K 1 | keyframes in the co-visibility graph of the current keyframe and fit a 3D line to the collected keyframes. If |K 1 | is larger than the number of keyframes in the co-visibility graph, only the available keyframes are collected. The normal vector of the ground plane is found in the null-space of the line direction (vehicle movement direction). In the case of vehicle rotation, since the view of the scene changes dynamically, only the limited number of previous keyframes share observations of the same, at least 15, 3D points with the current keyframe. As a result, the number of keyframes in the co-visibility graph is small and the trajectory of these keyframes exhibits negligible curvature. Therefore, the line direction of these keyframes can also be used to reduce the pitch error of the fitted ground surface. To illustrate this, we used the subsequence of sequence 06 of the KITTI dataset in which the road curvatures over 180 degrees are included. Fig. 6 shows the collected keyframes in the case of rotation. The trajectory of the vehicle is denoted by colored asterisks. The keyframes in |K 1 | are shown as filled red circles. The blue arrow indicates the fitted line direction. As can be seen, the number of collected keyframes decreases as the road curvature increases. The line direction obtained by (4) is almost parallel to the trajectory traveled by the collected keyframes.

E. SCALE COMPUTATION AND APPLICATION
After obtaining the camera height using (8), we compute the scale factor by s = h gt /h, where h gt is the ground-truth camera height. For the KITTI dataset [14], h gt = 1.65m. The camera center position of the current keyframe is updated as where c w c and c w p are the camera-center positions of the current and previous keyframes, respectively, in the world coordinate system, as obtained in the local mapping step. Since all camera poses are optimized by the local BA at the local mapping step, for the keyframes in K 1 , we assume that the translation vector is accurate up to a scale factor [47]. Consequently, if the number of 3DGPs is smaller than δ, the scale factor employed for the previous keyframe is adopted. Unlike conventional methods, the proposed method uses a scale factor directly, without applying any post-processing methods such as Kalman filter [8], [9], [12] or median filter [11]. In addition, the scaled camera pose is used as the final output, without performing any further BA optimization.

IV. EXPERIMENTAL RESULTS
The performance of the proposed method was compared with that of the following hand-crafted methods: Lee's method [3], Song's method [9], Wang's method [11], and Fanani's method [10]. All these methods estimate the scale factor based on the prior knowledge of fixed camera height. In addition, we also compare the performance of the proposed method with that of three CNN-based methods: Yin's method [16], Sucar's method [17], Yang's method [18], and Loo's method [19]. ORB-SLAM using the proposed scale estimation method with and without the guidance of the motion vector are referred to as GCSEMK (geometrically constrained scale estimation from multiple keyframes) and SEMK (scale estimation from multiple keyframes), respectively.

A. IMPLEMENTATION DETAILS
The proposed method includes three steps that run in parallel: tracking, local mapping, and scale correction. The tracking and local mapping steps are exactly the same as in ORB-SLAM. We considered that for the scale estimation methods [3]- [12] that use the prior knowledge, such as the camera height and the size of objects in the scene, the performance of scale drift elimination can be better evaluated without applying the loop closure. Since the proposed method also estimates the scale factor using the camera height, for a fair comparison, we did not apply the loop closure. The scale factor was obtained directly by only using a fitted 3D plane without post-processing or additional processes. We evaluated the performance of the proposed method on the KITTI visual odometry dataset [14]. The KITTI visual odometry dataset covers urban, residential, country, and highway roads. The images were captured at 10fps over a driving distance of 39.2 km. The vehicle speed varies from 0 to 90km/h. Since highly accurate camera poses are available, the KITTI dataset is wildly used for evaluating the scale estimation algorithm. The training sequences (00-10) were used as the ground truth pose is publicly available. The performance of the proposed method was quantified by using the KITTI visual odometry evaluation metric [14]. We compared the difference between the predicted and ground-truth camera poses, and computed the average of the rotation and translation errors for all possible sub-sequences of lengths of 100, 200, · · · , 800 meters [14]. To further test the capacity of the proposed scale estimation method, we also used the RobotCar dataset [48]. Comparing with the KITTI dataset, the Robot-Car dataset is difficult for the vSLAM algorithms because of significant illumination changes. For some sequences in the RobotCar dataset, the state-of-the-art vSLAM algorithms such as SVO [27], DSO [29], and ORB-SLAM cannot estimate camera pose due to feature tracking failure [19]. Since ORB-SLAM was used as our baseline algorithm, we tested the proposed method on the sequence for which the ORB-SLAM system can produce camera poses and 3D points.
To evaluate the performance of the scale drift elimination, we also computed the absolute trajectory error (ATE) [49] of the proposed method and compared it with those of ORB-SLAM, ORB-SLAM without loop closure (OS-NL), and ORB-SLAM with the ground-truth scale (OS-GTs). We fol-lowed the evaluation method used in Wang's [11] and Loo's method [19]. Since the absolute scale is unknown for the ORB-SLAM and OS-NL, for comparison, the ATE was computed after aligning the estimated trajectory with the ground truth by using a similarity transformation. 3 All the experiments were performed on a PC with an AMD Ryzen 7 2700X CPU, 16GB RAM, and NVIDIA GeForce Titan XP. In the RANSAC-based ground-plane-fitting, the maximum number of iterations and the distance threshold were set to 500 and 0.01, respectively. For each iteration, three 3DGPs were used.

B. PARAMETER STUDY AND PROCESSING TIME
To learn the effect of the parameters, λ, |K 1 |, and δ used in the proposed method, we evaluated the results obtained using SEMK and GCSEMK with different parameter settings. For each parameter setting, we executed the proposed methods 5 times on the KITTI training sequences. The results of sequence 01 are not presented since the proposed methods failed at this sequence.
The proposed method was developed by using the opensource ORB-SLAM as a baseline algorithm. We selected 3DGP as a candidate point for plane fitting if only the 3GP can be observed in, at least, λ keyframes. In our experiment, we set λ = 3 based on the following reasons. First, in ORB-SLAM, the keyframe is selected adaptively to ensure that the feature can be robustly tracked in various environments [25], [26]. For instance, if the scene is changed largely in the case of rotation, or fast vehicle speed, the keyframe is selected as fast as possible for robust tracking. Second, since a larger baseline can produce higher precision of 3D reconstruction, the reconstructed 3DGP with a higher λ value is preferred. Third, we found that, when λ > 3, the number of collected 3DGPs is extremely decreased. This is due to the fact that as the camera moves forward, the feature points on the ground plane exit the field of view very quickly and fewer 3DGPs are simultaneously observable from 4 keyframes. In our experiment, when λ > 3, the scale factor cannot be obtained for several parts of sequence 06, 07, 08 and 09 since the number of collected 3DGPs is smaller than the minimum required number for plane fitting. Thus, under the ORB-SLAM keyframe selection method, we found that λ = 3 is an appropriate value to chose 3DGPs with high reconstruction precision. The reliable results for both KITTI and RobotCar dataset were also obtained by setting λ = 3.
To fully study the impact of the parameter |K 1 |, we tested the proposed method 5 times for KITTI training sequences with different |K 1 | values ranging from 1 to 20 as shown in Fig. 8. The average result for each sequence with different |K 1 | was computed and shown as the dashed line with different colors. The overall performance was also computed for each |K 1 |, and shown as a thick red line. Note that, when |K 1 | > 10, the difference in overall performance is unnoticeable. This is because that, we only assembled the keyframes in the co-visibility graph of the current keyframe. The accuracy and processing time of the proposed method for different |K 1 | is shown in Fig. 8. The processing time per frame increased as |K 1 | increased since much more 3DGPs are collected. The processing time was measured without using any CPU and GPU acceleration techniques such as Intel streaming SIMD extensions and CUDA-accelerated libraries. Considering both processing time and accuracy, we set |K 1 | = 12. In the proposed method, the previously computed scale factor is used to correct the camera pose if the number of collected 3DGPs is smaller than δ. After setting |K 1 | = 12 and λ = 3, we run the proposed methods 5 times for KITTI training sequences and the results are shown in Figure.3. One can note that the performance of the proposed method is almost unchanged when δ < 30. When δ > 30, the performance is degraded. This is because, for the textureless ground region, the number of collected 3DGPs is generally smaller than 30 although we collected 3DGPs from multiple keyframes. In summary, the default settings were determined as |K 1 | = 12, λ = 3 and δ = 15. To study the impact of the road segmentation method, we tested the proposed methods on the KITTI dataset with SegNet [38], MultiNet [43], and RBANet. Table 1 lists the overall translation error of the proposed scale estimation method with different networks. For each network, the evaluation results of F1-measure are also listed in Table 1. For the URBAN road KITTI dataset, the SegNet, MultiNet, and RBANet produced F1-measure with 84.04%, 94.88%, and 96.30%, respectively. From the results, we found that similar results were obtained by the proposed methods with different road segmentation network. This indicates that the proposed scale estimation method is not very sensitive to the choice of road segmentation networks. To obtain the experimental results, we used RBANet since it showed slightly better performance than Segnet and MultiNet. For the RBANet, the segmentation process takes about 170ms for 1024 × 768 images.

C. OVERALL PERFORMANCE OF KITTI DATASET
For the KITTI dataset, we used the image captured from the left camera of the KITTI binocular stereo setup. To evaluate the proposed scale estimation method as comprehensive as possible, we run SEMK and GCSEMK several times on all sequences. The ORB-SLAM occasionally losses tracking on sequence 02, 08, and 09. Since the translation error is accumulated over time, for each sequence, only the camera poses with full trajectory were used for evaluation. We obtained 30 full trajectories for each sequence and computed their average translation error. The performance of the conventional and proposed methods on the KITTI dataset is listed in Table 2. The example trajectories obtained by using GCSEMK are shown in Fig. 9. As can be seen, the estimated vehicle trajectories were closely aligned with the ground-truth vehicle trajectories. The average, minimum, and standard deviation of the translation error are denoted as T(%), min, and σ T , respectively. The best is boldfaced. The scores of the compared methods were acquired from the authors' papers, and 'n.a.' indicates unavailable data. As shown in Table 2, the proposed method, GCSEMK, achieved the best result with a translation error of 1.31%,  followed by the Fannani's method and SEMK, with a translation error of 1.54%. For sequence 00, GCSEMK and wang's methods provided the best result, with a translation error of 1.01%. For sequence 01, Only Lee's and Fannani's methods obtained results for the full-trajectory, as listed in Table 2. This sequence was a highway scenario with frequently repeated features. The images were captured only at 10fps in spite of the vehicle speed of 90km/h. The proposed method failed at sequence 01 because the 3DGPs collected across multiple keyframes were still insufficient in several parts of the sequence. For sequence 02, the best results were obtained by Fanani's method, with a translation error of 0.99%, followed by Song's method and the proposed method. For the sequences 03-08, the proposed methods, SEMK and GCSEMK, considerably improved the accuracy of translation compared with conventional methods. Particularly, for the sequence 07 in which the road region exhibited less texture than in other sequences and, for a short time, over 70% of the frames are occluded by a large truck, SEMK and GCSEMK obtained superior results to the other methods. As the proposed method used multi-keyframes, SEMK and GCSEMK can reliably collect the 3DGPs from the partially occluded keyframes. Comparing the results of SEMK and GCSEMK, we can conclude that identifying the normal vector of the ground plane in the null space of the motion vector is an effective and efficient strategy. In this case, the average translation error was improved from 1.54% to 1.31%. In addition, σ T of GCSEMK is smaller than that of SEMK, indicating that more stable results can be obtained with the guidance of the movement vector of a camera. For sequence 10, Wang's method provided the best result, with a translation error of 0.93%, followed by GCSEMK, with an error of 1.37%. Video results can be found from our project webpage. 4 Table 3 provides the ATE results obtained using the standard ORB-SLAM, OS-GTS, OS-NL, Loo's method, Wang's method, SEMK, and GCSEMK. As listed in Table 3, there  [19], Wang's method [11], SEMK, and GCSEMK. are great gaps between the ATE results obtained using the GT scale and the scale estimated via the loop closure. This indicates that, if the scale factor is estimated accurately, we can significantly improve the performance of monocular visual odometry/SLAM in a large-scale environment just applying the scale correction. Among these methods, GCSEMK obtained the best result with an average root mean square error (RMSE) of 2.75m. For the short-distance sequences 03 and 04, all methods obtained appropriate results. The ATE results obtained for each keyframe are shown in Fig. 10. The results of the other sequences are available on the project webpage. The results of SEMK and GCSEMK are marked using green dashed lines and blue solid lines, respectively. For each sequence, the blue solid lines are generally below the green dashed lines, indicating that, in road driving scenarios, the normal vector of the ground surface traveled by the vehicle can be effectively found from the nullspace of the movement direction of the vehicle.

E. EXPERIMENT RESULT ON RobotCar DATASET
For the Robotcar dataset, we used the image captured from the center camera of the Bumblebee XB3 trinocular stereo setup. The real-world camera height was set to h gt = 1.58m according to Robotcar dataset. For a fair comparison with Loo's method [19], the first 200 frames for each Robot-Car sequence were skipped because the images were highly  over-exposed at the beginning. In addition, the images were cropped to 1248 × 376. The results of the proposed methods were obtained by using the same parameters as used for the KITTI dataset. The ATE results and estimated trajectories are shown in Table 4 and Fig. 11, respectively. For sequence 2014/05/06/12/54/54, the SEMK and GCSEMK obtained better results than Loo's method [19] and ORB-SLAM. The best results for sequence 2014/06/25/16/22/15 was obtained by Loo's method since their method is more robust to the illumination changes than the ORB-SLAM, SEMK, and GCSEMK. We found that for the short trajectory, such as sequence 03, 04 of the KITTI dataset and the sequences of RobotCar dataset, the performance of SEMK and GCSEMK are similar. The experimental results indicate that the proposed method can accurately estimate the scale factor if the camera pose is reliably provided by ORB-SLAM.

F. DISCUSSION OF LIMITATIONS OF THE PROPOSED METHOD
The proposed 3DGP collection method has been made based on the assumption that the multiple keyframes share similar fields of view. When the view of the scene changes dynamically, such as a curved road or a longitudinal slope as shown in Fig. 12(a), few 3DGPs can be reconstructed due to feature matching failure. In this case, the number of collected 3DGPs can be largely decreased. If N ≤ δ, we used the previously computed scale factor to correct the current camera pose as shown in Fig. 12(b)-(c). This may introduce a relatively larger RMSE value as shown in Fig. 12(d). The representative keyframes corresponding to non-planar surface are shown in Fig. 12(e)-(f). As can be seen, on the ground region, since the view of the scene changes dynamically, only a small number of feature points can be found and matched from a very different viewpoint. In the case of the dynamic change of shape and illumination of the ground region, since the visual information alone is insufficient for achieving robust navigation, we plan to integrate the proposed method with the inertial measurement unit sensors.

V. CONCLUSION
In this paper, aiming to estimate the scale factor accurately, we first collected 3DGPs across multiple keyframes. Then, we collected 3DGPs for each keyframe in the co-visibility graph. The proposed 3DGP collection method provides a sufficiently large number of 3DGPs for robust plane-fitting. As a result, we can only use the 3D plane-fitting algorithm to accurately compute the scale factor without additional process. To further reduce the scale estimation error caused by the inevitable uncertainty of reconstructed 3DGPs, we found the normal vector of the 3D ground plane in the null space of a movement direction of the camera center. We modeled the 3D plane-fitting process as a constrained optimization problem and provided a closed-form solution. The experimental results demonstrated that the proposed geometric constraint can effectively reduce the solution space of the normal vector. The performance on the challenging KITTI dataset showed that ORB-SLAM with the proposed scale estimation method outperforms the state-of-the-art conventional methods.  In 1992, he joined the Department of Electronic Engineering, Korea University, where he is currently a Professor. From 1988 to 1992, he was an Assistant Professor with the Department of Electrical and Computer Engineering, University of Michigan-Dearborn. He has published over 210 international journal articles. He also holds over 60 registered patents in fields such as video signal processing and computer vision.
Dr. Ko was a 1999 recipient of the LG Research Award. He received the Hae-Dong best paper award from the Institute of Electronics and Information Engineers (IEIE), in 1997, the best paper award from the IEEE Asia Pacific