Fast and Low-Drift Visual Odometry With Improved RANSAC-Based Outlier Removal Scheme for Intelligent Vehicles

Visual odometry estimates the ego-motion of a vehicle using only the input of a single or multiple cameras mounted on the vehicle. This paper focuses on the research of the stereo visual odometry system of intelligent vehicles, and discusses how to improve the robustness, accuracy and efficiency. A new robust estimation algorithm, Locally Optimized Progressive Sample Consensus algorithm, is proposed. Compared with the RANSAC algorithm, it can not only improve the accuracy of model estimation, but also can find more inliers to terminate the iteration process in advance, thereby speeding up the algorithm. A decoupling-based motion estimation algorithm is proposed. Monocular method is used to estimate the rotation parameters, which eliminates the influence of mismatching between left and right frames on rotation estimation. Moreover, when estimating the translational motion, the decoupling method makes the normalized re-projection error criterion better distinguish between inliers and outliers. The performance of the method is evaluated on the KITTI benchmark dataset by comparing it with the existing visual odometry systems. The experimental results show that the proposed technique has a high accuracy and efficiency.


I. INTRODUCTION
In the mobile robot system, it is very important to estimate the pose itself for target detection and localization. The traditional pose estimation methods include global positioning system (GPS), inertial measurement units (IMUs), wheel speed sensors and sonar positioning system. In recent years, the camera system has become cheaper, with higher resolution and frame rate. The performance of computers has been significantly improved, and real-time image processing has become possible. Therefore, a new pose estimation method, visual odometry (VO), is produced. The term VO was coined by Nister et al. [1] in 2004. It estimates the pose of agent using only the stream of images acquired by a single or multiple cameras [2]. It has a low cost and can work in GPS-denied The associate editor coordinating the review of this manuscript and approving it for publication was Antonio J. R. Neves . environments such as underwater and in the air. Its local drift rate is less than that of wheel speed sensors and low-precision IMUs. The data it obtains can be easily fused with other vision-based algorithms, eliminating the calibration between sensors.
The idea of estimating camera ego-motion from continuous image sequences was first proposed by Moravec [3]. He used a slidable camera to obtain visual information and completed the robot's indoor navigation. In 1987, Matthies et al. [4] designed a theoretical framework from feature extraction, feature matching and tracking to motion estimation, which is still followed by most VO systems. Most of the early VO systems were mainly used in planetary exploration [3], [5], the most typical of which was NASA's Mars exploration project. VO was used in the Mars Exploration Rovers to measure 6 degrees of freedom (6-DoF) parameters when the wheel speed sensor failed. Nister et al. [1] designed a real-time VO system, which truly realized the robot's outdoor navigation. At the same time, they also proposed two types of VO implementation methods and processes, namely monocular vision and stereo vision, which laid a new foundation for the later research of VO. In recent years, with the development of autonomous driving technology, VO is widely used in intelligent vehicles.
VO usually consists of a feature module and a motion estimation module. The feature module includes feature detection and feature matching. Every time a new frame is acquired, the algorithm first detects some significant and repeatable features. Then feature matching is carried out between images, that is, to find the projection points of the same spatial point on the two images. The motion estimation module is the core calculation step of VO systems, and it is mainly composed of two parts: outlier removal and motion estimation. Correspondences generated by feature matching usually contain some abnormal data points that do not conform to the mathematical model. These data points are called outliers, and the data points that conform to the mathematical model are called inliers. Outliers will have a serious impact on motion estimation, so they need to be removed. Some outlier removal works are done independently before motion estimation. These works usually can not remove all outliers, but can only increase the proportion of inliers in the data set. Therefore, it is usually necessary to use a robust estimation algorithm to remove outliers in the process of motion estimation. There is usually an iterative process between the robust estimation algorithm and the motion estimation, and the result of the iteration is to output a set of inliers. These inliers are used to calculate the relative motion between two consecutive frames, and then to calculate the current camera pose. This paper makes full use of the driving environment and motion characteristics of vehicles to construct a stereo VO system suitable for intelligent vehicles. The research is carried out from three aspects: feature module, robust estimation algorithm and motion estimation algorithm based on decoupling, aiming to improve the robustness, accuracy and efficiency of the algorithm. The major contributions of the work are summarized as follows: 1) A feature module that is appropriate for vehicle visual odometry is designed. Two constraints are imposed in the process of feature matching, namely speed smoothness constraint and nearest neighbor ratio constraint. These constraints can speed up the feature matching and reduce the probability of mismatching. The refined feature screening strategy is used to further select feature points by considering their spatial distribution and temporal evolution. The feature points are evenly distributed on the image by using bucketing technology. By retaining the older feature points, some features with better tracking characteristics are obtained for motion estimation. The experiment proves that the designed feature module can stably output high-quality correspondences.
2) In order to improve the ability of the VO system to remove outliers, a Locally Optimized Progressive Sample Consensus (LOPSAC) algorithm is proposed. Progressive sampling preferentially extracts the data points with highquality scores, which increases the probability of extracting the pollution-free inliers set and speeding up the algorithm. Local optimization can find the optimal solution near the hypothesis generated by the minimum set, which can not only improve the accuracy of the model, but also find more inliers to terminate the iteration process early.
3) A motion estimation algorithm based on decoupling is proposed, and different outlier criteria are used according to the vehicle motion mode. There are many objective functions and outlier criteria used in VO systems. At present, the most commonly used is to construct the re-projection error on the image plane as the objective function and take it as the outlier criterion. However, the re-projection error is not only related to the matching quality of the feature points, but also related to its spatial position. In this paper, the normalized re-projection error based on decoupling is used as the outlier criterion of translation estimation when the vehicle goes straight. The decoupling calculation of rotation and translation not only enables the normalized re-projection error criterion to better distinguish between inliers and outliers, but also eliminates the influence of false matching between left and right frames on rotation estimation.
The rest of the paper is organized as follows. Related work is introduced in Section II. Section III-V presents the core algorithms, including feature module design, LOPSAC algorithm and motion estimation based on decoupling. Section VI is experiments and system evaluation, comparing the proposed method with the VISO2-S [6], S-PTAM [7] and ORB-SLAM2 [8] system. The conclusions are made in Section VII.

II. RELATED WORK
The essential part of any VO system is outlier removal and motion estimation. Outlier removal can be divided into independent outlier removal before motion estimation and outlier removal using a robust estimation algorithm. Some methods remove outliers by analyzing the consistency of optical flow. Adam et al. [9] rotated an image and by doing that they have built many image pairs with matching points. They have shown that the correct matching points create line segments that point in approximately the same direction. By using the mean shift mode algorithm they were able to find those segments that do not point in the mode direction, and classify them as false matches. Grinstead et al. [10] believe that when the motion of the agent conforms to the assumption of plane motion, the optical flow direction that is correctly matched follows a certain commonality. They used a directional tracking algorithm to calculate the consistency between the optical flow direction of each feature point and the main direction, and then excluded the feature points with poor consistency by setting a threshold. Santana et al. [11] further advanced the above two methods. Through observation, they believe that the change of rolling angle makes the mode of optical flow complex, resulting in that the correctly matched optical flow no longer has a common direction. In order to reduce VOLUME 10, 2022 the interference of rolling angle, they divided the image into left and right parts. Although the optical flow of each part still does not have a common direction at this time, the difference between correct matching is much smaller than that of mismatches. Therefore, the points with a large difference in optical flow direction in the left and right images can be removed respectively. In addition to mismatches, feature points on moving targets are also an important source of outliers. The RDSLAM system [12] developed by Zhejiang University can detect scene changes and identify the changed three-dimensional points. We [13] proposed a spatial position constraint method to remove moving points based on the smoothness principle of vehicle motion. However, these methods can not deal with scenes with significant changes in a short time.
The above methods can not remove all the outliers, but only increase the proportion of inliers in the data set. Therefore, the robust estimation algorithm used in the process of motion estimation is an important means of outlier removal. At present, most VO systems [6], [14]- [16] use RANSAC as a robust estimation algorithm, but the way of using it is slightly different. The VISO2-S system [6] set the iteration number to a fixed value of 200. The NOTF system [14] set the estimated proportion of inliers to 0.5, which can also obtain a fixed iteration number. Wu et al. [15] used the new maximum consistent set to update the required iteration number after each iteration, so the iteration process can be ended early. The SSLAM system [16] used RANSAC in both feature matching and motion estimation. It can improve the proportion of inliers and save the time of motion estimation, but it will increase the calculation time of the feature module. Although there are some improved algorithms for RANSAC [17], [18], the current VO system rarely uses these algorithms. Only the cv4xv1-sc system [19] adopted the MLESAC algorithm [20], which is an improved algorithm of RANSAC. Based on the in-depth study of RANSAC and its improved algorithm, this paper proposes a robust estimation algorithm that takes into account both accuracy and efficiency, and applies it to the VO system.
The objective function needs to be constructed according to the vehicle motion model to solve the motion parameters. The most commonly used vehicle motion model is the 6-DoF method. This method usually adopts the re-projection error to construct the objective function, and uses the nonlinear least square method to solve the motion parameters. For example, Pire et al. [7] used the sum of squares of re-projection errors as the objective function. Wu et al. [15] added weight coefficients to the re-projection error according to the characteristics of the feature points. Li et al. [21] fused the semantic invariant with the re-projection error function. In order to improve the efficiency of motion estimation, some vehicle VO systems also use motion model constraints. For example, assuming that the camera makes plane motion, the complexity of the motion model was reduced to 3 degrees of freedom [22]. Scaramuzza et al. [23], [24] reduced the complexity of the motion model to 2 degrees of freedom by introducing the non-holonomic constraints of vehicle motion. They only used a pair of feature points to obtain the model solution of vehicle motion, and the motion estimation frequency was as high as 400Hz. However, if the vehicle's motion is inconsistent with the assumed model, the accuracy of motion estimation will be significantly reduced.
The above methods solve the rotation and translation motion parameters simultaneously. However, some methods [25]- [27] decouple rotation and translation. The SOFT system [25] showed that solving the rotation and translation separately is more conducive to improving the motion estimation accuracy in the stereo VO system. The reason is that rotation can be estimated by the monocular method. The monocular method only involves feature matching between consecutive frames, while the stereo vision method involves a quadruple matching. Therefore, the monocular method has a higher probability of obtaining correct matching. In this paper, the decoupling-based normalized re-projection error is introduced as one of the criteria to measure the outliers of translation estimation according to the vehicle's motion mode. The decoupling calculation of rotation and translation not only makes the normalized re-projection error criterion better distinguish between inliers and outliers, but also can use the monocular method to estimate the rotation parameters, eliminating the influence of the mismatch between the left and right frames on the rotation estimation.

A. FEATURE EXTRACTION AND MATCHING
The combination of the ORB feature detector and the ORB feature descriptor [28] has strong robustness to a variety of image changes. When applied to VO systems, ORB-ORB shows high performance in both accuracy and efficiency. Therefore, ORB-ORB is selected as the feature operator of the system. Since the ORB descriptor is a binary feature descriptor, Hamming distance is used as the similarity measurement in feature matching. In order to narrow the search range and reduce the proportion of outliers, two constraints are imposed in the matching process, namely speed smoothness constraint and nearest neighbor ratio constraint. The specific implementation methods of speed smoothness constraint please refer to the work [13], and the nearest neighbor ratio constraint is introduced in detail here.
Suppose that the template image and the image to be matched are I 1 and I 2 respectively. There is a feature point p on I 1 , and its descriptor vector is V p . The nearest neighbor and next-nearest neighbor of point p are feature points q and q on I 2 respectively. Their descriptor vectors are V q and Vq, then the nearest neighbor to next-nearest neighbor distance ratio R p is defined as: Here D HAM (·) is the Hamming distance function. R p represents the matching quality. It is taken as an indicator of whether to accept the matching, and the matching with R p greater than or equal to the threshold T R is discarded. In addition, the nearest neighbor distance must be less than the threshold T D . The matching relationship between point p and q is expressed as follows: Here '' ='' means accepting the matching and '' ='' means rejecting the matching.

B. FEATURE SCREENING BY BUCKETING PLUS AGING
Although the above constraints are imposed in the process of feature matching, the number of correspondences retained is still large. In fact, using too many correspondences for motion estimation is not conducive to the improvement of accuracy, but also increases the complexity of the algorithm. Therefore, we only retain some feature points with better characteristics. In order to achieve accurate motion estimation, the bucketing technique [6] is adopted to make the reserved features evenly distributed on the image. The implementation of bucketing is to divide the image into some small rectangular blocks, namely buckets, and only a limited number of features are reserved in each bucket. The traditional bucketing method [6] randomly selects the features reserved in each bucket, which is a very simple strategy. This paper adopts a more refined feature screening strategy. The age attribute is used as the criterion to select the features reserved in each bucket. We assign age attribute to features. Assuming that the age of the initially detected feature is 0, the age increases by one once the feature is successfully tracked in the subsequent frame. We have conducted an experiment on a group of continuous images of the KITTI dataset [29] to analyze the relationship between the age of features and tracking characteristics. After tracking 50 frames, all the features detected at the beginning disappeared, i.e., the maximum survival age of features in this experiment was 49. The average of the interframe errors of all features with the same survival age is called the average of the inter-frame errors (AIE). Figure 1 shows the relationship between the AIE and the survival age. It indicates that the features with a longer survival age give smaller AIE. Many experiments on other image data sets also lead to the same conclusions.
The above analysis shows that the features with older survival age have better tracking characteristics. The features with a longer survival age are more reliably tracked, thus giving less tracking error. Therefore, they have a higher priority to be selected for motion estimation in this study. Based on this principle, the points in each bucket are sorted in descending order of age to form a sequence, and one feature point is selected from the head of the sequence each time.
Assuming that the number of selected features in the bucket is b, stop selecting when one of the following two conditions is true: a) No more candidate features in the bucket; b) b = b max . Here b max represents the maximum number of feature points in each bucket. Figure 2 shows the feature points retained by the above screening method, where b max = 2. Each square separated by the yellow line represents a bucket, the green dots represent the retained feature points, and the red number represents the age of the corresponding point.

IV. LOPSAC ALGORITHM
There are still outliers in the data set output by the feature module. The research content of this section is the model estimation problem in the presence of outliers, i.e. robust estimation algorithm. RANSAC algorithm [30] can extract the optimal subset from the data set containing a large proportion of outliers through iteration. It is the most commonly used robust estimation algorithm. Ideally, the probability of finding the correct model is equal to the probability of randomly selecting a sample without outliers. RANSAC gives the number of iterations required as follows: Among them, α is the proportion of inliers, and m is the sample size. The confidence of extracting a sample without outliers is η 0 , and η 0 is usually set to 0.95. This paper proposes the LOPSAC algorithm, which improves RANSAC from three aspects: progressive sampling, reverse order model verification and local optimization.

A. PROGRESSIVE SAMPLING
Let the number of all correspondences be N , and the data points µ i (i = 1 . . . N ) in the data set are arranged in descending order of quality scores to obtain the set M N . The quality scores in this paper is the nearest neighbor to next-nearest neighbor distance ratio R p defined in (1). Progressive sampling assumes that the data points with higher quality scores are more likely to be inliers. Considering that the standard RANSAC randomly extracts T N samples of size m from M N , the formed sequence is expressed as Progressive sampling also extracts all the samples i in the sequence However, sampling is carried out in the order of quality scores from high to low, and the sequence formed is Suppose the data set composed of the first n data points with the highest quality score in M N is M n , and VOLUME 10, 2022 contains T n (n ≥ m) data points on average which are all from M n , the recursive relationship of T n+1 can be derived: There are T n samples with all data points from M n , and T n+1 samples with all data points from M n+1 . Since the subsets M n and M n+1 satisfy M n+1 = M n ∪ µ n+1 , there are T n+1 −T n samples containing 1 data point µ n+1 and m−1 data points from M n . Therefore, for n = m . . . N , T n+1 − T n samples need to be drawn, and each sample is composed of µ n+1 and m − 1 data points randomly drawn from M n . Since T n is not necessarily an integer, it is defined as follows: where · . means rounding up. So for n = m . . . N , the number of samples to be drawn isT n = T n+1 −T n . According to the above analysis, the outline of the LOPSAC algorithm is shown in Algorithm 1. Since progressive sampling preferentially extracts the data points that are more likely to be inliers, it is likely to end the sampling process earlier, thereby speeding up the algorithm.

B. REVERSE ORDER MODEL VERIFICATION
A simple and efficient model verification method is designed in this paper. Assuming that the maximum number of the inliers obtained from the previous k − 1 sampling is I max , then the minimum number of the outliers is N − I max . When verifying the data points in M N with the model parameter θ k obtained in the k-th sampling, only the number of the outliers in M N satisfies: the verification can be stopped. In addition, since the data points in M N are arranged in descending order according to the quality score, the data points at the back of the queue are more likely to be outliers. Therefore, this paper verifies the data points in M N from back to front, so that it can find enough outliers earlier, and further save the verification time.

C. LOCAL OPTIMIZATION
The standard RANSAC assumed that a model computed from an uncontaminated sample is consistent with all inliers. However, this assumption is usually invalid when the observation data are polluted by a lot of noise. Since the assumption of RANSAC is generated from the minimum set, it is greatly affected by noise. When the iteration number reaches the stop criterion, the number of inliers is usually less than the expected value. As a result, the number of effective inliers is reduced, so the iteration number required by RANSAC is usually greater than the expected value determined by (3).
Although the observation data are polluted by noise, the hypothesis generated by the minimum set containing only inliers is always close to the optimal solution. Therefore, the implementation of local optimization on such a hypothesis is expected to achieve ideal results. If the current maximum consensus set is obtained in the k-th iteration, the local optimization is started. Firstly, a nonminimum set is randomly drawn from the I k inliers obtained in the k-th iteration, and then the model parameters are calculated with the non-minimum set, which is used to verify all data points in M N . Calculate new model parameters with data points whose error is less than the threshold ρ ·τ (ρ > 1). Reduce ρ and iterate until the error threshold decreases to τ , the model and consistency set are recorded. Repeat the above process k in times (k in = 10 in this paper), and take the maximum consistent set and model as the return value of the k-th iteration. The outline of the local optimization algorithm is shown in Algorithm 2. The local optimization step does not need to be implemented in each sampling process, but only starts when the current model generates a new maximum consensus set. Therefore, the number of local optimizations is much smaller than the maximum sampling number k max .

D. STOPPING CRITERION
Since the proportion of inliers (α) in the data set is usually unknown, standard RANSAC uses the worst estimate of α to calculate the maximum sampling number k max , and the k max is constant throughout the process. This paper uses an adaptive estimation method. Every time a maximum consistent set is generated, it is used to update α, then update k max with (3). The implementation of the iteration stopping criterion

.Local optimization
Input consensus set k , call the local optimization program of Algorithm 2, and return θ lo k and lo has been reflected in Algorithm 1. In order to make a distinction, the ''RANSAC'' mentioned later refers to the RANSAC algorithm that determines k max by the adaptive method. The RANSAC whose k max is determined by the initial setting is called ''RANSAC-k max ''. For example, if k max is set to a fixed value of 200, it is represented as ''RANSAC-200''.

V. MOTION ESTIMATION ALGORITHM BASED ON DECOUPLING
In this section, a more discriminative outlier criterion is designed when estimating the translational motion according to the motion characteristics of the vehicle. In addition, a motion estimation algorithm based on decoupling is proposed.

A. OUTLIER CRITERION OF TRANSLATION ESTIMATION
It can bring some problems in estimating the translational motion by taking the re-projection error as the outlier criterion. This problem is particularly prominent in the scenes with large differences in the spatial positions of feature points.

Algorithm 2
The Outline of the Local Optimization Algorithm Input: M N : The data set in which data points are arranged in descending order of quality scores; τ : Inliers threshold; ρ: Threshold coefficient of local optimization; : Attenuation step length of ρ; k in : Sampling number of local optimization; : Consistent set before local optimization Output: θ lo : Model parameters after local optimization; lo : Consistent set after local optimization A non-minimum set of size min(| | 2, 14) is randomly drawn from Use the non-minimum set to calculate the model parameters θ while ρ ≥ 1 do Use model parameters θ to verify the data points in M N Get the data point set whose error is less than the threshold ρ · τ Update model parameters θ with data point set In order to solve this problem, a new outlier criterion needs to be designed.

1) THE RE-PROJECTION ERROR IN THE CASE OF STRAIGHT TRAVEL WITHOUT ROTATION
Suppose the pixel coordinate of the spatial point P i = X i , Y i , Z i T on the frame k is p i k = u i k , v i k T , and the image coordinate isp i k = x i k , y i k . The rotation matrix from frame k − 1 to frame k is R k , and the translation vector is t k , then the re-projection error is expressed as: where P i k−1 is the coordinate of the spatial point P i on frame k − 1, which can be obtained by triangulation; π(·) is the perspective projection function. Since the motion parameters are the same for all feature points, the condition that ε i k can be used to measure the matching quality is that it can only be related to the motion parameters and the matching quality, but has nothing to do with other factors. Now let's examine whether there are other factors that affect the re-projection error ε i k in addition to the above two factors. For the convenience of analysis, it is assumed that there is only a motion parameter error between correspondences, but there is no matching error. The motion parameter with error is expressed as (R k ,t k ) The re-projection errorε i k VOLUME 10, 2022 at this time can be expressed as: Since there is no matching error, p i k is the real pixel coordinate, andp i k contains the errors caused by the motion parameters.
In order to analyze the factors affectingε i k , it is assumed that the vehicle travels straight without rotation. That is, the rotation matrix R is the identity matrix, and the translation vector t = t x , t y , t z T satisfies: Let λ i k−1 = Z i k−1 f , the motion parameter with errort z = t z + t z = t z , combining the constraints of R and t to obtain the re-projection error in (8): It can be seen from (10) thatε i k is not only related to the motion parameter error, but also related to the projection position x i k−1 , y i k−1 and the parameter λ i k−1 determined by the depth. It is easy to remove the points far from the center of the image and the points close to the camera if the re-projection error is used as the outlier criterion.
In fact, x i k−1 , y i k−1 and λ i k−1 are all determined by the spatial position of the feature point. If a quantity related to the spatial position of feature points can be found to normalizẽ ε i k , the influence of this factor can be eliminated, and the quantity is the amplitude of optical flow. Combining with the constraints of R and t, the normalized re-projection errorε i k can be expressed as: The approximate equal sign here is obtained by (11), it can be seen that the re-projection error after normalization is no longer related to the image coordinates and distance of the feature points.

2) RELAX CONSTRAINTS
The preceding paragraphs impose constraints on the rotation and translation movement of the vehicle. When the vehicle is driving at a high speed, the normalized re-projection errorε i k is almost independent of the spatial position of the feature point, so it is suitable as outlier criterion. However, when the rotation and translation constraints are not satisfied, the effect of this method is not ideal. The method to relax the rotation constraint is to decouple the rotation and translation. Firstly, the monocular method is used to calculate the rotational motionR k , and thenR k is used to compensate for the optical flow caused by the rotation of the vehicle, so that the normalized re-projection error based on decoupling (DNRE) is obtained: Here t k is the translation vector to be estimated. The denominator is the amplitude of optical flow generated by translational motion. BecauseR k is the estimated rotation matrix, the re-projection errorε i k is approximate to the error caused by the pure translation motion. Compared with the normalized re-projection error in (11), the normalized re-projection error based on decoupling removes the rotation constraint.
The translation constraint limits the motion mode of the vehicle to drive straight ahead. When calculating the translation motion, we adopt different outlier criteria according to the vehicle motion. According to the smoothness principle of the vehicle speed, if the translation vector t k−1 = t x , t y , t z T in the previous frame satisfies the following formula: then it shows that the translational motion of the vehicle is mainly straight ahead, where T rat is the preset threshold. At this time, the DNRE is used as the outlier criterion. Otherwise, the re-projection error (RE) will be used as the outlier criterion.

B. MOTION ESTIMATION ALGORITHM
This section constructs the framework of the motion estimation algorithm, and introduces in detail the two important components of the framework, namely rotation estimation and translation estimation.

1) ALGORITHM FRAMEWORK
The flow of the motion estimation algorithm is shown in Figure 3. The input of the algorithm is the feature correspondences of frame k − 1 and frame k, and the output is the rotation matrix R k and the translation vector t k . Assuming that the set of all correspondences is M N , we first use the left image point set M l N to calculate the rotation matrix R k through the 5-point LOPSAC algorithm (Algorithm 3). After obtaining the rotation matrix, take it as a known quantity and use the binocular point set M N to calculate the translation vector t k . At this time, at least one point is needed to solve the translation vector, so the 1-point algorithm is used to calculate the translation vector.

2) ROTATION ESTIMATION
We use the left image point set (M l N ) to calculate the rotation matrix R k through the 5-point LOPSAC algorithm. The specific implementation of the algorithm is described in Algorithm 3. Because the translation vector calculated by the monocular method lacks absolute scale, the model parameters obtained include 3 rotation parameters and 2 translation parameters, and the useful results are the rotation parameters here.

.Local optimization
Input consensus set k , call the local optimization program of Algorithm 2, and return θ lo k and lo

3) TRANSLATION ESTIMATE
According to the rigid body motion equation: where R k is the rotation matrix obtained through the rotation estimation; t k represents the translation vector to be estimated; P' i k−1 is the coordinate of P i k−1 after rotation. There is only translational motion between P' i k−1 and P i k , but no rotational motion. Then the pixel coordinate p' i k of the spatial point P' i k−1 on the frame k is expressed as: Thus, the following objective function is constructed: where p i k is the pixel coordinate obtained by feature matching; ε i k is the re-projection error. Since there is no rotation between P' i k−1 and P i k , the rotation constraint is satisfied. When the vehicle motion conforms to (13), it approximately satisfies the translation constraint. At this time, the DNRE is used as the outlier criterion. Here it is expressed as: For the correspondence µ i k ∈ M N and the inliers set , when the vehicle motion conforms to (13), we use the following formula to determine whether µ i k is an inlier point: When the vehicle motion does not conform to (13), we use the following formula to determine whether µ i k is an inlier point: Here τ δ and τ ε are preset thresholds for determining inliers. The translation vector has 3 degrees of freedom, and each feature point can provide 3 constraints. Therefore, at least one binocular correspondence is required to solve the translation vector t k . Since the 1-point algorithm is not suitable for progressive sampling, this paper uses the Gauss-Newton method to solve the translation vector t k under the framework of RANSAC+ local optimization.

VI. EXPERIMENTS AND SYSTEM EVALUATION
In order to test the performance of the system, this section conducts experimental research in two scenes of highway and city.

A. EXPERIMENTAL DESIGN
We evaluated the performance of the algorithm from three aspects: robustness, accuracy and efficiency. When evaluating the robustness of the algorithm, the proportion of detected inliers was taken as the main indicator, and other data obtained by the robust estimation algorithm were also analyzed. The average rotation error (ARE) and average translation error (ATE) [29] were used as the accuracy indicators. Many experiments have shown that ATE can reflect the accuracy of trajectory better than ARE, so we used ATE as the main indicator and ARE as the auxiliary indicator. The average single frame calculation time was used as the efficiency indicator. In order to evaluate the impact of using the normalized re-projection error as the outlier criterion on the accuracy of the algorithm, the threshold T rat was changed to observe the change of ATE. The computer used in the experiments was equipped with an Intel Core i5 quad-core processor, the main frequency was 2.6GHz, and the development environment was Linux Ubuntu 14.04+KDevelop4.
When evaluating the robustness of the algorithm, the LOPSAC algorithm (LOP) was compared with the following three algorithms: 1) VISO2-S: The VISO2-S system was selected to evaluate the quality of the correspondences output by the feature module of the proposed method. The RANSAC sampling number was set to a fixed value of 200.
2) RAN: the RANSAC version of the proposed method. Replace the LOPSAC algorithm in this system with the RANSAC algorithm to evaluate the performance of LOPSAC.
3)RANS-20000: the RANSAC-20000 version of the proposed method. Replace the LOPSAC algorithm in this system with the RANSAC-20000 algorithm. As mentioned above, RANSAC-20000 refers to RANSAC with 20000 fixed samples. Since this sampling number is much greater than the value determined by (3), the result was taken as the true inliers output by the feature module.
In order to evaluate the accuracy and efficiency of the proposed method (PM), the following three open source VO or SLAM systems were selected for comparison: VISO2-S, S-PTAM and ORB-SLAM2. Although the accuracy of VISO2-S has lagged behind some advanced VO systems, its computing speed is still in the leading position due to the use of the SSE instruction set. S-PTAM and ORB-SLAM2 are stereo SLAM systems with high precision. ORB-SLAM2 has a closed-loop detection function, but S-PTAM does not.
The KITTI 01 and 07 sequences were selected as the experimental dataset. These two sequences represent highway scenes and urban scenes respectively. Figure 4 shows image examples of the two sequences. There are 1101 images in 01 sequence, the driving distance is 2452m, and the average driving speed is 80km/h; there are 1101 images in the 07 sequence, the driving distance is 695m, and the average driving speed is 23km/h. The camera parameters of the above two sequences are shown in Table 1 (4 decimal places are reserved).

B. ROBUSTNESS EVALUATION
The robustness of the system depends on the quality of the correspondences output by the feature module and the robust estimation algorithm. The proportion of inliers used in each motion estimation (α), the sampling number of a single frame (k), the number of verifications per sampling (vps) and the motion estimation time of a single frame (time) are shown in Table 2. The above indicators are the average values of all single frames. Because of the motion decoupling algorithm used in this paper, the sampling number and the motion estimation time of a single frame are represented by the sum of the sampling number and the calculation time of the two motion estimation respectively.
It can be seen from Table 2 that the proportion of the inliers obtained by the RANSAC-20000 algorithm are 77.32% and 91.85% respectively, which are close to the true values. The proportion of the inliers obtained by the latter three algorithms are much larger than that of VISO2-S, indicating that the correspondences output by the feature module in this paper are of higher quality. The proportion of the inliers obtained by the LOPSAC algorithm is only lower than that   of the RANSAC-20000 algorithm, but higher than the other two algorithms. Figure 5 shows the inliers ratio curves in a single frame of the four algorithms. It can be seen from the curve that the inliers ratio obtained by the LOPSAC algorithm in most frames is higher than that of RANSAC, and slightly lower than the approximate true inliers ratio obtained by RANSAC-20000. The inliers ratio obtained by each algorithm in the 07 sequence is higher than that of the 01 sequence, indicating that the matching difficulty of the 07 sequence is lower than that of the 01 sequence. This is mainly due to the faster driving speed of the 01 sequence. Figure 6 shows the inliers obtained by the LOPSAC algorithm in the different scenes of the two sequences. The vehicle in Figure 6 (a) and Figure 6 (b) are in straight ahead and right turn respectively, and the vehicle in Figure 6(c) and Figure 6(d) are in straight ahead and left turn respectively. The matching positions of the feature points in the two consecutive frames are connected by straight lines. The red correspondences represent the inliers detected by the algorithm, and the green correspondences represent the outliers. It can be seen that all the feature points on the moving targets (cyclists and moving vehicles) in the four figures have been excluded as outliers, and the remaining inliers consistently reflect the motion state of the vehicle.
As shown in Table 2, RANSAC-20000 sets the sampling number for each motion estimation to 20000, and the total sampling number for rotation and translation estimation is 40000. The VISO2-S system also sets the sampling number of RANSAC to a fixed value, and here uses its default parameter of 200. The sampling number of RANSAC is only 35 (12), which is far less than 200 of the VISO2-S system. This is mainly due to the following two reasons: First, the sampling number of RANSAC is not fixed here, but the required sampling number will be updated with the obtained inliers ratio and (3) after each sampling, which has been described in the preceding paragraph. Second, the inliers ratio output by the feature module in this paper is higher than that of VISO2-S. It can be seen that the sampling number of LOPSAC is only 10 (5), which is much less than that of RANSAC. The reason is that LOPSAC uses progressive sampling and local optimization to find a sufficient number of inliers in advance.  LOPSAC uses the reverse order model verification, and the number of verifications per sampling is much less than the total number of data points. The other three algorithms have been completely verified, and the number of verifications per sampling is equal to the total number of data points.
The calculation time is an important indicator, which is mainly determined by the sampling number of a single frame (k), the number of verifications per sampling (vps), and the calculation time of the motion parameters. It can be seen from Table 2 that the calculation time of LOPSAC is much less than that of RANSAC and VISO2-S, indicating that the motion estimation speed of the algorithm is very fast. Table 3 lists the accuracy indicators of the four systems VISO2-S, S-PTAM, ORB-SLAM2 and PM on the KITTI 01 and 07 sequences. It can be seen that the ARE value obtained by the proposed method in the 01 sequence is only slightly larger than ORB-SLAM2, while the ATE value is 20%∼30% smaller than the two SLAM systems of S-PTAM and ORB-SLAM2. Although this system only uses the two-frame estimation method, it still obtains a high estimation accuracy, while the accuracy of the other two-frame system VISO2-S is far lower than the other three methods. In the 07 sequence, the accuracy of the proposed method is only lower than ORB-SLAM2, but higher than S-PTAM and VISO2-S. Since the 07 sequence contains a closed loop, the closed-loop detection function of ORB-SLAM2 improves the accuracy of the algorithm. In addition, we also calculate the accuracy indicators of the four algorithms on all 11 sequences of the KITTI training set. The proposed method obtains the minimum ATE and ARE on 5 sequences respectively, and the average accuracy is also the highest. The ARE and ATE of the system are 0.21deg/100m and 0.70%, respectively. Figure 7 compares the vehicle trajectory obtained using the above four methods and the real trajectory. As shown in Figure 7(a), the trajectory of PM is closest to the real trajectory, followed by ORB-SLAM and S-PTAM. The trajectory of VISO2-S deviates far from the real trajectory, which is consistent with the results reflected by ATE. Figure 7(b) reflects that the trajectory of ORB-SLAM2 and PM are closest to the real trajectory. Because the errors are very small in this sequence, it is difficult to distinguish which is closer to the real trajectory. Figure 8 shows the t rat curve obtained from the ground truth. For the convenience of observation, the vertical axis range is set to 0 ∼ 200. In the 01 and 07 sequences, the mean values of t rat are 79 and 50, respectively. In order to analyze the impact of using DNRE as outlier criterion on the accuracy, we observe the change of the ATE value by changing the threshold T rat . The results are shown in Figure 9, and the threshold T rat is set to be the integer in the interval of 0 to 200. In fact, since the t rat values of most frames are less than 200, the ATE value when T rat > 200 is approximately equal to the ATE value when T rat = 200. All frames use DNRE as the outlier criterion when T rat = 0, while it is approximate to all frames use RE as the outlier criterion when T rat = 200.

C. ACCURACY EVALUATION
As shown in Figure 9, both curves first decrease and then increase with the increase of T rat , and the minimum error occurs around T rat = 28 (22). It shows that the accuracy obtained by using RE as the outlier criterion is higher when T rat ≤ 28 (22), while the accuracy obtained by using DNRE is higher when T rat > 28 (22). The maximum error occurs at T rat = 200. In other words, no matter what the threshold T rat is set to, the accuracy of the algorithm in this paper is always higher than or equal to the algorithm using RE as the outlier criterion. The T rat value can be obtained through experiments. In this paper, the T rat is set to a fixed value of 30. Table 4 lists the average single frame calculation time of the four methods, including feature module, motion estimation module and total calculation time, in which S-PTAM and   ORB-SLAM2 are only included in their tracking thread time. The calculation time of the feature module used by the proposed method is only longer than that of S-PTAM, but less than that of the other two methods. The feature module of ORB-SLAM2 takes the most time, mainly due to its long matching time and the need to complete a local map tracking. The motion estimation module of the proposed method takes much less time than the other three methods. From the perspective of the total calculation time, the efficiency of the proposed method is the highest among the four methods, which is about 26-30% higher than VISO2-S.

VII. CONCLUSION
The paper presents a novel stereo-based VO system for the navigation of autonomous vehicles. The system shows good performance in robustness, accuracy and efficiency. We design a VO feature module with a fast running speed and a high proportion of inliers. The refined feature screening strategy is used to select feature points by considering their spatial distribution and temporal evolution. This paper proposes a new robust estimation algorithm, namely LOPSAC algorithm. Compared with the RANSAC algorithm, it can save a lot of time and extract more inliers. In order to improve the accuracy of the proposed method, we also propose a motion estimation algorithm based on decoupling. The experiments have been conducted on the public VO database. The experiments show that our system achieves a slightly lower accuracy than that of the ORB-SLAM2 system in the sequence with closed loop, while the accuracy of our system is higher than that of the ORB-SLAM2 system in the sequence without closed loop. No matter in which sequence, our system achieves a much faster running speed than the ORB-SLAM2 system, and outperforms the VISO2-S and S-PTAM system both in accuracy and running speed.
This paper studies the two-frame VO system, which relies on accurate inter-frame motion estimation to reduce trajectory drift, so it is difficult to obtain a consistent estimate of the trajectory in a large range. Therefore, our future work is to introduce a multi-frame optimization method.
WENYAN CI received the Ph.D. degree from the University of Shanghai for Science and Technology, in 2019. He is currently an Assistant Professor with Huzhou University, Zhejiang, China. His research interests include computer vision, intelligent vehicles, and process control performance monitoring.
TIANXIANG XU is currently pursuing the master's degree with Huzhou University, Zhejiang, China.

His research interests include intelligent vehicles and computer vision.
TIE XU is currently pursuing the master's degree with Huzhou University, Zhejiang, China. His research interests include intelligent vehicles and computer vision.
XIALAI WU received the B.S. degree in automation from Zhejiang Sci-Tech University, China, and the Ph.D. degree from Zhejiang University, China, in 2019. He is currently a Lecturer with Huzhou University. His current research interests include intelligent vehicles and integrated design and control.
SHAN LU received the B.S. and Ph.D. degrees from Zhejiang University, China, in 2011 and 2016, respectively. He is currently an Assistant Professor and the Deputy Director of the Institute of Intelligence Science and Engineering, Shenzhen Polytechnic. His research interests include data-driven optimization, cyber system control, and signal processing. He has undertaken several national and provincial funds and projects in the field of smart manufacturing, process control system, and acts as the Principal Investigator of Innovation Team of Guangdong Province, China.