Efficient Ego-Motion Estimation for Multi-Camera Systems With Decoupled Rotation and Translation

In this article, we present novel solutions to estimate the ego-motion of a multi-camera system with a known vertical direction (e.g., from the inertial measurement unit). By assuming small camera motion between successive video frames, we demonstrate that rotation and translation estimation can be decoupled. This makes our methods require fewer correspondences to estimate the ego-motion and have a good accuracy. Accordingly, we estimate the ego-motion with two steps. First, we propose a 1-point method to estimate rotation with only a single correspondence which produces up to two solutions. Then, we adopt a 3-point linear method and a 2-point sampling method to solve translation which produce a single solution. We compared our algorithms with state-of-the-art algorithms on synthetic and real datasets. The experiments demonstrate that our algorithms are accurate and efficient in road driving scenarios. We also demonstrate that our proposed methods can efficiently find an optimal inlier set using histogram voting or exhaustive search instead of RANSAC.


I. INTRODUCTION
The relative pose estimation problem is classical and fundamental in computer vision applications, such as robotics, automotive industry, augmented reality, and visual simultaneous localization and mapping. This problem refers to computing the pose of the current frame with respect to the coordinate system related to the previous frame [1]. Different camera configurations are used to solve the problem, such as monocular, stereo, and multi-camera system. Monocular attracted wide attention from researchers and a large number of algorithms have arisen in prior work. The classical and basic solvers are the normalized 8-point algorithm [2] and the 5-point minimal algorithm [3].
However, recently, the multi-camera system has been extensively used for many emerging applications, such as autonomous driving with drones and vehicles because it The associate editor coordinating the review of this manuscript and approving it for publication was Ramakrishnan Srinivasan . Example of a multi-camera system configuration mounted on a car. Given the known vertical direction, we estimate the relative rotation using the far features, and estimate the relative translation using the near features.
covers a potentially large field-of-view [4], [5]. The larger the field-of-view (FoV), the more information we obtain around the environment. This allows us to detect and track objects robustly, particularly in environments with little texture. Our work focuses on the relative pose estimation for a multi-camera system. A multi-camera system can be modeled as a generalized camera, which was proposed by [6], [7]. If the light rays passing through the three-dimensional (3D) world points and image points intersect at a single center of projection, the camera system is modeled as a central perspective projection model; otherwise, the camera system is modeled as a generalized camera model; that is, the difference between the central perspective projection model and generalized camera model is that the latter does not have a single center of projection, as shown in Fig. 2. The ego-motion of the multi-camera system can be obtained linearly using 17 points [6]- [8] and minimally using six points [9], [10].
To deal with the outlier matches, the ego-motion estimation algorithms are applied in a robust framework, such as random sample consensus (RANSAC) [11]. To reduce the computation cost and improve the robustness of RANSAC, reducing the number of points required for estimating a motion model is an efficient strategy [12]. Thus, it is necessary and important to study minimal solvers for ego-motion estimation. Researchers use additional information to reduce the number of points required. One approach is to use some motion constraints to simplify the problem of ego-motion estimation. For example, planar motion [13], [14] or Ackermann steering motion [15], [16] in road driving scenarios. Another approach is to obtain additional information from other sensors, for example, the inertial measurement unit (IMU). At the present time, as the IMU has become cheaper and prevalent, it is increasingly often fixed on multi-camera systems. As the accuracy of the yaw angle from the IMU sensor is not as good as those of the roll and pitch angles, we use the roll and pitch angles to determine the vertical direction, which reduces the degrees of freedom (DOFs) in the relative pose by two. Thus, this makes the ego-motion estimation process simpler and faster. This ideal has been applied in the monocular camera system [13], [17]- [19] and the multi-camera system [20]- [22].
Our work aims at solving the ego-motion estimation problem for a multi-camera system when the vertical direction in the multi-camera coordinate frame is provided by the IMU [13], [20]. The knowledge of the vertical direction can reduce the DOFs in the relative pose by two, and then the unknown translation and unknown yaw angle are left for us to solve. Additionally, we use the fact that for points that are far away, the parallax-shift (induced by translation) between two views is hardly noticeable [13]. Hence, we classify all points into two sets: ''far points,'' which are far away, and ''near points,'' which are nearby in the scene. In terms of the far points, the translation between consecutive frames is negligible while the ego-motion is small. Thus, we can decouple rotation and translation estimation if there are some far points in the scene. Accordingly, we estimate the ego-motion with two steps. First, we propose a 1-point algorithm to estimate the rotation. This allows us to solve the rotation of the multi-camera rig with a minimal set of one correspondence. To the end of accurate and robust results, 1-point algorithm is finally embedded into histogram voting and RANSAC loop. Then, we propose two methods to estimate translation, a linear method using three near points and a sampling method using two near points.
The main contributions of this article are as follows: • Our work decouples rotation and translation estimation for multi-camera systems. We estimate rotation using the far points, and then estimate translation using the near points.
• We propose a 1-point method to estimate rotation for multi-camera systems on the condition of knowing the vertical direction. The method requires only a single point and produces up to two candidate solutions, thus improving the efficiency of our method in RANSAC.
• We propose two methods to estimate translation for multi-camera systems, 3-point linear method and 2-point sampling method, which have high accuracy.
The remainder of this article is organized as follows: in Section 2, we present an overview of related work. We briefly establish notation and introduce the generalized epipolar constraint (GEC) in Section 3. In Section 4, we describe our methods in detail. We conduct experiments on simulation and real datasets in Section 5, where our methods are compared with state-of-the-art methods.

II. RELATED WORK
Pless et al. [6] formulated the GEC, which is a mechanism that makes a net of cameras a single camera. The linear solution of ego-motion for generalized cameras requires 17 corresponding image rays because there are 18 unknowns in the constraint. Similarly, Sturm et al. [7] provided epipolar geometry for generalized cameras and suggested 17 correspondences to solve the relative motion linearly. The above methods have their merits, but do not work on real data. Li et al. [8] found that the 17-point approach is not applicable to certain special generalized VOLUME 8, 2020 camera configurations, hence, provided an extension to 17-point approach and proposed linear 16-point and 14-point approaches in certain special configurations such as multi-camera systems where the camera centers are aligned.
A minimal solution was first proposed by Stewénius et al. [9]. The algorithm requires only six correspondence pairs to determine relative motion using the Gröbner basis technique and provides up to 64 solutions. However, it is unsuitable for a real-time system because of the high computational complexity of the solver. Kneip et al. proposed a nonlinear optimization algorithm over relative rotation only based on an efficient eigenvalue minimization strategy [23], [24]. A single solution is sought by a closed-form function using seven or more correspondences and is susceptible to obtaining a local optimal result. Ventura et al. [10] proposed a solver that uses the first-order approximation to the relative pose. The approximation motion model is appropriate under the assumption of small motion between two images, so it is applied in continuous motion. Although the model simplifies the relative pose problem, the method yields up to 20 solutions, so it is unsuitable for inclusion in a RANSAC scheme. Moreover, to solve a 20th-degree polynomial makes the method sensitive to noise.
To reduce the DOFs of relative motion, researchers have exploited extra information and/or assumptions from motion models. Consequently, fewer correspondences are required to solve the problem. By constraining the camera motion to the Ackermann motion model, Lee et al. [15] proposed a 2-point method that yields up to six candidate solutions. However, the model applies the case in which a car undergoes on a planar, which is a very strict assumption in practice. For a practical application, Lee et al. [20] used the information of the vertical direction to reduce the DOFs of relative motion by two and then proposed minimal 4-point and linear 8-point algorithms. The minimal 4-point algorithm provides up to eight possible solutions via the hidden variable resultant method and the linear 8-point algorithm yields up to one solution using the standard SVD method. Sweeney et al. [21] derived relative motion problems as quadratic eigenvalue problems with a known axis of rotation. Similar to the algorithm of Lee et al. [20], it provides an eight-degree polynomial. Unlike the algorithm of Lee et al. [20], however, it yields up to six solutions using four correspondences. Liu et al. [22] used a first-order approximation motion model and an IMU sensor to determine the unknown yaw angle from the roots of a four-degree polynomial.

III. GENERALIZED EPIPOLAR CONSTRAINT
In this section, we introduce the generalized epipolar constraint briefly. A multi-camera rig can be described as a generalized camera that captures a set of light rays [7], [8]. We use the Plücker vector to express a light ray. The Plücker vector is composed of a pair of 3-vectors, u and q, which are the direction vector and moment vector, respectively. We choose a reference frame V arbitrarily, and then the extrinsic matrix of the ith camera C i in V is denoted by [R Ci , T Ci ] and the intrinsic matrix is denoted by K Ci . The Plücker coordinate of the light ray from the optical center of the camera C i to the normalized image pointx ij = K −1 Ci x ij is given by where u ij = R Cixij is the unit direction of the light ray in the reference coordinate system. The transformation from k frame to k + 1 frame is denoted by rotation matrix R and translation vector t. Suppose L ij,k L ij,k+1 is a pair of Plücker line correspondences in two views. Then the Plücker coordinate of L ij,k in the k + 1 frame is expressed as where [t] × is a skew-symmetric matrix made up of translation vector t. L ij,k and L ij,k+1 intersect in space if and only if As a result, the generalized epipolar constraint (GEC) can be written as

IV. METHODS
We solve the ego-motion estimation problem of multi-camera system using three steps. First, we use the roll and pitch angle from the IMU sensor to transform the Plücker line correspondences, thereby aligning the vertical direction of the multi-camera system. This step reduces the DOFs of rotation from three to one so that we have a single unknown in rotation to solve. Second, because the multi-camera system is greater than or equal to 10 Hz for road vehicle application, a car cannot run much farther within a 0.1-second time interval or fewer time interval. It is a reasonable assumption that the ego-motion between two successive frames is small [10], [22]. In practice, in the case of small ego-motion, we find that the change of the near point's image coordinate contains rotation and translation information, and the change of the far point's image coordinate only contains rotation information. Accordingly, we propose a 1-point method to estimate the rotation. Finally, according to the given rotation, we estimate translation with a 3-point linear method and a 2-point sampling method.

A. ALIGN THE VERTICAL DIRECTION
The vertical direction refers to the direction of gravity, that is, the ''up'' direction of the multi-camera system. The knowledge of the vertical direction can be provided by vanishing points [25] or the IMU measurements. In this study, as the accuracy of the yaw angle from the IMU sensor is not as good as those of the roll and pitch angles, we use the roll and pitch angles from IMU to align the vertical direction. We can obtain the pitch angle (rotation around the X-axis), roll angle (rotation around the Y-axis), and yaw angle (rotation around the Z-axis) from the IMU with respect to the reference frame V , where the XY plane is parallel to the ground, and the Z-axis points down. The rotation matrices from the yaw, pitch, and roll angles between the two consecutive generalized camera frames are denoted by R y , R p , R r ↔ R y , R p , R r . Hence, the relative rotation matrix R is written as Coincidentally, R T y R y is the relative yaw rotation matrix, so we denote it by R y . As observed from (5), only a single unknown R y in the relative rotation remains to be solved. We substitute (5) into the GEC in (4), and eliminate the Plücker line and camera indices ij for brevity: To simplify (6), we first factor out R p R r to obtain Hence Then fac- From (9), we can obtain a simplified GEC: Thus, after aligning the vertical direction, we have two unknowns to solve:t and R y .

B. ROTATION ESTIMATION METHODS
We assume that the change of the far point's image coordinate is only affected by the relative rotation, that is, the relative translation is close to zero when we only use the far points [13]. Therefore, if the Plücker lines are both formed by the far points, a new GEC can be obtained: We rewrite R y by applying the tangent half-angle substitution given by cos α = 1 − q 2 1 + q 2 and sin α = (2q) 1 + q 2 , where α is the relative yaw angle that makes up R y : We substitute the yaw rotation matrix R y into the generalized epipolar constraint in (11) to obtain where the coefficients A, B, and C are formed by the elements of the Plücker line correspondenceL(l 1 ,l 2 ,l 3 ,l 4 ,l 5 ,l 6 ) ↔ L (l 1 ,l 2 ,l 3 ,l 4 ,l 5 ,l 6 ): We obtain q by solving (13). A single Plücker line correspondence provides up to two possible solutions for q. We use two methods to generate an optimal solution in this article 1-point RANSAC and histogram voting.

1) 1-POINT RANSAC
RANSAC is the standard process for estimating the model to deal with the outlier matches. RANSAC randomly samples minimal data sets to generate model hypothesis. Then, the model hypothesis are tested on the whole data set to identify and remove outliers. Finally, only inliers are used to estimate the model. As illustrated in Fig. 3, if the measured transformation is perfect, the difference between the reprojected Plücker line L repr and the measured Plücker line L meas is negligible. Hence, we compute the reprojection error as 1 − L T meas L repr in this article. It should be noted that we consider a corresponding point pair as an inlier if the angle α between L meas and L repr is lower than the threshold α threshold given by arctan t f , where f is the focal length and t is the threshold of the classical reprojection error in pixels [26].

2) HISTOGRAM VOTING
The possible solutions for q only use a single Plücker line correspondence; hence, a straightforward approach that requires no iteration is based on histogram voting method [16]. The method is more efficient than RANSAC because the histogram voting method avoids computing the inliers and outliers for each possible solution. A Plücker line correspondence is used to compute a hypothesis of q. Due to our assumption that the ego-motion is small, thus leading to q ∈ (−1, 1). Then we use these hypotheses of q to generate histogram statistics in discrete bins (e.g., a bin size of 0.01).  According to the number of elements in each container in the histogram, the center of the bin corresponding to the maximum number is considered as the best solution for q.
Finally, we substitute q into (12) to obtain the relative yaw angle matrix. Therefore, we can obtain the relative rotation according to (5). Fig. 4 shows an example histogram generated using real data.

C. TRANSLATIOIN ESTIMATION MENTHODS
After estimating rotation, it is easy to perform translation estimation using near points. According to the generalized epipolar constraint in (10), after aligning the vertical direction, we factor out R y to rewrite (10) as: whereL = R y 0 0 R y L .
The constraints from the three Plücker line correspondences can be stacked into a linear equation system. We can solve the linear equation system to obtaint. Then the estimated translation t is recovered using (8).

2) 2-POINT METHOD USING DISCRETE SAMPLING FOR THE X-Y TRANSLATION DIRECTION
The 3-point linear method in the previous subsection requires three Plücker line correspondences. Despite this, in this subsection, we adopt a sampling method so that we can reduce the correspondences required. First, we choose an appropriate parameter to perform discrete sampling within a suitable bounded range. Then we search for a global optimality using an exhaustive search method. Because the direction of the translation in the x-y plane has the obvious value of the discrete step bound, we sample the direction in this article.
The direction of the translation can be described as θ and can be sampled in steps of 1 • from 0 • to 360 • . Consequently, we can rewrite (16) as Equation (17) has two unknowns, t 2 x +t 2 y andt z , which leads to a 2-point method for estimating translation. We denote t 2 x +t 2 y by r. The constraints from two correspondences can be stacked into an equation system, which finally leads to (m 1 cos θ + m 2 sin θ)r + m 3tz = n (m 1 cos θ + m 2 sin θ)r + m 3tz = n .
For a given angle θ, the two unknowns r andt z can be obtained as: According to the values of r andt z , we can recovert. Then, we substitutet into (8) to obtain translation t.

V. EXPERIMENTS
We performed experiments on both synthetic and real scene data to validate the performance of the proposed methods.
The tests on the synthetic scene were used to demonstrate the accuracy and robustness of our methods with respect to pixel noise and IMU noise. The tests on the real scene were used to demonstrate the feasibility of our solvers in practical autonomous driving scenarios. We compared the root mean square errors of the rotation and translation direction with those of state-of-the-art solvers. The errors were defined as follows: where R gt denotes the ground truth rotation, R est denotes the corresponding estimated rotation, t gt denotes the ground truth translation, t est denotes the corresponding estimated translation. The abbreviations of the solvers for comparison are as follows: 17pt-Li: linear solver of Li et al. to determine the relative pose problem of a multi-camera system with 17 correspondences [8].
8pt-Kneip: solver of Kneip et al. to determine the relative pose of a multi-camera system with an efficient eigenvalue minimization strategy [24]. 6pt-Stewénius: minimal solver of Stewénius et al. to determine the relative pose of a multi-camera system with the Gröbner basis technique [9]. 4pt-Liu: minimal solver of Liu et al. to determine the relative pose of a multi-camera system using a first-order approximation motion model [22]. 4pt-Lee: minimal solver of Lee et al. to determine the relative pose of a multi-camera system using the hidden variable resultant method [20]. 4pt-Our: our minimal solver to determine the relative pose of a multi-camera system using 1-point RANSAC rotation estimation method and 3-point linear translation estimation method.
Histogram voting: our solver to determine the relative rotation of a multi-camera system using the histogram voting approach.
Histogram voting+3pt: our solver to determine the relative translation of a multi-camera system using the 3-point linear method after solving the relative rotation using the histogram voting approach.
Histogram voting+2pt: our solver to determine the relative translation of a multi-camera system by sampling for the x-y translation direction after solving the relative rotation using the histogram voting approach.
All codes were implemented in C++ and tested on a 2.81 GHz Intel Core i7 with 16 GB RAM. The implementations of 17pt-Li, 8pt-Kneip, and 6pt-Stewenius were provided in the OpenGV library [26]. We used 4pt-Liu publicly available implementations from GitHub. We implemented the solver 4pt-Lee. We tested each solver on 10,000 randomly generated problems to compute the average computation times shown in Table 1.

A. SYNTHETIC DATA EXPERIMENTS
In the simulations, the experimental setup was as follows: we generated two cameras. The baseline of the two cameras was set to 0.5 m and the focal length was set to 1,000 pixels. The two cameras had non-overlapping FoVs. For each trial, we created two sets of 3D points: the far points were ran-  5,5] m. The two sets of 3D points each contained 100 points. We projected the 3D points onto the image plane of the multi-camera system to obtain the feature points. The motion between consecutive frames is small in automatic driving; therefore, the relative rotation angle rotated on each axis was set to a random angle in the range of [−1 • ,1 • ] and the translation was set from 0 to 0.5 m randomly in the simulations. These conditions were chosen to reflect realistic conditions. Each solver was used within a RANSAC scheme.

1) IMAGE NOISE EXPERIMENT
We conducted experiments to validate how the accuracy of the image point coordinates affected the relative pose estimated by our solvers. Gaussian noise has been added to the image point coordinates ranging from 0 to 2 pixels of standard deviation at an interval of 0.2 pixels, while the IMU data has been kept perfect. We compared our solvers with four solvers (4pt-Liu [22], 4pt-Lee [20], 6pt-Stewénius [9], 8pt-Kneip [24], and 17pt-Li [8]). For each level of image noise, 1,000 random trials were generated with perfect IMU data, and then we used the median error as a measure of performance to evaluate the estimated transformation. Fig. 5 shows the accuracy of rotation and translation computed using different solvers for three cases: forward motion, sideways motion, and random motion. As observed in Fig. 5(a) and (b), our two methods were close with each other, and slightly outperformed the other methods for forward motion. However, in the case of sideways motion shown in Fig. 5(c), there was no obvious tendency in terms of the rotational error using our methods with gradually increasing image noise. The results show that image noise had less effect on the accuracy of rotation estimation than sideways motion. For random motion, it is interesting to see that our methods were slightly worse in the absence of image noise, while our VOLUME 8, 2020 methods worked better for gradually increased image noise levels. It seems that our assumption that far points are only influenced by rotation led to certain errors. We observe that results of 8pt-Kneip [24] were poor. It seems that the method becomes numerically degenerate when the rotation matrix is close to identity.

2) IMU NOISE EXPERIMENT
We conducted experiments to validate how the accuracy of IMU affects the relative pose estimated by our solvers as our solvers rely on IMU measurements. Hence, we compared our solvers with 4pt-Liu [22] and 4pt-Lee [20], which work with the known vertical direction. Gaussian noise of standard deviation ranging from 0 • to 1 • was added to the IMU while assuming image noise with a standard deviation of 1 pixel. Figs. 6 and 7 show the median error of rotation and translation from 1,000 trials at each level of IMU noise using our methods compared with 4pt-Liu [22] and 4pt-Lee [20]. The figures demonstrate that our methods were close to each other in terms of the median error, and outperformed the other methods at all levels of IMU noise in terms of both rotation and translation estimation.

B. REAL-WORLD DATASET EXPERIMENTS
To determine the performance of our algorithms in a practical driving scene, we compared our methods with state-of-the-art methods on the KITTI autonomous driving benchmarking dataset [27]. We performed experiments on the first 11 sequences (00-10) in the visual odometry benchmark dataset, which provide the ground truth. These sequences provide left and right images, and contain approximately 46,000 images. In our experiments, we extracted feature matches from each camera individually using the SURF algorithm and we did not use any cross-camera matches. In the following sections we conducted 3 sets of experiments with the KITTI dataset. In the first experiment, we tested the effectiveness of our strategy which removes a big part of feature points for the rotation estimation. In the second experiment, we tested the performance of our algorithms compared to state-of-the-art methods. In the third experiment, we tested the quality of the inlier detection in comparison to state-of-the-art methods.

1) SELECTION OF THE INLIER USED TO ESTIMATE ROTATION ACCORDING TO THE Y-COORDINATE
In the study, we only use the far points to estimate rotation, thus we wish to preemptively discard those near points as early as possible. In practice, we observed that the y-coordinate of image coordinate of point faraway changes little in consecutive frames after aligning the vertical direction. As shown in Fig. 8, if the camera moves forward with a distance of d, the change of the y-coordinate of image point can be computed as: where f is the focal length, D is the distance from a 3D point P to the camera, Y is the distance from the 3D point to the optical axis. As observed in (23), the change of the y-coordinate of image point is approximate to 0 while the distance from a point P to the camera D is far enough. Consequently, for far points, the parallax-shift (induced by translation) between two views is hardly noticeable.The yaw rotation matrix influences the change of the x-coordinate of the image point, and does not influence the change of the y-coordinate of the image point. It is exactly based on this we can decouple the rotation and translation. We separate far from near points in two steps. First, with the knowledge of the vertical direction, we pre-rotate the image point to make the camera plane vertical to the ground plane. Then, these points are partitioned to far points and near points according to whether the change of the y-coordinate with respect to the image coordinate system is less than 1 pixel. This threshold value was given on the based of simulation experiments. Fig. 9 shows an example of the separation of far and near points using this criterion, where the green points denote far points, and red points denote near points. As observed in Fig. 9, the far points were well separated from the near points using y-coordinate of image points. Table 2 shows the number of far points in each sequence based on the simple criterion. NumberPoints refers to the average number of points extracted using SURF from the right and left images in each sequence. NumberFar refers to the average number of far points chosen using the y-coordinate. Ratio = NumberFar/NumberPoints refers to the average percentage of far points. As shown in Table 2, the criterion can reject outliers with a percentage of more than 50% for rotation estimation. This allows to significantly remove a big part of feature points for the rotation estimation, thus making it more efficient.

2) COMPARISON OF ROTATION AND TRANSLATION ESTIMATION WITH THE GROUND TRUTH
We compared our algorithms with 17pt-Li [8], 8pt-Kneip [24], 6pt-Stewénius [9], 4pt-Liu [22], and 4pt-Lee [20]. To compare our algorithms fairly, we did not apply any  nonlinear refinement, bundle adjustment, or loop closure. This means that we only computed the frame-to-frame visual odometry component. We ran all algorithms within a RANSAC framework. For all sequences, the number of RANSAC iterations was fixed at 100 and the inlier threshold was set to 1 pixel in the experiments.
We computed the median error of the rotation and translation estimates with respect to the ground truth. We report the accuracy results of the rotation and translation estimation on KITTI odometry sequences in Table 3 and Table 4. From the tables, we observe that the performances of our methods were comparable with or better than that of the other methods, particularly for translation estimation. The higher accuracy of the translation estimation was caused by the strategy that only used the near points to estimate the translation. According to Table 3, histogram voting was the best approach for rotation estimation on most sequences. Although, it should be noted that 4pt-Liu [22] and 4pt-Lee [20] were close to each other and 8pt-Kneip [24] was slightly poor, which coincides with the simulation experiments on image pixel noise. Table 4 shows that our three methods, 4pt-Our, Histogram voting+3pt, and Histogram voting+2pt, were all better than TABLE 3. Accuracy results of rotation estimation on KITTI odometry sequences 00-10 (unit: degrees). The median for each error measure is given.

TABLE 4.
Accuracy results of translation estimation on KITTI odometry sequences 00-10 (unit: degrees). The median for each error measure is given.  the other methods in terms of translation estimation; in particular, the accuracy of histogram voting+2pt was the highest.
The empirical cumulative error distributions of rotation and translation for KITTI VO-seq-00 are provided in Fig. 10. The proposed solvers (Histogram voting, Histogram voting+3pt, and Histogram voting+2pt) provided significantly better estimations than the state-of-the-art methods. 4pt-Our was in accordance with 4pt-Liu [22] and 4pt-Lee [20]. Average RANSAC runtime comparison of our method with state-of-the-art multi-camera ego-motion methods over KITTI sequences is shown in Table 5. The proposed method was more efficiently used within RANSAC for robust estimation in comparison to state-of-the-art methods with good accuracy of the ego-motion estimation.
The scenarios of KITTI odometry dataset are diverse, such as light changing or less of environment texture. It is interesting to see that our methods had outstanding performance on KITTI odometry datasets. Consequently, the real experiments demonstrate that a road driving scenario does fit our method very well, no matter light changing or less of environment texture.

3) COMPARISON OF THE INLIER RECOVERY RATE
Another extremely helpful application of our methods is selecting a correct inlier set required for the next step (e.g., accurate motion estimation and non-linear optimization). Therefore, we conducted an experiment that tested how many of the real inliers (calculated from the ground truth) can be found using our methods. Table 6 shows the mean of the inlier recovery rate on KITTI odometry sequences 00-10 using our methods and state-of-the-art methods. Histogram voting+3pt and Histogram voting+2pt were slightly better than the other methods. Fig. 11 shows the inlier set detection of the first two frames from KITTI seq-00 using Histogram voting+2pt.

VI. CONCLUSION
In this article, we proposed new methods to solve the problem of ego-motion estimation of a multi-camera system with decoupled rotation and translation estimation, while the vertical direction is known. We assumed that the far points were not affected by the translation on the condition of small motion, which proved to be correct in road driving scenes using experiments on KITTI datasets. According to the assumption, we proposed a minimal solver to estimate rotation with only a single far point. To estimate translation, we proposed a linear method with three near points and a sampling method with two near points. We verified the efficiency and robustness of our methods in a series of experiments on synthetic and real data. These experiments demonstrated that our methods applied very well to automatic driving scenarios which contain far features. In future work, we plan to try more reliable feature extraction to improve the accuracy of the ego-motion estimation.
MIAO TIAN received the B.S. degree in automation from Nanjing Normal University, Nanjing, China, in 2012, and the M.S. degree in guidance navigation and control from Information Engineering University, Zhengzhou, China, in 2016. She is currently pursuing the Ph.D. degree in aeronautical and astronautical science and technology from the National University of Defense Technology, Changsha, China. Her research interests include computer vision and visual navigation.