Frame-to-Frame Visual Odometry Estimation Network With Error Relaxation Method

Estimating frame-to-frame (F2F) visual odometry with monocular images has significant problems of propagated accumulated drift. We propose a learning-based approach for F2F monocular visual odometry estimation with novel and simple methods that consider the coherence of camera trajectories without any post-processing. The proposed network consists of two stages: initial estimation and error relaxation. In the first stage, the network learns disparity images to extract features and predicts relative camera pose between adjacent two frames through the attention, rotation, and translation networks. Then, loss functions are proposed in the error relaxation stage to reduce the local drift, increasing consistency under dynamic driving scenes. Moreover, our skip-ordering scheme shows the effectiveness of dealing with sequential data. Experiments with the KITTI benchmark dataset show that our proposed network outperforms other approaches with higher and more stable performance.


I. INTRODUCTION
Estimating camera orientation and location (pose or egomotion) in every frame, called visual odometry (VO), is an essential module for navigation, robotics, and augmented reality. When accurate camera poses are predicted from all frames, it is valuable and helpful to analyze the surrounding world based on the camera coordinate and understand both indoor and outdoor environments, including autonomous vehicles [1], [2], [3], unmanned robotics [4], [5], and quadrocopters [6].
Especially in robotics vision, such as Simultaneous Localization and Mapping (SLAM) system or Structure from Motion (SfM), figuring out camera pose through sequential images has been widely developed in an offline or online manner. One of the trials is geometric-based approaches, The associate editor coordinating the review of this manuscript and approving it for publication was Wei Wei . such as feature-based [7], [8], [9], semi-direct [10], [11], and direct-based methods [12]. They predict the camera pose by matching corresponding features between two frames and optimize it with 3D geometry manners [13] or optimizers like the weighted Gauss-Newton method. However, these methods heavily rely on continuous laborious camera pose optimization, such as bundle adjustment [14] and loop closure [15], rather than estimating the precise F2F camera pose.
In our system, we aim to estimate the F2F camera pose with only two adjacent frames and replace those laborious optimizations with error relaxation methods. Furthermore, given that camera movements vary in a driving environment, we propose a novel and simple scheme called a skipordering method. By skipping one interval frame from the image sequences when training the network, we find that it is effective to learn the enriched camera pose containing various speeds of camera motion. Training strategy of the proposed network. The network is trained using disparity images that are an output of [16]. The network learns to predict camera pose through the forward loss function in the initial estimation stage. Accumulated drift is decreased from our proposed bi-directional and correction loss function in the error relaxation stage.
Inspired by the geometric-based approaches that predict approximate camera poses and then refine them, our training strategy consists of two stages, as shown in Fig. 1. First, the deep neural network predicts approximate camera poses at an initial estimation stage. Then, each rotation and translation is designed to learn the 6-DOF camera pose effectively. After that, optimization loss functions are proposed to reduce error in the relaxation stage. Motivated by the UnFlow network [17], which applies the sum of vectors with opposite directions equals zero, we design a bi-directional loss function. With the reversed sequence of training images, our network trains not only forward but also backward camera pose. Note that the problem of F2F camera odometry estimation is a propagated error; we add a correction loss function to relax this error. To the best of our knowledge, it is the first approach to predict pose and relax errors simultaneously by training disparity images only.
Moreover, for predicting a precise camera pose in a dynamic environment, such as moving vehicles and pedestrians, we propose an attention network that emphasizes static regions and de-highlights dynamic regions within feature space. We show that this network helps to focus on static objects and be robust in a highly dynamic scene.

II. RELATED WORK
This section introduces the relevant work concerning the two approaches to visual odometry: geometric-based methods and learning-based approaches.

A. GEOMETRIC-BASED METHODS
Visual odometry has been widely studied with geometricbased methods in the past. Geiger et al. [7] proposed a camera ego-motion algorithm with a sparse feature(circular, blob, and corner) matching method, which generates a consistent 3D point cloud. Ciarfuglia et al. [18] insisted that the optical flow has correlated with the camera ego-motion using a support vector machine [19].
Klein et al. [8] showed a notable Parallel Tracking and Mapping (PTAM) system incorporating with bundle adjustment algorithm [20]. They proved to estimate camera pose in every frame and update them in real-time by using keyframes, which are representative images, to reduce of redundancy of their algorithm. Inspired by the fact that their approach to predicting camera poses in real-time, lots of SLAM system was introduced to show the accuracy of pose and 3D map density, such as ORB-SLAM, the Large Scale Direct SLAM (LSD SLAM), Dense Tracking and Mapping (DTAM), and RGB-D SLAM.
A typical pipeline in ORB-SLAM, suggested by Mur-Artal et al. [9], is to compute correspondences with ORB descriptor [21] and estimate camera pose by EPnP algorithm [22]. They also introduced a loop closing optimizer, which eliminates odometry drift when the camera revisits the same place they had visited using DBoW2 [23] with ORB. However, since this descriptor has the representative of FAST keypoint [24], the map has a low density.
To overcome this problem and improve the quality of the map, various studies have been attempted to view the entire image rather than the point. Newcombe et al. [12] used the alignment of previous and current images rather than matching interesting points for predicting camera pose and generating a highly-dense 3D map. Similar trials [25] were introduced using the depth map of each pixel which is the output of the Kinect sensor. However, using whole images to predict camera pose and generate a 3D map has difficulty updating them compared to using interest points.
Engel et al. [10] proposed semi-dense approaches based on a probabilistic depth map representation to solve these problems. They suggested a probabilistic depth map to align the high-gradient region of images such as corners and edges, called semi-dense, for effectively estimating camera pose and update. DSO [26] also proposed joint optimization of parameters, including geometry and camera motion, and it was designed to minimize the photometric error.

B. LEARNING-BASED APPROACHES
With the potential improvement of CNNs in many robotics vision areas [27], various approaches have attempted to solve the camera odometry estimation by modeling it as a regression problem. One of the first trials with CNN's method was proposed by Konda [28], [29], [30] who showed the possibility of predicting the speed and directional changes of the camera using depth information. Since camera motion is differently predicted according to the distance of an object from the camera, they insisted that depth images can be used to estimate visual odometry.
Flowdometry [31] proposed that an optical flow image contains more motion information than other images to train the end-to-end convolutional neural network. It used pretrained architecture [32], which is FlowNetS, to perform odometry regression. P-CNN [33] was also trained in both VOLUME 10, 2022 dense optical flow and RGB to extract more visual information. It explained that training both images is learning new feature representations.
A recurrent neural network (RNN) is functional architecture for processing continuous data. Wang et al. [34] suggested an RNN, called DeepVO, to estimate accurate visual odometry using prior knowledge. They argued that it is essential to know much information from past images for computing camera motion. Wang et al [35] also presented RNN models called denseSLAMNet for 3D reconstruction from consecutive images. However, unlike CNN, these networks have many parameters, so it needs an enormous capacity to train successive images.
Recently, attracted to the fact that self-supervised learning does not need labels, several trials [36], [37], [38], [39] have been studied to train the relationship between depth information and camera pose. Given the exact depth value in one frame and camera intrinsic parameters, the 3D relative pose of the adjacent frame is computed geometrically. Even though these approaches are pretty fashionable, they focus on optimizing both depth and camera pose between adjacent frames, not predicting camera trajectory in all frames in a scene that might contain camera drift.

III. PROPOSED METHOD
This section describes the proposed network architecture as shown in Fig. 2 and the framework for learning F2F camera odometry with the error relaxation methods in detail.

A. NETWORK ARCHITECTURE 1) ENCODER
Various CNN architecture, such as VGG [40], ResNet [41], and DenseNet [42], has been studied and shown tremendous performance in classification, recognition, and segmentation tasks. However, those networks are unsuitable for estimating exact camera poses in VO estimation. To extract geometry features for VO estimation, it is important not to lose the spatial information in the feature domain. For preserving this information, we use a convolution layer (stride 2) for embedding features rather than a pooling layer as mentioned in [43].
For acquiring rotation and translation from two images, each feature should be extracted in the same way and needs to be analyzed about the displacement of those features. Thus, the encoder shares their kernel weights, which means that each input image is calculated with the same convolution layer. Each output produced by the shared encoder is operated with the output of the attention network, and then channel-wise stacked features are passed through the rotation and translation networks.

2) ATTENTION NETWORK
To be robust to outliers when estimating VO in a geometricbased method, hand-crafted features are investigated whether these fit their model or not. For instance, the survival of the fittest policy is applied to reduce outliers in the ORB-SLAM system [9], and VISO-2 [7] also exploits Random sample consensus (RANSAC) for removing them.
Inspired by these concepts, the attention network is designed to assign weights to the output of the encoder. The disparity image has an inverse proportion to the depth map. Nearby objects, which have large pixel intensity in the disparity image, have more influence in predicting the poses of the camera. For example, predicting VO by comparing the displacement of features from distant objects or dynamic objects is quite challenging. This network helps to highlight and de-emphasize the features by multiplying the output of the encoder pixel-wisely. All layers in the attention network The orientation of the coordinate system is the same as that provided by the KITTI dataset [45], in which forward, rightward, and downward mean z, x, and y-axis, respectively. Each blue arrow represents a 3D camera pose with the reference frame. The red arrow means F2F camera odometry T i ,i +1 ,where R i ,i +1 ∈ SO(3) and p i ,i +1 ∈ R 3 of i + 1th frame with respect to the ith frame.
use the ReLU function except for the last layer. The Sigmoid function is used in the last layer for setting the output between 0 and 1.

3) ROTATION AND TRANSLATION NETWORKS
The rotation and translation networks learn to predict camera orientation and location using geometric features. Producing orientation and location of camera motion through one network extracts the feature that should contain both rotation and translation information. Since this information has different feature spaces, we design our rotation and translation networks independently. The rotation network has two additional layers than the translation network because the rotation has higher non-linearity properties [44]. Our network outputs a camera pose with six values with fully connected layers.

B. CAMERA POSE
In this section, we describe the framework for learning F2F camera odometry from two monocular images, I i and I i+1 as shown in Fig. 3. Our network outputs the transformation matrix T, given by a 3D camera rotation matrix R and translation vector p: where R and p belong to the SO(3) and R 3 sets, respectively. For convenience, we notate the forward, rightward, and downward axis as z, x, and y. Every rotation matrix should satisfy the following condition: both orthogonality (RR T = I ) and determinant of matrix (det(R) = 1), which is called the special orthogonal group. In robotics vision, various mathematical representations of 3D rotation exist, such as quaternion, Euler angle, and the Lie Group. Note that most of the driving scene is straight ahead; the fourth value of quaternion is abnormally high compared to the other values. This hinders learning in the network and interrupts the balance of rotation values. The Lie group has the disadvantage of not being able to know the reference axis in a stopped-driving environment. Therefore, our network produces the 6-DOF transformation between frames with three Euler angles and translation vectors. For convenient notation, Euler angle and translation vector are denoted as θ and p, and T is composed of 6 values, [θ z , θ y , θ x , p x , p y , p z ].
In the initial estimation stage, we train our network with the following forward loss function: where (·) and( ·) denote the label and estimated camera pose, and λ is a scale factor to balance them. This loss function is composed of the Mean Square Error (MSE) of Euler θ and translation p.

C. POSE RELAXATION METHODS
The set of transformations with respect to the first frame is computed as follows: The fundamental issue of F2F pose estimation is reducing propagated errors. As shown in Eq. 3, if drift occurs at the beginning, the trajectory error might be accumulated, and the gap continues to grow. Therefore, the process of relaxing this error is vital for predicting the F2F camera pose.
On the other hand, geometric-based approaches clean up this error through loop closure and bundle adjustment continuously [46]. Moreover, they are optimizing predicted camera poses locally and globally at post-processing. However, since our purpose is to design a network for learning both prediction and correction at once, we have divided the training steps into two stages, initial estimation and error relaxation, as shown in Fig. 1. Furthermore, the novel and simple loss function, called a correction loss which is described in detail in Section III-C2, is defined to decrease the odometry drift in the error relaxation stage, which causes higher and more stable performances than other networks.

1) BI-DIRECTIONAL LOSS
Note that ground-truths provided by the KITTI benchmark [45] are unbalanced, as shown in Fig. 4. Even though the distribution of the y-axis is widely spread, the Euler angle of all axes is usually Gaussian. In the F2F translation z-axis, however, has only positive values. This means that the network only learns the forward direction, which leads to a one-way learning process. To obtain accurate camera poses, the network should compute the displacement of features. However, when the camera moves forward, all pixels spread outward-direction from a vanishing point. These biased pixel movements prevent the network from learning the feature displacement, thus this causes unsatisfied results.
For solving these issues, we defined a bi-directional loss function, L b , in the error relaxation stage as follows: where I stands for identity matrix, andT i,i+1 andT i+1,i are the output of the proposed network when input images are {I i , I i+1 } and {I i+1 , I i }, respectively.

2) CORRECTION LOSS
Inspired by the fact that the image de-noising task removes errors by referring to the surrounding pixels, we designed an additional loss function with an adjacent estimated camera pose for obtaining coherent camera odometry. Assumed that T i,i+1 + e has an error, expressed in Fig. 5, it needs to be relaxed using other camera poses, T i−1,i and T i+1,i+2 . Therefore, we defined our correction loss function as follows: When learning our network, we first use the forward loss function in the initial estimation stage. After that, bidirectional and correction loss functions are added in the error relaxation stage to relax errors that occurred in the first stage. These loss functions improve our accuracy, which is described in detail in Sec. IV.

D. SKIP-ORDERING
F2F visual odometry is the task of solving a regression problem from consecutive images. Although RNN might be suitable for handling sequential data, it needs a huge resource for training successive images. To deal with these issues with CNN, we use three consecutive images as network inputs in both the initial estimation and error relaxation stages. The green dotted box in Fig. 2 represents the skip-ordering method. When we train our network, three consecutive disparity images, I i , I i−1 , and I i−2 are used as input where I i denotes i-th disparity image. We define the pair of I i and I i−2 as a skip-ordering method that allows learning large movements of pixels (e.g., high speed). We also conducted the experiment by skipping two frames, the pair of I i and I i−3 . However, the results were poor because few overlapped regions between images exist. Our network becomes robust against various speeds due to the skip ordering scheme, which is explicitly explained in Sec. IV.

IV. EXPERIMENT
In this section, we evaluate our proposed method with the KITTI dataset [45]. Our network is compared to other geometric-based and learning-based approaches using only monocular images. Since monocular visual odometry has a fundamental problem; scale ambiguity, some algorithms are re-scaled by referring to the ground-truth.

A. DATASETS
KITTI datasets are well-known benchmarks that offer data related to the driving scene and evaluate submitted algorithms. Notably, they provide stereo images and GPS/INS information in the odometry dataset. We used the left original image (370 × 1220) to train the network and made F2F labels using provided INS data. KITTI Datasets consist of 11 sequences (00-10) with labels and 11 (11)(12)(13)(14)(15)(16)(17)(18)(19)(20)(21) without labels. We set the experimental environment identical to that of other algorithms. The training and test sets consisted of from 00 to 07 (16338 images) and 08 to 10 (6863 images) sequences, respectively.
The training image is an output of the method proposed by Godard et al. [16]. Their network was trained on Cityscape [47] and the KITTI dataset and evaluated on a different dataset [48]. We leverage the pre-trained network to train our network because the disparity has spatial clues on each pixel. Although color images have more texture than disparity, they have redundancy features, thus it prevents training camera poses. The experiment results support our hypothesis that the proposed network learned with only disparity images outperforms the network learned with color or color + disparity, as described in Sec. IV-D.

B. IMPLEMENTATION DETAILS
To preserve geometric features, we did not employ batch normalization in any layer. Instead, we heuristically set λ θ to 350 to balance rotation and translation losses. We used Adam [49] optimization and Xavier initialization. After training our network until 30 epochs for the initial estimation stage, bi-directional and correction losses are added to re-train our network for 10 epochs in the error relaxation stage.

C. ODOMETRY RESULT
To evaluate our performance, we utilized KITTI benchmark evaluation, T error (%), and R error ( • /100m) based on the KITTI TABLE 1. Visual Odometry results evaluated on Sequence 08, 09, and 10. Both T error and R error represent translation and rotation drift. Since each sequence has different frames, we also computed the standard deviation (std) to prove the stability of the odometry result. Algorithms marked with stars are indicated to be re-scaled by force. Odometry dataset. They measure the difference between estimation and ground-truth about translation and rotation depending on the length [m] and speed [km/s]. These are the major evaluation metrics of the KITTI Odometry benchmark for ranking various odometry algorithms. Table 1 compares the proposed network with other algorithms, including handcrafted [7], [9] and learning-based methods [18], [31], [33], [38]. Table 1. reports the average error and standard deviation of camera odometry in each test sequence. VISO2-M [7] and SFMLearner [38] algorithms, which are labeled as (*) mark, cannot solve scale problems themselves. Therefore, we recovered these scale issues using the ratio of estimated and ground-truth. Our translation accuracy is the most accurate among all algorithms, and rotation error shows competitive results. Moreover, we find that our results are stable by computing the standard deviation. ORB-SLAM (LC) with loop closure using ORB descriptors to match features in monocular sequential images shows low performance in terms of translation. In spite of using optimization, such as loop closure and bundle adjustment, we can find that our error relaxation network effectively works by showing lower translation errors. Interestingly, geometricbased methods (e.g.VISO2-M, ORB-SLAM) are profoundly low rotation errors compared to the learning-based method. SVR-VO [18] exploit support vector machine to handle non-linear relations between features. Since SVM has  relatively fewer parameters than those of deep-learning networks, it shows more stable performance. However, it appears to have pretty low performance in translation and rotation error because deep learning has high dimensional feature space, thus it represents geometric features more easily.
Regarding the deep learning-based approaches, P-CNN VO [33] shows better performance than other networks. P-CNN [33] trains both color and dense optical flow images to extract more visual information. Since the optical flow image implies the movements over the temporal axis, both P-CNN and Flowdometry [31] show low rotation error compared to other methods. However, our network shows a lower translation error than optical flow-based networks.
Detailed errors in terms of length [m] and speed [km/s] are shown in Fig. 6. Our methods appear comparative results compared with other algorithms. Due to the skip-ordering and error correction loss, we achieve stable results in the high-speed scene. We plot the test sequences to evaluate qualitative trajectories as shown in Fig. 7. Sequence 08, 09, and 10 have 4071, 1591, and 1201 frames, respectively. Since the 08 sequence has a large number of frames (4071), lots of errors are accumulated continuously compared to our network.

D. ABLATION EXPERIMENT
To prove our proposed method, we validate several ablation experiments as shown in Table 2. The model, such as RGB and RGB+Disparity (stacked channel-wisely) stands for the type of training image. We find that the network that is trained by color image shows a higher error than the network that is trained by disparity due to the redundancy features of the color image. The results of the RGB+Disparity model also appear to have higher translation errors than the RGB model.
A, S, and C represent whether attention network, skipordering, and correction loss are included or not. Although Disparity/+S shows the best performance in Seq 09, error correction loss reduces the overall loss. Interestingly, just skipping the interval image shows that the average translation and rotation error reduced from 10.70 to 7.57 and 4.11 to 3.24. Furthermore, the attention network helps to reduce both translation and rotation errors. As shown in Fig. 8, we find that it filters out the non-confidential region. . Ablation Experiment. Dotted black, red, and green represent ground-truth, estimated camera pose with and without error correction. While there is a small drift at the beginning, the drift increases over time. Fig. 9 shows how our correction loss functions are useful to estimate camera odometry qualitatively. At the beginning of the frame, the network which trained only forward loss function cannot remove accumulated errors. However, the network with error relaxation loss eliminates them effectively even in the complicated sequence of 09.

V. CONCLUSION
We have investigated several experiments for achieving F2F camera location and orientation simultaneously with the learning-based method. The essential issue in predicting F2F VO is propagating errors continuously. The geometric-based method optimizes their initial post-process estimations to solve these problems using well-known loop closing and bundle adjustment. To replace that optimizer, we define bi-directional and correction loss functions in the error relaxation stage. These loss functions increase the coherence of camera trajectories and remove errors that occurred in the initial estimation stage. In future work, we plan to extend our model to train the multimodal dataset (e.g.LiDAR and RGB) of another dataset to increase the accuracy of camera odometry. Color images have plentiful features, and LiDAR has a high-accuracy depth value, thus incorporating their strengths could improve the performance.