OmniVO: Toward Robust Omni Directional Visual Odometry With Multicamera Collaboration for Challenging Conditions

With the recent developments in computer vision, vision-based odometry plays an increasingly important role in the field of autonomous systems. However, using traditional visual odometry or simultaneous localization and mapping (vSLAM) only performs better in simple environments with obvious structural features. Visual odometry may easily fail in a complex environment due to the sparsity of stable features, sensor failure, and extreme weather conditions or sunlight problems. Monocular camera-based traditional algorithms are highly affected by these issues, leading to stability and reliability problems. Therefore, to deal with such issues, a vision-based Omni-directional odometry based on multi-camera collaboration is proposed. The development of feature-based omnidirectional odometry and feature prioritization to limit the computational complexity offered by multiple cameras are major contributions. Firstly, the multi-camera state perception model is developed based on the spherical camera, which guarantees an accurate transformation from the camera to the spherical coordinate system. The feature detection and tracking are performed in the images of the individual cameras in a parallel thread. The features tracking across the different cameras prevent failures, however, it also adds extra computational complexity for the pose estimation and optimization module. To budget the feature distribution, a feature prioritization algorithm is proposed to limit the number of features. Furthermore, the multi-view pose refinement module further reduces the complexity of the system. The feature prioritization helps to maintain the smaller set of tracked features with similar accuracy and less computational complexity. Finally, the pose is estimated in a spherical coordinate system by projecting all the successfully tracked key points to the sphere with the help of the omnidirectional perception model. To validate the proposed method, the data set is collected from the outdoor environment with ground truth provided by high-accuracy GPS. Detailed qualitative and quantitative evaluations are performed, which show that the proposed algorithm improves the position accuracy by about 40%-60% as compared with state-of-the-art methods with limited computation time.

The main contribution of this work is as follows:

91
• The complete feature-based visual odometry algorithm 92 is developed for a multi-camera panoramic system with 93 collaboration of multiple cameras by using a spherical 94 state perception model for challenging situations.

95
• The feature selection and outlier removal scheme is 96 proposed to budget the features in order to limit the 97 computational performance. The multi-view optimiza-98 tion scheme is used based on the probability of features 99 for each of camera to limit the 3D landmark for reducing 100 optimization computational performance.

101
• Importantly, the extensive evaluation is performed with 102 different camera settings and comparison with other 103 methods. The data set is collected from the outdoor 104 environment that covers many challenging situations 105 with ground truth provided by RTK GPS. 106 The remainder of this paper is structured as follows: Related 107 work is reviewed in Section II. A multi-camera panoramic 108 perception including fisheye camera calibration and the 109 multi-camera rig model is discussed in Section III. The 110 detailed explanation of the proposed visual odometry is 111 explained in section IV. The experimental evaluation is pre-112 sented in section V. Sections VI and VII present a discussion 113 and conclusions, respectively.  Similarly, the multi-camera vision system is also inte-150 grated to direct sparse odometry [30]. In the case of 151 visual SLAM for panoramic or multi-fisheye camera sys-152 tems, some algorithms have been published in the literature. 153 The PanoSLAM [17] proposed a complete feature-based 154 SLAM system with a multi-fisheye panoramic camera. 155 They proposed a fisheye camera calibration method by 156 combining an equidistant projection model and a trigono-157 metric polynomial for high-accuracy transformation of the 158 fisheye point to the corresponding panoramic point. Further-159 more, the bundle adjustment with specific backpropagation 160 error function has been developed for panoramic cameras. 161 The experiment conducted on a self-collected and open-162 source dataset, demonstrated better performances in chal-163 lenging scenes. The omni-SLAM [31] is also a feature-based 164 SLAM system designed for multi-camera setup by adding 165 deep learning for depth estimation of the panoramic 166 image.

167
SLAM system based on monocular camera suffers from 168 a small field of view that ultimately causes the failure of 169 the system in challenging situations such as strong sunlight, 170 abrupt motion changes, sharp turns, and feature-less regions 171 in the surroundings. While multi-camera system methods 172 track all the features points and perform optimization with 173 each camera's information, the camera information associ-174 ated with all views helps to recover failure problems but adds 175 the extra computational complexity of the algorithm. There-176 fore, we proposed an effective visual odometry algorithm 177 for the omnidirectional camera with multi-camera collabora-178 tion.The feature prioritization and multi-view pose optimiza-179 tion are introduced in the traditional system that effectively 180 improves the performance of the algorithm. The distortion 181 in images affects the feature matching capabilities of the 182 odometry algorithm. Therefore, the KLT tracking algorithm 183 is used in a multi-threaded way for each camera. The com-184 plete hardware system is designed and data-set is collected 185 from the outdoor environment, including parking, main road 186 environment, and narrow road region across the building. The 187 detailed experimentation suggests that the proposed method 188 can considerably improve the performance and reliability 189 in complex regions where single camera-based methods has 190 difficult to track.

192
To obtain a high-precision SLAM or odometry system, a suit-193 able camera model is important for the projection relationship 194 VOLUME 10, 2022   Normally, a monocular camera is associated with a world 232 coordinate system with an extrinsic transformation as shown 233 in Figure 4. The Figure 4 depicts the relation between world 234 coordinates to five camera of ladybug and with center of 235 panoramic camera. In case of a multi-camera panoramic sys-236 tem, the world coordinate system is attached to its panoramic 237 coordinate system based on a fixed rigid transformation from 238 camera to panoramic sphere. Based on equations (2) and (3) 239 the multi-camera panoramic space perception model can be 240 derived, which shows the relationship between the 3D point 241 p w in space, 2D p i in pixel coordinate and 3D p s in sphere 242 coordinate system similar to [17].
Equation 2, shows the general procedure for projecting 3D 245 world point to corresponding image plane with known extrin-246 sic calibration parameters. The R k is the rotation and T k is the 247 translation parameters obtained from extrinsic calibration of 248 the each lens of ladybug camera. The k represent number of 249 cameras, k = 0, 1, 2, 3, 4) total five cameras in the ladybug 250 case. Similarly, the projection of 2D image point to its 3D 251 spherical coordinates can be performed as: The 2D images point are projected to the 3D sphere by 255 equation 3 with known calibration parameters from each 256 camera lens to the sphere. The R i are the rotation parameters 257 for each camera, T i are the translation parameters for each 258 camera and s is scale factor. The K i (x) is the calibration 259 function used to project 2D distorted points to their recti-260 fied coordinates. Similarly, the k shows number of cameras 261 i = 0, 1, 2, 3, 4), total five cameras in the ladybug camera 262 system. Equation 4 is used to limit the panoramic point to 263 fix sphere with radius r.The camera model used in this work 264 represents multi-camera rig with fixed slight offset between 265 of fisheye camera center C and the panoramic camera center 266 S as shown in Figure 3. The camera model used in the 267 paper is based on the multi-camera rig for omnidirectional 268 cameras. The multi-camera is designed with separate five 269 fish-eye lenses with a slight fixed offset between each camera 270 center and the panoramic camera center. The above procedure 271 from equation 3 and 4 express the col-linearity between cam-272 era center, pixel coordinate, and panoramic center. Figure 5 273 shows the camera images obtained from an omnidirectional 274 camera. The results of image rectification and panoramic 275 image generation from distorted camera images are shown 276 in Figure 5. The top row shows the raw images covering 277 360 horizontal views, while the rectified camera images are 278 shown in the middle row and projected into a panoramic 279 image. The raw images have lens distortion that needs to 280 be corrected before pose estimation. However, our algorithm 281 extracts features from the raw image and then un-distorts and 282 rectifies key points before sphere projection. The bottom row 283 shows the rectified images obtained through the process of 284    The feature prioritization module's job is to budget the num-305 ber of features for tracking. A large amount of distortion 306 present in images also complicates the KLT-based feature 307 tracking and matching. Therefore, the outliers are identi-308 fied and removed in the feature prioritization module. The 309 limited number of selected features are projected onto the 310 sphere based on the spherical camera geometry developed in 311 section III. The pose is then estimated based on the monoc-312 ular technique in a spherical coordinate system by using an 313 8-point algorithm. The pose is refined based on triangulated 314 points obtained from the previous motion. The open-source 315 library g2o [32] is used to carry out the optimization tasks. 316 The triangulated points from all the cameras are still in large 317 enough amounts so that they can increase the computational 318 complexity of the algorithm. Therefore, multi-view optimiza-319 tion is adopted based on the highest triangulated camera 320 features from all the cameras. The detailed procedure of the 321 proposed algorithm is explained in the next sections.
where s is the similarity score, the features are selected or  The fundamental matrix is decomposed by singular value 417 decomposition (SVD). The SVD can be described as follows: where U and V are orthogonal matrices and , a is singular 420 value matrix. The essential matrix decomposition will lead to 421 the following solution of rotation and translation vector.
Return: Refined Pose

465
A detailed experiment has been performed with self-collected 466 data in order to validate the proposed system. The data set is 467 VOLUME 10, 2022 The proposed method uses an omnidirectional camera with 529 five individual cameras placed in a horizontal row with 530 fixed baselines from each other and from the sphere cen-531 ter. The 1000 fast features are extracted from each of five 532 images in a parallel thread. In order to evaluate the pro-533 posed system, extensive experiments has been conducted 534 on a real dataset. The RTK GPS is used to quantitatively 535 measure the accuracy by computing the translation Root 536 Mean Squared Error (RMSE) between the estimated poses 537 and the ground truth. The trajectories are aligned using 538 Horn Method [35] and the two error metrics are com-539 puted similarly to [36]. The RMSE of Absolute Translation 540 Error and the Rotation Error (RE) with yaw-only orienta-541 tion. The RMSE of ATE can be calculated as shown in 542 equation 11.
where t i,e , t g,i and are the estimated and ground-truth transla-545 tion vector respectively. The total number of observation are 546 denoted by n, while i denotes the i th number of observation. 547 Besides RMSE, the minimum, maximum, and standard devi-548 ation of the error matrix is also calculated.

550
To evaluate the overall performance of the proposed method, 551 we conducted several experiments with different camera set-552 tings. The study is conducted to show the advantage of adding 553 more cameras to pose estimation accuracy and the ability of 554 the system to perform in a challenging situation. Three differ-555 ent combinations are used, namely cam2 (2 cameras), Trifoc-556 ular (3 Cameras), and the 5 camera setting named OmniVO. 557 For each experiment, 1000 features per frame are extracted. 558 For each camera combination, only the related features are 559 tracked and used for pose estimation and optimization. The 560 resultant output trajectories with different combinations of 561 cameras are shown in Figures 8 and 9. The output trajectory 562 plot of all different combinations with the reference ground 563 truth trajectory is shown in Figure 8. where Figures 6(a), 6(b), 564 and 6(c) show the resulting trajectories for three different 565 data sets. it can be seen from the figure that adding features 566 from different viewing direction makes the system reliable for 567 tracking and increases pose estimation accuracy in complex 568 situations.

569
The sequence 1 of the experiments is recorded from out-570 door parking environment, which normally contains a suffi-571 cient number of features due to the large buildings and trees 572 surrounding it, but in the case of sharp turns, a smaller number 573 of features are successfully tracked due to less structural 574   turns, the road region, and space that cover the majority 579 of the area in the image. Therefore, adding more structural 580 information distributed across all the camera views consider-581 ably improves the position accuracy. Sequence 3 is recorded 582 VOLUME 10, 2022  small differences for all combinations. However, in the case 586 of bumps and other lightning problems, multiple camera 587 setups improve the positioning accuracy.  The output trajectory of OmniVO with reference to ground 589 truth is shown in Figure 9 for all three sequences. The  The front view camera is selected to run ORBSLAM-mono 657 and Viso2-mono. The images are rectified and distortions are 658 corrected for both algorithms. The qualitative results between 659 the proposed method (OmniVO), ORBSLAM-mono, and 660 viso2-mono against the GNSS ground truth trajectories are 661 shown in Figure 11, 12, and 13 for all three sequences 662 respectively.

663
As shown from the qualitative results, the proposed algo-664 rithm with an omnidirectional camera setup performs better 665 in each of the selected scenarios. In the case of sequence 1, 666 the proposed methods have slightly better results than 667 ORBSLAM-mono and viso2. ORBSLAM-mono failed in the 668 case of a very sharp turn due to a lack of reliable feature 669 tracking. Similarly, viso2-mono can be able to track some 670 features, but due to fewer matches, pose estimation accuracy 671 is not very accurate, so viso2-mono slightly drifts away from 672 the actual path as shown in Figure 11. The proposed method 673   the translation error and rotation error (RE) by comput-689 ing the RMSE, minimum, and standard deviation. More-690 over, successfully tracked frames are also shown to check 691 the tracking failure of each algorithm, In the case of the 692 recovered frame, the proposed method is able to track all 693 frames over the entire sequence. While ORBSLAM-mono 694 fails to recover all frames, similarly, the proposed method has 695 high positioning and orientation accuracy in comparison with 696 state-of-the-art single-camera methods for sequence 1 and 697 sequence 2. Because the environment is slightly feature-less 698 and contains very sharp turns, single camera-based methods 699 are not able to provide reasonable performance. In the case of 700 sequence 3, the viso2-mono and ORBSLAM-mono provide 701 similar results in comparison with the proposed method, 702 as this data is recorded from the main road with obvious 703 feature information.

704
The translation error for each method is shown in 705 Figure 14. The proposed algorithm with multi-camera setup 706 provides better than ORBSLAM-mono and viso2-mono for 707 the first two sequences. while the position error of proposed 708 method and ORBSLAM-mono is almost simiar for third 709 sequence, However, the accuracy is better than viso2-mono 710   Figures 15(b), and 15(c). ides an efficient way of utilizing 747 the multi-camera system with a spherical camera model to 748 estimate camera pose. The presented result shows that the 749 proposed method can alleviate the problem offered by a single 750 camera. Furthermore, feature prioritization can significantly 751 help to improve the computational performance of the sys-752 tem. The proposed method can efficiently estimate the motion 753 for the multi-camera system but cannot provide accurate real 754 scale information as the overlapped region between cameras 755 is very small. Therefore, the 3D LiDAR can be used to 756 provide real-scale depth for visual information, and a tightly 757 coupled approach with another sensor could be a good choice 758 for future implementation. Furthermore, without mapping, 759 loop closures, and back-end optimization, odometry drifts 760 from the actual path for a longer trajectory. Therefore, inte-761 gration of the proposed method to design a visual-LiDAR 762 SLAM algorithm for an omnidirectional camera system will 763 be considered.

765
In this paper, aiming to provide a robust visual odometry 766 system for omnidirectional multi-camera setup, the method is 767 proposed which can significantly improve the performance of 768 visual odometry. Firstly, the features are detected from each 769 view in a parallel thread and tracked with the help of the 770 KLT algorithm. The outlier removal algorithm is proposed 771 to remove the wrong tracked feature after KLT tracking. The 772 features are prioritized to reduce the computational complex-773 ities of the pose estimation inside a RANSAC loop. The pose 774 is estimated with an 8-point algorithm by using a camera 775 model. At last, the inlier landmarks are obtained from all 776 the cameras and then the pose is optimized using a camera 777 having a large number of inlier sets. Most importantly, the 778 hardware setup is developed and the dataset is recorded from 779 the outdoor environment. The detailed experimental results 780 show that the proposed method can effectively improve the 781 result by approximately 40% in comparison with state-of-782 the-art methods. To address the real-scale issues, the 3D 783 implementation.