Loading [a11y]/accessibility-menu.js
XVO: Generalized Visual Odometry via Cross-Modal Self-Training | IEEE Conference Publication | IEEE Xplore

XVO: Generalized Visual Odometry via Cross-Modal Self-Training


Abstract:

We propose XVO, a semi-supervised learning method for training generalized monocular Visual Odometry (VO) models with robust off-the-self operation across diverse dataset...Show More

Abstract:

We propose XVO, a semi-supervised learning method for training generalized monocular Visual Odometry (VO) models with robust off-the-self operation across diverse datasets and settings. In contrast to standard monocular VO approaches which often study a known calibration within a single dataset, XVO efficiently learns to recover relative pose with real-world scale from visual scene semantics, i.e., without relying on any known camera parameters. We optimize the motion estimation model via self-training from large amounts of unconstrained and heterogeneous dash camera videos available on YouTube. Our key contribution is twofold. First, we empirically demonstrate the benefits of semi-supervised training for learning a general-purpose direct VO regression network. Second, we demonstrate multi-modal supervision, including segmentation, flow, depth, and audio auxiliary prediction tasks, to facilitate generalized representations for the VO task. Specifically, we find audio prediction task to significantly enhance the semi-supervised learning process while alleviating noisy pseudo-labels, particularly in highly dynamic and out-of-domain video data. Our proposed teacher network achieves state-of-the-art performance on the commonly used KITTI benchmark despite no multi-frame optimization or knowledge of camera parameters. Combined with the proposed semi-supervised step, XVO demonstrates off-the-shelf knowledge transfer across diverse conditions on KITTI, nuScenes, and Argoverse without fine-tuning.
Date of Conference: 01-06 October 2023
Date Added to IEEE Xplore: 15 January 2024
ISBN Information:

ISSN Information:

Conference Location: Paris, France

Funding Agency:


1. Introduction

Monocular Visual Odometry (VO) methods for recovering ego-motion from a sequence of images have mostly been studied within a restricted scope, where a single dataset, such as KITTI [28], may be used for both training and evaluation under a fixed pre-calibrated camera [37], [45], [54], [77], [109], [112], [116], [124], [126], [128]. However, very few studies have analyzed the task of generalized VO, i.e., relative pose estimation with real-world scale across differing scenes and capture setups.

Contact IEEE to Subscribe

References

References is not available for this document.