High Quality Depth Estimation from Monocular Images Based on Depth Prediction and Enhancement Sub-Networks | IEEE Conference Publication | IEEE Xplore

High Quality Depth Estimation from Monocular Images Based on Depth Prediction and Enhancement Sub-Networks


Abstract:

This paper addresses the problem of depth estimation from a single RGB image. Previous methods mainly focus on the problems of depth prediction accuracy and output depth ...Show More

Abstract:

This paper addresses the problem of depth estimation from a single RGB image. Previous methods mainly focus on the problems of depth prediction accuracy and output depth resolution, but seldom of them can tackle these two problems well. Here, we present a novel depth estimation framework based on deep convolutional neural network (CNN) to learn the mapping between monocular images and depth maps. The proposed architecture can be divided into two components, i.e., depth prediction and depth enhancement sub-networks. We first design a depth prediction network based on the ResNet architecture to infer the scene depth from color image. Then, a depth enhancement network is concatenated to the end of the depth prediction network to obtain a high resolution depth map. Experimental results show that the proposed method outperforms other methods on benchmark RGB-D datasets and achieves state-of-the-art performance.
Date of Conference: 23-27 July 2018
Date Added to IEEE Xplore: 11 October 2018
ISBN Information:

ISSN Information:

Conference Location: San Diego, CA, USA

1. Introduction

Acquiring depth information of real scenes is an non-trival task for many applications, such as semantic labeling, pose estimation [1], 3D modeling [2], etc. While high quality texture information can be easily captured by popular color cameras, the acquisition of depth information is still remaining a challenging task in real conditions. Traditional methods of depth acquisition mainly rely on stereo matching techniques [4], or some specialized depth sensing apparatus [5]. Stereo matching uses image correspondence matching and triangulation methods to compute depth information based on two-view images captured by calibrated binocular camera systems, while others using depth sensors, e.g., Time-of-Flight camera and Microsoft Kinect, which use the active mechanism to acquire scene depth directly (Some postprocessing techniques [6] are employed to obtain high quality depth map). These methods can achieve a relatively satisfying results, but are extremely dependent on the capturing apparatus. Hence, it is essential to develop a method to estimate scene depth information by exploiting monocular cues in scenarios where direct depth sensing is not available or not possible. It is worth noting that, in absence of geometry assumptions in texture images, depth estimation from a color image of a generic scene is seriously ill-posed due to the inherent ambiguity of mapping an color measurement into a depth value (Fig. 1).

Depth estimation example. (a) Color image; (b) groundtruth (gt) depth map; results obtained by (c) laina et al. [3] and (d) ours.

Contact IEEE to Subscribe

References

References is not available for this document.