1. Introduction
Acquiring depth information of real scenes is an non-trival task for many applications, such as semantic labeling, pose estimation [1], 3D modeling [2], etc. While high quality texture information can be easily captured by popular color cameras, the acquisition of depth information is still remaining a challenging task in real conditions. Traditional methods of depth acquisition mainly rely on stereo matching techniques [4], or some specialized depth sensing apparatus [5]. Stereo matching uses image correspondence matching and triangulation methods to compute depth information based on two-view images captured by calibrated binocular camera systems, while others using depth sensors, e.g., Time-of-Flight camera and Microsoft Kinect, which use the active mechanism to acquire scene depth directly (Some postprocessing techniques [6] are employed to obtain high quality depth map). These methods can achieve a relatively satisfying results, but are extremely dependent on the capturing apparatus. Hence, it is essential to develop a method to estimate scene depth information by exploiting monocular cues in scenarios where direct depth sensing is not available or not possible. It is worth noting that, in absence of geometry assumptions in texture images, depth estimation from a color image of a generic scene is seriously ill-posed due to the inherent ambiguity of mapping an color measurement into a depth value (Fig. 1).
Depth estimation example. (a) Color image; (b) groundtruth (gt) depth map; results obtained by (c) laina et al. [3] and (d) ours.