I. Introduction and Literature Review
Multi-view stereo (MVS) aims to recover the dense 3-D representation of the scene leveraging stereo correspondences as the main cue given calibrated 2-D images from multiple views (more than two views), essentially equivalent to solving the pixel correspondences across multi-view images. Recently, learning-based MVS approaches [1], [2], [3], [4], [5], [6], [7], [8], [9] have significantly outperformed the traditional counterparts in MVS benchmarks [10], [11], [12], [13]. Deep MVS approaches decouple the MVS into a two-stage process: learning-based depth map estimation and depth map filtering and fusion. Compared to the handcrafted photometric measures in traditional approaches, deep MVS approaches encode scene cues, such as reflective priors and illumination changes into the network by adopting powerful feature extraction and cost volume representation to achieve superior reconstruction accuracy and completeness. Despite the superiority of the learning-based MVS approaches, the following improvements can be made to further boost the overall reconstruction quality.