I. Introduction
Estimating 6D object poses, i.e., 3D translations and 3D orientations of objects with respect to cameras, is crucial for a variety of real-world applications ranging from robotic navigation and manipulation to augmented reality and virtual reality. The majority of existing works have so far mainly dealt with the instance-level 6D pose estimation [1]–[7], where a set of 3D CAD models of known instances are given as priors. The problem is thereby reduced to finding the sparse or dense correspondence between a target object and a prior 3D model. Although 3D CAD models are available in some industrial applications such as assembling different parts, the requirement still significantly limits many practical robotic applications since it can be expensive or even impossible to acquire 3D CAD models of all the objects in an environment.