I. Introduction
Recent years have witnessed the emergence of 3D scene understanding technologies, which enables robots to understand the geometric, semantic and cognitive properties of real-world scenes, so as to assist robot decision making. However, 3D scene understanding remains challenging due to the following problems: 1) Holistic understanding requires many sub-problems to be addressed, such as semantic label assignment [2], object bounding box localization [3] and room structure boundary extraction [1] etc. However, current methods solve these tasks with separate models, which is expensive in terms of storage and computation. 2) The physical commonsense [5] like gravity [6] or interference [7] between different tasks are ignored and potentially violated, producing geometrically implausible results.