I. Introduction
Visual perception and understanding of the surrounding environment are crucial to implementing reliable robotic systems such as Simultaneous Localization and Mapping (SLAM), and Advanced Driver Assistance Systems (ADAS). Because the perception provides an ego-frame agent with detailed local features and structural information, various approaches to the perceptions have been actively studied including 3D detection and semantic segmentation. As a representation of latent variables for the tasks, Bird’s Eye View (BEV) space has been frequently employed. BEV space is free from distortions of homogeneous coordinate systems and categorizes object shapes into a few classes. Thus, it provides a robust representation of elements in 3D space including cars, buildings, pedestrians, large-scale scenes, etc.