Abstract:
In embodied visual room rearrangement, the agent needs to recover the scene state to the goal state through interacting with the environment based on the egocentric visua...Show MoreMetadata
Abstract:
In embodied visual room rearrangement, the agent needs to recover the scene state to the goal state through interacting with the environment based on the egocentric visual observations after the locations and states of some objects are changed. It has important application potential in the field of robotics. This task is challenging in visual perception, scene understanding, and action execution. Existing methods do not take full advantage of the semantic information and spatial relationship of objects in the scene perception and understanding process. To tackle the challenges and shortcomings of the current methods, we build a hierarchical decision framework based on the pretrained semantic scene representation and transformer-based scene memory to solve this task. The results in the unseen scenes demonstrate the effectiveness of the proposed model compared with other methods.
Published in: IEEE Transactions on Neural Networks and Learning Systems ( Early Access )