Embodied Scene Understanding for Vision Language Models via MetaVQA | IEEE Conference Publication | IEEE Xplore