Vision-language model-driven scene understanding and robotic object manipulation | IEEE Conference Publication | IEEE Xplore