Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-Training and Multi-Modal Tokens | IEEE Conference Publication | IEEE Xplore