Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-Training and Multi-Modal Tokens

Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-Training and Multi-Modal Tokens | IEEE Conference Publication | IEEE Xplore

IEEE Account

Purchase Details

Profile Information

Need Help?