Vision-Text Cross-Modal Fusion for Accurate Video Captioning | IEEE Journals & Magazine | IEEE Xplore