VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs | IEEE Conference Publication | IEEE Xplore