Bridging Text and Video: A Universal Multimodal Transformer for Audio-Visual Scene-Aware Dialog | IEEE Journals & Magazine | IEEE Xplore