Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation | IEEE Conference Publication | IEEE Xplore