Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions | IEEE Conference Publication | IEEE Xplore