Cross Modal Video Representations for Weakly Supervised Active Speaker Localization | IEEE Journals & Magazine | IEEE Xplore