Modeling Long-Term Multimodal Representations for Active Speaker Detection With Spatio-Positional Encoder | IEEE Journals & Magazine | IEEE Xplore