1. Introduction
Temporal reasoning of multi-modal data plays a significant role in human perception in diverse environmental conditions [10], [37]. Grounding the multi-modal context is critical to current and future tasks of interest, especially those that guide current research efforts in this space, e.g., embodied perception of automated agents [29], [4], [8], human-robot interaction with multi-sensor guidance [25], [6], [2], and active sound source localization [34], [22], [18], [27]. Similarly, audiovisual event (AVE) localization demands complex multimodal correspondence of grounded audio-visual perception [24], [7]. The simultaneous presence of the audio-visual cues over a video frame denotes an audio-visual event. As shown in Fig. 1, the speech of the person is audible in all of the frames. However, the individual speaking is visible in only a few particular frames which represent the AVE. Precise detection of such events greatly depends on the contextual understanding of the multi-modal features over the video frame.