AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization | IEEE Conference Publication | IEEE Xplore

AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization


Abstract:

An audio-visual event (AVE) is denoted by the correspondence of the visual and auditory signals in a video segment. Precise localization of the AVEs is very challenging s...Show More

Abstract:

An audio-visual event (AVE) is denoted by the correspondence of the visual and auditory signals in a video segment. Precise localization of the AVEs is very challenging since it demands effective multi-modal feature correspondence to ground the short and long range temporal interactions. Existing approaches struggle in capturing the different scales of multi-modal interaction due to ineffective multi-modal training strategies. To overcome this limitation, we introduce AVE-CLIP, a novel framework that integrates the AudioCLIP pre-trained on large-scale audio-visual data with a multi-window temporal transformer to effectively operate on different temporal scales of video frames. Our contributions are three-fold: (1) We introduce a multi-stage training framework to incorporate AudioCLIP pre-trained with audio-image pairs into the AVE localization task on video frames through contrastive fine-tuning, effective mean video feature extraction, and multi-scale training phases. (2) We propose a multi-domain attention mechanism that operates on both temporal and feature domains over varying timescales to fuse the local and global feature variations. (3) We introduce a temporal refining scheme with event-guided attention followed by a simple-yet-effective post processing step to handle significant variations of the background over diverse events. Our method achieves state-of-the-art performance on the publicly available AVE dataset with 5.9% mean accuracy improvement which proves its superiority over existing approaches.
Date of Conference: 02-07 January 2023
Date Added to IEEE Xplore: 06 February 2023
ISBN Information:

ISSN Information:

Conference Location: Waikoloa, HI, USA

Funding Agency:


1. Introduction

Temporal reasoning of multi-modal data plays a significant role in human perception in diverse environmental conditions [10], [37]. Grounding the multi-modal context is critical to current and future tasks of interest, especially those that guide current research efforts in this space, e.g., embodied perception of automated agents [29], [4], [8], human-robot interaction with multi-sensor guidance [25], [6], [2], and active sound source localization [34], [22], [18], [27]. Similarly, audiovisual event (AVE) localization demands complex multimodal correspondence of grounded audio-visual perception [24], [7]. The simultaneous presence of the audio-visual cues over a video frame denotes an audio-visual event. As shown in Fig. 1, the speech of the person is audible in all of the frames. However, the individual speaking is visible in only a few particular frames which represent the AVE. Precise detection of such events greatly depends on the contextual understanding of the multi-modal features over the video frame.

Contact IEEE to Subscribe

References

References is not available for this document.