Abstract:
We focus on the audio-visual event (AVE) localization task, which refers to locating the segments with AVE and identifying their event categories. Since different event-r...Show MoreMetadata
Abstract:
We focus on the audio-visual event (AVE) localization task, which refers to locating the segments with AVE and identifying their event categories. Since different event-relevant video segments often describe different aspects of an AVE, they can complement each other. However, current approaches model the AVE localization task as a sequential classification process, through which event-relevant video segments cannot accurately collaborate with each other. Therefore, we propose the Collaborative Segments Decision (CSD) that can collaborate between event-relevant video segments by modeling the AVE localization task as a sequential decision process. In addition, to realize collaboration between cross-modal features, we propose the Consistent Feature Propagation (CFP) by exploiting their consistency over time. We propose the Collaborative Decision Network (CDN) by combining the above components. Experimental results show that CDN outperforms baseline methods in fully and weakly supervised settings.
Published in: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 04-10 June 2023
Date Added to IEEE Xplore: 05 May 2023
ISBN Information: