Journals & Magazines >IEEE Transactions on Multimedia >Volume: 26

Leveraging the Video-Level Semantic Consistency of Event for Audio-Visual Event Localization

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Audio-visual event (AVE) localization has attracted much attention in recent years. Most existing methods are often limited to independently encoding and classifying each...Show More

Metadata

Abstract:

Audio-visual event (AVE) localization has attracted much attention in recent years. Most existing methods are often limited to independently encoding and classifying each video segment separated from the full video (which can be regarded as the segment-level representations of events). However, they ignore the semantic consistency of the event within the same full video (which can be considered as the video-level representations of events). In contrast to existing methods, we propose a novel video-level semantic consistency guidance network for the AVE localization task. Specifically, we propose an event semantic consistency modeling (ESCM) module to explore video-level semantic information for semantic consistency modeling. It consists of two components: a cross-modal event representation extractor (CERE) and an intra-modal semantic consistency enhancer (ISCE). CERE is proposed to obtain the event semantic information at the video level. Furthermore, ISCE takes video-level event semantics as prior knowledge to guide the model to focus on the semantic continuity of an event within each modality. Moreover, we propose a new negative pair filter loss to encourage the network to filter out the irrelevant segment pairs and a new smooth loss to further increase the gap between different categories of events in the weakly-supervised setting. We perform extensive experiments on the public AVE dataset and outperform the state-of-the-art methods in both fully- and weakly-supervised settings, thus verifying the effectiveness of our method.

Published in: IEEE Transactions on Multimedia ( Volume: 26)

Page(s): 4617 - 4627

Date of Publication: 16 October 2023

ISSN Information:

DOI: 10.1109/TMM.2023.3324498

Funding Agency:

Contents

I. Introduction

Research on the audio-visual event (AVE) localization task has proven that joint audio-visual representations can facilitate understanding unconstrained videos [1], [2], [3], [4], [5], [6]. Specifically, AVE is defined as an event that is audible and visible simultaneously (as shown in Fig. 1). AVE localization requires the model to identify the AVE category and localize its boundary in the temporal dimension. It is essential for understanding, recommending, and searching video content, especially for short video platforms.

References is not available for this document.

Leveraging the Video-Level Semantic Consistency of Event for Audio-Visual Event Localization

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Leveraging the Video-Level Semantic Consistency of Event for Audio-Visual Event Localization

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

Authors

Figures

References

Citations

Keywords

Metrics

Supplemental Items

References

IEEE Account

Purchase Details

Profile Information

Need Help?