Loading [MathJax]/extensions/MathMenu.js
VideoXum: Cross-Modal Visual and Textural Summarization of Videos | IEEE Journals & Magazine | IEEE Xplore

VideoXum: Cross-Modal Visual and Textural Summarization of Videos


Abstract:

Video summarization aims to distill the most important information from a source video into either an abridged video clip or a textual narrative. Existing methods often t...Show More

Abstract:

Video summarization aims to distill the most important information from a source video into either an abridged video clip or a textual narrative. Existing methods often treat the generation of video and text summaries as independent tasks, thus neglecting the semantic correlation between visual and textual summarization. In other words, these methods only study a single modality as output without considering coherent video and text as outputs. In this work, we first introduce a novel task: cross-modal video summarization. This task seeks to transfer a long video into a condensed video clip and a semantically aligned textual summary, collectively referred to as a cross-modal summary. We then establish VideoXum (X refers to different modalities), a new large-scale human-annotated video benchmark for cross-modal video summarization. VideoXum is reannotated based on ActivityNet Captions with diverse open-domain videos. In the current version, VideoXum provides 14 K long videos, with a total of 140 K pairs of aligned video and text summaries. Compared to existing datasets, VideoXum offers superior scalability while preserving a comparable level of annotation quality. To validate the dataset's quality, we provide a comprehensive analysis of VideoXum, comparing it with existing datasets. Further, we perform an extensive empirical evaluation of several state-of-the-art methods on this dataset. Our findings highlight the impressive generalization capability of the vision-language encoder-decoder framework yields on VideoXum. Particularly, we propose VTSUM-BLIP, an end-to-end framework, serving as a strong baseline for this novel benchmark. Moreover, we adapt CLIPScore for VideoXum to measure the semantic consistency of cross-modal summaries effectively.
Published in: IEEE Transactions on Multimedia ( Volume: 26)
Page(s): 5548 - 5560
Date of Publication: 29 November 2023

ISSN Information:


Contact IEEE to Subscribe

References

References is not available for this document.