Abstract:
In general, videos are powerful at recording physical patterns (e.g., spatial layout) while texts are great at describing abstract symbols (e.g., emotion). When video and...Show MoreMetadata
Abstract:
In general, videos are powerful at recording physical patterns (e.g., spatial layout) while texts are great at describing abstract symbols (e.g., emotion). When video and text are used in multi-modal tasks, they are claimed to be complementary and their distinct information is crucial. However, when it comes to cross-modal tasks (e.g., retrieval), existing works usually use their common part in the form of common space learning while their distinct information is abandoned. In this paper, we argue that distinct information is also beneficial for cross-modal retrieval. To address this problem, we propose a divide-and-conquer learning approach, namely Complementarity-aware Space Learning (CSL), by recasting this challenge into learning of two spaces (i.e., latent and symbolic spaces) to simultaneously explore their common and distinct information by considering multi-modal complementary character. Specifically, we first propose to learn a symbolic space from video with a memory-based video encoder and a symbolic generator. In contrast, we also introduce learning a latent space from text with a text encoder and a memory-based latent feature selector. Finally, we propose a complementarity-aware loss by integrating two spaces to facilitate video-text retrieval tasks. Extensive experiments show that our approach outperforms existing state-of-the-art methods by 5.1%, 2.1% and 0.9% of R@10 for text-to-video retrieval on three benchmarks, respectively. Ablation study also verifies that the distinct information from video and text improves the retrieval performance. Trained models and source code have been released at https://github.com/NovaMind-Z/CSL.
Published in: IEEE Transactions on Circuits and Systems for Video Technology ( Volume: 33, Issue: 8, August 2023)
Funding Agency:
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Learning Spaces ,
- Video-text Retrieval ,
- Latent Space ,
- Common Space ,
- Latent Features ,
- Spatial Layout ,
- Common Information ,
- Distinct Information ,
- Retrieval Performance ,
- Symbolic Space ,
- Video Encoding ,
- Text Encoder ,
- Visual Features ,
- Global Features ,
- Hidden State ,
- Textual Descriptions ,
- Word Embedding ,
- Shared Space ,
- Video Features ,
- Video Captioning ,
- Visual Space ,
- Textual Features ,
- Semantic Correlation ,
- Video Information ,
- Retrieval Results ,
- Symbolic Features ,
- Graph Neural Networks ,
- Training Videos
- Author Keywords
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Learning Spaces ,
- Video-text Retrieval ,
- Latent Space ,
- Common Space ,
- Latent Features ,
- Spatial Layout ,
- Common Information ,
- Distinct Information ,
- Retrieval Performance ,
- Symbolic Space ,
- Video Encoding ,
- Text Encoder ,
- Visual Features ,
- Global Features ,
- Hidden State ,
- Textual Descriptions ,
- Word Embedding ,
- Shared Space ,
- Video Features ,
- Video Captioning ,
- Visual Space ,
- Textual Features ,
- Semantic Correlation ,
- Video Information ,
- Retrieval Results ,
- Symbolic Features ,
- Graph Neural Networks ,
- Training Videos
- Author Keywords