Abstract:
Temporal Language Grounding (TLG) aims to localize moments in untrimmed videos that are most relevant to natural language queries. While existing weakly-supervised method...Show MoreMetadata
Abstract:
Temporal Language Grounding (TLG) aims to localize moments in untrimmed videos that are most relevant to natural language queries. While existing weakly-supervised methods have achieved significant success in exploring cross-modal relationships, they still face a critical bottleneck: the interference of task-irrelevant information in query embeddings. To address this issue, we propose TLG Frequency Spiking (TFS), a dimensional mask derived from the frequency domain that models the varying importance specific to different queries. By enhancing the understanding of queries, TFS effectively optimizes the cross-modal alignment of visual and textual modalities. Experimental results show that TFS significantly outperforms state-of-the-art baselines on both the Charades-STA and ActivityNet-Captions datasets.
Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information: