Abstract:
Video paragraph captioning task aims at generating a fine-grained, coherent and relevant paragraph for a video. Different from the images where objects are static, the te...Show MoreMetadata
Abstract:
Video paragraph captioning task aims at generating a fine-grained, coherent and relevant paragraph for a video. Different from the images where objects are static, the temporal states of objects are changing in videos. The dynamic information could be contributed to understanding the whole video content. Existing works rarely put focus on modeling the dynamic changing state of the objects in the videos, causing the activities occurred in videos are poorly or wrongly depicted in paragraphs. To address this problem, we propose a novel Object State Tracking Network, which can capture the temporal state change of objects. However, due to the similarity of the consecutive frames in the videos, the information of the video is redundant and noisy. We further propose a semantic alignment mechanism, and enable the sentence information to refine the visual information. Extensive experiments on ActivityNet Captions demonstrate the effectiveness of our method.
Published in: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 04-10 June 2023
Date Added to IEEE Xplore: 05 May 2023
ISBN Information: