Abstract:
The Large Vision-Language Model (LVLM) has achieved impressive performance in the field of visual-language understanding. However, its ability to understand longer videos...Show MoreMetadata
Abstract:
The Large Vision-Language Model (LVLM) has achieved impressive performance in the field of visual-language understanding. However, its ability to understand longer videos is still limited due to the length and information diversity of multi-modal videos. Moreover, accurately matching detailed content within videos remains an open research problem. We design a new framework for LVLM inference, "Tower of Thoughts" (ToT), which extends the "Chain-of-Thought" (CoT) approach to the visual domain and constructs the high-dimensional semantics of the complete videos from the bottom up. Meanwhile, to achieve question-answering for video details within the constraints of the restricted context window, we propose a method of self-retrieval augmented generation (SRAG), which makes it possible to obtain details from long videos by storing and accessing video text as dense vectors in non-parametric memory. The solution of combining the ToT with SRAG enables our model to have cross-modal high-density semantic fusion and comprehensive and accurate generation capabilities, thereby achieving rationalized video answers. Experiments on public benchmarks demonstrate the effectiveness of our proposed method. In addition, we also conducted experiments on multi-modal long videos in the open world and achieved remarkable outcomes. These results provide new perspectives and technical routes for the future development of visual language models.
Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information: