Find Details in Long Videos: Tower-of-Thoughts and Self-Retrieval Augmented Generation for Video Understanding | IEEE Conference Publication | IEEE Xplore

Find Details in Long Videos: Tower-of-Thoughts and Self-Retrieval Augmented Generation for Video Understanding


Abstract:

The Large Vision-Language Model (LVLM) has achieved impressive performance in the field of visual-language understanding. However, its ability to understand longer videos...Show More

Abstract:

The Large Vision-Language Model (LVLM) has achieved impressive performance in the field of visual-language understanding. However, its ability to understand longer videos is still limited due to the length and information diversity of multi-modal videos. Moreover, accurately matching detailed content within videos remains an open research problem. We design a new framework for LVLM inference, "Tower of Thoughts" (ToT), which extends the "Chain-of-Thought" (CoT) approach to the visual domain and constructs the high-dimensional semantics of the complete videos from the bottom up. Meanwhile, to achieve question-answering for video details within the constraints of the restricted context window, we propose a method of self-retrieval augmented generation (SRAG), which makes it possible to obtain details from long videos by storing and accessing video text as dense vectors in non-parametric memory. The solution of combining the ToT with SRAG enables our model to have cross-modal high-density semantic fusion and comprehensive and accurate generation capabilities, thereby achieving rationalized video answers. Experiments on public benchmarks demonstrate the effectiveness of our proposed method. In addition, we also conducted experiments on multi-modal long videos in the open world and achieved remarkable outcomes. These results provide new perspectives and technical routes for the future development of visual language models.
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information:

ISSN Information:

Conference Location: Hyderabad, India
Department of Electronic Engineering, Tsinghua University, China
Beijing National Research Center for Information Science and Technology (BNRist), China
Department of Electronic Engineering, Tsinghua University, China
Beijing National Research Center for Information Science and Technology (BNRist), China
Samsung Research China - Beijing (SRC-B)
Department of Electronic Engineering, Tsinghua University, China
Beijing National Research Center for Information Science and Technology (BNRist), China
Department of Electronic Engineering, Tsinghua University, China
Beijing National Research Center for Information Science and Technology (BNRist), China
Department of Electronic Engineering, Tsinghua University, China
Beijing National Research Center for Information Science and Technology (BNRist), China

Department of Electronic Engineering, Tsinghua University, China
Beijing National Research Center for Information Science and Technology (BNRist), China
Department of Electronic Engineering, Tsinghua University, China
Beijing National Research Center for Information Science and Technology (BNRist), China
Samsung Research China - Beijing (SRC-B)
Department of Electronic Engineering, Tsinghua University, China
Beijing National Research Center for Information Science and Technology (BNRist), China
Department of Electronic Engineering, Tsinghua University, China
Beijing National Research Center for Information Science and Technology (BNRist), China
Department of Electronic Engineering, Tsinghua University, China
Beijing National Research Center for Information Science and Technology (BNRist), China

References

References is not available for this document.