Abstract:
In recent years, the rapid advancement of multi-modal large language models has propelled the development of video-based conversation models. Due to their exceptional vid...Show MoreMetadata
Abstract:
In recent years, the rapid advancement of multi-modal large language models has propelled the development of video-based conversation models. Due to their exceptional video understanding capabilities, there is often an expectation that these models can handle all video-related tasks, including action recognition. However, because action recognition datasets typically lack semantic information, limiting the performance of dialogue models. Additionally, as these dialogue models are designed for video understanding, they frequently overlook critical information required for action recognition—continuous motion—in their model architecture and training dataset configurations. To address these challenges, we first propose a novel two-step mapping framework based on large language models, termed “Vision-Semantics-Label” mapping, to better adapt video-based large language models for action recognition. In the first step, we proposed a visual-skeletal collaborative learning large language model (VS-LLM), which utilizes human keypoints to compensate for the missing motion details without increasing the input token length of the large language model. In the second step, we designed two mapping methods: verb noun match (VN-Match) and all text match (ALL-Match), which can effectively extract relevant action descriptions from the text. Finally, we construct semantic action recognition datasets to ensure that the training data inherently contains action details, enabling the model to better achieve action recognition. We evaluate our approach on five benchmark datasets, demonstrating the state-of-the-art performance of large language models in action recognition. The source code and dataset are publicly available at https://github.com/xiaoyu92568/VS-LLM.
Published in: IEEE Transactions on Circuits and Systems for Video Technology ( Early Access )