Abstract:
This paper focuses on few-shot action recognition (FSAR), where the machine is required to understand human actions, with each only seeing a few video samples. Even with ...Show MoreMetadata
Abstract:
This paper focuses on few-shot action recognition (FSAR), where the machine is required to understand human actions, with each only seeing a few video samples. Even with only a few explorations, the most cutting-edge methods employ the action textual features, pre-trained by a visual-language model (VLM), as a cue to optimize video prototypes. However, the action textual features used in these methods are generated from a static prompt, causing the network to overlook rich motion cues within videos. To tackle this issue, we propose a novel framework, namely, motion-aware visual-language representation modulation network (MoveNet). The proposed MoveNet utilizes dynamic motion cues within videos to integrate motion-aware textual and visual feature representations, as a way to modulate the video prototypes. In doing so, a long short motion aggregation module (LSMAM) is first proposed to capture diverse motion cues. Having the motion cues at hand, a motion-conditional prompting module (MCPM) utilizes the motion cues as conditions to boost the semantic associations between textual features and action classes. One further develops a motion-guided visual refinement module (MVRM) that adopts motion cues as guidance in enhancing local frame features. The proposed components compensate for each other and contribute to significant performance gains over the FASR task. Thorough experiments on five standard benchmarks demonstrate the effectiveness of the proposed method, considerably outperforming current state-of-the-art methods.
Published in: IEEE Transactions on Circuits and Systems for Video Technology ( Early Access )