Abstract:
Unsupervised few-shot action recognition is a practical but challenging task, which adapts knowledge learned from unlabeled videos to novel action classes with only limit...Show MoreMetadata
Abstract:
Unsupervised few-shot action recognition is a practical but challenging task, which adapts knowledge learned from unlabeled videos to novel action classes with only limited labeled data. Without annotated data of base action classes for meta-learning, it cannot achieve satisfactory performance due to the low-quality pseudo-classes and episodes. Though vision-language pre-training models such as CLIP can be employed to improve the quality of pseudo-classes and episodes, the performance improvements may still be limited by using only the visual encoder in the absence of textual modality information. In this paper, we propose fully exploiting the multimodal knowledge of a pre-trained vision-language model CLIP in a novel framework for unsupervised video meta-learning. Textual modality is automatically generated for each unlabeled video by a video-to-text transformer. Multimodal adaptive clustering for episodic sampling (MACES) based on a video-text ensemble distance metric is proposed to accurately estimate pseudo-classes, which constructs high-quality few-shot tasks (episodes) for episodic training. Vision-language meta-adaptation (VLMA) is designed for adapting the pre-trained model to novel tasks by category-aware vision-language contrastive learning and confidence-based reliable bidirectional knowledge distillation. The final prediction is obtained by multimodal adaptive inference. Extensive experiments on five benchmarks demonstrate the superiority of our method for unsupervised few-shot action recognition.
Published in: IEEE Transactions on Circuits and Systems for Video Technology ( Early Access )
Funding Agency:
School of Computer Science and Technology, Sun Yat-sen University, Guangzhou, China
School of Computer Science and Technology, Sun Yat-sen University, Guangzhou, China
School of Computer Science and Technology, Sun Yat-sen University, Guangzhou, China
School of Computer Science and Technology, Sun Yat-sen University, Guangzhou, China
School of Computer Science and Technology, Sun Yat-sen University, Guangzhou, China
School of Computer Science and Technology, Sun Yat-sen University, Guangzhou, China
School of Computer Science and Technology, Sun Yat-sen University, Guangzhou, China
School of Computer Science and Technology, Sun Yat-sen University, Guangzhou, China
School of Computer Science and Technology, Sun Yat-sen University, Guangzhou, China
School of Computer Science and Technology, Sun Yat-sen University, Guangzhou, China