Abstract:
Recent research on skeleton-based action recognition has focused on designing network architectures that effectively capture motion features. In this work, we draw inspir...Show MoreMetadata
Abstract:
Recent research on skeleton-based action recognition has focused on designing network architectures that effectively capture motion features. In this work, we draw inspiration from the similarity between the temporal dynamics of action-information-embedded skeleton sequences and natural text, proposing a novel framework called Action Recognition via Language Processing (ARLP). This framework treats skeleton sequence data as analogous to "sentences" in Natural Language Processing (NLP) and employs similar processing pipelines for feature extraction and classification. The Skeleton Vector Quantized Variational Autoencoder (SVQ-VAE) constructed based on GCNs and VQ-VAE are proposed to transform skeleton sequences into skeleton tokens, addressing the challenge of preserving the spatial dimensions and topological information of skeleton data. These skeleton tokens are then fed into the proposed Pose Transformer (POTR), where an "action queries" mechanism is introduced at the transformer’s decoder to adapt to action recognition tasks, enabling the model to autoregressively extract temporal features and enhance interpretability. Experimental results demonstrate that ARLP significantly outperforms benchmark models on three mainstream datasets.
Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information: