Abstract:
The goal of spatial-temporal action detection is to generate spatial-temporally aligned action tubes. Most of the existing 2D CNN-based solutions directly aggregate tempo...Show MoreMetadata
Abstract:
The goal of spatial-temporal action detection is to generate spatial-temporally aligned action tubes. Most of the existing 2D CNN-based solutions directly aggregate temporal adjacent contexts through frames without alignment. The misaligned spatial-temporal contextual features might lead to chaotic representation and misaligned action tubes. Moreover, most existing methods fail to efficiently exploit motion dependencies. In this paper, we propose Modulation-based Center Alignment (MCA) and Sparse Valuable Motion Mining (SVMM) for more accurate action detection: With deformable convolution, key-frame based modulation is firstly designed to align the action center between temporal frames; then motion region guided sparse self-attention is developed for valuable motion mining. Our framework can outperform current 2D CNN-based methods significantly, based on the experimental result on two widely used benchmarks of JH-MDB and UCF101-24.
Published in: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 04-10 June 2023
Date Added to IEEE Xplore: 05 May 2023
ISBN Information: