Abstract:
Most modern approaches in temporal action localization (TAL) mainly focus on time domain information, while neglecting the advantages of information from other domains. H...Show MoreMetadata
Abstract:
Most modern approaches in temporal action localization (TAL) mainly focus on time domain information, while neglecting the advantages of information from other domains. How to effectively utilize information from different domains and their interactions in a reasonable manner has been an attractive yet challenging issue in TAL. In this paper, we propose a novel cross time-frequency Transformer model (TFFormer) for TAL. A dual-branch network architecture is designed to capture the time and frequency features at multiple scales, using the multi-scale transformer in the time branch and the DB1 Discrete Wavelet Transform (DWT) in the frequency branch. To fuse these features from different domains, we propose a cross time-frequency attention mechanism that includes a time pathway and a frequency pathway, enhancing the interaction between the temporal and frequency features. Furthermore, a gated control mechanism is designed to aggregate features from different scales, characterizing the respective contributions of features at different scales. We also design a new regression loss function for locating the time boundaries. Extensive experiments were carried out on four challenging benchmark datasets, including two third-person datasets and two first-person datasets. The proposed method achieves impressive results on these datasets. Specifically, TFFormer achieves an average mAP of 23.2% on Ego4D and 25.6% on EPIC-Kitchens 100, which outperform previous state-of-the-arts by a large margin. It also obtains competitive results on ActivityNet v1.3 and THUMOS14, with an average mAP of 36.2% and 67.8%. We also conducted extensive ablation studies to validate the effectiveness of each component in the proposed method.
Published in: IEEE Transactions on Circuits and Systems for Video Technology ( Volume: 34, Issue: 6, June 2024)
Funding Agency:
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Temporal Localization ,
- Action Localization ,
- Temporal Action Localization ,
- Time Domain ,
- Effects Of Components ,
- Extensive Experiments ,
- Attention Mechanism ,
- Temporal Features ,
- Wavelet Transform ,
- Transformer Model ,
- Contribution Of Features ,
- Time-domain Information ,
- Frequency Domain ,
- Multilayer Perceptron ,
- Temporal Information ,
- Characteristic Scale ,
- Multi-scale Features ,
- Frequency Information ,
- Action Classes ,
- Feature Pyramid ,
- Wavelet Basis Function ,
- Multi-scale Feature Fusion ,
- Action Instances ,
- Wavelet Basis ,
- Frequency Domain Features ,
- Multilayer Perceptron Layer ,
- Gating Units ,
- Classification Head ,
- 1D Convolution ,
- Pyramid Level
- Author Keywords
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Temporal Localization ,
- Action Localization ,
- Temporal Action Localization ,
- Time Domain ,
- Effects Of Components ,
- Extensive Experiments ,
- Attention Mechanism ,
- Temporal Features ,
- Wavelet Transform ,
- Transformer Model ,
- Contribution Of Features ,
- Time-domain Information ,
- Frequency Domain ,
- Multilayer Perceptron ,
- Temporal Information ,
- Characteristic Scale ,
- Multi-scale Features ,
- Frequency Information ,
- Action Classes ,
- Feature Pyramid ,
- Wavelet Basis Function ,
- Multi-scale Feature Fusion ,
- Action Instances ,
- Wavelet Basis ,
- Frequency Domain Features ,
- Multilayer Perceptron Layer ,
- Gating Units ,
- Classification Head ,
- 1D Convolution ,
- Pyramid Level
- Author Keywords