Neighbor-Guided Pseudo-Label Generation and Refinement for Single-Frame Supervised Temporal Action Localization | IEEE Journals & Magazine | IEEE Xplore

Neighbor-Guided Pseudo-Label Generation and Refinement for Single-Frame Supervised Temporal Action Localization


Abstract:

Due to the sparse single-frame annotations, current Single-Frame Temporal Action Localization (SF-TAL) methods generally employ threshold-based pseudo-label generation st...Show More

Abstract:

Due to the sparse single-frame annotations, current Single-Frame Temporal Action Localization (SF-TAL) methods generally employ threshold-based pseudo-label generation strategies. However, these approaches suffer from inefficient data utilization, as only parts of unlabeled frames with confidence scores surpassing a predefined threshold are selected for training. Moreover, the variability of single-frame annotations and unreliable model predictions introduce pseudo-label noise. To address these challenges, we propose two strategies by using the relationship of the video segments with their neighbors’: 1) temporal neighbor-guided soft pseudo-label generation (TNPG); and 2) semantic neighbor-guided pseudo-label refinement (SNPR). TNPG utilizes a local-global self-attention mechanism in a transformer encoder to capture temporal neighbor information while focusing on the whole video. Then the generated self-attention map is multiplied by the network predictions to propagate information between labeled and unlabeled frames, and produce soft pseudo-label for all segments. Despite this, label noise persists due to unreliable model predictions. To mitigate this, SNPR refines pseudo-labels based on the assumption that predictions should resemble their semantic nearest neighbors’. Specifically, we search for semantic nearest neighbors of each video segment by cosine similarity in the feature space. Then the refined soft pseudo-labels can be obtained by a weight combination of the original pseudo-label and the semantic nearest neighbors’. Finally, the model can be trained with the refined pseudo-labels, and the performance has been greatly improved. Comprehensive experimental results on different benchmarks show that we achieve state-of-the-art performances on THUMOS14, ActivityNet1.2, and ActivityNet1.3 datasets.
Published in: IEEE Transactions on Image Processing ( Volume: 33)
Page(s): 2419 - 2430
Date of Publication: 22 March 2024

ISSN Information:

PubMed ID: 38517712

Funding Agency:


I. Introduction

Temporal Action Localization (TAL) seeks to identify temporal boundaries and classify actions in untrimmed videos, a field gaining traction due to its potential in applications like video surveillance, [1], event analysis [2], and highlight detection [3]. Despite significant advancements in fully-supervised TAL [4], [5], [6], [7], the high cost of collecting frame-level annotations remains a challenge. To alleviate this problem, the weakly supervised TAL (WTAL) methods [8], [9], [10], [11] that only utilize video-level annotations are proposed. However, WTAL’s performance lags behind that of fully-supervised approaches due to limited access to action information during training. To narrow the performance gap between, research has explored point-level or single-frame annotation [12], [13], [14], [15], [16], offering a cost-effective compromise that maintains performance.

Contact IEEE to Subscribe

References

References is not available for this document.