I. Introduction
Temporal Action Localization (TAL) seeks to identify temporal boundaries and classify actions in untrimmed videos, a field gaining traction due to its potential in applications like video surveillance, [1], event analysis [2], and highlight detection [3]. Despite significant advancements in fully-supervised TAL [4], [5], [6], [7], the high cost of collecting frame-level annotations remains a challenge. To alleviate this problem, the weakly supervised TAL (WTAL) methods [8], [9], [10], [11] that only utilize video-level annotations are proposed. However, WTAL’s performance lags behind that of fully-supervised approaches due to limited access to action information during training. To narrow the performance gap between, research has explored point-level or single-frame annotation [12], [13], [14], [15], [16], offering a cost-effective compromise that maintains performance.