I. Introduction
Temporal action localization in videos has a wide range of applications in different scenarios [4]. This task aims to localize action instances in untrimmed videos along the temporal dimension. Most existing methods [5]–[9] are trained in a fully supervised manner. However, such a requirement of frame-level annotations does not suit real-world applications since densely annotating large-scale videos is expensive and time-consuming. To address this challenge, weakly supervised methods [10], [11] have been developed with only video-level labels, which are much easier to annotate. Among diverse weak supervisions, video-level category labels are the easiest to collect and are the most commonly used [11]–[13].