I. Introduction
Research on the audio-visual event (AVE) localization task has proven that joint audio-visual representations can facilitate understanding unconstrained videos [1], [2], [3], [4], [5], [6]. Specifically, AVE is defined as an event that is audible and visible simultaneously (as shown in Fig. 1). AVE localization requires the model to identify the AVE category and localize its boundary in the temporal dimension. It is essential for understanding, recommending, and searching video content, especially for short video platforms.