Loading [MathJax]/extensions/MathZoom.js
Multi-Modality Self-Distillation for Weakly Supervised Temporal Action Localization | IEEE Journals & Magazine | IEEE Xplore

Multi-Modality Self-Distillation for Weakly Supervised Temporal Action Localization


Abstract:

As a challenging task of high-level video understanding, Weakly-supervised Temporal Action Localization (WTAL) has attracted increasing attention in recent years. However...Show More

Abstract:

As a challenging task of high-level video understanding, Weakly-supervised Temporal Action Localization (WTAL) has attracted increasing attention in recent years. However, due to the weak supervisions of whole-video classification labels, it is challenging to accurately determine action instance boundaries. To address this issue, pseudo-label-based methods [Alwassel et al. (2019), Luo et al. (2020), and Zhai et al. (2020)] were proposed to generate snippet-level pseudo labels from classification results. In spite of the promising performance, these methods hardly take full advantages of multiple modalities, i.e., RGB and optical flow sequences, to generate high quality pseudo labels. Most of them ignored how to mitigate the label noise, which hinders the capability of the network on learning discriminative feature representations. To address these challenges, we propose a Multi-Modality Self-Distillation (MMSD) framework, which contains two single-modal streams and a fused-modal stream to perform multi-modality knowledge distillation and multi-modality self-voting. On the one hand, multi-modality knowledge distillation improves snippet-level classification performance by transferring knowledge between single-modal streams and a fused-modal stream. On the other hand, multi-modality self-voting mitigates the label noise in a modality voting manner according to the reliability and complementarity of the streams. Experimental results on THUMOS14 and ActivityNet1.3 datasets demonstrate the effectiveness of our method and superior performance over state-of-the-art approaches. Our code is available at https://github.com/LeonHLJ/MMSD.
Published in: IEEE Transactions on Image Processing ( Volume: 31)
Page(s): 1504 - 1519
Date of Publication: 20 January 2022

ISSN Information:

PubMed ID: 35050854

Funding Agency:


I. Introduction

Temporal action localization in videos has a wide range of applications in different scenarios [4]. This task aims to localize action instances in untrimmed videos along the temporal dimension. Most existing methods [5]–[9] are trained in a fully supervised manner. However, such a requirement of frame-level annotations does not suit real-world applications since densely annotating large-scale videos is expensive and time-consuming. To address this challenge, weakly supervised methods [10], [11] have been developed with only video-level labels, which are much easier to annotate. Among diverse weak supervisions, video-level category labels are the easiest to collect and are the most commonly used [11]–[13].

Contact IEEE to Subscribe

References

References is not available for this document.