Abstract:
Weakly supervised anomaly detection is to identify the time window when an anomaly event happened based on the video-level label indicating whether the video contains ano...Show MoreMetadata
Abstract:
Weakly supervised anomaly detection is to identify the time window when an anomaly event happened based on the video-level label indicating whether the video contains anomaly event or not. Recent efforts have focused on leveraging multi-modal data, specifically combining visual and audio information, to enhance detection accuracy. While some studies have explored intra-video separation techniques, the primary emphasis remains on distinguishing potentially anomalous events scoring highest from those scoring lowest as normal events. Nevertheless, challenges persist in delineating boundaries between normal and abnormal events, particularly when visual differences are subtle. Our proposed framework, called Audio-Visual Collaborative Learning (AVCL), addresses the challenge of ambiguity in weakly supervised anomaly detection. Our core idea centers around utilizing both audio track variations and the perceptual robustness of visual information to detect and differentiate challenging cases, which composed of two essential modules: the Audio-Visual collaborative Hard cases Separation (AVHS) module and the Multi-modal Mutual Learning (MML) module. The AVHS module aims to address the challenge of discerning visually ambiguous clips in anomaly videos, differentiating between normal and abnormal events. To further improve detection accuracy, we introduce the Multi-modal Mutual Learning (MML) module, and this module enables a process of mutual learning to facilitate the exchange of knowledge and expertise between the single-modal branch and the multi-modal branch. We demonstrate that the proposed approach achieves state-of-the-art detection performance on benchmarks of XD-Violence and CCTV-Fights_{sub} datasets.
Published in: IEEE Transactions on Multimedia ( Early Access )