Audio-Guided Attention Network for Weakly Supervised Violence Detection | IEEE Conference Publication | IEEE Xplore

Audio-Guided Attention Network for Weakly Supervised Violence Detection


Abstract:

Detecting violence in video is a challenging task due to its complex scenarios and great intra-class variability. Most previous works specialize in the analysis of appear...Show More

Abstract:

Detecting violence in video is a challenging task due to its complex scenarios and great intra-class variability. Most previous works specialize in the analysis of appearance or motion information, ignoring the co-occurrence of some audio and visual events. Physical conflicts such as abuse and fighting are usually accompanied by screaming, while crowd violence such as riots and wars are generally related to gunshots and explosions. Therefore, we propose a novel audio-guided multimodal violence detection framework. First, deep neural networks are used to extract visual and audio features, respectively. Then, a Cross-Modal Awareness Local-Arousal (CMA-LA) network is proposed for cross-modal interaction, which implements audio-to-visual feature enhancement over temporal dimension. The enhanced features are then fed into a multilayer perceptron (MLP) to capture high-level semantics, followed by a temporal convolution layer to obtain high-confidence violence scores. To verify the effectiveness of the proposed method, we conduct experiments on a large public violent video dataset, i.e., XD-Violence. Experimental results demonstrate that our model outperforms several methods and achieves new state-of-the-art performance.
Date of Conference: 14-16 January 2022
Date Added to IEEE Xplore: 21 February 2022
ISBN Information:
Conference Location: Guangzhou, China

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.