Semi-Supervised Sound Event Detection Using Self-Attention and Multiple Techniques of Consistency Training | IEEE Conference Publication | IEEE Xplore

Semi-Supervised Sound Event Detection Using Self-Attention and Multiple Techniques of Consistency Training


Abstract:

We present a system that detects sound events and their time boundaries in audio signals. The proposed system is based on the mean-teacher framework of semi-supervised le...Show More

Abstract:

We present a system that detects sound events and their time boundaries in audio signals. The proposed system is based on the mean-teacher framework of semi-supervised learning of a deep neural network with the transformer architecture for a self-attention mechanism. The network can be trained efficiently with a small amount of strongly labeled synthetic data and a large amount of weakly labeled or unlabeled real data. The model parameters are learned with multiple consistency criteria, including interpolation consistency, shift consistency, and clip-level consistency, to improve the generalization and representation power. We also apply data augmentation with spectral and temporal masks to increase data diversity. Finally, an adaptive post-processing stage is applied to effectively smooth the frame-level network output. The proposed system is evaluated on the data released for DCASE 2020 Task 4. It achieves the state-of-the-art performance of event-based F-score of 46.30%, segment-based F -score of 72.21 %, and polyphonic sound detection score (PSDS) of 69.01%. These numbers are better than the performance of 41.54%, 68.11 %, and 63.56% attained by a reference system without the proposed transformer blocks, consistency objective functions, and data augmentation.
Date of Conference: 14-17 December 2021
Date Added to IEEE Xplore: 03 February 2022
ISBN Information:

ISSN Information:

Conference Location: Tokyo, Japan

Contact IEEE to Subscribe

References

References is not available for this document.