Adapting Single-Channel Pre-trained Transformer Models for Multi-Channel Sound Event Localization and Detection | IEEE Conference Publication | IEEE Xplore

Adapting Single-Channel Pre-trained Transformer Models for Multi-Channel Sound Event Localization and Detection


Abstract:

In recent years, the significance of pre-trained transformer audio models has been increasingly recognized. However, existing pre-trained transformer audio models are bas...Show More

Abstract:

In recent years, the significance of pre-trained transformer audio models has been increasingly recognized. However, existing pre-trained transformer audio models are based on single-channel audio. They cannot be directly applied to multi-channel audio for Sound Event Localization and Detection (SELD) tasks. To address this issue, in this paper, we propose SELD-SSAST, a novel model based on the single-channel Self-Supervised Audio Spectrogram Transformer (SSAST). Specifically, we first introduce a fusion feature that enables SSAST to learn the unique features in SELD problems effectively. Secondly, we input the multi-channel audio features into a single SSAST module to learn the temporal information across channels through channel-mixing. Finally, to enable SSAST to learn the relationships between multi-channel audio features, we propose a Convolutional Cross Attention (CCA) module to replace the Transformer’s Self-Attention and an intensity vector (IV) enhanced module to learn the differences between channel features. Our experiments show that using SELD-SSAST improved performance by 23.5% and 20.2% over the baseline on two datasets, respectively. Additionally, with the same data scale, SELD-SSAST outperforms the models in state-of-the-art (SOTA) methods on two datasets.
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information:

ISSN Information:

Conference Location: Hyderabad, India

Funding Agency:


References

References is not available for this document.