Abstract:
In recent years, the significance of pre-trained transformer audio models has been increasingly recognized. However, existing pre-trained transformer audio models are bas...Show MoreMetadata
Abstract:
In recent years, the significance of pre-trained transformer audio models has been increasingly recognized. However, existing pre-trained transformer audio models are based on single-channel audio. They cannot be directly applied to multi-channel audio for Sound Event Localization and Detection (SELD) tasks. To address this issue, in this paper, we propose SELD-SSAST, a novel model based on the single-channel Self-Supervised Audio Spectrogram Transformer (SSAST). Specifically, we first introduce a fusion feature that enables SSAST to learn the unique features in SELD problems effectively. Secondly, we input the multi-channel audio features into a single SSAST module to learn the temporal information across channels through channel-mixing. Finally, to enable SSAST to learn the relationships between multi-channel audio features, we propose a Convolutional Cross Attention (CCA) module to replace the Transformer’s Self-Attention and an intensity vector (IV) enhanced module to learn the differences between channel features. Our experiments show that using SELD-SSAST improved performance by 23.5% and 20.2% over the baseline on two datasets, respectively. Additionally, with the same data scale, SELD-SSAST outperforms the models in state-of-the-art (SOTA) methods on two datasets.
Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information: