Loading [MathJax]/extensions/MathMenu.js
Improving Self-Supervised Learning for Audio Representations by Feature Diversity and Decorrelation | IEEE Conference Publication | IEEE Xplore

Improving Self-Supervised Learning for Audio Representations by Feature Diversity and Decorrelation


Abstract:

Self-supervised learning (SSL) has recently shown remarkable results in closing the gap between supervised and unsupervised learning. The idea is to learn robust features...Show More

Abstract:

Self-supervised learning (SSL) has recently shown remarkable results in closing the gap between supervised and unsupervised learning. The idea is to learn robust features that are invariant to distortions of the input data. Despite its success, this idea can suffer from a collapsing issue where the network produces a constant representation. To this end, we introduce SELFIE, a novel Self-supervised Learning approach for audio representation via Feature Diversity and Decorrelation. SELFIE avoids the collapsing issue by ensuring that the representation (i) maintains a high diversity among embeddings and (ii) decorrelates the dependencies between dimensions. SELFIE is pre-trained on the large-scale AudioSet dataset and its embeddings are validated on nine audio downstream tasks, including speech, music, and sound event recognition. Experimental results show that SELFIE outperforms existing SSL methods in several tasks.
Date of Conference: 04-10 June 2023
Date Added to IEEE Xplore: 05 May 2023
ISBN Information:

ISSN Information:

Conference Location: Rhodes Island, Greece

1. INTRODUCTION

Deep learning has achieved great success in many application domains, including computer vision, audio processing, text processing, and others [1]. However, it often requires a huge amount of labeled data for training. To reduce the expensive cost of annotating large-scale data, self-supervised learning (SSL) aims to learn powerful feature representations by leveraging the supervisory signals from the input data itself. Learning is often done by solving a hand-crafted pre-text task without using any human-annotated labels. Various pre-text tasks for SSL have been proposed, including prediction of future frames [2], masked feature prediction [3],[4], contrastive learning [5]-[8], and predictive coding [9]. Once the network is trained to solve the pre-text task, feature representations are extracted from the pre-trained model in order to solve new downstream tasks. Powerful and generic representations can benefit downstream tasks, especially those with limited labeled data.

Contact IEEE to Subscribe

References

References is not available for this document.