Conferences >ICASSP 2023 - 2023 IEEE Inter...

Improving Self-Supervised Learning for Audio Representations by Feature Diversity and Decorrelation

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Self-supervised learning (SSL) has recently shown remarkable results in closing the gap between supervised and unsupervised learning. The idea is to learn robust features...Show More

Metadata

Abstract:

Self-supervised learning (SSL) has recently shown remarkable results in closing the gap between supervised and unsupervised learning. The idea is to learn robust features that are invariant to distortions of the input data. Despite its success, this idea can suffer from a collapsing issue where the network produces a constant representation. To this end, we introduce SELFIE, a novel Self-supervised Learning approach for audio representation via Feature Diversity and Decorrelation. SELFIE avoids the collapsing issue by ensuring that the representation (i) maintains a high diversity among embeddings and (ii) decorrelates the dependencies between dimensions. SELFIE is pre-trained on the large-scale AudioSet dataset and its embeddings are validated on nine audio downstream tasks, including speech, music, and sound event recognition. Experimental results show that SELFIE outperforms existing SSL methods in several tasks.

Published in: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Date of Conference: 04-10 June 2023

Date Added to IEEE Xplore: 05 May 2023

ISBN Information:

ISSN Information:

DOI: 10.1109/ICASSP49357.2023.10097100

Conference Location: Rhodes Island, Greece

Contents

1. INTRODUCTION

Deep learning has achieved great success in many application domains, including computer vision, audio processing, text processing, and others [1]. However, it often requires a huge amount of labeled data for training. To reduce the expensive cost of annotating large-scale data, self-supervised learning (SSL) aims to learn powerful feature representations by leveraging the supervisory signals from the input data itself. Learning is often done by solving a hand-crafted pre-text task without using any human-annotated labels. Various pre-text tasks for SSL have been proposed, including prediction of future frames [2], masked feature prediction [3],[4], contrastive learning [5]-[8], and predictive coding [9]. Once the network is trained to solve the pre-text task, feature representations are extracted from the pre-trained model in order to solve new downstream tasks. Powerful and generic representations can benefit downstream tasks, especially those with limited labeled data.

References is not available for this document.

Improving Self-Supervised Learning for Audio Representations by Feature Diversity and Decorrelation

Abstract:

Metadata

Abstract:

ISSN Information:

1. INTRODUCTION

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Improving Self-Supervised Learning for Audio Representations by Feature Diversity and Decorrelation

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

1. INTRODUCTION

Authors

Figures

References

Citations

Keywords

Metrics

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?