1. INTRODUCTION
An effective way to learn audio representations that can be used for sound classification involves training deep neural networks (DNNs) on supervised tasks, using large annotated datasets [1], [2], [3]. However, these datasets require a considerable amount of effort to be built and are always limited in size, hindering the performance of learned representations. Recent research approaches explore and adopt unsupervised, self-supervised or semi-supervised learning methods for obtaining generic audio representations, that later can be used for different downstream tasks [4], [5], [6], [7]. The large amount of multimedia data available online is a great opportunity for these types of approaches to learn powerful audio representations.