1. INTRODUCTION
Given the huge imbalance between available labeled and unlabeled data, self-supervised pre-training is key for good performance on low-resource downstream tasks. Learning representations of discrete input sequences, such as in Natural Language Processing (NLP) applications, uses either masked prediction [1], [2] or auto-regressive generation [3] of input sequences with partial obfuscation. For continuous inputs, such as in Computer Vision (CV) applications, representations are often learned through instance classification, in which each image and its augmentations are treated as a single output class, contrasted against all other samples [4].