1. INTRODUCTION
Deep learning has achieved great success in many application domains, including computer vision, audio processing, text processing, and others [1]. However, it often requires a huge amount of labeled data for training. To reduce the expensive cost of annotating large-scale data, self-supervised learning (SSL) aims to learn powerful feature representations by leveraging the supervisory signals from the input data itself. Learning is often done by solving a hand-crafted pre-text task without using any human-annotated labels. Various pre-text tasks for SSL have been proposed, including prediction of future frames [2], masked feature prediction [3],[4], contrastive learning [5]-[8], and predictive coding [9]. Once the network is trained to solve the pre-text task, feature representations are extracted from the pre-trained model in order to solve new downstream tasks. Powerful and generic representations can benefit downstream tasks, especially those with limited labeled data.