1. INTRODUCTION
Speaker recognition refers to identify or verify a claimed speaker by analyzing the given speech from that speaker. Over the past few years, supervised deep learning methods greatly improve the performance of speaker recognition system [1], [2], [3]. These methods require large-scale datasets to learn discriminative speaker representations. However, manually annotating speaker labels for a large scale dataset may sometimes be expensive and problematic. On the other hand, there are vast numbers of unlabeled speech data that can be used for training DNNs. With self-supervision methods, deep learning can automate the labeling process and benefit from massive amounts of data. Self-supervised learning is an old, active research area, and has recently received growing attention in speech signal processing [4], [5], [6], [7], [8], natural language processing [9], and computer vision [10], [11], [12], [13], [14], [15].