I. Introduction
The applications of audio classification have significantly increased, and their impact has become more evident [1], [2]. These applications cover various domains, including acoustic scene classification [3], underwater acoustic signal classification [4], [5], and urban sound classification [6]. Although supervised learning methods and self-supervised learning method achieve superior performance in these tasks, they typically require extensive labeling, which can be time-consuming and labor-intensive [7]. A key challenge in this field is to efficiently select the most informative audio data for labeling within a limited budget, thereby reducing costs while maximizing the model’s performance.