A Survey on Deep Active Learning: Recent Advances and New Frontiers

Active learning seeks to achieve strong performance with fewer training samples. It does this by iteratively asking an oracle to label new selected samples in a human-in-the-loop manner. This technique has gained increasing popularity due to its broad applicability, yet its survey papers, especially for deep learning-based active learning (DAL), remain scarce. Therefore, we conduct an advanced and comprehensive survey on DAL. We first introduce reviewed paper collection and filtering. Second, we formally define the DAL task and summarize the most influential baselines and widely used datasets. Third, we systematically provide a taxonomy of DAL methods from five perspectives, including annotation types, query strategies, deep model architectures, learning paradigms, and training processes, and objectively analyze their strengths and weaknesses. Then, we comprehensively summarize main applications of DAL in Natural Language Processing (NLP), Computer Vision (CV), and Data Mining (DM), etc. Finally, we discuss challenges and perspectives after a detailed analysis of current studies. This work aims to serve as a useful and quick guide for researchers in overcoming difficulties in DAL. We hope that this survey will spur further progress in this burgeoning field.


I. INTRODUCTION
T HE remarkable success of deep learning relies heavily on large-scale datasets with human-annotated labels [1].However, continually labeling large-scale datasets is an extremely time-consuming, expensive, and laborious task, which tends to become a bottleneck for deep learning with limited labeled data.To tackle this issue, Deep Active Learning (DAL) recently exhibits great potential.As Fig. 1 shows, DAL models are first trained on an initial training dataset.Then, query strategies can be iteratively applied to select the most informative and representative samples from a large pool of unlabeled data.Finally, an oracle labels the selected samples and adds them to the training dataset for retraining or fine-tuning of the DAL models.DAL aims to achieve competitive performance while reducing annotation costs within a reasonable time [2]- [4].Benefiting from the strong representation capabilities of various neural networks, such as Graph Neural Networks (GNNs) [5], Convolutional Neural Networks (CNNs) [6], and Transformers [7], as well as leveraging prior knowledge from pre-trained models like Contrastive Language-Image Pre-Training (CLIP) [8] and Generative Pre-trained Transformer (GPT) [9], DAL has made significant advances.
As a methodology for selecting or generating a subset of training data in data-centric AI, DAL is closely related to learning settings and practical techniques, including curriculum learning [10], transfer learning [11], data augmentation or pruning [12], [13], and dataset distillation [14].The commonality of these methods is to train or fine-tune a model using a small number of samples, aiming to remove noise and redundancy while improving training efficiency without decreasing models' performance on downstream tasks.However, one primary difference from DAL is that these approaches have full access to all labels when selecting, distilling, or generating training subsets.DAL defaults to that all data should be unlabeled during the training subset selection process, making it better suited for real-world scenarios where labels are initially unavailable.
To summarize DAL methodologies, recent efforts have focused on specific tasks such as text classification [15] and image analysis [16], [17], specific domains like NLP [18] and CV [19], [20], or reproducing mainstream baselines [21], [22].As for most early survey work, one common inadequacy is that they may not have enough discussion of recent advances [23]- [25], or lack summarization of emerging learning paradigms (contrastive learning etc.) and challenges [26], [27], especially in light of rapidly developing deep learning techniques (e.g., Fine-tune on pre-trained models).To assist researchers in reviewing, summarizing, and planning for future exploration, we provide a comprehensive review encompassing the latest advancements and insights in the field.While some survey papers focus on stream-based DAL [28], this paper concentrates on pool-based DAL.
Specifically, we first introduce our strategy for collecting reviewed papers and explain our criteria for selecting them in Section II.Then, we give a specific formal definition for DAL in Section III-A, and chronologically summarize the most influential DAL baselines and the widely used datasets in Section III-C.As Fig. 2 shows, in Section IV, we develop a high-level taxonomy to provide a broad overview of this field, categorizing previous studies from five perspectives.In Section IV-A, we classify the annotation types into hard, soft, hybrid, explanatory, and random/multi-agent annotations, and give a detailed introduction to each annotation type.In Section IV-B, we summarize query strategies into five distinct categories, including uncertainty-based, representative-based, influence-based, Bayesian-based and their hybrid methods, and analyze the strengths and weaknesses of each query type.As for deep model architectures, in Section IV-C, they are mainly categorized into Recurrent Neural Networks (RNNs), CNNs, GNNs, and Pre-trained methods.We discuss the benefits and drawbacks of each type of architecture.In Section IV-D, we are pleased to discover that various emerging learning paradigms, such as Curriculum Learning and Continual Learning, have shown promising results when combined with DAL.For each learning paradigm, we provide a detailed description of its definition and how to integrate it with DAL.In Section V, we comprehensively show some domains in which DAL methods have been successfully applied, including NLP, CV, DM, etc.As depicted in Fig. 3, despite the remarkable progress in DAL, this rapidly developing field is still fraught with several crucial emerging challenges.In Section VI, we analyze the causes and opportunities of each challenge, which can be summarized as follows: • Pipeline-related: inefficient & costly human annotation, insufficient research on stopping strategies, and cold-start; • Task-related: difficulty in cross-domain transfer, unstable performance, and lack of scalability and generalizability; • Dataset-related: outlier data & oracles, data scarcity & imbalance, and class distribution mismatch.
Finally, after organizing and summarizing the current DALrelated research, we have four intriguing findings that we would like to share with the readers: (1) As shown in Section IV-E, DAL has great potential as a sample selection strategy to apply few-shot or one-shot setting for large-scale pre-trained models with billions of parameters [29], [30].Furthermore, as discussed in Section III-C, many studies have shown that using only 10∼20% labeled samples for fine-tuning the pre-trained language models with billions of parameters can yield even better performance and be 5∼10 times more efficient than training with a full labeled dataset [31], [32].(2) Intuitively, having more high-quality samples can promote model performance for some tasks.Thus, as shown in Section IV-D, many works integrate DAL with semi-supervised strategies, allowing to obtain more high-quality labeled samples without increasing the need for human labor.However, as discussed in Section VI-C, semisupervised methods are highly sensitive to outliers and error labels, easily fueling a vicious cycle, i.e., models continue to label samples with wrong pseudo-labels.How to effectively integrate DAL with semi-supervised strategies, using humanlabeled true signals to guide semi-supervised annotation and avoid the mislabel circular, remains an open and challenging issue waiting to be solved.(3) From the detailed analysis of Scalability & Generalizability in Section VI-B, although DAL has achieved great success in classification tasks, comparing various DAL methods to choose the optimal one for a given task remains time-intensive and unrealistic in practice.Thus, there is an urgent need for a universal framework that is friendly to various downstream tasks.(4) By summarizing DAL applications for NLP in Section V-A, we find only a few DAL studies focused on generative tasks.Generative tasks, such as summarization and question answering, urgently require more attention and research compared to classification tasks.This is because generating informative objects, such as annotations, is more difficult and time-consuming.Defining the most meaningful samples for generation tasks and explaining why those samples play an important role are two core problems that need to be solved.We hope that future research can promote the development of DAL for generation tasks.
Overall, the main contributions of this paper are as follows: • This is the latest comprehensive and systematic survey paper on DAL to help researchers review, summarize, and look forward to the future about DAL.• Based on the novel DAL texonomy, we detail the explanations and discussions of the methodology, ranging from annotation types, query strategies, deep model architectures, learning paradigms, and training processes.• The difficult challenges in DAL are presented from multiple perspectives.By a detailed analysis of challenges and current studies, we discuss possible advanced solutions for them.
• A GitHub repository 1 is available with the most up-to-date DAL techniques, including papers, code, and datasets.Remaining part of this survey is organized as follows.Section II shows the collection of DAL papers.Section III introduces important DAL baselines and datasets.Section IV details the taxonomy of DAL methods.Section V reviews DALrelated applications.Section VI introduces DAL challenges and opportunities.Section VII ends this article with the conclusions.

II. PAPER COLLECTION AND FILTERING
We first determine relevant keywords used to search articles and create an initial keyword list, as shown in Fig. 4. We perform searches across multiple databases using all possible 3-keyword combinations from defined keyword groups, such as "Active Learning", "Machine Learning", and "Open-set".The databases searched include Google Scholar, Scopus, Semantic Scholar, and Web of Science.We limit the number of papers collected per query to 200, and the publication date ranges from January 2013 to March 2023.We collect a total of 10,000 research papers from various sources and obtain 3,967 unique papers after removing any duplicates.Fig. 4 shows the trend of these articles over time, revealing a growing interest in the topic we are investigating.To ensure the relevance of the collected articles to DAL, we conduct a detailed manual inspection of their abstracts.As a result, we identify 1,273 articles that are considered interesting and pertinent for our study.Based on the collected materials, we employ these keywords to perform a final filtering process and also consider the reputation of conferences or journals in which the papers were published, as well as their impact.This approach further refines our dataset, resulting in 405 articles that are selected for systematic analysis, and 220 articles are finally summarized and discussed, focusing on their key findings and contributions.This rigorous analysis ensures that the articles are relevant and provide valuable insight into the field of DAL.

III. DEEP ACTIVE LEARNING
In this section, we first introduce the basic notation and definition of DAL and then discuss the most important DAL baselines based on their relevance and chronological order.
Mi ← Train Mi−1 on D i train ; 8: end while

A. Notations & Definition
We focus on pool-based DAL methods since most DAL methods belong to this category.Pool-based DAL methods iteratively select the most informative samples from a large pool of unlabeled datasets until either the base model reaches a certain level of performance or a pre-defined budget is exhausted.As shown in Algorithm 1, we use a classification task as an example for illustration, while other tasks follow the typical definition of their task domains.Given an initial labeled training dataset D train = {x i , y i } m i=1 and a large-scale pool of unlabeled data D pool = {x i } n i=1 , where m≪n, x i represents the feature vector of the i-th sample, and y i ∈ {0, 1} is the class label for binary classification (or y i ∈ {1, . . ., k} for multi-label classification), the DAL procedure is carried out in T iterations.In the i-th iteration, a batch of samples Q i with batch size b is selected from D i−1 pool on the basis of the base model M and an acquisition function α( ).These samples Q i are then labeled by an oracle and added to the i-th training dataset D i train , with which the model M is then re-trained.DAL terminates when the labeled budget Q is exhausted or the desired performance of the model is reached.

B. Comparisons between Traditional and Deep AL
The differences between traditional and Deep AL mainly lie in the following two aspects: (1) most traditional AL methods use fixed pre-processed features to calculate uncertainty/representativeness.In deep learning tasks, feature representations are jointly learned with Deep Neural Networks (DNNs).Therefore, feature representations dynamically change during DAL processes, and thus pairwise distances/similarities used by representativeness-based measures need to be recomputed in every stage.In contrast, for traditional AL with classical ML tasks, these pairwise terms should be precomputed [22].(2) DAL can leverage advanced large-scale pretrained language models to achieve comparable performance in few-shot or one-shot settings.In contrast, traditional AL methods with few-shot or one-shot settings may not meet the minimum requirements for the number of training samples needed to achieve comparable performance [30], [33].On the other hand, the most similar aspect between traditional and deep AL methods is their utilization of a small number of the most informative samples to train models, thereby improving efficiency and reducing reliance on labeled samples.

C. Important DAL Baselines and Datasets
The most important baselines for DAL are carefully categorized in Table I from six perspectives to provide readers with a complete understanding of the development of DAL and the identification of the most relevant works.These influential studies have achieved breakthroughs in designing new DAL methods, tackling novel tasks, or integrating with emerging learning paradigms.They have been published in influential international conferences or high-quality journals in machine learning, CV, NLP, etc., and have been highly cited with more than 100 total citations or more than 10 citations per year.
BCBA [34] pioneers the combination of AL with Bayesian neural networks (BNNs), using Monte Carlo dropout for a variational Bayesian approximation to apply for image classification.Based on this, DBAL [35] proposes an uncertainty-based query strategy for high-dimensional image classification.To expand number of labeled samples without increasing human labors, CEAL [36] combines DAL with semi-supervised strategies by assigning pseudo-labels to high-confidence samples while requesting annotations for the most uncertain samples.Relying on a single query strategy may lead to errors.Thus, ESNN [37] uses a deep ensemble of DNNs to measure sample uncertainty from multiple aspects and achieves good robustness for unbalanced datasets.However, the aforementioned methods are criticized for being less effective for batch DAL [45].To address this issue, CoreSet [41] selects informative batches that cover the whole data distribution and BatchBALD [45] uses mutual information to identify the most informative batches.And Cluster-Margin [55] aims to select informative and diverse mini batches to improve accuracy and efficiency.
To better help DAL adjust to different tasks, reinforcement learning provides detailed rewards for dynamically controlling query strategies.For example, PAL [38] learns a deep reinforcement learning-based Q-network as an adaptive policy to select data samples for labeling.Similarly, DRAL [46] uses a reinforcement learning framework to dynamically adjust the acquisition function via rewards to obtain high-quality queries.UCBVI [62] provides a new modification to the Q-network formulation for reward-free exploration, significantly reducing query complexity.However, reinforcement learning requires a large amount of training data and human-designed rewards, which is difficult for many real-world applications.To address this issue, meta learning and transfer learning have become main solutions.LAL [39] trains a regressor to learn optimal query strategies for downstream tasks.MAML [59] combines meta learning and DAL by initializing an active learner with meta-learned parameters obtained through meta-training on tasks similar to the target task during DAL.DLER [47] designs an architecture to learn a transferable model from a high-resource setting to a low-resource one, allowing DAL to select a few informative samples based on the knowledge of the source domain.AADA [50] jointly considers domain alignment, uncertainty, and diversity for sample selection.
To enlarge the labeled training dataset for DNNs without incurring additional human labor costs, semi-supervised, semisupervised, and self-supervised DAL methods have been proposed.MIAL [44] pioneers semi-supervised DAL using cluster-based strategies to measure sample informativeness.ASM [43] collaborates with self-learning and DAL, designing a selector function to selectively and seamlessly determine the confidence of the samples, where high-confidence samples are labeled by a pseudo-labeling module, and low-confidence samples are labeled by humans.CSAL [51] first uses semisupervised learning to distill information from unlabeled data during the training stage and then uses consistencybased sample selection for DAL.TOD [54] leverages a novel unlabeled data sampling strategy for data annotation in conjunction with a semi-supervised training scheme to improve the performance of the task model with unlabeled data.Recently, data augmentation has expanded to become a deep neural model that generates virtual instances to help expand training datasets.GAAL [40] introduces a generative adversarial network to the DAL query method to generate informative samples to train the model.BGADL [48] expands GAAL and combines generative adversarial DAL with Bayesian data augmentation to generate diverse and informative samples.DFAL [42] uses adversarial DAL to select samples close to the decision boundary as the most informative samples for DAL.VAAL [49] learns a latent space using a variational autoencoder (VAE) to generate new informative samples and trains an adversarial network to discriminate labeled and unlabeled data.Inspired by these works, TA-VAAL [57] incorporates a learning loss prediction module and a task ranker to enable task-aware sample selection.SRAAL [52] proposes a relabel adversarial model that aims to obtain the most informative unlabeled samples.LADA [56] anticipates data augmentation impact by scoring both real and virtually augmented instances, allowing training in informative labeled and augmented data.
Large-scale pre-trained language models (PLMs) achieve great success and become a milestone in artificial intelligence.Due to sophisticated pre-training objectives and huge model parameters, large-scale PLMs effectively captures knowledge from massive labeled and unlabeled data.DAL also ushers in a new paradigm by leveraging the prior knowledge in PLMs to enable few-shot or zero-shot learning for many downstream tasks.ALPS [31] extracts knowledge from PLMs to select the first batch of data using masked language modeling loss, which successfully solves the cold-start problem of DAL.Ein-Dor et al. [53] use multiple DAL methods to select samples for finetuning in BERT-based text classification.It achieves comparable or higher performance than fine-tuning on full datasets only with 10%∼20% labeled samples.Karamcheti et al. [58] use DAL to identify and remove noisy data, select balanced samples to fine-tune PLMs, and achieve better performance in visual question-answering.BATL [32] is a task-independent batch acquisition method on a PLMs with triplet loss to determine hard samples, which have similar features but difficult to identify labels in an unlabeled data pool.TYROGUE [60] [64] 600,000 Images Classification, Localization ImageNet [65] 1.2M Images Classification, Detection MSCOCO [66] 123,287 Images Object detection Cityscapes [67] 5,000 Images Semantic segmentation Caltech-101 [68] 9,000 Images Classification SST [69] 11,855 Text Sentiment analysis TREC [70] 5,952 Text Question answering SNLI [71] 570,000 Text Natural language inference IMDB [72] 50,000 Text Sentiment analysis AGNews [73] 31,900 Text Classification PubMed [74] 19,717 Text Document classification YouTube-8M [75] 237,000 Audio Classification MIMIC-III [76] 112,000 Medical Healthcare analytic designs an interactive DAL framework to flexibly select samples to fine-tune PLMs for multiple low-resource tasks.Schroder et al. [61] extend the PLMs using available unlabeled data for greater adaptability and introduce effective fine-tuning for the robustness of DAL in low-resource and high-resource settings.
As shown in Table II, we also conclude the most widely used datasets in DAL including images, text, and audio.

A. Annotation Type
) Hard annotations provide one or multiple discrete categorical labels independently for each sample.For example, Citovsky et al. [55] annotate each image with a specific label such as "balloon" or "strawberry" for an image classification task.Wiechman et al. [77] design an online annotation system to assign multiple labels to long documents based on their sentiments, topics, and spam/non-spam status. ) Soft annotations allow continuous and subjective labels for samples.For instance, ReDAL [78] annotate continuous 2D region labels for 3D point clouds in semantic segmentation.Kothawade et al. [79] use mutual information as an auxiliary metric to select annotation regions in images for autonomous vehicles.Xie et al. [80] propose a region-based approach to automatically query a small subset of image regions to label while maximizing segmentation performance.) Hybrid annotations combine automatic pseudo-labels of highconfidence predictions with human labeling of low-confidence samples in an iterative self-paced manner [43].For example, Wang et al. [36] propose a complementary sample selection strategy to progressively choose the most informative samples, pseudo-labeling high-confidence predictions for training.Yu et al. [81] jointly use the expertise of different annotation groups, inter-relations between workers, and label correlations within groups.By weighting groups, they reduce the impact of low-quality workers and calculate reliable consensus labels.) Explanatory annotations provide a hard or soft label along with an explanation for each annotation.For example, Schroder et al. [82] use topic-related annotations for environmental texts.Similarly, Yan et al. [83] annotate the text and list keywords as evidence of the accuracy of the label.Unlike the above methods, Zhou et al. [84] annotate samples by minimizing correlations between tasks and provide explainable medical knowledge to distinguish selected samples.) Random/multi-agent annotations use multiple independent pseudo-annotators to randomly label new unlabeled samples without human input [85].For example, Gong et al. [86] use an agent team to collaboratively select informative images for annotation based on the decisions from the other agents.

B. Query Strategy
) Uncertainty-based methods aim to select the most ambiguous samples according to model predictions.Given an input x i : where P (ŷ i |x i ) represents the likelihood that x i is classified into the i-th class [87].Uncertainty-based methods focus on designing various score functions to measure sample uncertainty and informativeness, including predictive entropy [87], least confidence [88], highest estimated dual variables [89], mutual information between model posterior and predictions [79].Some strategies check samples near the decision boundary as the most uncertain ones [90], such as instances close to the hyperplane [44] or close to the margin [91].Others combine multiple query strategies, forming a query-by-committee [92] or disagreement-based [93] DAL strategy to decrease errors made by a single query strategy.With the development of adversarial learning, instead of selecting samples from unlabeled datasets, models tend to generate the most informative and uncertain synthetic samples to expand the training dataset [48].
However, they have some common drawbacks: (1) redundant samples, as uncertain points, are continually selected yet in short of coverage; (2) simply focusing on a single sample lacks robustness to outliers; (3) these task-specific designs exhibit limited generalizability.) Representative-based methods aim to sample the most prototypical data points that effectively cover the distribution of the entire feature space.Existing methods can be categorized into density-based and diversity-based approaches.Densitybased methods prefer to select samples that can represent all unlabeled samples.They use clustering methods to select cluster centers [94] as the most representative samples or select samples that can maximize probability coverage of the whole feature space of unlabeled datasets [41].For example, Kim et al. [95] design the density awareness coreset approach to estimate sample densities and preferentially select diverse points from sparse regions.Given the input x i : where N (x i , k) represents the k-nearest neighbors of x i [95].
Coleman et al. [96] and Gudovskiy et al. [97] achieve efficiency by only considering nearest neighbors rather than all data or matching feature densities with self-supervised methods.
Diversity-based methods prefer to select samples that are different from the labeled samples.They use context-sensitive methods [98] that take into account the distance between a sample and its surrounding labeled samples to enrich the diversity of the labeled dataset.BMAL [99] performs DAL for the image labeling problem, where diversity is measured by the KL-divergence of the class probabilities distribution of similar neighboring instances, formulated as: Other diversity-based methods tend to train a model, such as adversarial networks [57], contrastive networks [100], hierarchical clustering [44], and pre-trained models [53], to help discriminate labeled and unlabeled sets and select the most different unlabeled samples.For example, Li et al. [101] explicitly learn a non-linear embedding to select representative samples.Parvaneh et al. [102] explore neighborhoods around unlabeled data by interpolating features with labeled points.Li et al. [103] propose an acquisition function that measures mutual information between a batch of queries to encourage diversity.To further increase label efficiency, Citovsky et al. [55] use hierarchical clustering to diversify batches, requiring only 40% of the labels to achieve the same target performance.However, since they use ResNet-101 as their backbone, which contains only 170 MB parameters, more than 20% labeled samples are required for fine-tuning the model.
However, the aforementioned representative-based methods, which solely focus on sampling diverse samples, are always insensitive to samples that are close to the decision boundary (excluding hybrid methods that jointly consider representative and uncertainty), despite the fact that such samples are probably more important to the prediction model, as suggested by Zhao et al. [104].In addition, representative-based methods work well for a small sample of data and classifiers with a small number of classes since their computational complexity is almost quadratic with respect to data size [55].
) Influence-based methods aim to select samples that will have the greatest impact on the performance of the target model.These techniques can be categorized into three main groups.
(1) The first group is directly measuring the expected impact on the modal through metrics such as gradient norm [105], query complexity [106], kernel approximation [107], KL divergence [97], change of loss function [108], or model parameters [54], and expected error reduction (EER) [109].Specifically, EER can be formulated as where x s refers to the labeled sample.(2) The second group is incorporating different learning policies, such as reinforcement learning and imitation learning, to select samples based on reward signals or demonstrated actions.Despite the promising advantages, this requires significant additional training [110].For example, Wertz et al. [111] propose reinforced DAL, a reinforcement learning policy that uses multiple elements of the data and the task to dynamically pick the most useful unlabeled subset during the DAL process; (3) The last group is training a separate model to estimate the impact on the target model [89].For example, Peng et al. [14] propose a knowledge distillation framework to evaluate the impact of samples based on the knowledge learned by the student model.Elenter et al. [89] use the dual variables of the original model to measure the impact on the target model.
However, despite recent advances, influence-based DAL remains challenging.Directly measuring model changes or incorporating new learning policies always requires huge time and space costs, and training a new model will over-rely on its accuracy and often lead to unstable results.) Bayesian methods aim to minimize classification errors and improve model beliefs by leveraging Bayes' rule.Most studies have treated Bayesian models (e.g., Gaussian process [109], BNNs [35], Bayesian probabilistic ensemble [112]) as uncertainty-based methods, using them to estimate the informativeness of the sample.However, Bayesian DAL is better viewed as its own distinct system, with methods that select batches by directly measuring impact on the target model, such as BatchBALD [45] and Causal-BALD [113].For example, we define a Bayesian model with model parameters w ∼ p(w|D train ), and BALD can be defined to estimate the mutual information between the model predictions and the model parameters, formulated as: where H represents the entropy and E is the expectation.Compared to standard DNNs, the aforementioned Bayesian DAL methods, which leverage the advantages of probabilistic graphical theory [35], can often provide reasonable explanations for why these samples should be selected [45].However, they often require extensive accurate prior knowledge and tend to underperform deep learning models in representation learning and fitting capacity. ) Hybrid methods aim to take advantage of the above multiple query strategies and to achieve a trade-off among them.Hybrid methods can be further categorized according to interaction patterns.Serial-form hybrids apply criteria sequentially within an DAL cycle, filtering out non-informative samples until the batch is filled [55].Criteria-selection hybrids use only one query strategy in one DAL iteration, in which they select the best query strategy or network architecture with the highest criterion.For example, DUAL [114] switches between densitybased and uncertainty-based selectors to choose the best criterion for each DAL cycle.Unlike DUAL, iNAS [115] searches a restricted candidate set to find the optimal model architecture incrementally in each DAL iteration.Parallel-form hybrids use multi-objective optimization methods or a weighted sum to merge multiple query criteria into one for sample selection.For example, Gu et al. [2] efficiently acquire batches with discriminative and representative samples by proposing procedures to update labeled and unlabeled sets, based on pathfollowing optimization techniques.Citovsky et al. [55] jointly optimize the uncertainty and diversity criteria in batch mode using multi-objective acquisition functions.TOD [54] selects samples with high model uncertainty and outputs discrepancy through a weighted combination of both metrics.
Hybrid methods combine the advantages of different query strategies.However, determining the most effective combinations and trade-offs between criteria is time consuming and still remains open for further investigation.

C. Model Architecture
) Traditional Machine Learning architectures, such as Forest [39] and Support Vector Machine (SVM) [44], are statisticalbased models that do not use neural networks.And they attract great attention in the early stage of the DAL development.
) Bayesian Neural Networks (BNNs) combine neural networks with Bayesian inference, quantifying the uncertainty introduced by the models in terms of outputs and weights to explain the trustworthiness of the prediction [116].Many studies propose DAL strategies based on BNNs, aiming to improve efficiency and explainability in samples selection [38], [45]. ) Recurrent Neural Networks (RNNs) [117] use their reasoning from previous experiences to predict upcoming events and are able to learn features with long-term dependencies.They have been widely used for sequential data such as text and audio.DAL is seldom combined with RNNs since they require large-scale labeled datasets for training.Some special tasks that easily recognizable patterns, such as malicious word detection on social networks [118], can be solved with DAL. ) Convolutional Neural Networks (CNNs) [6] are feedforward neural networks that can extract features from data with convolution structures and have been widely used for image processing with three advantages: local connections, weight sharing, and down-sampling dimensionality reduction.DAL can be effectively combined with CNNs since Sener et al. [41] proved that a subset of samples (coreset) can geometrically characterize all features of the entire image set and can be selected by minimizing a rigorous bound.Following their study, more studies have been conducted [49], [55]. ) Graph Neural Networks (GNNs) [5] learn node representations by aggregating neighborhood information and achieve great success in various tasks, such as node classification.However, effectively handling graph data with dense interconnections between samples using limited labeled data remains an open challenge [119].DAL can help address this by selectively querying labels for the most informative samples and executing only one training epoch to reduce the annotation cost for various types of graphs, such as homogeneous graphs [120], heterogeneous graphs [121] and attribute graphs [122].) Variational Autoencoders (VAEs) is a class of neural network architecture designed with an encoder-decoder framework [123].It aims to capture the underlying data distribution and learn to generate samples that closely resemble the input data.VAEs-based DAL methods usually generate samples to fool discriminators in an adversarial training manner, thus improving discriminators' ability to select the most challenging-todistinguish samples for training DAL models [49], [57]. ) Pre-Trained Language Models (PLMs), based on Transformers, utilize multi-head self-attention to capture long-term dependencies.By pre-training on large unlabeled corpora, PLMs embed substantial general knowledge and transfer to downstream tasks, enabling state-of-the-art (SOTA) performance [30].For example, Seo et al. [32] identify the most informative samples for a given task, focusing on PLMs finetuning, to learn salient patterns with minimal annotation cost.The combination of pre-training rich knowledge foundation and DAL's sample-efficient tuning unlocks PLMs 's further potential for many applications.

D. Learning Paradigm
) Traditional Learning Paradigm, as illustrated in Algorithm 1, iteratively queries and labels samples to train the models in a vanilla supervised learning manner, without incorporating any advanced learning paradigms [32], [34].
) Semi-supervised Learning, also known as weakly-supervised learning, aims to jointly use real-labeled samples and pseudolabeled samples to train the models.Current DAL methods are designed with various efficient strategies to obtain pseudolabels for unlabeled samples.For instance, DBAL [35] and CoreSet [41] first predict pseudo-labels using their models and then calculate samples' confidence scores to judge whether these pseudo-labels should be trusted or not.On the other hand, LADA [56] and BGADL [48] propose new data augmentation methods to create more samples based on original labeled samples, using their original real-labeled samples as pseudolabels.These studies effectively reduce human-labors and achieve comparable performance compared with traditional supervised learning using larger labeled samples.) Contrastive Learning improves feature representation by pulling similar instances closer together while pushing dissimilar instances apart [124].Contrastive methods extract discriminative features, such as semantics [100] and distinctiveness [57], to estimate the sample uncertainty during acquisition.For example, as shown in Fig. 5, Du et al. [125] extract both semantic and distinctive features with contrastive learning and then combine them in a query strategy to choose the most informative unlabeled samples with matched categories. ) Adversarial Learning enables a model to train fully differentiable by solving minimax optimization problems [49].This approach can be used as a generative query technique for DAL.For example, DAL can be combined with generative adversarial network, which consist of a generator and a discriminator, where the DAL model acts as the discriminator and the generator explores the distribution of unlabeled data to generate the most informative and uncertain synthetic samples for training [57].Li et al. [122] propose SEAL, as shown in Fig. 6 which consists of two adversarial components.The graph embedding network encodes all nodes into a shared space, with the intention of making the discriminator treat all nodes as labeled.Additionally, a semi-supervised discriminator is used to differentiate unlabeled nodes from labeled ones.The divergence score of the discriminator is used as an informativeness measure to actively select the most informative node for labeling.The two components form a loop to mutually improve DAL.
) Meta Learning enables DNNs to leverage the knowledge acquired from multiple tasks, represented in the network with Pseudo Labels Discriminator Fig. 6: An example for contrastive learning based query strategies.their weights, to adapt faster to new tasks.Meta learning can provide an acquisition function for DAL [39], [126] or favorable model initialization during DAL by controlling the transfer of knowledge from multiple source tasks.For example, Shao et al. [127] propose Learning-to-Sample, where a boosting model and sampling model dynamically learn from each other and iteratively improve performance.Zhu et al. [59] combine both paradigms by initializing an active learner with meta-learned parameters via meta-training on tasks similar to the target task.
) Reinforcement Learning involves an agent that can interact with its environment and learn to alter its behavior in response to received rewards [119].Given that almost all DAL methods use heuristic acquisition functions with limited effectiveness, Reinforcement learning frames DAL as a reinforcement learning problem to explicitly optimize an acquisition policy.In the DAL with reinforcement learning setup, an autonomous agent (acquisition selector) controlled by a deep learning algorithm that observes a state s t from its environment (predictor) at time t.It takes an action a t to maximize the reward r t (prediction accuracy), where a t decides whether to query unlabeled samples [62].
) Curriculum Learning mimic human and animal learning processes, where the training progresses gradually from simple to complex samples.This provides a natural way to exploit labeled data for robust learning [10], [128].Specifically, curriculum learning uses a predefined learning constraint to incrementally incorporate additional labeled samples during training.Curriculum Learning introduces a weighted loss on all labeled samples, acting as a general regularizer over the sample weights.For example, Wang et al. [129] use a pseudo-labels strategy which iteratively assigns pseudo-labels to unlabeled samples with high prediction confidence. ) Continual Learning is developed for constraints on taskbased settings, where the model continuously learns a sequence of tasks one at a time, where all data for the current task are labeled and available in increments.However, real-world systems do not have the luxury of large labeled datasets for each new task.To address this issue, Mundt et al. methods adapt models learned from a labeled source domain to a different unlabeled target domain with the same task, inductive methods ensure that the domains of source and target are the same but tasks are different.DAL with transfer learning can better enhance each other's performance by selecting the best target samples with a distribution similar to the source domain [50].In addition, transfer learning can minimize the number of annotation labels needed and provide auxiliary information for DAL acquisition functions.For example, as shown in Fig. 7, Xie et al. [87] propose an energy-based active domain adaptation that balances domain representation and uncertainty when selecting target data.
) Imitation Learning provides SOTA results in many structured prediction tasks by learning near-optimal search policies [92].Such methods assume access to an expert during training that can provide the optimal action in any queried state, essentially asking "what would you do here?"and learning to mimic that choice.For example, Bullard et al. [132] use imitation learning to allow an agent in a constrained environment to concurrently reason about both its internal learning goals and externally impose environmental constraints within its objective function.) Curriculum Learning Training gradually progresses from easy to complex samples, mimicking human and animal learning processes.This provides a natural and iterative way to exploit labeled data for robust learning.For example, Tang et al. [135] propose a self-paced DAL approach that jointly considers the value and difficulty of a sample.It queries samples from easy to hard to minimize annotation cost.Wang et al. [43] show that curriculum learning alone improves the accuracy of the object detection by 3.6%, while the combination of curriculum learning and DAL improve the accuracy by 4.3%.
) Pre-training & Fine-tuning (Pre+FT) have become a primary training process with the development of large-scale PLMs [58].It leverages the rich prior knowledge in PLMs to solve different downstream tasks.DAL attracts attention as a sample selection strategy for fine-tuning with only 10%∼20% of labeled data achieving competitive performance compared to full data fine-tuning [32].DAL iteratively selects and annotates batches of informative samples to fine-tune the PLMs for the downstream task.This satisfies task-specific needs, while also enabling a few-shot learning [30].

V. APPLICATIONS OF DAL
As shown in Table III, the integration of DL and AL is leading to an increasing application of AL methods in various domains of life, ranging from agricultural development [82] to industrial revitalization [82], and from artificial intelligence [137] to biomedical fields [160].In this section, we aim to provide a systematic and detailed overview of existing DAL-related work from a broad application perspective.

A. Applications in Natural Language Processing
With the emergence of large-scale language models, NLP has achieved great success using computers to help understand intricate languages.However, fine-tuning these language models requires a substantial amount of data, computation resources, and time.DAL provides a strategy for searching high-quality small and high-quality samples to help fine-tune the model and save resources.In the following, we introduce some of the most influential DAL methods in NLP.[83], [136].make the selection process efficient.high time consumption, unstable performance.uncertainty sampling [61].
high efficiency and performance.vulnerable to outliers, unstable performance.use pre-trained language models [137].
easily adapt to new datasets.vulnerable to outliers and imbalanced datasets.
remove outliers and diverse sampling.vulnerable to document embeddings.
efficiently minimize costly data annotations.wait for human reaction, need expert knowledge.
Information Extraction label identical subsequences [141] high efficiency and effectiveness.lack of generalizability, cold-start.label most novel words [142].
high efficiency and effectiveness.unstable performance, cold-start.
select the most semantically varied samples.vulnerable to outliers, lack of scalability.

CV
Image Captioning semantic adversarial DAL [145] overcome scarcity of labeled data.difficulty in cross-domain transfer, cold-start.domain transfer learning [146].
transfer knowledge from high-resource.vulnerable to outliers, data scarcity.
balance between label efforts and effect.vulnerable to outliers, imbalance datasets.
suppress noisy instances.unstable performance, lack of scalability.
can learn an optimal sampling policy.vulnerable to outliers and imbalance datasets.
high efficient and effectiveness.high time consumption, cold-start Person Re-identification human-in-the-loop [46].improve model performance.high time consumption, lack of generalizability.incremental annotation [155].
select diverse samples without redundancy.vulnerable to outliers, cold-start.
stable performance.single sample selection costs much time.
Link Prediction multi-view DAL [156].query informative samples from multi-view.lack of scalability and generalizability.transfer learning DAL [157].
easily apply to new datasets.unstable performance, cold-start.
efficient and effectiveness.unstable performance, cold-start.
) Interestingly, they achieve comparable performance in widely used text classification datasets while training in less than 20% of the labeled data, which demonstrates their ability to utilize limited labeled data.In another study, Jelenic et al. [137] conduct an initial empirical study to investigate the transferability of the DAL by using PLMs .They find DAL can effectively adapt to new datasets with pre-trained models.
) Abstractive Text Summarization (ATS) aims to compress a document into a brief, informative and readable summary that retains the key information of the original document.However, constructing human-annotated datasets is a timeconsuming and costly endeavor.DAL are explored to re-duce the amount of annotation needed while achieving a certain level of ATS performance.For example, Gidiotis et al. [138] address the issue from a Bayesian view and study uncertainty estimation for SOTA text summarization models.They augment the pre-trained summarization models with Monte Carlo dropout, forming the corresponding variational Bayesian PLMs models.By generating multiple summaries from these models, they approximate Bayesian inference and estimate the summarization uncertainty.Experiments on multiple benchmark datasets consistently demonstrate their improved summarization performance with higher Recall-Oriented Understudy for Gisting Evaluation (ROUGE) scores.Unlike the above method, as Fig. 9 (a) shows, Tsvigun et al. [139] propose an alternative query strategy for ATS based on diversity principles.This strategy, known as in-domain diversity sampling, involves selecting instances that are dissimilar from annotated documents, but similar to the core documents of the domain.Given limited annotation budget, they can improve model performance and consistency scores.
) Question Answering involves answering questions about images or passages of text [161].However, current models require large-scale training data to achieve high performance.DAL methods, such as Datamap [58] and hierarchical dialog policies [140], are designed to maximize performance with minimal labeling effort.Specifically, in Fig. 9   ) Semantic Parsing aims to convert a natural language utterance to a logical form: a machine-understandable representation of its meaning [162].DAL can help reduce data requirements and improve efficiency for semantic parsing.For example, Duong et al. [143] design a simple hyperparameter selection technique for DAL to accelerate data annotation.Experiments show that their method significantly reduces the need for data annotation and improves the model's performance on semantic parsing.Li et al. [163] also design a hyperparameter tuning module to reduce the additional annotation cost.In addition, they design a novel query strategy that prioritizes examples with various logical form structures and more lexical choices, which further improve the performance for semantic parsing.Cohen et al. [144] propose a novel DAL method with two new annotation manners, called HAT.Experiments show that HAT can pick out the most semantically varied and illustrative utterances, leading to the highest possible gains in parser performance.

B. Applications in Computer Vision
With the remarkable success of CNNs and Vision Transformers, a valuable insight has been gained that more labeled image datasets can promote to obtain better performance of the task.However, as the amount of data increases, training DNNs becomes time and resource consuming.Additionally, even if the number of data increases, the presence of noise often leads to limited performance improvement.DAL can effectively reduce noise and time consumption in many CV tasks.Hereafter, we provide detailed information on specific tasks and their improvements achieved with DAL in CV. ) Image Classification aims to accurately classify images based on the provided labels for many specific fields such as remote sensing [16], medical imaging [164] and face recognition [129].We list the most successful DAL methods for image classification in Section III-C, such as BCBA, DBAL and CEAL, which can be referred to for more detailed information.) Image Captioning aims to automatically generate descriptive text about the content of an image.Achieving high-quality captioning requires large-scale datasets with diverse images.Unfortunately, creating such a dataset is time-consuming and costly.To tackle this issue, Zhang et al. [145] devise a novel adversarial DAL model, which uses visual and textual information to select the most representative samples to optimize the performance of image captioning.Experiments show that they overcome the limitations of labeled data scarcity and improve the practicality and effectiveness of image captioning.In a similar vein, Cheikh et al. [146] introduce a knowledgetransferable DAL framework for low-resorce datasets.They take advantage of existing datasets, translating their captions into Arabic, and train the model with translated caption datasets as prior knowledge for low-resource ArabicFlickr1K datasets (which contain only 1,095 images).Their model achieves the Bilingual Evaluation Understudy (BLEU) score of 47%, serving as compelling evidence for the effectiveness of their approach.) Semantic Segmentation aims to understand images at the pixel level, serving as the basis for various applications, including autonomous driving [80] and robot manipulation [30].However, training segmentation models requires an extensive amount of data with pixel-wise annotations, a process that is burdensome and prohibitively expensive [78].To solve this challenge, Konyushkova et al. [147] propose an uncertaintybased DAL method with geometric priors to expedite and simplify the annotation process for image segmentation.Experiments show that their method can be applied to both background-foreground and multi-class segmentation tasks.Qiao et al. [148] introduce a collaborative panoptic regional DAL framework for partial annotated semantic segmentation.By incorporating semantic-agnostic panoptic matching and region-based selection and extension, their model strikes a balance between labeling efforts and performance.Similarly, Xie et al. [80] propose an automated region-based DAL approach for semantic segmentation considering the spatial adjacency of image regions and the confidence in prediction.Experiments show that they can use a small number of labeled image regions while maximizing segmentation performance.) Object Detection is transformed into a region classification task by generating candidate regions of objects from the input image.Features are typically extracted from candidate object regions using CNNs and classifiers are subsequently employed for the final detection.DAL can reduce labeled data to better fit numerous parameters of CNN.Wu et al. [149] propose a novel hybrid query strategy that jointly considers uncertainty and diversity.Extensive experiments are conducted on two object detection datasets that effectively demonstrate the superiority and effectiveness of their model.Wang et al. [43] introduce active sample mining with switchable selection criteria to incrementally train robust object detectors using unlabeled or partially labeled samples, avoiding the influence of noisy samples and outliers.The effectiveness of the model is demonstrated through extensive experiments on publicly available object detection benchmarks.Yuan et al. [150] define an instance uncertainty learning module that takes advantage of the discrepancy of two adversarial instance classifiers trained in the labeled set to predict the instance uncertainty of the unlabeled set.With iterative instance uncertainty learning and re-weighting, they suppress noisy instances, bridging the gap between instance and image-level uncertainty.) Pose Estimation aims to localize the positions of specific key points in images, which has a wide range of applications, such as augmented reality, translation of sign language, and human-robot interaction.Obtaining pose annotations can be extremely expensive and laborious.To address this issue, Caramalau et al. [151] propose distribution-based methods for the selection of diverse and representative samples.Experiments demonstrate their high efficiency and effectiveness for pose estimation.Similarly, Shukla et al. [152] use an uncertaintybased query strategy and annotate samples with the lowest confidence scores and further improve the performance with fewer labeled samples.Gong et al. [86] design a novel meta agent teaming DAL (MATAL) framework to actively select and label informative images for effective learning.MATAL formulates the sample selection procedure as a Markov Decision Process and learns an optimal sampling policy that effectively maximizes the performance of the pose estimator.) Target Tracking aims to accurately track targets in images, which can be applied for numerous applications, including video surveillance, autonomous vehicles, etc.Using DAL can better help train neural networks with limited labeled samples for target tracking.Yuan et al. [153] present a new DAL sequence selection method in a multi-frame collaboration way for target tracking.To ensure the diversity of selected sequences, they measure samples' similarity by their temporal relation between multiple frames in each video, and they use a nearest neighbor discriminator to select the representative samples.Experiments show that their method can eliminate background noise and improve efficiency.) Person Re-identification (Re-ID) aims to match a specific pedestrian using different cameras, which is an essential task for public security.Previous efforts mainly concentrate on enhancing the performance of Re-ID models, relying on large labeled datasets.However, these efforts often overlook data redundancy issues that can arise in constructing Re-ID datasets.To address data redundancy in Re-ID datasets, Liu et al. [46] propose an alternative human-in-the-loop model based on reinforce learning.In their method, a human annotator provides binary feedback to fine-tune a pre-trained CNNs Re-ID model.Extensive experiments prove the superiority of their method compared to existing unsupervised, transfer learning, and DAL models.On the other hand, Xu et al. [155] focus on learning from scratch with incremental labeling through human annotators and model feedback.They combine DAL with an incremental annotation process to select informative and diverse samples without redundancy from an unlabeled set in each iteration.These samples are then labeled by human annotators to further improve the performance of the model.

C. Applications in Graph Data Mining and Learning
There are substantial increase in content-rich network from various domains, such as social networks, citation networks, and financial networks.Graphs have emerged as a powerful tool for representing and discovering knowledge, with nodes representing instances characterized by rich content features and edges denoting relationships or interactions between nodes. ) Node Classification is to predict the labels of unlabeled nodes in a partially labeled network.GNNs rely heavily on a sufficient number of labeled nodes, which is costly and timeconsuming.To address this problem, many graph-based DAL methods are proposed.For example, ICA-based methods [165] leverage label dependence among neighboring nodes to select diverse samples for node classification, while AGE [166] and ANRMAB [167] integrate GCNs with three traditional DAL query strategies and achieve good performance on many node classification datasets.As Fig. 10 shows, Hu et al. [120] present a graph policy network for transferable DAL on graphs, which formalizes DAL on graphs as a Markov decision process and learns the optimal query strategy with reinforce learning.The state is defined based on the current graph status, and the action is to select a node for annotation at each query step.The reward is defined as the performance gain of the GNNs trained with the selected nodes.) Link Prediction aims to predict missing or potential links between nodes in a given network.It involves using existing connections or relationships to infer the likelihood of forming new connections.In the context of link prediction, the challenge arises from the limited availability of existing link information between nodes in a network.DAL can help alleviate this issue, for example, DALAUP [168] uses neural networks to obtain vector representations of user pairs and utilizes multiple query strategies to select informative user pairs for labeling and model training, achieving superior performance compared to existing methods.Cai et al. [156] design a multi-view DAL method that reduces the annotation cost by selectively querying metadata for the most informative examples, using a mapping function from the visual view to the text view.They demonstrate that multi-view DAL can use richer information to help improve performance than using single view.Zhao et al. [157] propose a DAL-based transfer learning framework for link prediction in recommender systems, which iteratively selects entities from source systems for target systems using uncertainty-based criteria.Experiments show that their method successfully improves efficiency and effectiveness. ) Community Detection aims to accurately partition nodes into distinct classes based on the topological structure of the networks.However, in many practical scenarios, unsupervised methods struggle to achieve the exact community.To solve this issue, Gupta et al. [158] propose community trolling, a DAL-based method for topic-based community detection.Their method selects relevant samples from polluted big data, reducing the unreliable dataset to a reliable one for studying communities.Chien et al. [159] propose a novel DAL method for geometric community detection.They first remove many cross-cluster edges while preserving intra-cluster connectivity to avoid noise.Then, they interactively query the label of one node for each disjoint component to recover the underlying clusters.Experiments show that they can achieve SOTA performance in community detection.

D. Other Selected Interesting Applications
) Engineering Systems.DAL methods exhibit remarkable performance in computationally demanding engineering systems by significantly reducing running time and computational costs.For example, Yue et al. [169] introduce two novel DAL algorithms: the variance-based weighted AL and the D-optimal weighted AL, designed specifically for Gaussian processes with uncertainties.Numerical studies demonstrate the effectiveness of their approach, notably improving predictive modeling for automatic shape control of composite fuselage structures.In another vein, Lee et al. [170] optimize their DAL acquisition function by jointly considering safe variance reduction and safe region expansion tasks, aiming to minimize failures without explicit knowledge of failure regions.This approach is tailored for real systems with uncertain failure conditions, as demonstrated in the predictive modeling of composite fuselage deformation, achieving zero failures by considering the composite failure criterion.Furthermore, Lee et al. [171] introduce a partitioned DAL method, comprising two systematic steps: global searching for uncertain design spaces and local searching using local Gaussian processes.They apply their method to aerospace manufacturing and materials science, achieving superior performance in prediction accuracy and computational efficiency compared to benchmarks. ) Personalized Medical Treatment explores how patient health is affected by taking a drug and how user questions are answered by search recommendation [172].Although modern methods can achieve impressive performance, they need a significant amount of labeled data.To solve this issue, Deng et al. [160] propose the use of DAL to recruit patients and assign treatments that reduce the uncertainty of an Individual Treatment Effect model.Sundin et al. [173] propose to use a Gaussian process to model the individual treatment effect and use the expected information gain over the S-type error rate, defined as the error in predicting the sign of the conditional average treatment effect, as their acquisition function.Jesson et al. [113] develop epistemic uncertainty-aware methods for DAL of personalized treatment effects from high-dimensional observational data.In contrast to previous work that only uses information gain as the acquisition objective, they propose Causal-BALD because they consider both information gain and overlap between the treatment and control groups.Li et al. [174] used DAL to help people by recognizing their emotion.

VI. CHALLENGES & OPPORTUNITIES OF DAL
As Table IV shows, hereafter, we summarize the challenges and the corresponding potential solutions and opportunities.

A. Pipeline-related Issues
) Inefficient & Costly Human Annotation.DAL assumes that human annotators are readily available to label new samples once they are required.However, this assumption may not hold in some real-world applications.Human annotators can get tired or need breaks, forcing the DAL process to be suspended until they reappear.Moreover, human annotation is time-consuming and needs expert knowledge, resulting in long waits before models can be re-trained with new labeled data.
To improve efficiency, DAL methods incorporate additional techniques to reduce human annotation.Wang et al. [36] use self-supervised learning by adding pseudo-labels with high confidence to help reduce human effort and improve the performance of the model.Go one step further, Yang et al. [85] introduce multiple pseudo-annotators that provide labels for unlabeled samples, achieving good performance without requiring human expert knowledge.On the other hand, as shown in Fig. 11, Huang et al. [134] propose a new annotation strategy to allow servers, workers, and annotators to cooperate efficiently for sharing candidate queries and annotations.Experiments show that their model can avoid annotation noise and save much time for re-checking annotations.To further reduce expert knowledge, others tend to reduce the search scope in each iteration to improve efficiency.For example, Yang et al. [94] restrict candidate samples to their nearest neighbors of the labeled set rather than scanning all data.) Insufficient Research on Stopping Strategies.Few studies are designed for stopping strategies of DAL methods [196].

Challenge Types
Challenges Opportunities
Insufficient research on stopping strategies the confidence among the selected samples does not increase [175].stop when all instances lie between two contour lines [176].upper bound in expected generalization errors as stopping criterion [177].
However, stopping strategies are essential for DAL because they reduce the amount of human labor by limiting the number of samples that need to be labeled and prevent the inclusion of noisy and redundant samples, which can negatively affect the performance of DAL models.
McDonald et al. [175] design two novel stopping strategies for DAL methods in the document classification task.The first strategy measures the overall confidence of the classifiers in correctly classifying the remaining unlabeled documents.It assumes that when the classifier's mean confidence level for the remaining documents stabilizes, the model stops the DAL process, since its effectiveness would no longer improve.The second strategy measures the confidence of the classifiers among the selected documents to be reviewed.It assumes that when the classifier's confidence stops increasing for these documents, it has reached its maximal confidence and stops the DAL process.Benefiting from the idea of the margin exhaustion criterion, Yu et al. [176] identify two corresponding contour lines in the instance space and assume that the DAL process can only be stopped when all instances lying between these two contour lines have been labeled.They achieve good performance in many classification tasks.Based on the Bayesian theory, Ishibashi et al. [177] derive a novel upper bound for the difference in expected generalization errors before and after obtaining new training data.They then combine this upper bound with a statistical test to derive a stopping criterion for DAL and significantly improve efficiency.
) Cold-start.Most DAL methods fail to improve over ran-dom selection when the annotation budget is very small, a phenomenon sometimes term as "cold-start" [179].Uncertainty sampling has been shown to be inherently unsuitable for low budgets, possibly explaining the cold-start phenomenon [201].Low budgets can be seen in many applications, especially those that require an expert tagger whose time is expensive.If we want to expand deep learning to new domains, overcoming the cold-start problem is an ever-important task.To relieve the cold-start issue, Yuan et al. [31] use pretrained embeddings on unsupervised tasks, decreasing budget dependency while remaining faithful to uncertainty sampling.Similarly, Yu et al. [178] try to use pre-trained knowledge from PLMs to avoid cold-start.They select few shot samples to fine-tune large-scale PLM, achieve SOTA performance in six datasets, and improve the efficiency of labeling over existing baselines by 3.2%-6.9%on average.On the other hand, in Fig. 12 (a-b), Yehuda et al. [180] develop a new DAL initialization strategy to solve the cold-start issue for low-budget image classification, which significantly outperforms CoreSet initialization in the low-budget regime.They also theoretically analyze different DAL strategies in embedding spaces and improve performance on both low-and high-budget scenes.In Fig. 12 (c), Cao et al. [181] apply the informative sampling policy on the γ tube to solve the cold-start sampling problem.Mahmood et al. [182] query a diverse set of examples with minimal Wasserstein distance from unlabeled data.They report a significant performance boost in the low-budget regime.

B. Task-related Issues
) Difficulty in Cross-domain Transfer.We discuss two difficulties of cross-domain transfer in DAL.First, machine learning systems are always deployed on various devices with the same labeled dataset.However, DAL is often modeldependent and not directly transferable, i.e., data queried for one model may be less effective for another [183]; Second, transfer learning biases DAL to select samples that match the distribution of the source domain to the target domain, leading to sampling bias and the high cost of transfer learning.To benefit multiple target models, some methods aim to select samples in joint disagreement regions across models [183], adopt multi-agent reinforcement learning for optimal selection [154], or leverage multi-task learning to transfer common knowledge from the source domain as shown in Fig. 13.To avoid sampling bias, Farquhar et al. [184] apply corrective weighting using an unbiased risk estimator to maintain the target distribution during pool-based sampling.Trang et al. [110] introduce a heuristic query strategy that matches the distribution of the source domain while retrieving valuable target samples.Hu et al. [120] learn transferable DAL policies on labeled source graphs that generalize selection to unlabeled target graphs.Experiments show that the above methods can achieve excellent performance and transferability.) Unstable Performance.DAL methods always have unstable performance, i.e., results for the same method vary significantly with different initialized seeds [108].Two primary reasons can explain this instability.First, the DAL methods are sensitive to the initial labeled dataset.The initial selected samples have a great influence on the eventual outcome of the current approaches.With insufficient initial labeling, subsequent DAL cycles become highly biased, resulting in poor selection.Second, current DAL methods always separate active learning and deep learning methods into two separate processes, easily leading to sub-optimal and unstable performance [202].
To solve DAL's sensitivity to the initialization, current methods always use diverse sampling and pre-trained models.
Yu et al. [176] adopt hierarchical clustering to select 10% samples near each clustering center as representative samples.Their new initialization greatly helps stabilize the performance.Zlabinger et al. [185] take into account both diversity and polarization to effectively select initial samples for DAL methods that further stabilize the performance of the DAL process.Yang et al. [94] select initial samples by evaluating the total distance between the unlabeled samples and the initial samples, showing that the same distance between them can result in better and stable performance.On the other hand, Yuan et al. [31] incorporate language information as prior knowledge to help learn node representations and use clustering methods to select the initial data.Similarly, Ein-Dor [53] uses BERT to learn the representations of the input sentences and uses a hybrid query strategy to select the most uncertain and diverse samples as the initialized training data.To bridge the gap between AL and deep learning models, Kwak et al. [186] introduce Trustworthy AL (TrustAL), a labelefficient DAL framework by transferring distilled knowledge from deep learning models to the data selection process.As Fig. 14 shows, they jointly optimize knowledge distillation and DAL to obtain a more consistent and reliable performance compared to the two best performing baselines on three benchmarks.Similarly, Ma et al. [187] learn nonlinear embeddings to map inputs into a latent space and introduce a selection block to choose representative samples in the learned latent space to achieve stable performance.Margatina et al. [61] extend the PLMs to continually pre-train on available unlabeled data to tailor it to the task-specific domain, where they can benefit from both labeled and unlabeled data at each DAL iteration.Their experiments show considerable enhancements in data efficiency and stability compared to the standard fine-tuning approach, emphasizing the importance of a suitable training strategy in DAL.Mamooler et al. [188] try to combine DAL with PLMs in the legal domain, where they use unlabeled data in three stages: training the model to adjust it to the downstream task, using knowledge distillation to direct the embeddings to a semantically meaningful space, and identifying the initial set.
) Lack of Scalability & Generalizability.Current DAL methods lack scalability, as they always require significant modifications to neural network architectures for adapting to different query strategies.Another issue with current methods is their heavy reliance on DAL's weight parameters, while the parameters may not be generalizable to different datasets.Users are required to prepare additional labeled samples as a validation set to tune parameters by cross-validation, which contradicts the goal of minimizing the need for labeled data.In response to the above issues, Maekawa et al. [60] introduce a novel DAL method, called TYROGUE, that uses a hybrid query strategy to improve model generalization and reduce labeling costs.As Figure 15 shows, uncertainty-based methods tend to acquire similar data points from a specific area within an iteration, diversity-based methods tend to acquire data points similar to the samples acquired in previous iterations, and TYROGUE balances diversity and uncertainty by acquiring samples that are diverse and also closer to the model decision boundary.RMQCAL [104] is a novel scalable DAL method, which allows for any number and type of query criteria, eliminates the need for empirical parameters, and makes the trade-offs between the query criteria self-adaptive.On the other hand, Wan et al. [189] propose an embedded network of nearest-neighbor classifiers to enhance the generalization ability of models trained in labeled and unlabeled sub-spaces in a simple but effective manner.Deng et al. [190] focus on combining sample annotation and counterfactual sample construction in the DAL procedure to enhance the model's out-of-distribution generalization.Wang et al. [191] introduce a new training manner to improve model's generalizability and show a strong positive correlation between convergence speed and generalization performance under ultra-wide conditions.

C. Dataset-related Issues
) Outlier Data & Noisy Oracles.DAL methods tend to acquire outliers since models always assign high uncertainty scores to outliers.Outliers can damage a model's learning ability and fuel a vicious cycle in which DAL methods continue to select them [43].Identifying and removing outliers has become an important direction in improving DAL performance and robustness.On the other hand, classic DAL methods assume that annotators have high labeling accuracy.However, in realworld settings, sample difficulty and annotator expertise can significantly affect the quality and accuracy of annotation, which may further degrade model performance.
To remove outliers, Park et al. [126] propose MQ-Net to adaptively find the best balance between purity and informativeness of samples, filtering out noisy open-set data.Elenter et al. [89] introduce a new query strategy based on Lagrangian duality to select diverse samples, efficiently removing redundant data.Other studies [14] use knowledge distillation to compress useful knowledge into a small model, effectively identifying and removing outliers.To make high-quality annotations, AMCC [81] measures worker annotations considering both their commonality and individuality to reduce the impact of unreliable workers and improve effectiveness.Zhao et al. [192] actively select samples that are relabeled multiple times through crowd-sourcing majority voting.EMMA [193] relabels samples Fig. 16: An example of imbalanced sampling [195].
to remove noisy annotations by analyzing the stimulus based on model memory retention and greedy heuristics.BALT [203] improves human expertise during labeling to improve relabel quality and significantly improve model performance.Zlabinger [185] trains human annotators on a set of prelabeled samples to improve the quality of annotations.Huang et al. [134] propose a multi-server, multi-worker framework for DAL, where servers and workers cooperate to select diverse samples and improve model performance.
) Data Scarcity & Imbalance.Data scarcity poses two critical challenges.First, datasets are difficult to collect and annotate [204]; Second, DAL methods have the common underlying assumption that all classes are equal, while some classes have more samples than others (skewed class distribution [176]) or some classes may be more difficult to learn than others, leading to sampling bias in the acquisition process [205].
For scarce datasets, Chen et al. [12] used data augmentation to generate diverse samples to expand training data.Other studies used PLMs as prior knowledge and fine-tuned them to reduce the required labeled samples [32].For difficult annotations, Gudovskiy et al. [97] introduce several novel self-supervised pseudo-labels estimators to correct acquisition bias by minimizing the distribution shift between unlabeled data and weakly labeled validation data.To mitigate the classes imbalance, Yu et al. [176] are the first to use cost-sensitive learning.They choose the extreme weighted learning machine as the base learner to select samples based on the class imbalance ratio, class overlap, and small disjunction.They investigate why DAL can be impacted by a skewed instance distribution and improve DAL performance on imbalanced datasets.Choi et al. [194] solve the issue of data imbalance by considering the probability of mislabeling a class, the probability of the data given a predicted class, and the prior probability of the abundance of a predicted class, during querying samples of DAL.Experiments show that they can significantly enhance the ability of existing DAL methods to handle unbalanced datasets.As shown in Fig. 16, Zhao et al. [195] propose an alternate query strategy by using the medial distribution to find a compromise between importance weighting and class-balanced sampling.Experiments show that their model can be easily combined with various DAL methods and successfully select balanced samples in imbalanced datasets.Hartford et al. [196] present an exemplar guided DAL method that shows strong empirical performance under extremely skewed label distributions by using exemplar embedding.Zhang et al. [197] propose a graph-based DAL method that ) Class Distribution Mismatch.DAL methods assume that the labeled and unlabeled data are drawn from the same class distribution, which means that the categories of both datasets are identical [200].However, in real-world scenarios, unlabeled data often come from uncontrolled sources, and a large portion of the examples may belong to unknown classes.For example, when crawling images for binary image classification using keywords like "dog" and "cat," over 50% of the images in the unlabeled dataset are irrelevant to the task (e.g., "deer," "horse").Annotating these irrelevant images will lead to a waste of annotation budget as they are unnecessary for training the desired classifier.Despite this challenge, existing DAL systems tend to select these irrelevant images for annotation, as they contain more uncertain knowledge.
To address this issue, As shown in Fig. 17 (a), He et al. [198] propose the energy discrepancy to measure the density distribution between the seen and unseen classes.Then, they propose an iterative optimization strategy to facilitate the teacher-student distillation network to avoid selecting samples from unseen classes.Furthermore, Tang et al. [199] propose a dual DAL framework that simultaneously performs model search and data selection.Their framework effectively addressed the issue of distribution mismatch and significantly improves model performance.In Fig. 17

VII. CONCLUSION
Due to the advantages of DAL, such as high efficiency, good effectiveness, and strong robustness, DAL has been deployed in both research and industry projects.This article provides a comprehensive survey on DAL, including its collection, definition, influential baselines and datasets, taxonomy, applications, challenges, and some inspiring prospects.First, we discuss the collection and filtering of DAL papers to ensure their high-quality.Second, we give the definition of DAL tasks, and present its basic pipeline, influential baselines, and widely used datasets.Third, we present our taxonomy for DAL methods from several perspectives and discuss their strengths and weaknesses.From them, we obtain some guidelines for selecting different query strategies, deep model architectures, and learning paradigms to apply for different tasks.In addition, different annotation strategies can significantly reduce manual labor while also bringing certain drawbacks.In terms of training process, curriculum learning training and Pre+FT can better adapt to the current era of large language models.Fourth, we discuss some typical applications of DAL.Other than the commonly used and popular DAL methods used for CV tasks, we also introduce the carefully designed DAL method for NLP, DM, etc. Finally, even though DAL has many benefits, we

Fig. 1 :
Fig. 1: The general pipeline in deep active learning.
Finally, in Section IV-E, three different training processes, including traditional training, curriculum learning-based training, and pretraining & fine-tuning will be introduced with typical examples.Input L a b e l . . .

Fig. 7 :
Fig. 7: An example for transfer learning based query strategies.
Löffler et al. [131]  propose an imitation learning scheme (IALE) that mimics the selection of the best-performing expert heuristic at each stage of the learning cycle in a batch-mode setting.As shown in Fig.8, IALE can well imitate the Entropy-based and CoreSet-based methods and thus obtain better performance.) Multi-task Learning (MTL) focuses on formulating methods to maintain performance across multiple tasks rather than a single task.Multi-task DAL (MTAL) methods combine multiple individual task-related query strategies into a single unified approach and jointly optimize the unified one.In contrast to single-task query settings, where the uncertainty of a single selected task classifier is used to query unlabeled samples, in MTAL the uncertainty of an instance is determined by the uncertainties from classifiers across all tasks.For example, Ikhwantri et al.[133] propose an MTAL framework for semantic role labeling with entity recognition as an auxiliary task.This alleviated data needs and leverages entity information to aid role labeling.Their experiments show that MTAL can outperform single-task DAL and standard MTL, using 12% less training data than passive learning.Zhou et al.[84] propose a Multi-Task Adversarial DAL framework, where adversarial learning maintains the effectiveness of the MTL and DAL modules.A task discriminator eliminates irregular task-specific features, while a diversity discriminator exploits heterogeneity between samples to satisfy diversity constraints.E. Training Process ) Traditional Training first trains a model on an initialized training dataset and then selects unlabeled samples to annotate based on the predictions of the current model.The newly annotated samples are added to the training set for re-training the model in the next iteration [134].This iterative process continues, with the model parameters randomly re-initialized before each epoch of re-training [36], until either the sample budget or number of DAL iterations is reached.

Fig. 9 :
Fig. 9: An example for samples selection of ATS and Datamap.
(b), Ning et al.[200] introduce a detector-classifier DAL framework, where the detector filters unknown classes using Gaussian Mixture Models and the classifier selects uncertain in-distribution samples for retraining.By actively acquiring purer in-distribution query sets, this framework improves the model generalization on class distribution mismatch.

TABLE I :
Detailed taxonomy of important Deep Active Learning baselines.Refer to Section IV for a detailed explanation of each category.Any Types in Query Strategy means the proposed frameworks can be combined with any types of DAL query strategies.

TABLE II :
Widely used DAL dataset information.

TABLE III :
Illustration of DAL-related applications in main fields, including classic methods with their advantages and disadvantages.

TABLE IV :
Summary of various challenges and opportunities.