How Much a Model be Trained by Passive Learning Before Active Learning?

Most pool-based active learning studies have focused on query strategy for active learning. In this paper, via empirical analysis on the effect of passive learning before starting active learning, we reveal that the amount of data acquired by passive learning significantly affects the performance of active learning algorithms. In addition, we confirm that the best amount of data that should be acquired by passive learning depends on the given settings: network complexity, query strategy, and datasets. Inspired by these observations, we propose a method to automatically determine the starting point of active learning for the given settings. To this end, we suggest entropy of sample-uncertainty to measure the training degree of a target model and develop three empirical formulas to determine an appropriate entropy of sample-uncertainty that should be obtained by passive learning before starting active learning. The effectiveness of the proposed method is validated by extensive experiments on popular image classification benchmarks and query strategies.


I. INTRODUCTION
In recent years, deep learning-based algorithms have made a breakthrough in many areas of machine learning tasks such as image classification [1]. The success of deep learning is attributed to the huge amount of supervised data. Recent studies have reported that increasing amounts of data can improve the performance of a model [2]. However, in actual applications, obtaining a tremendous amount of supervised data is impractical due to the limitation of time and cost.
Recently in the deep learning field, active learning has emerged as one way to efficiently collect supervised data. The key idea of active learning is to efficiently improve the performance of the target model by actively selecting the data to be labeled through an algorithm, rather than randomly selecting the data to be labeled. There are many scenarios for active learning [3], but recent studies have mainly considered a pool-based active learning scenario. The pool-based scenario consists of two learning phases [3]. First, the model is trained The associate editor coordinating the review of this manuscript and approving it for publication was Yonghong Peng . with a small amount of supervised data which are randomly selected from an unlabeled data pool and are labeled by a human. This process is referred to as Passive Learning (PL). Next, a query strategy suggests worthy samples in the unlabeled data pool. The suggested samples are queried to a human and labeled manually. This process is referred to as Active Learning (AL). The pool-based scenario is processed over several iterations. PL is performed for the first few iterations (usually 1 iteration), and then AL is performed during the remaining iterations.
Recent deep AL studies have focused on designing effective query strategies. Thus, most previous works overlook PL in the pool-based scenario. But it should be pointed out that lots of query strategies are dependent on the initial state of a model because most query strategies utilize the model prediction outputs [4]- [7], or features extracted from a certain layer of the model [8]. Thus, the amount of PL used to initialize a model before applying AL can have a significant impact on the performance of an AL algorithm. For this reason, it is important to analyze the effect of PL on the performance of AL and to find an appropriate amount of PL before starting AL in the pool-based scenario. VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 1. An example of PL effects on AL performance. The performance of the AL algorithm can be degraded when the amount of PL is (a) too small or (c) too large before starting AL. The amount of PL refers to the number of data acquired by PL. In this case, we can see the (b) best performance can be achieved when the model is trained by 100 data points acquired by PL before starting AL.
In this paper, we empirically reveal the effect of PL on the performance of an AL algorithm. In particular, we attempt to find the amount of PL before applying AL as shown in Fig. 1. To this end, we conduct extensive experiments that perform pool-based AL scenarios by increasing the amount of PL. Based on the analysis of the results, we present a meaningful trend for an appropriate amount of PL before starting AL. Then, by utilizing the trend, we propose empirical formulas to recommend an appropriate amount of PL without rigorous experiments. For the proposed method, first, we suggest a measure to estimate the training degree of a model, referred to as Entropy of sample-uncertainty (EoU). We utilize EoU as a criterion to determine whether the current amount of PL is sufficient to start AL. Then we analyze the trend of EoU depending on three important settings of AL: network complexity, AL query strategy, and dataset complexity. Based on the trend, we develop empirical formulas to determine an appropriate EoU that should be achieved by PL before AL. Through experiments on MNIST, CINIC, and CIFAR datasets, we validate the effectiveness of our method by applying it to the popular AL algorithms.
Our contributions are summarized as follows. 1) We reveal how the amount of PL affects the performance of an AL algorithm through extensive experiments.
2) We suggest a metric called Entropy of sample-Uncertainty (EoU) to measure the training degree of a model achieved by PL. 3) We develop empirical formulas to automatically determinate an EoU value representing an appropriate amount of PL for a given model, query strategy, and dataset.

II. RELATED WORKS
Recently, the active learning [3] has been applied to various deep learning tasks such as object detection [4], [9], person re-identification [10], multi-task learning [11], named entity recognition [12], human pose estimation [13], action localization [14], and biomedical image analysis [15], [16]. Recent active learning methods can be categorized into two types [17]. The one is the uncertainty-sampling methods [4]- [7], [9], [10], [12], [15] which select the most uncertain (also referred to as informative) samples for the target model. The other is representative subset methods [8], [18], which select the representative subset from the unlabeled data pool. The core of the uncertainty-sampling methods is to estimate the uncertainty information for the unlabeled sample. In case the target model infers a probability distribution (e.g., image classification), classical methods such as entropy [19] or variational ratio [20] of a probability distribution can be used as uncertainty estimators. Despite its simplicity, classical methods are still utilized in many deep learning applications and show prominent performance [7], [9], [12]- [15], [21].
However, most deep learning applications, such as object detection [22] or human-pose estimation [23], infer deterministic results, instead of probabilistic results. In addition, the classical uncertainty-based active learning methods have a problem with scalability to high dimensional data and a huge number of model parameters [6].
Therefore, recent studies attempt to efficiently estimate the sample-uncertainty in a deep model. Dropout-based method [6] performs multiple forward passes with dropout layers [24] to predict the sample uncertainty. Ensemble-based method [5] utilizes multiple deep neural networks, which have the same structure but are differently initialized. The method proposed in [4] estimates the sample-uncertainty by predicting the loss value of the sample. To this end, they add the additional network (loss prediction module) to the target model. Since the loss function should be defined in any deep learning task, this method can be applied to any deep learning task.
The representative subset methods select the representative subset from the unlabeled data pool. Core-set approach [8] formulated the active learning problem as k-Center problem [25]. The goal of the k-Center problem is to select the k points that maximize the minimum distance among the selected points and their nearest centers. Then the Core-set method solves the problem via integer programming. Variational adversarial active learning method [18] selects unlabeled data that are not similar to labeled data by training VAE [26] and discriminator adversarially. k-centered clustering algorithms such as k-medoid clustering [27] can be utilized for selecting representative samples by choosing cluster-center [28]. Some recent studies [29], [30] have attempted to combine the two kinds of active learning strategies mentioned before. The method proposed in [29] selects samples that have high uncertainty while preserving the distribution of the dataset. The method proposed in [30] improves its original one [29] by considering the easiness of a sample.

III. ANALYSIS SETTINGS
In this section, first, we introduce pool-based active learning (AL) and define notations. Then, we describe our evaluation scenarios and settings of datasets and models to analyze the effect of passive learning (PL) on the performance of the AL algorithm.

A. POOL-BASED ACTIVE LEARNING
Pool-based AL is an iterative process. At each iteration, the following steps are performed given the labeled dataset D l , unlabeled dataset D u , and model M.
First, M is trained by D l , and a pooling subset P u is randomly chosen from D u . A query strategy will be applied to P u rather than D u to mitigate overlapping problem [4], [5]. Next, necessary information to perform a query strategy is collected, such as output map or layer activation map (feature map) of the trained model for samples in D l and P u . Then the query strategy suggests queried samples Q u from P u , and Q u are labeled by human experts. Note that if the query strategy is a random selection, the process is referred to as PL. Finally, newly labeled samples are added to D l and subtracted from D u . The aforementioned process is repeated until the T -th iteration. In general, the cardinality of Q u is referred to as budget b which is set to a fixed value for every iteration.
To provide a clear explanation, we define additional notations. Let x i be the i-th sample in a dataset. Then p c i denotes the predicted probability that x i belongs to the c−th class, where c ∈ {1, . . . , C} in a C-class classification problem.

B. EVALUATION SCENARIOS ON THE EFFECT OF PL
The goal of this paper is to reveal the effect of PL on the performance of AL, especially in terms of the amount of PL (i.e., the number of data acquired by PL). In our evaluation scenarios, we have trained a model during T iterations, where in the k-th scenario, the first k iterations adopt PL and the last (T − k) iterations adopt AL. From k = 0 to k = T , T + 1 scenarios are conducted. Thus, in the k-th scenario, the passively labeled samples and actively labeled samples become k × b and (T − k) × b, respectively. The total training samples for every scenario are the same as T × b. Each scenario is evaluated by the test accuracy of the trained model. For each dataset, the corresponding deep model, b (the number of samples to be labeled at each iteration), T (the number of iterations to be performed during one pool-based AL scenario), and |P u | (the cardinality of the pooling subset) are given.

C. EVALUATION SETTINGS 1) DATASETS
Image classification is the most popular application for AL studies [4]- [6], [8], [15]. Thus, we performed experiments on four widely used image classification datasets: MNIST [31], CINIC-10 [32], CIFAR-10, and 100 [33]. The details for datasets are provided in Appendix A-B. Table 1 presents AL settings for each dataset. To observe the effect of PL in a fine-grained manner, we set a small b that represents the resolution for a fixed amount of total training samples. T is set to 20 due to a computational limitation: note that the computation burden to perform one evaluation scenario is roughly equivalent to that for learning tons of samples as many as b × (2T 3 + 6T 2 + 3T )/6 samples. For |P u |, we empirically found that when |P u | was 10% of the full dataset size, we achieved the best performances.

2) MODEL
We employed Keras implementation [34] as a deep model for MNIST dataset. To validate the effect of network complexity on AL, we further experimented on four variations of the original implementation: M1, M2, M3, M4 (original), and M5. The higher numbered network has the higher complexity. For CIFAR and CINIC-10, we employed ResNet-18 [35] as a deep model. A drop-out layer was added before the last fully connected layer with a dropout ratio of 0.5. The first convolution layer of ResNet-18 was modified as '3×3 kernel, stride of 1 and padding of 1' to fit CIFAR and CINIC-10. Detailed implementations and training schemes are provided in Appendix A.

IV. EMPIRICAL ANALYSIS AND METHODOLOGY
In this section, based on the evaluation results of the proposed scenarios, we present our observations on the effect of PL on the performance of AL. From the observations, we found the best amount of PL depends on the target settings: model complexity, query strategy, and datasets, which is related to the training degree of a model. To estimate the training degree of a model, we propose a metric based on the entropy of The results of our evaluation scenarios on MNIST with three AL methods (Entropy, VarR, and DBAL) and a M1 network model. In each graph, the x axis denotes the number of labeled data acquired by PL, and the y axis denotes the test performance of the model. The mean and standard deviation of the repeated results are represented as a red point and a blue bar, respectively. A cubic polynomial fitted curve (green curve) of the results is also presented to clearly show the tendency.
sample-uncertainty. Using the metric, we suggest empirical formulas to determine the best amount of PL for AL. Fig. 2 shows the results of the proposed evaluation scenarios for MNIST dataset. See Appendix B for more results on the other datasets and network models. In each graph, the x axis denotes the amount of PL (the number of labeled data acquired by PL) and y axis denotes the test accuracy of the model after completing the pool-based AL scenario. To obtain robust results, the experiments were repeated by 20/5/5 times for MNIST / CINIC-10 / CIFAR, respectively. The mean and standard deviation of the repeated results are presented as red points and blue bars in the graphs. A cubic polynomial fitted curve (green curve) is also illustrated to clearly show the tendency of the results.

B. EFFECT OF PL ON PERFORMANCE OF AL
Based on the experimental results, we found a common tendency. The finally achieved performance is degraded when the amount of PL is too large or too small. First, in the case that the amount of PL is too large, performance degradation is reasonable because the pool-based AL scenario is terminated with little benefits from AL.
Interestingly, the performance can also be degraded when the amount of PL is too small. We analyzed this phenomenon in relation to the training degree of the model, which is achieved by PL. The fundamental goal of AL is to select the most helpful samples for the performance improvement of the current model. Most query strategies must be dependent on the training degree of a model [4]- [8]. For example, in the case of uncertainty sampling methods [3], such as Entropybased sampling [19], unlabeled samples are feed-forwarded into a model, and the model outputs are utilized by a query strategy. Another example is Core-set method [8]. The Core-set does not utilize predicted results, but it also depends on a model since it utilizes the responses of a specific layer of the model as features for a sample. Therefore, if the model does not achieve a certain training degree by a sufficient amount of PL, a query strategy could not choose proper samples for training the model. Poor sample selection can lead to performance degradation of an AL algorithm. According to the above investigations, useful observations are summarized as below.
Observation 1: The amount of PL before AL affects the performance of AL.
By investigating further experimental results (see Appendix B), we discovered additional phenomenons. As shown on graphs given in Appendix B, the best amount of PL is different depending on the target settings: model complexity, query strategy, and datasets. This result means that the sufficient training degree of a model for AL starting can vary depending on the target AL settings. From these phenomenons, we suggest the following observation.
Observation 2: The best amount of PL depends on the target settings: model complexity, query strategy, and datasets.
Based on Observation 1 and 2, our objective is to find a method to determine the best amount of PL before AL depending on the target settings. Note that performing extensive experiments on T + 1 scenarios to determine the best amount of PL is impractical in actual applications. Therefore, we aim to propose formulas to automatically determine the best amount of PL without conducting the evaluation experiments via multiple scenarios.
In our observation, the amount of PL is highly correlated with the training degree of the target model trained by PL. Hence, to efficiently search for the best amount of PL, we should find a metric that can measure the training degree of a model for given settings. As the metric, we suggest the entropy of sample-uncertainty (EoU) in the following section. Using EoU, we can automatically determine the amount of PL before the starting AL in one scenario without an investigation of T + 1 scenarios.

C. EoU: TRAINING DEGREE OF A MODEL
To estimate the training degree of a model, we have designed a formula that considers both the trained model M and the pooling subset P u . To this end, we have focused on the uncertainty of model responses for an unlabeled sample, which is referred to as sample-uncertainty for simplicity.
Sample-uncertainty becomes low for a familiar sample similar to the samples learned already by the model, whereas it becomes high for an unfamiliar sample rarely presented to the model. When the model has not learned any data, it is unfamiliar with all samples in an unlabeled dataset. Therefore, for all unlabeled data, sample-uncertainties are estimated to be high. Then, the sample-uncertainties for uncertain samples in the unlabeled dataset will be uniformly distributed. The more a model learns labeled data, the more unlabeled samples will have low sample-uncertainty. Thus, the distribution of the sample-uncertainties for samples in the unlabeled dataset tends to be concentrated as model-training proceeds. Fig. 3 shows sample-uncertainties estimated for all unlabeled samples in P u of MNIST. We trained models with three different sizes of the randomly chosen training dataset. In Fig. 3, as aforementioned, the sample-uncertainties are uniformly distributed when the model is not trained (top graph). And the distribution of sample-uncertainties becomes unbalanced as the model is trained with more training samples (middle and bottom graph). Therefore, using the distribution of sample-uncertainties, we can quantify the training degree of a model.
Entropy [19] can be utilized as a metric to measure the uncertainty of a distribution. To apply entropy, the sample index i is defined as a discrete random variable. Then the probability that the i-th sample is unfamiliar (uncertain) to the model, is defined by normalized sample-uncertainty. Let u i be the sample-uncertainty of the i-th sample. Then the uncertain probability q i for the i-th sample is given by where M = |P u |. Then a normalized entropy for a distribution of sample-uncertainty is defined by where log M is introduced to normalize the entropy to [0, 1] since the maximum entropy depends on the number of unlabeled samples varying during AL process.
In conclusion, we suggest H , referred to as Entropy of sample-Uncertainty (EoU), as a measure to estimate the training degree of a model. Fig. 4 illustrates the correlation between the performance (classification accuracy on the testing dataset) and EoU of the models. The result shows that EoU is negatively correlated with the performance (training degree) of the model. Note that the validation performance can be utilized to measure the training degree of the model. However, the validation performance varies largely depending on the experiment settings and tasks. In contrast, EoU is experimentally normalized to a scalar value between [0.80, 1.00]. Thus, EoU provides a more consistent criterion when judging the training degree of a model.

D. DETERMINATION OF THE BEST AMOUNT OF PL
Utilizing EoU, we propose a method to automatically determine the best amount of PL. The proposed method is based on two of our investigations: (1) For an effective AL, a model should achieve a certain training degree, and (2) EoU decreases as the model trained. The concept of the proposed method is simple. Start AL when EoU of the model is less than some scalar value τ . To this end, at each iteration, estimate EoU of the model and compare it to the τ . If EoU becomes less than τ , start AL. The amount of PL at this iteration is regarded as the best amount of PL. Algorithm 1 describes the detailed process of the proposed method. Though Algorithm 1 assumed sample-uncertainty as the Entropy method, other types of sample-uncertainty can be also applied.
Then determination of τ is the remaining problem. Since τ depends on the settings: query strategy, model if H > τ then 11:  complexity, and dataset, hence, in the following section, we analyzed the dependency of EoU for the given target settings and provided the empirical formulas to determine τ .
To design the formulas empirically, we analyzed the dependency of EoU for the given target settings. For a concise explanation, we denote the amount of PL and EoU of the trained model by P and H , respectively. The best cases are denoted by P * and H * . To see the dependency of the best cases on the settings, P * and H * were chosen by those of the scenario closest to the peak of the fitting curve.

1) DEPENDENCY OF H * ON QUERY STRATEGY TO DESIGN f 1 (·)
As shown in the Table 2, H * varies depending on the query strategy. First, we analyzed the trend of H * in correlation with the meaning of EoU.
As explained in the previous section, EoU (H ) can be interpreted as a measure of the training degree of a model, which depends on the amount of unfamiliar samples to the model in an unlabeled dataset. For the cases of Entropy and VarR, H * has been measured around 0.90 ∼ 0.93. This means that the query strategy works well when the amount of sample uncertain to the model is around 90 ∼ 93% of the unlabeled dataset, i.e., the model needs to understand around 7 ∼ 10% of the unlabeled dataset for PL. For DBAL, the model needs to understand only 3 ∼ 5% of the unlabeled dataset to utilize the AL query strategy well. In our opinion, the reason that DBAL needs less EoU than others is that DBAL utilizes multiple predictions for one sample. By aggregating multiple predictions, DBAL can estimate more reliable sample-uncertainties than Entropy and VarR which use only one prediction per sample. Then, for DBAL, query strategy can be effectively utilized for a model with a low training degree. Thus it is reasonable that DBAL requires higher H * than Entropy and VarR.
Since α 1 ∈ [0, 1] in f 1 (·) denotes the reliability of sample-uncertainty estimated by a query strategy, high α 1 means that the estimated sample-uncertainty is reliable. Then considering the results on H * in Table 2 and 3, we adjust τ 0 according to α 1 for the query strategy, following Formula 1 designed empirically.
Formula 1: Depending on the query strategy, the hyperparameter τ 1 is determined according to α 1 ∈ [0, 1] as Remark 1: In the analysis, we have observed a positive correlation between the best EoU H * and the guessed reliability α 1 on the query strategy and so adopted a linear regression model for f 1 (τ 0 , α 1 ) function. First we can set τ 0 as the baseline for Entropy strategy. Then we set the slope and bias to 0.1 and −0.05 for scaling τ 1 into the range of [0, 1] without adjusting τ 0 for middle reliability α 1 = 0.5. We provide examples of how to determine τ 0 and α 1 . We adopt Entropy strategy for the baseline. Since H * for Entropy strategy is less than 0.93, we set τ 0 = 0.93 and α 1 = 0.5 (median) as a middle reliability for the baseline. VarR has a similar strategy to Entropy. Thus α 1 for VarR are set to the same as that of Entropy. For DBAL, we increase α 1 to 0.7 since its sample-uncertainty is more reliable than that of Entropy as aforementioned. For LLAL which utilizes an add-on module to calculate sample-uncertainty [4], we decrease α 1 to 0.2 since the add-on module only returns scalar values, which requires lots of training to get reliable predictions.

2) DEPENDENCY OF H * ON MODEL COMPLEXITY TO DESIGN f 2 (·)
According to the results of Table 2, P * and H * values tend to decrease as the model becomes complex (note that a higher-numbered model has more complexity). In general, the more complex a model is, the better the model can learn the given samples by over-fitting. Over-fitting can be beneficial to AL because an over-fitted model can distinguish between untrained (uncertain) samples and trained samples well. As shown in Table 2, a complex model can achieve a similar EoU (H * ) with smaller PL samples (P * ) than a simple model.
Since α 2 ∈ [0, 1] in f 2 (·) is defined by a guessed model complexity, high α 2 means that the model is complex. Then considering the results on H * in Table 2, we adjust τ 1 according to α 2 , following Formula 2.
Formula 2: Depending on the model complexity, the hyperparameter τ 2 is determined according to α 2 ∈ [0, 1] as Note that α 2 is subtracted because a complex model have lower H * than a simple model. Remark 2: Formula 2 is also designed as a linear function in a similar way to Formula 1. However, the signs of slope and bias are reversed to model a negative correlation between H * and α 2 . We set the baseline to M4 (original Keras) network. Thus α 2 = 0.5 is given for M4 network. We give α 2 = 0.7 for ResNet-18 since ResNet-18 is more complex than M4 network. Table 3 shows the experimental results for various datasets. CINIC-10 has intermediate complexity between CIFAR-10 and 100 [32]. Regardless of a query strategy, H * for a complex dataset is higher than that for an easy dataset. We analyzed this result as follows. Under the same conditions (network, budget, etc.), the complex dataset makes a model hardly understand unlabeled data. Thus, for complex datasets, the overall training degree of a model becomes low. Therefore, H * will also tend to increase for the complex dataset.

3) DEPENDENCY OF H * ON DATASET TO DESIGN f 3 (·)
Thus we define α 3 ∈ [0, 1] to represent the dataset complexity in f 3 (·). High α 3 means that a dataset is complex. Considering the results on H * in Table 3, we adjust τ 2 according to α 3 , following Formula 3. VOLUME 10, 2022 Remark 3: Formula 3 is also designed as a linear function in a similar way to Formula 1, 2. The signs of slope and bias are determined equal to Formula 1 in order to model a positive correlation between H * and α 3 . We set the baseline to MNIST case. Thus α 3 = 0.5 is given for MNIST. Then we set α 3 = 0.8/0.9/1.0 for CIFAR-10 / CINIC-10 / CIFAR-100, respectively. Finally using Formulas 1-3, we can determine the hyper-parameter τ ∼ H * for Algorithm 1.

A. EVALUATION OF PROPOSED METHOD
To verify the effectiveness of the proposed method to determine the best amount of PL, we compared the three methods.
(1) 'Existing method' represents the method adopted in each AL study, where the pre-determined amount of PL performs only in the first iteration and AL algorithm performs in the remaining iterations. We evaluated methods to various AL query strategies including Entropy, VarR, DBAL, K-means, Coreset [8], LLAL [4]. Experimental settings are equivalent to our analysis setting. Table 4 shows the results. In most cases, the proposed method achieved an equal or better performance than the existing search method. However, for Coreset method, the proposed method achieved degraded performance. In our opinion, the reason is that τ was not accurately estimated since Coreset does not utilize sample-uncertainty, whereas the proposed method is based on EoU. We also provide a non-parametric analysis for the repeated experiments in Fig. 5 by presenting all individual results of repeated experiments. Each circle denotes an individual result, and the horizontal line denotes the mean value. For each AL algorithm, the results of the existing method (blue), the proposed method (red), and the rigorous search are presented.
Since the proposed method calculates EoU at every iteration of an AL scenario, the proposed method requires more computation than the existing search method. Calculating EoU requires as much time as applying the AL query strategy to obtain sample uncertainties. Therefore, for the proposed scenario with T iterations, the proposed method requires additional computation equivalent to T times of complexity of a query strategy. In general, the time complexity of a query strategy is much smaller than the model training. Thus, the time complexity of the proposed method is not too heavy.

VI. SUMMARY AND DISCUSSION
In this paper, we revealed how the amount of PL affects the performance of an AL algorithm through extensive experiments. From the results, we observed that finding the best amount of PL before starting AL is important. In order to utilize our observations practically, we developed a method to automatically determine the best amount of PL without extensive experiments. To this end, we suggested Entropy of sample-uncertainty (EoU) to measure the training degree of a model, and utilize it as a criterion to determine the best amount of PL. By exploring the trends of the best amount of PL and EoU for various AL settings, we suggested empirical formulas to determine the hyper-parameter (τ ) of the proposed method. We demonstrated the effectiveness of the proposed method by applying it to various AL algorithms. VOLUME 10, 2022 FIGURE 6. Results of our experimental scenarios on MNIST dataset, with 3 AL methods (Entropy, VarR, and DBAL) and 4 network models (M2, M3, M4, and M5). In each graph, x axis denotes the number of labeled data acquired by PL and y axis denotes the test accuracy of the model. Mean and standard deviation of the repeated results are indicated as a red point and blue bar, respectively. A cubic polynomial fitted curve (green curve) of the results is also presented to clearly show the tendency of results.

APPENDIX A IMPLEMENTATION DETAILS A. NETWORK STRUCTURES
In the experiment, we employed Keras implementation [34] for a deep model on MNIST [31]. To analyze the influence of network complexity on AL, we further experimented on four variations of the original implementation; M1, M2, M3, M4 (original), and M5.
The detailed structure of each model for MNIST is described in Table 5. For a concise description, we define For CIFAR and CINIC-10, we employed ResNet-18 [35] for the deep model. A drop-out layer was added before the last fully connected layer with a dropout ratio of 0.5. The first convolution layer of ResNet-18 was modified as '3×3 kernel, stride of 1 and padding of 1' to fit CIFAR and CINIC-10.
Pre-processing techniques were also applied to the data during model training. For CIFAR and CINIC-10, we normalized the data by a channel-wise mean (0.4914, 0.4822, 0.4465) and standard deviation (0.2470, 0.2435, 0.2616) and augmented the original data with random cropping from zero-padded 36 × 36 images and random horizontal flipping.
C. TRAINING SCHEME Models were trained by Adam optimizer [36] with a learning rate of 5 × 10 −4 and a batch size of 64 during 100 epochs. For each iteration of pool-based AL, the models were trained from scratch with the updated labeled dataset.

D. COMPUTING ENVIRONMENTS
We implemented the proposed algorithm using PyTorch library [37]. We ran the experiments on Intel Core i7-6700K CPU @ 4.00 GHz and a single NVIDIA GeForce GTX 1080-Ti of 12 GB memory.

APPENDIX B FULL RESULTS OF EVALUATION SCENARIOS
In this section, we provide the experimental results that are not included in the manuscript. Fig. 6 shows the results for MNIST, with three AL methods (Entropy [19], VarR [20], and DBAL [6]) and 4 networks (M2, M3, M4, and M5). Fig. 7 shows the results for CINIC-10 [32] and CIFAR-10, 100 [33] with three AL methods (Entropy, VarR, and DBAL) and ResNet-18 [35]. Note that we fit results of Entorpy and VarR in CINIC-10 with the quartic polynomial curve since the results are noisy. We provide some explanations for noisy results. For MNIST, we got smooth results since (1) we repeated experiments 20 times, (2) results are reported finely with small b, and (3) images of MNIST are well aligned. However, for CIFAR and CINIC, the results seem to be noisy since (1) experiments are repeated 5 times the same as the existing works, (2) b is larger than b for MNIST, and (3) CIFAR contains hard samples.
Interesting point is that applying AL few iterations shows worse performance than not applying AL. We guess the reason is as follows. Most datasets may contain outlier samples. Outliers are deviated from the distribution of the entire dataset, but have a significant influence on the formation of a decision boundary. From the perspective of query strategies, outliers may seem informative, but they adversely affect performance in reality. Some outliers can be included in Q u during the first few steps when applying the AL algorithm. In consequence, applying only PL can achieve better performance than applying AL very few steps. Note that results of MNIST show a clear shape since images of MNIST are well aligned (i.e. fewer outliers).
SANGDOO YUN received the B.S., M.S., and Ph.D. degrees in electrical engineering and computer science from Seoul National University, Seoul, South Korea, in 2010, 2013, and 2017, respectively. He is currently a Research Scientist with NAVER AI Laboratory. His current research interests include computer vision, deep learning, and image classification.
JIN YOUNG CHOI (Member, IEEE) received the B.S., M.S., and Ph.D. degrees in control and instrumentation engineering from Seoul National University, Seoul, South Korea, in 1982, 1984, and 1993, respectively. From 1984to 1989, he was with the Electronics and Telecommunications Research Institute (ETRI), Daejeon, South Korea, where he was involved in the project of switching systems. From 1992 to 1994, he was with the Basic Research Department, ETRI, where he was a Senior Member of Technical Staff involved in the neural information processing system. From 1998 to 1999, he was a Visiting Professor with the University of California at Riverside, Riverside, CA, USA. Since 1994, he has been with Seoul National University, where he is currently a Professor with the School of Electrical Engineering. He is also with the Engineering Research Center for Advanced Control and Instrumentation, Automation and Systems Research Institute, and the Automatic Control Research Center, Seoul National University. His current research interests include adaptive and learning systems, visual surveillance, motion pattern analysis, object detection, object tracking, and pattern recognition. VOLUME 10, 2022