A Survey on Cost Types, Interaction Schemes, and Annotator Performance Models in Selection Algorithms for Active Learning in Classification

Pool-based active learning (AL) aims to optimize the annotation process (i.e., labeling) as the acquisition of annotations is often time-consuming and therefore expensive. For this purpose, an AL strategy queries annotations intelligently from annotators to train a high-performance classification model at a low annotation cost. Traditional AL strategies operate in an idealized framework. They assume a single, omniscient annotator who never gets tired and charges uniformly regardless of query difficulty. However, in real-world applications, we often face human annotators, e.g., crowd or in-house workers, who make annotation mistakes and can be reluctant to respond if tired or faced with complex queries. Recently, a wide range of novel AL strategies has been proposed to address these issues. They differ in at least one of the following three central aspects from traditional AL: (1) They explicitly consider (multiple) human annotators whose performances can be affected by various factors, such as missing expertise. (2) They generalize the interaction with human annotators by considering different query and annotation types, such as asking an annotator for feedback on an inferred classification rule. (3) They take more complex cost schemes regarding annotations and misclassifications into account. This survey provides an overview of these AL strategies and refers to them as real-world AL. Therefore, we introduce a general real-world AL strategy as part of a learning cycle and use its elements, e.g., the query and annotator selection algorithm, to categorize about 60 real-world AL strategies. Finally, we outline possible directions for future research in the field of AL.


I. INTRODUCTION
I NFORMATION and communication technology has become an integral part of humans' lives and supports us embedded in our surroundings [1]. In particular, improving computational power and the ease of collecting a plethora of data has promoted machine learning (ML) [2,3]. Nowadays, ML models are employed in various fields [4], ranging from recommender systems [5], text classification [6], and speech recognition [7] to object detection in videos [8]. In this survey, we consider ML for building classification models. They learn from data sets consisting of instances and their corresponding annotations (e.g., class labels, membership probabilities, M. Herde, D. Huseljic, B. Sick, and A. Calma are with the department of Intelligent Embedded Systems, University of Kassel, Germany (e-mail: {marek.herde dhuseljic bsick adrian.calma}@uni-kassel.de).
This research was supported by the CIL project at the University of Kassel under internal funding P/710 and P/1082. We thank Daniel Kottke, Tuan Pham Minh, Lukas Rauch, and Robert Monarch for their comments that greatly improved this survey. etc.). However, annotating instances may be costly and timeconsuming since it is often manually executed by annotators.
In general, an annotator is an information or knowledge source such as a human, the Internet, or a simulation system [9] and can annotate various types of queries. In this survey, we focus on human annotators. Other commonly used terms are oracle [10], expert [11], worker [12], teacher [13], and labeler [14]. A large group of (human) annotators who do not necessarily know each other is also named a crowd [15]. The exact characteristics of a crowd, e.g., the number and heterogeneity of the annotators, depend on the requirements of the crowdsourcing initiative at hand.
Active learning (AL) is a subfield of human-in-the-loop learning [16] and interactive ML [17,18], which directly and iteratively interacts with human annotators. It aims at reducing annotation and misclassification cost [19]. Thus, an AL strategy queries annotations for instances from which the classification model is expected to learn the most [20]. As a result, the classification model trained on an actively selected subset of annotated instances reaches in average a superior performance to a model trained on a randomly selected subset. AL strategies have been successfully employed in several applications, e.g., malware detection [21], waste classification [22], classification of medical images [23], and training of robots [24]. However, many of these AL strategies make three central assumptions that limit their practical use [25], and we refer to them as traditional AL.
(1) There is a single omnipresent and omniscient annotator providing the correct annotation for each query at any time. This assumption conflicts with the available options of annotation acquisitions. In particular, crowdsourcing represents a popular way to obtain data annotations [26]. However, on crowdsourcing platforms, e.g., Amazon's Mechanical Turk [27,28], CloudResearch (formerly TurkPrime) [29], and Prolific Academic [30], multiple error-prone annotators have to be considered [31]. Otherwise, annotation mistakes (e.g., noisy class labels) will degrade the classification model's performance [32,33]. (2) The cost of an annotation is constant across the queries.
This assumption is violated in cases such as biomedical citation screening [11], in which articles are to be classified as relevant or irrelevant for a particular research topic. The time to annotate an article depends on its length, complexity, and the queried annotator [34]. Hence, the cost varies across pairs of articles and annotators.

Real-world Active Learning
Objective and Learning Cycles (Section II) Selection Algorithms: Which query is presented to which annotator? (Section VI) Joint Selection: Pairs of queries and annotators are jointly selected.
(Section VI-B) Sequential Selection: Queries are selected in the first step and annotators in the second step.
(Section VI-A) Annotator Performance Models: Who is the best annotator? (Section V) Annotator Performance: Annotators differ in their performances regarding annotation quality.
(Section V-B) Influence Factors: Annotators are influenced by several factors during the annotations process.
(Section V-A) Interaction Schemes: How to interact with annotators?
(Section IV) Annotation Types: There are different ways of providing learning information as annotations.
(Section IV-B) Query Types: There are different ways of querying an annotator for learning information.
(Section IV-A) Cost Types: Which costs are to be considered? (Section III) Annotation Cost: Costs may be imbalanced and they may depend on the annotator or query.
(Section III-B) Misclassification Cost: Costs may be imbalanced and they may depend on the type of misclassification.
(Section III-A) Fig. 1. Overview of real-world AL and structure of this survey's main body: The blue nodes of the tree name the different topics of real-world AL identified in this survey. Additionally, they provide a brief summary of each topic and reference the specific section with more details.
(3) Each query requests the class label of a specific instance. This assumption ignores the possibility of designing more general and effective queries [25,35]. Some of these queries also require annotations to be more complex than simple class labels. We avoid confusion regarding the terms label and annotation by defining an annotation as the most general reply to a query, e.g., an annotator could answer a query with "I have no idea!". Correspondingly, a class label is a specific example of an annotation.
Various concepts have been proposed to overcome the limitations above. These include collaborative interactive learning [36,37] and proactive learning [38,39]. We summarize their main differences to traditional AL through the three following aspects: (1) Instead of assuming a single omniscient and omnipresent annotator, they consider (multiple) human annotators whose performances can be affected by various factors, e.g., missing expertise, fatigue, and malicious behavior. (2) Instead of repeatedly querying class labels of instances, they generalize the interaction with human annotators by considering different types of queries and annotations, such as asking an annotator for feedback on an inferred classification rule. (3) Instead of assuming uniform cost, they take more complex cost schemes regarding annotations and misclassifications into account. In this survey, we provide an overview of existing AL strategies taking at least one of the three aspects into account and refer to them as real-world AL. We limit the scope by including only strategies for classification in the pool-based AL setting [19] because it is the most researched AL field. However, many implications of this survey go beyond this scope and are emphasized in the outlook. Based on these prerequisites, this survey makes the following contributions: • We formalize the objective of a real-world AL strategy as the optimal annotation sequence to a cost-sensitive classification problem.
• We propose a taxonomy of existing cost types, interaction schemes, annotator performance models, and selection algorithms to compare different real-world AL strategies.
• We give a comprehensive comparison of about 60 realworld AL strategies and analyze them regarding their handling of error-prone annotators, usage of query and annotation types, consideration of imbalanced misclassification and annotation cost, and query-annotator selection.
• We identify five unsolved challenges in the real-world AL setting and formulate them as future research directions.
We structure this survey's main body according to Fig. 1 that gives an overview of the main topics reviewed in this survey. The four sections III-VI are accompanied by a respective tabular literature overview of real-world AL strategies, including detailed analyses in this survey's appendices. Based on these literature overviews, we formulate challenges in the setting of real-world AL and beyond in Section VII. We conclude this survey in Section VIII.

II. REAL-WORLD ACTIVE LEARNING
In this section, we introduce the problem setting of realworld AL. A real-world application illustrates a possible scenario violating the assumptions of traditional AL and thus indicating the need for real-world AL strategies. In the context of this application, we also explain the notation used throughout this survey. Moreover, we formalize the objective of real-world AL as the optimal solution to a cost-sensitive classification problem and present learning cycles finding greedy approximations of this solution.

A. Motivating Application
An example of a practical use case requiring the employment of a real-world AL strategy is the classification of low-voltage grids described in [40,41]. They connect most consumers, e.g., households, to the electrical power system, and an illustration of such a grid is given in Fig. 2.
Formerly, the power system was designed to transport energy from a few central generators (plants) to the consumers. However, recent developments are characterized by an increasing number of installed distributed generators, particularly photovoltaic generators, in low-voltage grids [42]. These distributed generators may provoke, e.g., an overload of electrical components, and violate critical voltage values within a grid. Assessing the hosting capacity of low-voltage grids for distributed generation supports the responsible distribution system operator in deciding for which low-voltage grid an investment in the infrastructure could be most beneficial such that its sustainable operational reliability is guaranteed [40].  Fig. 2. Example of real-world AL: The AL strategy queries four human annotators to assess the structure of a low-voltage grid regarding its hosting capacity for distributed generators. Their answers show possible issues when employing AL in real-world applications. The structure of a grid is assessed through ordinal class labels leading to non-uniform misclassification costs. The annotators can have conflicting opinions or may have no time to process a query. An annotator could also be reluctant to answer a query due to its difficulty. Moreover, one annotator may demand more money for an annotation than another. Sometimes the annotation costs are even unknown in advance (e.g., if annotation time is the major cost factor).
In this context, a significant challenge is the high complexity of low-voltage grids. Therefore, multiple annotators with heterogeneous background knowledge are requested to provide annotations, such as strong, weak, etc., classifying the hosting capacities of low-voltage grids (cf. Fig. 2). To do so, the annotators have access to a grid diagram and corresponding tabular information. The provided annotations, i.e., ordered classes and confidence assessments in this case, are prone to error because of missing expertise, for example. Moreover, the annotations are expensive because the annotators have to investigate the grids to generate an annotation regarding the hosting capacity. A real-world AL strategy can save time and money by training a classification model to categorize each possible low-voltage grid's hosting capacity automatically.
As a result, a representative question of this survey would be: "How to design a real-world AL strategy for solving problems such as the classification of low-voltage grids?".

B. Formalization of Problem Setting
An instance is described by a D-dimensional feature vector x = (x 1 , . . . , x D ) T , D ∈ N. It is drawn from the distribution Pr(X) = Pr(X 1 , . . . , X D ) defined over a D-dimensional feature (input) space Ω X , where X d denotes the random variable of the feature d ∈ {1, . . . , D} and X is used as short-cut for the D dimensional random variable representing all features. The observed multi-set of identically and independently distributed instances is given by X = {x 1 , . . . , x N } ⊆ Ω X . For example, the instance of the low-voltage grid illustrated in Fig. 2 may take the form Each instance x n belongs to a true class y n ∈ Ω Y sampled from the categorical distribution Pr(Y X = x n ) with Y denoting the random variable of the true class labels. In total, there are Ω Y = C ∈ N ≥2 classes. The multi-set of true class labels for the observed instances in X is denoted as Y = {y 1 , . . . , y N }. Regarding the classification of lowvoltage grids, the class labels would have an ordinal structure ranging from a very weak (Y = 1 ∈ Ω Y ) to a very strong (Y = 5 ∈ Ω Y ) hosting capacity.
As pointed out, there is no omniscient and omnipresent annotator in most applications. In the context of real-world AL, we work with (multiple) error-prone annotators who we summarize in the set A = {a 1 , . . . , a M }. Each annotator can be queried to provide annotations. An annotation is not restricted to be a specific class label y ∈ Ω Y , but all kinds of annotations are allowed, e.g., confidence scores [43], probabilistic labels [44], or rejecting to answer a query [45]. The space of possible annotations is summarized by the set Ω Z , e.g., Ω Z = [0, 1] if probabilistic class labels are expected for a binary classification problem. The (multivariate) random variable for the annotations of annotator a m is denoted as Z m .
A query cannot only ask for the class label of a specific instance x n , but more general queries such as "Do instance x n and instance x m belong to the same class?" can be formulated [46]. To learn from queries and annotations, a classification model requires appropriate mathematical representations of them. An exemplary representation of the query given above would be q = {x n , x m }. The mathematical representations of all possible queries are summarized in a set Q X , which depends on the underlying classification problem and the set of observed instances X [47]. Due to this dependency, we can interpret the queries as random events, and Q denotes the associated random variable. In most cases, a query asks for the class label of a specific instance such that we can define Q X = X as query set.
The task of a real-world AL strategy is to generate a sequence for the execution of the annotation process, which is assumed to consist of countable distinct (time) steps. In other words, a sequence answers the question: "Which annotator has to answer which query at which time step?". Accordingly, we define a sequence as a function S ∶ N → P(Q X × A), such that (q l , a m ) ∈ S(t) induces an annotation of query q l by annotator a m at time step t ∈ N. The annotation behavior of an annotator can be modeled through a conditional distribution Pr(Z m Q = q l , t) from which z (t) lm ∈ Ω Z is drawn as annotation of annotator a m for query q l , i.e., z (t) lm ∼ Pr(Z m Q = q l , t). As a result, annotators are not compulsorily deterministic in their decisions. Still, decisions might also change throughout the annotation process, e.g., if an annotator gets tired during the annotation process [48]. An annotation process executed until the beginning of the time step t according to a sequence S leads to a data set consisting of triplets of a query, an annotator, and an annotation. We define the end of a sequence S as the last time step at which an annotation has been performed, i.e., where the selection is empty: On a data set D(t), a classification model described by its parameters θ can be trained. We denote the resulting parameters of the classification model by θ D(t) . For example, these parameters would correspond to weights in the case of a neural network [49] taken as a classification model. The trained classification model predicts class labels for given instances, where the prediction for an instance x ∈ Ω X is denoted bŷ y(x θ D(t) ) ∈ Ω Y . In many cases, the classification model can predict the class label of an instance and estimate the probabilities of class memberships. In this case, we denote the estimated class membership probability that a given instance x belongs to class y by Pr(Y = y X = x, θ D(t) ).

C. Objective
Given the formalized problem setting and generalizing the objective definitions in [38,50] toward all query and annotation types including complex cost schemes, we formulate the objective of real-world AL as determining the optimal annotation sequence for a cost-sensitive classification problem: where Ω S denotes the set of all potential sequences. MC and AC are the misclassification and annotation cost, respectively. We expect them to be on the same scale. Otherwise, extra normalization might be necessary. The optimal annotation sequence S * minimizes the total cost while satisfying all constraints C. A common constraint is a maximum annotation budget B ∈ R >0 , i.e., C = {AC(D(t S + 1) ν) ≤ B}. The total cost is decomposed into MC and AC, where the vector κ encodes given hyperparameters for computing the MC, e.g, a cost matrix, and the vector ν represents the hyperparameters for computing AC, e.g., wages of the annotators. We provide a more detailed discussion on different cost schemes in the setting of real-world AL in Section III.
Since it is difficult to find the optimal annotation sequence S * given by Eq. 4 in advance [38], an AL strategy aims to approximate the optimal solution through a greedy approach. Therefore, the annotation sequence S is defined iteratively at run time by executing a cycle where one iteration corresponds to a single time step. We start with the description of such a cycle for traditional AL. Subsequently, we restructure it to fit the setting of real-world AL.

D. Traditional Active Learning Cycle
In traditional AL, an omniscient and omnipresent annotator A = {a 1 } is assumed to be available [19]. Moreover, a query expects the class label of an instance such that the set of queries can be represented by Q X = X and the set of annotations is given by the set of classes, i.e., Ω Z = Ω Y . Traditional AL strategies differ between the labeled (annotated) set L(t) = {(x n , y n ) (x n , a 1 , y n ) ∈ D(t)} and the unlabeled (non-annotated) set U(t) = {x n x n ∈ X ∧ (x n , y n ) ∉ L(t)} obtained after executing the (t − 1)-th iteration cycle. The main idea is to develop a strategy intelligently selecting instances from the unlabeled pool U(t) to which the annotator a 1 assigns true class labels. Due to the omniscience of this annotator a 1 , the annotation distribution satisfies Pr(Z 1 = y n X = x n , t) = 1 for all iteration cycles t ∈ N and observed instances x n ∈ X . Fig. 3 summarizes the entire selection procedure as a cycle.
The selection of an instance is based on a so-called utility measure φ ∶ X → R [51] estimating the utilities of the observed instances X regarding the classification model to be trained. In general, the unlabeled instance with the maximum utility is selected in iteration cycle t: There are many approaches computing instances' utilities. In the following, we briefly describe two fundamental concepts: • The simplest concept of utility measures is uncertainty sampling (US) [52], which usually requires an instance's class membership probabilities estimated by the classification model to be trained. Alternatively, distances to decision boundaries [53] are used as proxies of them. US ranks all instances in the unlabeled pool U(t) based on an uncertainty measure and queries the label for the instance with the maximum uncertainty regarding its class information. A common uncertainty measure is the entropy H [54] of the class distribution such that an instance's utility estimated is computed as • The decision-theoretic framework expected error reduction (EER) [55] estimates the performance of the classification model. Therefor, EER assumes that the instances in the unlabeled pool U(t) form a validation set. For each unlabeled instance, the classification model's expected error is computed on this validation set by retraining the classification model with each combination of the given unlabeled instance and its possible class label. The multiple retraining procedures of the classification model lead to high computational complexity. The resulting estimate of the negative expected error defines the utility measure. Correspondingly, EER selects the instance leading to the minimum estimated error. One of the main challenges regarding the design of utility measures is the exploration-exploitation trade-off. On the one hand, we aim to select instances near the classification model's decision boundary to refine it (exploitation). On the other hand, we aim to select instances in unknown regions (exploration) [56]. More advanced AL strategies balance this trade-off by considering distances to the decision boundaries, density, class distribution estimates [57,58,59], or using a Bayesian approach [60].
In batch mode AL [23], we must consider the diversity of instances since a batch of instances is selected in each learning iteration cycle. However, a detailed analysis of instance utility measures in the traditional AL setting is beyond the scope of this survey, and a more detailed discussion on them is given in [19,20,51,61].   [19]: (1) At the start of the iteration cycle t, the traditional AL strategy selects an unlabeled instance x n * from the unlabeled pool: U (t + 1) = U (t) ∖ {x n * }. (2) Subsequently, the instance is presented to the omniscient annotator A = {a 1 } who provides its true class label y n * . The resulting instance-label pair is inserted into the labeled pool: L(t + 1) = L(t) ∪ {(x n * , y n * )}, (3) on which the classification model is retrained by updating its parameters θ L(t) → θ L(t+1) . (4) At the end of the cycle, the traditional AL strategy decides whether to continue or to stop learning. This decision is made by a so-called stopping criterion [62,63,64], which is part of ongoing research and not within this survey's scope.

E. Real-world Active Learning Cycle
The traditional AL cycle depicted in Fig. 3 has to be adjusted to fit the setting of real-world AL. Our resulting cycle, including the real-world AL strategy's elements (i.e., query utility measure, annotator performance measure, and selection algorithm), is shown in Fig. 4.
The query utility measure φ ∶ Q X → R φ is an element being already part of the traditional AL setting. However, in the real-world AL setting, not only can the class labels of nonannotated (unlabeled) instances be queried, but more general queries can be selected for annotation. This also includes a re-annotation of instances, known as repeated labeling [65], re-labeling [66], or backward instance labeling [67]. Hence, the strict distinction into a non-annotated (unlabeled) set U(t) and an annotated (labeled) set L(t) is often not adequate anymore. As a result, the utility measure φ needs to be adapted to quantify the utility of more general queries. Another adaption concerns the form of the output of the utility measure. Instead of computing a single score per query, i.e., R φ ⊆ R, a utility measure may provide a more general description for each query, e.g., a distribution, which can then be combined with annotator performance estimates [68,69]. We provide an overview of query utility measures for different query and annotation types in Section IV.
The annotator performance measure ψ ∶ Q X × A → R ψ represents a novel element compared to traditional AL strategies and is defined through an annotator model. Similar to a classification model, an annotator model has parameters ω D(t) learned from a data set D(t). Its main task concerns the estimation of the performance ψ(q l , a m ω D(t) ) ∈ R ψ of an annotator a m regarding a query q l [68], e.g., the probability for providing a correct annotation. In most cases, ψ(q l , a m ω D(t) ) is a point estimate, i.e., R ψ ⊆ R, but there are also annotator models estimating probability distributions over annotator performances [68,69]. Moreover, an annotator model may account for improvements and deteriorations of annotators' performances, e.g., when an annotator learns or gets exhausted. The annotator performance may also be affected by collaboration mechanisms between the annotators, e.g., the best annotator is asked to teach the worst annotator [70]. We provide an overview of annotator performance measures in Section V.
A real-world AL strategy is completed by the selection algorithm as the final element. It updates the annotation sequence S by selecting query-annotator pairs in each iteration cycle t. This selection is specified by choosing a subset of query-annotator pairs S(t) ⊆ Q X × A. Therefor, it assesses potential query-annotator pairs through the query utility and annotator performance measure. If the set S(t) contains multiple queries, we face similar challenges as in batch mode AL, e.g., selecting diverse queries. We provide an overview of selection algorithms in Section VI.
AC and MC are modeled in AL literature by designing cost-sensitive variants of query utility measures, annotator performance measures, or selection algorithms. We provide an overview in Section III.  (1) At the start of the iteration cycle t, the classification and annotator model, both trained on the current data set D(t), provide information regarding the query utility and the annotator performance measure of the real-world AL strategy. (2) Based on both measures, the real-world AL strategy's selection algorithm specifies a set of query-annotator pairs S(t) ⊆ Q X × A. Each pair (q l , am) ∈ S(t) initiates an annotation of query q l by annotator am. (3) The annotations are inserted into the data set: (4) Then, the classification and annotator model are retrained on the updated data set (θ D(t) → θ D(t+1) and ω D(t) → ω D(t+1) ). (5) At the end of the iteration, the real-world AL strategy decides whether to stop or to continue learning.

III. COST TYPES
MC and AC are the most crucial cost types in the real-world setting, and we will summarize typical schemes of them in this section. There exist several additional types of cost when solving a classification problem, e.g., cost of computation (e.g., renting a graphics processing unit) and cost of test (e.g., getting the results of a blood test). They are described as a taxonomy in [71]. At the end of this section, we present a literature overview of real-world AL strategies explicitly modeling MC and/or AC.

A. Misclassification Cost
Mistakes of the classification model induce MC (the first summand in Eq. 4). In the literature, we identified three cost schemes and describe them in increasing order complexity in the following: Uniform MC: Each classification error is charged at an equal cost. The classification model's performance is inversely proportional to the misclassification rate [72], i.e., the proportion of misclassified instances. This cost scheme is the simplest one and is assumed by traditional AL strategies.
Class-dependent MC: This cost scheme is probably the most common one in cost-sensitive classification [73,74]. The cost of a classification error is defined by means of a cost matrix/table C ∈ R C×C ≥0 , where an entry C[y, y ′ ] in row y and column y ′ denotes the cost of predicting the class labelŷ(x n θ D(t) ) = y ′ , when the instance x n actually belongs to class y n = y. Our grid classification example could use the mean absolute error on class numbers as a typical cost measure for ordinal classes [75]. It would be implemented through C[y, y ′ ] = y − y ′ . In some applications, the cost matrix is extended by adding an extra column representing cases where the classification model is too uncertain and rejects predicting a class label (known as reject option [76]).
Instance-dependent MC: Costs of classification errors depend on specific characteristics of instances. An example is fraud detection, where the amount of money involved in a particular case has an essential impact on MC [77]. For our grid classification example, it would be more expensive if many households were affected by overloading a low-voltage grid. Consequently, the feature # house connections in Eq. 1 is to play a central role when computing the cost of misclassifying a grid.
MC can be computed as the expectation regarding the true (but unknown) joint distribution Pr(X, Y ) of instances and class labels [76]. For example, class-dependent MC is computed according to In practice, the exact computation of MC is often infeasible due to the limited size of test data. Furthermore, in the realworld AL setting, its estimation based on a separate set of instances is challenging because of a sampling bias (arising from the active data acquisition) [78] and the lack of known ground truth class labels (arising from the error-proneness of the annotators). Nevertheless, some real-world AL strategies take imbalanced, i.e., class-or instance-dependent, MC into account.

B. Annotation Cost
AC (the second summand in Eq. 4) arises from the work effort of the annotators who have to invest time to decide on appropriate annotations for the posed queries. The exact specification of AC depends on the underlying cost scheme. In the literature, we identified four different schemes and describe them in increasing order of complexity in the following: Uniform AC: The cost of obtaining an annotation is constant for each query and independent of the queried annotator. Correspondingly, the AC is proportional to the number of acquired annotations. This cost scheme is the simplest one and is frequently used. In particular, it is often employed in crowdsourcing environments, where the requester sets a constant pay rate per query. This means the qualification of an annotator and the time spent on annotating a query have no impact on the AC.
Annotator-dependent AC: In this cost scheme, the AC explicitly depends on the queried annotator. This setting is typical when annotators with different qualifications receive different earnings per query, e.g., annotators with different levels of expertise. Of course, there is typically no guarantee that expensive annotators provide more accurate annotations [14].
Query-dependent AC: Since there may be more or less difficult queries, the cost of annotating a query may depend on the query itself. For example, assessing the hosting capacity of a large and complex low-voltage grid may require more time than assessing a small and simple grid. Another example is the annotation of voice mails, where the duration of a voice mail is used as a proxy of the AC, e.g., 0.01 US dollar per second [50]. For the classification of documents, the number of words or characters in a document is often correlated to the AC [34]. Additionally, the query type affects the AC. For example, comparing two instances and deciding whether both belong to the same class is often easier than assigning an instance to one of many classes [79,80].
Query-and annotator-dependent AC: If the query and annotator-dependent cost schemes are considered, the AC varies across the pairs of query and annotator [81]. This cost scheme fits scenarios in which annotators are paid according to their individual hourly wages and the annotation time depends on the query [11].
The exact computation of AC depends on the underlying scheme. If we exemplary assume annotator-dependent AC with ν = {ν 1 , . . . , ν M } and ν m > 0 representing the payment per query for annotator a m , we would obtain where ≐ denotes a Boolean comparison and the indicator function δ ∶ {false, true} → {0, 1} returns one if the argument is true and zero otherwise. Correspondingly, N (t) m ∈ N is the number of annotations provided by annotator a m until the start of step t. In certain scenarios, such an exact specification of the AC is infeasible. This is when the annotation time is the major cost factor and is not known in advance. Therefore, an AL strategy is required to estimate the AC before querying an annotator. Table I gives a literature overview of real-world AL strategies, explicitly modeling imbalanced MC or AC. The first part of this table lists strategies being MC-sensitive, i.e., classdependent or instance-dependent. The second part summarizes strategies taking imbalanced AC into account, i.e., annotatorand/or query-dependent. Each strategy is categorized according to its cost scheme, the type of classification problem (binary vs. multi-class), and its predefined or estimated required cost information (cost matrix, annotation time, etc.). Additionally, we provide a brief description of each strategy's main idea. A more in-depth analysis of them is provided in the appendices of this survey.

IV. INTERACTION SCHEMES
Interaction with human annotators forms an essential part of AL. In this survey, we focus on the AL typical queryannotation-based interaction. For this purpose, we provide an overview of different query types and annotation types based on the literature. At the end of this section, we present a literature overview of existing real-world AL strategies using different combinations of queries and annotations as interaction schemes.

A. Query Types
The set of possible queries Q X specifies how a real-world AL strategy can interact with the available annotators A. Depending on the underlying classification problem, there are different possibilities to design these queries. In the literature, we identified the following three most common query types: Instance queries ask for information on a specific instance x n as illustrated in Fig. 5(a) and is the most common query type. Next to class labels, a query may request additional information. Concrete examples are presented in [44,101], where annotators are asked for confidence scores interpreted as proxies of an instance's class membership probabilities.
Region queries do not query information regarding a specific instance, but ask annotators to provide information about an entire region in the feature space [102]. For this purpose, the query is to be formulated in an appropriate and human-readable representation [103]. A common way to achieve this requirement involves formulating premises of sharp or possibilistic classification rules by defining conditions on the value ranges of features [104]. An example of such a region query is depicted in Fig. 5(b). Although a region query provides class information about many instances, this type of query differs from batch mode AL, where each instance of a selected batch is annotated individually [23].
Comparison queries enhance the learning process by obtaining relative information between instances [47]. For example, the comparison query, illustrated in Fig. 5(c), compares two instances x n and x m by requesting whether they belong to the same class or not [46,105]. Regarding the ordinal grid classification example, another conceivable comparison query may ask which of the two grid instances x n and x m has a superior hosting capacity.
Going beyond these three query types, we will present our own proposals for query types as future research directions in Section VII.

B. Annotation Types
Usually, the type of an annotation depends on the query itself. In this survey, we differentiate between the following three annotation types: Distinct annotations are the simplest form of annotations. They represent categorical information without the scope of interpretation. Most AL strategies use them to encode class labels. Other AL strategies expect a simple yes or no as a distinct annotation [46,47]. Furthermore, they can encode a sorting of instances in case of a comparison query [106].
Soft annotations allow for the representation of continuous information. They are often inaccurate and subjective. Many AL strategies use them to obtain information on the confidence of a provided class label by requesting a numerical value in a continuous confidence interval [43,101]. Another example is the use of probabilistic labels as gradual annotations [44,107], which enhanced the classification performance for certain tasks, e.g., in the medical domain [108].
Explanatory annotations are the most informative type of annotations. Instead of only communicating a distinct or soft decision, an explanatory annotation also explains why a certain

Strategy
Cost Scheme Classification Problem Cost Information

Misclassification Cost (MC)
Margineantu [82], Joshi et al. [79,80] class-dependent MC multi-class cost matrix (predefined) These strategies compute the expected MC on the annotated set. Therefor, they simulate the annotation of an instance and its addition to the classification model's training set. We can interpret this approach as a cost-sensitive variant of EER.
Liu et al. [83] class-dependent MC multi-class cost matrix (predefined) This strategy extends traditional US by making use of self-supervised training. Therefor, a cost-sensitive classification model is trained on instances annotated by annotators and instances annotated by a cost-insensitive classification model. Chen and Lin [84] class-dependent MC multi-class cost matrix (predefined) The first variant of this strategy computes the maximum expected MC of an instance. In contrast, the second variant computes the cost-weighted minimum margin between the two predictions with the lowest estimated MCs.
Krempl et al. [85] class-dependent MC binary cost ratio of false negative vs. false positive (predefined) This strategy computes the density-weighted expected MC reduction in an instance's neighborhood within the feature space. Therefor, it simulates the annotation of an instance and its addition to the classification model's training set.
Käding et al. [86] class-dependent MC multi-class cost function (predefined) This strategy computes the classification model's expected change by simulating the annotation of an instance.
Nguyen et al. [87] class-dependent MC binary cost matrix (predefined) This strategy computes the expected MC reduction when obtaining an annotation from an error-prone annotator and from an infallible expert. Therefore, it employs a cost-sensitive variant of EER.
Huang and Lin [88] class-dependent MC multi-class cost matrix (predefined) This strategy is based on a cost embedding approach, which transfers the MC information into a distance measure of a latent space. Utilities are defined as expected MCs that are represented through distances in the latent space.
Min et al. [89] class-dependent MC multi-class cost matrix (predefined) This strategy queries only annotations for instances with MCs being higher than their respective ACs. The MCs are estimated through a cost-sensitive k-nearest neighbor model.
Wu et al. [90], Wang et al. [91] class-dependent MC binary cost matrix (predefined) These strategies employ a density-based clustering technique to construct a master tree of instances. In an iterative process, this master tree is subdivided into blocks and for each block an estimated MC-optimal number of instances are annotated.
Krishnamurthy et al. [92,93] instance-dependent MC multi-class cost of predicting a class label for an instance (estimated) These strategies query MC information per instance and class from an annotator. Their idea is to query the actual MC information for the class label, for which an instance has the largest estimated MC range.
Annotation Cost (AC) Zheng et al. [14], Chakraborty [94] annotator-dependent AC mutli-class cost of querying an annotator (predefined) These strategies solve an optimization problem to specify a subset of annotators with low ACs and high performances. Moon and Carbonell [95], Huang et al. [96] annotator-dependent AC multi-class cost of querying an annotator (predefined) These strategies compute the annotator performance per AC unit to prefer annotators with high performances and low ACs.
Nguyen et al. [87] annotator-dependent AC multi-class cost ratio of querying crowd worker vs. expert (predefined) This strategy normalizes the utility of querying an expert or crowd worker by their respective ACs to find a trade-off between expensive but correct expert annotations and cheap but error-prone crowd worker annotations.
Margineantu [82], Donmez and Carbonell [38,39] query-dependent AC multi-class cost of annotating a query (predefined) These strategies subtract the query's individual AC from its utility to prefer highly useful queries with low ACs. These ACs are assumed to be known in advance for each query or to follow a predefined model.
Joshi et al. [79,80] query-dependent AC multi-class number of comparisons per query (estimated) These strategies use multiple comparison queries to reveal the class label of a non-annotated instance. The expected number of comparisons required to reveal an instance's class label is used as an AC proxy and subtracted from an instance's utility.
Tsou and Lin [97] query-dependent AC multi-class cost of annotating a query (predefined) This strategy builds a decision tree throughout the AL process. It computes the average AC of the already annotated instances in each leaf of this tree. These AC estimates are used to normalize the query utilities of the instances in the respective leaves.
Settles et al. [81], Haertel et al. [98], Tomanek and Hahn [99], Wallace et al. [100] query-dependent AC multi-class annotation time per query (estimated) These strategies estimate the annotation time per query. Therefore, they use either historical data in form of logged annotation times or employ prior knowledge regarding a domain, e.g., the number of words when annotating text. The estimated annotation times are considered by normalizing query utility or employing a linear rank combination of utilities and annotation times.
Wallace et al. [11] annotator-, query-dependent AC multi-class annotation time per query (estimated) + cost per time unit for each annotator (predefined) This strategy computes ACs by multiplying the annotation times (estimated through the number of words in a document) with the respective salaries (predefined) of the annotators. Annotators with low estimated ACs are more often queried.
Arora et al. [34] annotator-, query-dependent AC multi-class annotation time per query-annotator pair (estimated) This strategy estimates the annotation time as a function of the annotator and the query. Therefor, it uses features to describe the query and the annotator in combination with historical data in form of logged annotation times. decision has been made. An exemplary explanation would be: "The instance x n does not belong to the positive class because its feature value x nd is too low." [109].

C. Literature Overview
A query mostly requests information of a specific kind. Accordingly, the query and annotation types are closely coupled. Table II gives a literature overview of existing combinations of queries and annotations as interaction schemes. The query "To which class does instance x n belong?" known already from the traditional AL setting is excluded. A more in-depth analysis of the real-world AL strategies in Table II with a focus on their query utility measures is provided in the appendices of this survey.

V. ANNOTATOR PERFORMANCE MODELS
The error-proneness of annotators poses a major challenge in real-world AL [36]. In this section, we discuss the typical factors influencing the performance of error-prone annotators. Moreover, we identify three different types of annotator performance. At the end of this section, we present a literature overview of existing annotator performance models.

A. Influence Factors
We refer to "annotator performance" as a general term for the quality of the annotations obtained from an annotator. There is no clear definition of this term, but there exist several concrete interpretations, e.g., label accuracy [123], confidence [101], uncertainty [70], reliability [124], etc. Such an interpretation is closely coupled to the annotation type and the expected optimal annotation of a query.
The annotator performance may be affected by various factors [125,126], and the most prominent ones identified in the AL literature are given in the following: The domain knowledge of annotators has an essential impact on their performances [127]. Insufficient knowledge leads to a deterioration of the annotator performance. In complex tasks, such as assessing the hosting capacity of a low-voltage grid, a certain level of domain knowledge is indispensable.
The query difficulty affects the probability of obtaining an optimal annotation [12,128,129]. For example, in recognition of hand-written digits, it is often more challenging to differentiate between the digits 1 and 7 than discriminating between the digits 1 and 8 [44]. Next to the subject of a query, also its type can be crucial for the performance of an annotator [80].
The ability for a reliable self-assessment of annotators plays a central role, particularly in scenarios where queries ask for confidence scores as annotations [101]. Although empirical studies [11,130] have shown that annotators can reliably estimate their performances in some domains, the Dunning-Kruger-effect [131] states that, in particular, unskilled annotators provide not only erroneous annotations, but they also cannot realize their mistakes. This effect has also been confirmed in a large-scale crowd-sourcing study [132].
Motivation or level of interest of an annotator may influence the elaborateness during the annotation process. For example, in a crowdsourcing study analyzed in [127], more interested annotators performed superiorly.
The payment of an annotator may have a significant impact on the annotator performance, such that well-paid annotators provide more high-quality annotations. In a crowdsourcing environment, the improvement of the annotation quality has been confirmed by increasing the pay from 0.10$ to 0.25$ per query [127].
The annotator has to be concentrated when annotating a query [133]. Otherwise, annotation mistakes arise because of missing mindfulness or tiredness.
A constant stream of queries of the same type may be annoying for the annotator [134]. Therefore, the way of interaction between the AL strategy and an annotator may influence the annotation results and needs to be designed appropriately. For example, different interaction schemes can lead to different degrees of an annotator's enjoyability, as experimentally shown in [135].
The learning aptitudes of annotators are also crucial for their performances. For example, one could teach the annotators to provide high-quality annotations [125].
The collaboration between annotators is also interlinked with their performances. Incorporating corresponding mechanisms for collaboration can strongly improve the annotation quality [136].

B. Annotator Performance Types
Modeling and quantifying the influence of each of the previously listed factors on annotator performance is infeasible. Instead, existing annotator models abstract from these factors to estimate annotator performance. In the literature, we identified three different types of annotator performances. Therefor, we generalize the class label noise taxonomy, presented by Frénay and Verleysen [137], to the setting of realworld AL by including queries and annotations instead of instances and classes. The resulting statistical taxonomy of annotator performance types is presented in Fig. 6, and we provide more details in the following: Uniform annotator performance: The annotator performance depends only on the characteristics of the annotator. As a result, the query itself or the query's optimal annotation has no influence. An example is given in Fig. 7, where an annotator has the constant probability of 90% to recognize a hand-written digit correctly.
Annotation-dependent annotator performance: The annotator performance depends next to the annotator's characteristics on the optimal annotation for a query. An example is given in Fig. 7, where an annotator is better at identifying the digit 1 (constant correctness probability of 90%) than the digit 7 (constant correctness probability of 70%) in images of hand-written digits.
Query-dependent annotator performance: The annotator performance depends on the annotator's characteristics, the query, and the optimal annotation. An example is given in Fig. 7, where an annotator has a low probability to correctly identify the third digit as 7 because it can be misinterpreted as the digit 2.  Figure 5(a) illustrates an instance query by marking the selected instance x n * for which the class membership probability for the blue class is expected as an annotation. A region query defining the region 3] is depicted by the gray rectangle in Fig. 5(b). Again, the class membership probability for the blue class represents the annotation. Figure 5(c) illustrates a comparison query requesting whether the instances x n * and x m * belong to the same class (solid gray line). A solid black line connects instances belonging to the same class, and a dashed black line indicates that instances are assigned to different classes.
As an additional dimension, possible temporal dependencies regarding annotator performance can be taken into account. Therefore, we differ between persistent and time-varying annotator performance. In the first case, the annotator performance is constant during the entire annotation process. In the latter case, the annotator performance may increase due to the learning progress of an annotator [70,81] or may decrease because of exhaustion or emerging boredom [135].

C. Literature Overview
During the AL process, the performances of the annotators are estimated by annotator models. Table III provides a literature overview, including a categorization of those models. Next to the assumptions regarding the type of annotator performance, we use several other factors to categorize different annotator models. In particular, the query and annotation types described in Section IV are essential properties of an annotator model. However, to the best of our knowledge, existing annotator models focus on instance queries such that no column for the query type is present in Table III. As a further category, we differentiate between the assumed relation of the annotators. In the case of multiple annotators, they are either independent or collaborative. If a model can work with a single annotator, the term single is denoted for this category. Furthermore, we indicate in Table III whether an annotator model allows for the integration of prior knowledge regarding the performances of annotators. Additionally, we provide a brief description of each annotator model's main idea. A more in-depth analysis of these annotator models is provided in the appendices of this survey.

VI. SELECTION ALGORITHMS
The selection of query-annotator pairs is based on a selection algorithm. It uses the query utility measure φ and the annotator performance measure ψ as basis to specify S(t) ⊆ Q X × A as the set of query-annotator pairs in each AL iteration cycle t ∈ N. In this context, we differentiate between two types of selection algorithms, explained in the following. At the end of this section, we present a literature overview of existing selection algorithms.
. Statistical models of annotator performance types: Following the idea of Frénay and Verleysen [137], we present three different annotator performance types as graphical models. There are four random variables depicted as nodes: Q is the query, Z is the optimal annotation, Zm is the annotation provided by annotator am, and Pm is the variable indicating the performance of the annotator am. In the simplest case, Pm is a binary variable to represent whether an annotator provides the optimal annotation (Pm = 1) or not (Pm = 0). We denote observed variables by shading the corresponding nodes, whereas the other nodes represent latent variables. The variable t is a deterministic parameter denoting the time. Arrows represent statistical dependencies, e.g., the optimal annotation always depends on the underlying query. The dashed arrow between the annotator performance variable Pm and the time t indicates an optional dependency. If this dependency is considered, the annotator performance is time-varying [48]. Otherwise, it is assumed to be persistent.

Optimal Annotation:
Query: Annotator Performance: Which digit is shown in the image?
Uniform: Annotator-dependent: Query-dependent: Fig. 7. Illustration of annotator performance types: There are three images of hand-written digits. Assuming a uniform annotator performance, an annotator has an equal chance of correct digit recognition for each of the three images. In the case of annotation-dependent performance values, the chance of recognizing a digit correctly depends on its true class as optimal annotation, e.g., the annotator is better at recognizing digit 1 than digit 7. The assumption of query-dependent performance values is more general and realistic. For example, the annotator has a low chance of recognizing the right digit due to its unclear writing. In the case of time-varying annotator performances, the chance of correct digit recognition can change over time, e.g., the chance may increase due to the learning progress of an annotator [70,81] or may decrease because of exhaustion or emerging boredom [135].
soft annotations: where Ω C denotes the set of possible confidence scores Song et al. [43] How confident are you that instance xn ∈ X belongs to the positive class?
soft annotations: Ω Z = [−1, 1] with z ∈ Ωz indicating the confidence that xn belongs to the positive class Biswas and Parikh [109] Does instance xn ∈ X belong to class y ∈ Ω Y ? If this is not the case, can you explain the reason? explanatory annotations: representing the set of explanations Teso and Kersting [114] Does instance xn ∈ X belong to class y ∈ Ω Y because of explanation e ∈ Ω E ? explanatory annotations:

Region Queries
Druck et al. [115], Settles [116] For which classes is a positive feature value X d > 0 highly indicative? distinct annotations: indicating the set of possible classes Du and Ling [102] What is the proportion of positive instances in the region described by the constellation of categorical features, e.g., What is the proportion of positive instances in the region described by the feature constellation, e.g., Comparison Queries Fu et al. [46,105], Joshi et al. [79,80] Do instance xn ∈ X and instance xm ∈ X belong to the same class? distinct annotations: Xiong et al. [120] Is instance xn ∈ X more similar to instance xm ∈ X than instance xo ∈ X ? distinct annotations: Ω Z = {yes, no, uncertain} Kane et al. [47], Xu et al. [121], Hopkins et al. [122] Is instance xn ∈ X more likely to belong to the positive class than instance xm ∈ X ? distinct annotations: Qian et al. [106] What is the decreasing order of the instances {xn, xm, xo} ⊂ X regarding their similarities to instance xp ∈ X ? distinct annotations: Ω Z consists of all possible ordering of the available instances, e.g., (xm, xo, xn) ∈ Ω Z

A. Sequential Selection of Queries and Annotators
Sequential selection of queries and annotators is made in two steps. In the first step, one or multiple (in the case of batch mode AL) queries with the highest utilities are selected. In a second step, corresponding annotators are selected and assigned to the respective queries, e.g., a predefined number of the annotators with the highest estimated performances per query [139]. Ideally, the selected annotators lead to low AC while providing high accuracy annotations. The main motivation for a sequential selection is to emphasize useful queries by selecting them in advance of the annotators. Moreover, the issue of annotator selection reduces to determining a ranking of the annotators regarding a selected query. As a result, not the exact but only the relative differences between the performances of the annotators are crucial for the annotator selection.

B. Joint Selection of Queries and Annotators
Selecting queries without considering the annotator's performances can result in low-quality annotations because there is no guarantee that at least one annotator has a sufficient performance regarding a selected query [45]. This problem can be resolved by applying a selection algorithm jointly selecting queries and annotators. For this purpose, the query utility and the annotator performance measure are to be combined appropriately, e.g., by taking their product [96]. Compared to the sequential selection of queries and annotators, the joint selection comes with higher computational complexity. Instead of computing the annotator performance estimates only for the selected queries, the annotator performance estimates are required for each possible query. Moreover, exact estimates regarding the annotator performance are more crucial since the annotator performance estimates are directly integrated into the selection criterion. If these estimates are unreliable, not only the annotator selection will be negatively affected but also the combination with the query selection. Table IV provides an overview of selection algorithms employed by existing real-world AL strategies, which select  [14] distinct: class labels persistent independent yes This annotator model estimates the true class label of an instance by means of majority voting. Following the interval estimation method [138], the majority votes are then used to evaluate the upper bound of the fraction of correctly annotated instances as performance estimate for each annotator.

C. Literature Overview
Donmez et al. [48] distinct: class labels time-varying independent no This annotator model models the quality of each annotator as a time-varying latent state sequence. For this purpose, it assumes that the change in the annotation quality from one to the next state follows a Gaussian distribution with a zero-mean and a known variance, which is shared among all annotators.
Long et al. [139,140], Long and Hua [141] distinct: binary class labels persistent independent no This annotator model, based on a probabilistic model with Gaussian processes [142], estimates a single performance value per annotator by comparing the provided annotations to the estimated true annotations. The performance value of an annotator indicates the probability that this annotator assigns the correct class label to an instance.
Annotation-dependent Annotator Performance Wu et al. [143] distinct: binary class labels persistent independent yes This annotator model, based on the logistic regression model proposed by Raykar et al. [144], estimates the performance of an annotator in dependence of an instance's (unknown) true class label. Using the maximum a posteriori criterion, the model's training follows the expectation-maximization algorithm [145] which iteratively estimates the true class labels (expectation-step) to evaluate the probability of a correct annotation for each class-annotator pair (maximization-step).
Rodrigues et al. [146] distinct: binary class labels persistent independent no This annotator model, based on a Gaussian processes [142] framework and expectation propagation [147], estimates the class-dependent specificity and sensitivity of each each annotator by comparing the provided annotations to the estimated true annotations.
Moon and Carbonell [95] distinct: class labels persistent independent no This annotator model expects an initial set of instances annotated by each annotator. The true class labels of these instances are estimated through majority voting. Subsequently, the performance of an annotator is computed as the annotation accuracy, i.e., estimated fraction of correct annotations, per class.
Nguyen et al. [87] distinct: class labels persistent independent yes This annotator model differs between infallible experts and error-prone crowd workers. The performance of the latter ones is estimated by comparing their provided class labels with the expert class labels. For this purpose, the model computes a confusion matrix including a Bayesian prior for the group of crowd workers.
Query-dependent Annotator Performance Wallace et al. [11] distinct: binary class labels and uncertain persistent independent yes This annotator model relies on domain information in form of annotators' pay grades. Therefore, it assumes that the pay grades of the annotator are highly correlated with their annotator performances.
Donmez and Carbonell [38,39] soft: class labels and confidence scores persistent independent no This annotator model uses the annotators' confidence scores as proxies of their annotator performances. Using k-means clustering [148], an annotator is queried to annotate the k ∈ N instances closest to the respective k cluster centroids. It is assumed that instances belonging to a cluster, whose centroid has a high-confidence annotation, will be accurately annotated by the corresponding annotator.
Du and Ling [10] distinct: binary class labels persistent single no Since the classification model is trained under a single annotator's supervision, this annotator model assumes that the classification model behaves similarly to the annotator. As a result, the annotator performance estimates near the classification model's decision boundary are lower than in regions where the classification model is certain.
Yan et al. [68,149] distinct: binary class labels persistent independent no This annotator model, based on a logistic regression model proposed in [150], estimates the performance of an annotator in dependence of an instance and its true class label. Using the maximum likelihood criterion, the model's training follows the expectation-maximization algorithm [145] which iteratively estimates the true class labels (expectation-step) and uses them to evaluate the annotator performance for each instance-annotator pair (maximization-step).
Ni and Ling [101] soft: binary class labels and confidence scores persistent single/independent no This annotator model uses the confidence scores provided by the annotators as proxies of their annotator performances. For non-annotated instances, these scores are estimated using the (inverse-)distance-weighted k-nearest-neighbor rule [151]. Accordingly, an annotator's performance for an instance is defined as the weighted mean confidence score of its k nearest neighbors being already annotated by this annotator.
CONTINUED ON THE NEXT PAGE. Fang et al. [70] distinct: binary class labels time-varying collaborative no This annotator model interprets the performance as uncertainty of an annotator regarding high-level concepts, e.g., sports, politics, and culture in case of document classification. These concepts are latent variables and modeled through a Gaussian mixture model [152]. An instance may belong to multiple concepts. Using the maximum likelihood criterion, the model's training follows the expectation-maximization algorithm [145], which iteratively estimates the true class labels (expectationstep) and takes them as basis for evaluating an annotator's uncertainty in annotating an instance (maximization-step).
Fang and Zhu [113] distinct: binary class labels and uncertain persistent single/independent no This annotator model expects the annotator to provide uncertain as annotation, if the annotator does not know an instance's true class label. Using this information, the model characterizes the performance of the annotator by training a classifier to estimate the probability whether an instance will not belong to the annotator's uncertain knowledge set.
Fang et al. [153,154] distinct: binary class labels persistent independent no This annotator model assumes that the performance of an annotator depends on a high-level representation of an instance's features and the instance's true class label. This dependency is indirectly modeled by introducing a latent variable for the expertise of each annotator. The expertise of an annotator is then computed as weighted linear combination of the instance's high level features.
Zhao et al. [12] distinct: binary class labels persistent independent no This annotator model estimates the annotator performance through two latent variables, namely, the query difficulty and the query-independent expertise of an annotator. For example, for annotators with high expertise or for easy queries, the probability of providing the true class label is high. The query difficulty and annotator expertise are latent and therefore iteratively estimated through the expectation-maximization algorithm [145].
Zhong et al. [45], Käding et al. [86] distinct: class labels and uncertain persistent single/independent no These annotator model allow an annotator to provide uncertain as annotation in case of a lack of knowledge regarding an instance's class membership, otherwise she/he provides a class label. The instances annotated with class labels (positive class) and the ones annotated with uncertain (negative class) form a binary classification problem. They are used to train an annotator model, i.e., a support vector machine [53] in [45] and Gaussian processes [142] in [86]. It predicts whether an annotator has sufficient knowledge to annotate an instance (positive class) or not (negative class).
Huang et al. [96] distinct: class labels persistent independent no This annotator model expects an initial set of instances with true class labels and annotations of each annotator. The model assumes that an annotator has a similar performance on similar instances. Therefore, it estimates an annotator's performance for an instance by computing the annotation accuracy regarding the instance's nearest neighbors in the initial set.
Yang et al. [155] distinct: class labels persistent independent no This annotator model learns a low-dimensional embedding for each annotator to capture the annotator's expertise regarding latent topics. Additionally, an embedding for each instance is learned as representation by the latent topics. Both embeddings are combined to estimate the performance of an annotator. Since these embeddings are latent variables, they are learned through the expectation-maximization algorithm [145].
Chakraborty [94] distinct: class labels persistent independent no This annotator model expects an initial set of instances with true class labels and annotations of each annotator. Since the true class labels are known in this set, the mistakes of each annotator can be determined on this set. A binary logistic regression classifier is then trained for each annotator separately. The trained logistic regression model of an annotator estimates her/his performance as the probability of obtaining a correct annotation for a certain instance.
Herde et al. [69] distinct: class labels persistent independent yes This annotator model estimates the performance of an annotator for a certain instance in form of a Beta distribution. This distribution is parameterized by the number of estimated false and true annotations in the local neighborhood of an instance. A false or true annotation of an annotator is identified by comparing the annotations of a single annotator to the predictions of a classifier trained with the annotations of the other annotators. query-annotator pairs. Next to the differentiation between a sequential and joint selection of queries and annotators, the number of selected queries and annotators per learning cycle is of interest. Selecting only a single query-annotator pair is often easier than selecting a batch of query-annotator pairs. In the latter case, the selection algorithm must ensure that queries are diverse. Otherwise, redundant information is queried. Moreover, multiple annotators are to be distributed across queries. To differentiate between both settings, we denote either single or batch for the query and annotator selection categories in Table IV. A few selection algorithms consider criteria beyond annotator performance and query utility, e.g., a collaboration between annotators. We denote these criteria accordingly in Table IV. Additionally, we provide a brief description of each selection algorithm's main idea. A more in-depth analysis of them is provided in the appendices of this survey.

VII. FUTURE RESEARCH DIRECTIONS
This section proposes some future research directions resulting from analyzing the real-world AL strategies discussed in the previous sections. We structure them into three categories to distinguish between challenges that strongly relate to this survey and those that go partially beyond it. Although we  [153,154], Zhong et al. [45] single single none These strategies select the query with the highest estimated utility. Subsequently, they select the annotators with the highest estimated performances regarding the annotation of this query.
Wallace et al. [11] single single workload of annotators This strategy selects either a non-annotated query with the highest estimated utility or a query for re-annotation. The annotator selection follows a categorical distribution whose parameters reflect a certain objective, e.g., balancing the annotation workload among annotators.
Zhao et al. [12] single single none This strategy selects the query with the highest estimated utility. The annotator selection follows one of two options. On the one hand, an annotator can be selected with a probability proportional to her/his estimated performance. One the other hand, the estimated best annotator is either selected with a pre-defined probability or a random one.
Donmez et al. [123] single batch none This strategy selects the query with the highest estimated utility. Subsequently, it selects an adaptive number of annotators with the highest estimated performances.
Zheng et al. [14] single batch none This strategy selects the query with the highest estimated utility. In an exploration phase, it assigns an adaptive number of annotators with the highest estimated performances to this query. In the subsequent exploitation phase, a fixed subset of annotators with low ACs and high performances is determined and always selected.
Fang et al. [70] single batch collaboration between annotators This strategy selects the query with the highest estimated utility. Subsequently, it selects not only the annotator with the highest estimated performance but additionally the annotator with the lowest estimated performance. This way, the estimated best annotator can teach the estimated worst annotator.
Long et al. [139,140], Long and Hua [141] single batch none These strategies select the query with the highest estimated utility. Subsequently, they select a pre-defined number of annotators with the highest estimated performances.
Yang et al. [155] batch batch none This strategy selects a pre-defined number of queries with the highest estimated utilities. Subsequently, it assigns to each of these selected queries the respective annotator with the highest estimated performance.
Joint Selection Donmez and Carbonell [38,39], Moon and Carbonell [95], Huang et al. [96] single single none These strategies select the query-annotator pair whose product of estimated query utility and annotator performance is the highest.
Yan et al. [149] single single none This strategy jointly selects a query and annotator by solving a linearly constrained and bi-convex optimization problem. Its goal is to find the optimal trade-off between a highly useful query and a high-performance annotator.
Yan et al. [68] single single none This strategy combines the query utility information and annotator performance information through a mutual information criterion [156] as the joint selection criterion for a query-annotator pair.
Nguyen et al. [87], Herde et al. [69] single single none These strategies jointly select a query and annotator by incorporating the estimated performance of an annotator (group) into the query utility measure quantifying the performance gain of the classification model.
Chakraborty [94] batch batch query diversity This strategy jointly selects a batch of query-annotator pairs by solving a linear programming problem. Its solution balances the trade-off between useful queries, accurate annotators, and a small redundancy between these queries. define these addressable challenges separately, they are not entirely solvable without taking a holistic view.

A. Active Learning for Classification
Multi-criteria cost functions: The majority of existing real-world AL strategies minimize the number of queries and misclassifications. However, in real-world applications, the ACs are often unknown in advance and may be query-and annotator-dependent. Furthermore, the computation of the MC is related to the application at hand. Therefore, an AL strategy needs to accept a user-defined objective function as input.
This function needs to account for additional criteria, such as balancing the workload between annotators [11].
Novel query types and a combination of them: Present AL strategies focus on collecting novel information relevant to the classification model. However, a query may not only improve the classification model but additionally the queried annotator [157]. For example, a strategy could ask "Are you certain that instance x n ∈ X belongs to class y ∈ Ω Y ? Previously, you stated that the similar instance x m ∈ X belongs to class y ′ ∈ Ω Y ?". Such a query may help the annotator to learn from previous annotation mistakes. Moreover, most pool-based AL strategies query class information of instances. However, recently, Liang et al. [158] proposed the strategy active learning with contrastive natural language explanations (ALICE). It uses queries of the form "How would you differentiate between the class y ∈ Ω Y and class y ′ ∈ Ω Y ?" in combination with explanatory annotations. As a result, ALICE does not need a pool of non-annotated instances but only a small initial training set. Next to novel query types, future strategies may combine different query types to enhance interaction with annotators further.
Batch selection of diverse queries and annotators: Deep learning model's generalization capabilities depend on a vast amount of data. Therefore, annotating single queries per AL cycle may be inappropriate [159]. Instead, a batch of diverse and useful queries is to be selected per AL cycle. Such a batch maximizes usefulness by avoiding redundancies. In a multiannotator setting, assigning appropriate annotators to these queries is an additional challenge. For example, assigning all queries in a batch to a single annotator can be harmful because it could bias the performance estimates of the other annotators [146].
Advanced annotator performance estimation: Existing annotator models are limited in their application due to their assumptions. On the one hand, most of them assume persistent annotator performances and thus disregard, e.g., learning capabilities, collaboration, or signs of fatigue. On the other hand, they do not incorporate background knowledge about the annotators, e.g., interests, skills, level of education, age, etc. Such knowledge may improve the selection of annotators [160].
Realistic Evaluation: Evaluating real-world AL strategies is more complex than assessing traditional AL strategies. In particular, the simulation of realistic experimental settings represents a challenge. For example, there is a need to collect real-world data sets processed by multiple annotators to verify the performance of AL strategies in multi-annotator settings. When collecting such data sets, it is infeasible to present each possible query to each annotator. Therefore, a further research direction is the simulation of annotators for different query types. Moreover, an AL strategy may be evaluated in a realworld system [161] in addition to simulated experiments on benchmark data sets to verify its effectiveness regarding realworld applications.

B. Active Learning Issues Beyond Classification
Although we focused on AL strategies for classification in this survey, their analysis provides insights beyond a classification setting. If we exemplify object detection in images, similar challenges arise when employing AL strategies. For example, relying on the number of annotated images as AC is not representative. Instead, the number of objects within an image is more appropriate [162] because annotating images with many objects is more time-intensive. Another example for object detection is the handling of error-prone annotators, where the AL strategy has additionally to assess the quality of provided bounding box annotations.
A challenge affecting pool-based AL with multiple annotators is the asynchronous nature of the annotation process [136].
This results from different working speeds of annotators, i.e., some annotators process queries faster than others. Due to this asynchronous nature, the selection of query-annotator pairs must be adaptive regarding the working states of the annotators. This is, in particular, true for stream-based AL.
Techniques of explainable artificial intelligence may further improve the interaction between annotators and AL strategy. For example, the ML model can visualize its decision-making process such that a human annotator can monitor the model's learning progress and correct wrong decisions [157].

C. Active Learning Issues Beyond Artificial Intelligence
Deploying AL strategies into real-world applications not only raises challenges in the scope of artificial intelligence but also involves research beyond it. One example is graphical user interfaces of the annotation process, which are crucial for the efficiency of the AL process. Studies have shown that an appropriate user interface design strongly decreases the annotation time and thus AC [116,163]. Another example is the design of queries and annotations from a psychological perspective. On the one hand, queries are to be formulated neutral without a bias toward a specific annotation. On the other hand, annotations are to be comparable, particularly when asking for the annotators' self-assessments.
Another future research direction is integrating AL into further little to no explored application areas to exploit its full potential. For example, it can be employed in material science to actively design experiments in a more systematic way [164] or for automatic program repair [165] to save cost and time. Another example would be the review process in science, where AL can select appropriate reviewers as annotators for articles. Therefor, one could use feedback from authors of past conferences and the reviewers' background knowledge to train annotator models.

VIII. CONCLUSION
At the start of this survey, we pointed out unrealistic assumptions as disadvantages of traditional AL strategies. Based on that, we identified three crucial requirements for realworld AL strategies, i.e., estimating costs, asking alternative queries, and modeling annotator performances. Subsequently, we formalized the objective for classification tasks as the specification of the optimal annotation sequence leading to minimum MC and AC. Additionally, we proposed a novel AL cycle that generalizes the settings of the majority of existing real-world AL strategies. A strategy is part of a learning system in this cycle and comprises a query utility measure, an annotator performance measure, and a selection algorithm. We provided tabular literature overviews of existing real-world AL strategies regarding their cost types, their queryand annotation-based interaction, their handling of error-prone annotators, and their selection of query-annotator pairs. In addition, we analyzed the real-world AL strategies in more detail and embedded them in our unifying mathematical notation in the appendices. These analyses resulted in the formulation of future research directions in the field of AL.
Marek Herde received his B.Sc. and M.Sc. degrees in computer science from the Univ. of Kassel, Germany. Currently, he is also pursuing his Ph.D. degree in computer science there. His research focuses on active learning, deep learning, and methods for learning from error-prone annotators.
Denis Huseljic received his B.Sc. and M.Sc. degrees in computer science from the Univ. of Kassel, Germany. Currently, he is also pursuing his Ph.D. degree in computer science there. His research focuses on active learning, deep learning for computer vision, and methods for uncertainty estimation.
Bernhard Sick received his Diploma, Ph.D., and Habilitation degrees from the Univ. of Passau, Germany. He is currently a full professor for Intelligent Embedded Systems at the Univ. of Kassel, Germany. His research comprises data science and machine learning with applications, e.g., in renewable energies, autonomous driving, physics/materials science. He authored more than 200 peer-reviewed publications in these areas. He received several theses, best paper, teaching, and inventor awards. He is a member of IEEE and GI.
Adrian Calma received his B.Sc., M.Sc., and Ph.D. degrees in computer science from the Univ. of Kassel, Germany. He is keen on applying active learning techniques to real-world problems. His research focuses on developing active learning techniques handling error-prone annotators. Currently, he is a Fellow in Intelligent Embedded Systems Lab at the Univ. of Kassel, where he is working on improving precision farming methods with active learning.

Appendices of A Survey on Cost Types, Interaction Schemes, and
Annotator Performance Models in Selection Algorithms for Active Learning in Classification Marek Herde , Denis Huseljic , Bernhard Sick , Member, IEEE, Adrian Calma GENERAL T HE following appendices provide a more in-depth analysis of the real-world AL strategies reviewed in the associated survey. This analysis includes discussing the realworld AL strategies regarding their cost types in Appendix A, their interaction schemes in Appendix B, their annotator performance models in Appendix C, and their selection algorithms in Appendix D. Table I lists essential abbreviations, and Table I explains the mathematical notation used throughout this survey. For ease of notation, we do not explicitly denote step t if it is not required. Table II lists all analyzed real-world AL strategies, including their acronyms where N/A denotes not available. The crosses and check-marks indicate to which of the four research aspects these strategies contributed. If more than one reference is given for a strategy, the year of publication refers to the most recent one.

Symbol Meaning
Data Spaces Ω X feature/input space of possible instances random variables of all features Y random variable of true class labels Z = (Z 1 , . . . , Z M ) random variables of annotations Q random variable of queries A random variable of annotators P = (P 1 , . . . , P M ) random variables of annotators' performances Annotation Process sequence of the annotation process S * optimal annotation sequence t ∈ N (time) step t S last step of annotation sequence S CONTINUED ON THE NEXT PAGE.

Symbol Meaning
Annotation Process C constraints of the annotation process D(t) data set obtained at begin of step t U (t) non-annotated data set at begin of step t L(t) annotated data set at begin of step t z (t) lm annotation for query q l by annotator am obtained during step t Costs B ∈ R >0 annotation budget C ∈ R C×C

Symbol Meaning
Other and Strategy-specific Symbols In this appendix, we analyze concrete real-world AL strategies regarding their handling of costs. We structure this analysis according to the cost types, i.e., AC and MC, including their underlying cost schemes identified in Section III in the associated survey.

A. Misclassification Cost
Class-dependent MC: Margineantu [69] proposed active cost-sensitive learning (ACTIVE-CSL) as one of the first realworld AL strategies taking class-dependent MC into account. It expects a cost matrix as input. Concerning the already annotated instances, ACTIVE-CSL computes the expected MC after annotating an instance and adding it to the classification model's training set. Since the annotation of an instance is not known in advance, ACTIVE-CSL takes the expectation over all possible annotations. This query utility measure is similar to EER. As a result, it involves a lot of retraining and is thus computationally intensive. Moreover, taking only the set of already annotated instances into account biases the estimation of the expected MC. This is, in particular, true in the early stage of the AL process, where only a few instances have been annotated.
Joshi et al. [44,45] proposed a similar cost-sensitive EER variant. In contrast to ACTIVE-CSL, this measure also

Strategy
Year Acronym Cost Types

Interaction Scheme
Annotator Model

Selection Algorithm Appendix
Herde et al. [1] 2021 MaPAL C-C, D-B Hopkins et al. [2] 2020 N/A B-C Chakraborty [3] 2020 N/A A-B, C-C, D-B Min et al. [4] 2019 TALK A-A Wu et al. [5] 2019 CADU A-A Wang et al. [6] 2019 CATS A-A Krishnamurthy et al. [7,8] 2019 COAL A-A Tsou and Lin [9] 2019 CSTS A-B Hu et al. [10] 2019 ALPF B-A Bhattacharya and Chakraborty [11] 2019 N/A B-A Teso and Kersting [12] 2019 N/A B-A Luo and Hauskrecht [13,14,15] 2019 HALG & (A * )HALR B-B Calma et al. [16] 2018 N/A B-A Song et al. [17] 2018 N/A B-A Yang et al. [18] 2018 DALC C-C, D-A Huang et al. [19] 2017 CEAL A-B, C-C, D-B Kane et al. [20] 2017 ACCQ B-C Xu et al. [21] 2017 ADGAC B-C Huang and Lin [22] 2016 N/A A-A Long et al. [23,24] 2016 JGPC-ASAL C-A, D-A Long and Hua [25] 2015 MARMGPC-ASAA C-A, D-A Krempl et al. [26] 2015 OPAL A-A Nguyen et al. [27] 2015 N/A A-A, A-B, C-B Käding et al. [28] 2015 GP-EMOC PDE+R A-A, B-A, C-C Zhong et al. [29] 2015 ALCU-SVM B-A, C-C Xiong et al. [30] 2015 N/A B-C Qian et al. [31] 2015 ARP B-C Moon and Carbonell [32] 2014 N/A A-B, C-B, D-B Fu et al. [33,34] 2014 QHAL & PHAL B-C Rodrigues et al. [35] 2014 GPC-MA C-B, D-A Fang and Zhu [36] 2014 EIAL B-A, C-C Fang et al. [37,38] 2014 AL+kTrM & ALM+TrU D-A, C-C Zhao et al. [39] 2014 N/A C-C, D-A Chen and Lin [40] 2013 MEC & CWMM A-A Biswas and Parikh [41] 2013 N/A B-A Wu et al. [42] 2013 PMActive C-B, D-A Haque et al. [43] 2013 GQAL B-B Joshi et al. [44,45] 2012 N/A A-A, A-B, B-C Cebron et al. [46] 2012 N/A B-A Ni and Ling [47] 2012 BMO B-A, C-C, D-A Fang et al. [48] 2012 STAL C-C, D-A Yan et al. [49] 2012 N/A C-C, D-B Yan et al. [50] 2011 N/A C-C, D-B Wallace et al. [51] 2011 MEAL A-B, B-A, C-C, D-A Settles [52] 2011 DUALIST B-B Rashidi and Cook [53] 2011 RIQY B-B Zheng et al. [54] 2010 IEAdjCost A-B, C-A, D-A Donmez and Carbonell [55,56] 2010 N/A A-B, B-A, C-C, D-B Tomanek and Hahn [57] 2010 N/A A-B Wallace et al. [58] 2010 N/A A-B Du and Ling [59] 2010 N/A C-C Donmez et al. [60] 2010 SFilter C-A CONTINUED ON THE NEXT PAGE.  [63] 2009 CS USST A-A Arora et al. [64] 2009 N/A A-B Druck et al. [65] 2009 GE WU B-B Donmez et al. [66] 2009 IEThresh C-A, D-A Settles et al. [67] 2008 N/A A-B Haertel et al. [68] 2008 N/A A-B Margineantu [69] 2005 ACTIVE-CSL A-A exploits the non-annotated set of instances to evaluate the expected MC of the classification model. Since the true class labels of these instances are unknown, it relies on the estimated class membership probabilities of the classification model after each retraining. However, the high computational complexity of this strategy remains, as for ACTIVE-CSL, a limitation to train classification models with computation-intensive training procedures. The more recent strategy optimized probabilistic active learning (OPAL), proposed by Krempl et al. [26], partially overcomes the issue of high computational complexity. It computes the density-weighted reduction in the MC when annotating an instance. Different from ACTIVE-CSL, the expected MC is computed regarding the candidate instance. Therefor, it relies on so-called kernel frequency estimates. We can interpret them as the number of annotations per class in the neighborhood of an instance. They are often estimated through a kernel function quantifying similarities between instances. The kernel frequency estimates allow for a closedform solution for computing the expected MC. Additionally, they can be easily updated when adding additional annotations. Accordingly, OPAL is non-myopic by considering more than one annotation acquisition at once. The major disadvantage of OPAL is its need for an appropriate kernel frequency estimation, which is difficult for domains such as images. Another disadvantage is OPAL's restriction on binary classification problems.
Instead of computing the MC reduction, Käding et al. [28] proposed the expected model output change (EMOC) as a utility measure. It quantifies how the classification model's predictions change by simulating the annotation of an instance. For this, it compares the updated and old classification model's predictions through a cost (loss) function and assigns high utilities to instances leading to significant differences between the prediction pairs. Compared to an EER-based approach, EMOC does not need highly reliable estimates of the class membership probabilities to compute meaningful utilities. Nevertheless, many retraining procedures of the classification model are required and hence lead to high computational complexity.
In favor of computational efficiency, Liu et al. [63] proposed the strategy cost-sensitive uncertainty sampling with selftraining (CS USST). Its idea is to use US to select instances for finding the decision boundary minimizing the misclassification rate. Subsequently, a classification model is trained on the update annotated set L and used to obtain predictions for the instances of the non-annotated set U. Finally, a cost-sensitive classification model is trained on the union of both sets, including the previously obtained predictions. Using this semisupervised learning approach to determine the parameters of a cost-sensitive classification model, CS USST aims to overcome the selection bias caused by taking only the instances selected by US into account. Although this strategy resolves some limitations of US, the missing exploration issue of US remains.
Chen and Lin [40] proposed two further uncertainty-based real-world AL strategies, namely maximum expected cost (MEC) and cost-weighted minimum margin (CWMM). MEC generalizes the minimum confidence variant of US, and its utility measure is defined through where C ∈ R C×C ≥0 is a user-defined cost matrix. In contrast, CWMM is a generalization of the minimum margin US variant. It computes the MC difference between the prediction with the lowestŷ (1) ∈ Ω Y and second lowest costŷ (2) ∈ Ω Y : Both utility measures are easy to compute and can be used in combination with any probabilistic classification model. However, they share similar disadvantages as US-based utility measures, e.g., no consideration of representativeness and missing exploration.
Huang and Lin [22] introduced a different way of calculating uncertainty. The strategy active learning with cost embedding (ALCE) is based on a cost embedding approach which transfers the cost information into a distance measure of an (O ∈ N)-dimensional latent space. Therefor, the class labels in Ω Y are encoded as vectors u 1 , . . . , u C ∈ R O such that u i − u j < u k − u l holds for the Euclidean distance if and only if C[i, j] < C[k, l]. A cost-sensitive classifier is im-plemented through a multi-target regression model predicting latent vectorsû(x θ D ) ∈ R O . The final class label prediction is obtained by finding the nearest vector of the transformed class labels. This embedding allows ALCE to define the costsensitive uncertainty measure It assigns high utility to instances whose expected costs are high. On the one hand, ALCE overcomes the requirement of a probabilistic classifier. On the other hand, a suitable multitarget regression model is required instead. Nguyen et al. [27] proposed a strategy whose utility measure quantifies the expected MC reduction for querying an instance's annotation from cheap crowd workers or expensive experts. It separates candidate instances into non-annotated instances and those annotated through the majority vote annotation obtained from multiple crowd workers. A non-annotated instance can be selected for annotation through the crowd workers, whereas an annotated instance can be chosen to obtain an annotation from an expert. The latter case can be important for instances whose crowd worker annotations are likely wrong. The computation of the expected MC follows the idea of a cost-sensitive EER variant and extends it by modeling the crowd workers' error-proneness. Moreover, the strategy pre-selects a set of candidate instances (according to the selection criterion of US) to reduce its computation complexity. Nevertheless, it involves multiple retraining iterations of a probabilistic classification model.
Following a methodology of divide and conquer [70], Min et al. [4] proposed the strategy tri-partition active learning through k-NN (TALK). Based on a cost-sensitive k ∈ Nnearest neighbor (k-NN) model, TALK divides the nonannotated instances into three regions. For this, the MC of each non-annotated instance is estimated as instance utility through The set N k x ⊂ X contains the k annotated NN of instance x. If the estimated MC of an instance is lower than its AC, the instance is assigned to the first region and annotated according to the classification model's prediction. The remaining nonannotated instances are sorted according to their estimated MCs. A predefined number of the instances with the highest MCs are assigned to the second region and annotated by a human annotator. The other non-annotated instances form the third region and will be processed in the next cycle iteration. The major issues of TALK are its requirement for an appropriate k-NN model and its missing consideration of an instance's representativeness.
Two similar strategies to TALK are cost-sensitive active learning through density clustering under a label uniform distribution (CADU), proposed by Wu et al. [5], and costsensitive active learning through statistical methods (CATS), presented by Wang et al. [6]. They additionally consider an instance's density and aim to query the annotations of an MCoptimal number of instances per region/cluster.
Instance-dependent MC: Each of the previously discussed strategies assumes class-dependent MC, e.g., defined by a cost matrix. In contrast, Krishnamurthy et al. [7,8] presented cost overlapped active learning (COAL) as an approach for instance-dependent MC. Correspondingly, the cost of interchanging two classes may differ from instance to instance. COAL requires access to a set of regression functions, which provide the range of possible costs that a predicted class label for an instance may cause. The idea of COAL is to actively query the cost of predicting a specific class label for an instance instead of querying class labels. Therefore, COAL aims to query only the cost information of class labels with high uncertainty regarding their possible cost values. This uncertainty is quantified through a cost range that we can interpret as COAL's utility measure.

B. Annotation Cost
Annotator-dependent AC: In the following, we denote the annotators' individual ACs as ν = (ν 1 , . . . , ν M ) T ∈ R M >0 with ν m as the AC for querying annotator a m ∈ A.
For this setting, Zheng et al. [54] proposed the strategy IEAdjCost. It solves an optimization problem to determine a subset of annotators whose majority vote annotations achieve in average a user-defined level of accuracy ρ ∈ [0, 1] for the minimum sum of the annotators' ACs. Mathematically, this annotator set is defined through arg min where A ′ acc denotes the estimated probability that the majority vote annotation of A ′ for an arbitrary instance will be correct.
The strategy of Chakraborty [3] also solves an optimization problem to determine in each iteration cycle a set of annotators with high performances and low ACs to annotate a batch of queries. Therefor, Chakraborty combines query utility, annotator performance, and AC into one final score, i.e., the annotation error-weighted AC normalized by the query utility.
Using the annotators' ACs as a normalization factor to assess the utility of a query or the annotators' performances is another method to consider annotator-dependent AC explicitly. The strategies, proposed in [19,32,55,56], compute a kind of AC-effective performance: One disadvantage of such a normalization approach may be the non-linear relation between annotator performance and AC since they are computed on different scales. In contrast, the strategy of Nguyen et al. [27] uses the AC to normalize the expected reduction in MC. Both cost types are estimated on the same scale. As a result, this strategy may find a better trade-off between both cost types. Query-dependent AC: In the following, we denote the queries' individual ACs as ν = (ν 1 , . . . , ν L ) T ∈ R L >0 with ν l as the AC for obtaining an annotation for query q l ∈ Q X . Many strategies [44,45,55,56,69] take query-dependent AC into account by subtracting the AC of a query from its utility: Computing query utility per AC unit is another common approach to make a strategy cost-sensitive [67,68,57,9,58]: The approaches in Eq. 8 and Eq. 9 implicitly assume that the utilities and ACs can be expressed on the same scale [57]. If this is assumption is violated, one may re-scale the AC, e.g., through a linear transformation: where α 0 , α 1 ∈ R are corresponding coefficients to be determined. However, these coefficients are often unknown in real-world settings, or even non-linear transformations of the ACs are required [68]. For this reason, Haertel et al. [68] proposed two other approaches to consider query-dependent AC explicitly. On the one hand, the selection of queries can be constrained to a user-defined maximum AC ν max ∈ R >0 such that all queries with ν l > ν max are excluded from the annotation process. On the other hand, a query's utility and its AC can be combined through a linear combination of their ranks. In this context, a high query utility and a low AC leads to high ranks: with ρ ∈ [0, 1] as a user-defined weighting term. Both approaches have the disadvantage of finding appropriate values for ρ and ν max . Knowledge about the AC of each query is a prerequisite for employing the above approaches as part of a real-world AL strategy. Therefore, Margineantu [69] assumes that the AC per query is known in advance. In contrast, Donmez and Carbonell [55,56] and Joshi et al. [44,45] expect that the AC follows a fixed cost model where the AC of a query is correlated to the probability of obtaining its optimal annotation.
More advanced strategies [9,57,58,67,68] estimate the individual ACs at run-time during the annotation process. In particular, when we define AC through the annotation time, such an estimation is crucial. Settles et al. [67] provided a detailed analysis of annotation time used as a proxy for the AC. For this purpose, four data sets with four different tasks (i.e., extracting entities from news articles, classifying biomedical abstracts, extracting contact details from e-mail signature lines, and classifying segments in images) were annotated. The corresponding annotation times were logged. The results indicate a high variability of the annotation time per query on all four data sets. Moreover, the annotation time is found to be substantially annotator-dependent. Another investigated issue concerns the development of the annotation time during the annotation process. It has been observed that, in general, the annotation time decreases because the annotators can adapt quickly to the annotation task. The observations regarding the annotation time made by Settles et al. [67] have been mostly confirmed by Arora et al. [64] in another empirical investigation. In a further case study, Raghavan and Jones [71] found that the annotation time strongly depends on the type of query, i.e., annotating the importance of a feature regarding a document classification problem took about one-fifth of the time required for annotating a document.
Following the above studies, there are several methods for estimating times for annotating documents [58,67,68]. They estimate the annotation time using a regression model, e.g., support vector regression [72] or ordinary least squares [73], as a function of simple numerical features such as the number of words in a document. Experiments with these models demonstrated that annotation times are fairly learnable.
Tsou and Lin [9] proposed a more general approach for estimating query-dependent AC. Their real-world AL strategy cost-sensitive tree sampling (CSTS) constructs a decision tree dividing the observed instances into disjoint leaves during the AL process. The AC of querying an instance's annotation is then estimated through the average AC of the already annotated instances in the corresponding leaf of this tree.
Query-and annotator-dependent AC: In the following, we denote the ACs for each combination of query and annotator as ν = (ν 11 , . . . , ν LM ) T ∈ R L >0 with ν lm as the AC for obtaining an annotation for query q l ∈ Q X from annotator a m ∈ A.
Wallace et al. [58] proposed a straightforward approach to compute the AC per query-annotator pair in the domain of document classification. Therefor, they assume that the salary per time unit of each annotator is known. As a result, the AC ν lm is estimated by multiplying the expected time to annotate query q l by the salary per time unit of annotator a m . To estimate a query's annotation time, Wallace et al. simply expect that all annotators read a predefined number of words per minute and transform the length of a document to an annotation time under this model.
The assumption that each annotator requires similar annotation times is often violated in real-world applications [64,67]. For this reason, Arora et al. [64] proposed an approach to estimate the annotation time as a function of query and annotator characteristics. Since this approach also focuses on documents, they use character length and percentage of stop words to describe a document's query. For the annotators, an ordinal scale is used to assess whether an annotator is a native speaker of English. Combined with respective logged annotation times, these characteristics are used as inputs to a linear or support vector regression model [72] to estimate the annotation time for new query-annotator pairs. A significant advantage of such an approach is its inductive learning toward annotators for which no annotation times have been collected yet. However, the approach was only tested on a relatively small data set.

APPENDIX B INTERACTION SCHEMES
In this appendix, we analyze concrete real-world AL strategies regarding their interaction schemes. We structure this analysis according to the query types, i.e., instance, region, and comparison query, identified in Section IV in the associated survey.

A. Instance Queries
Instance queries are the core of research in traditional AL and have already been reviewed in several other surveys [74,75,76,77]. Therefore, we focus on the following strategies using queries going beyond the standard formulation: "To which class does instance x n ∈ X belong?".
Instance queries with partial label information: Classification problems with many classes often challenge human annotators because they have to pick one out of many classes as an instance's true class. This is, in particular, difficult if multiple classes may be in question for an instance. Therefore, Cebron et al. [46] proposed a strategy overcoming such issues by facilitating instance queries. Instead of asking for an instance's class label, it refers to partial label information by querying to which classes an instance does not belong. Correspondingly, annotators are allowed to annotate an instance with a subset of class labels, i.e., Ω Z = P(Ω Y ). This partial label information is used to create a set of instances for each class. Such a set consists of instances that do not belong to the respective class. As illustrated by Fig. 1, a one-class support vector machine (SVM) [78] is trained for each set to define a region of instances not belonging to the respective class. An instance's prediction is obtained by computing the distance to each of the C regions and assigning the instance to the class with the highest distance. The utility of an instance is estimated through the expected error of the classification model, i.e., the committee of the C one-class SVMs. Instances with high distances to all regions have high utilities because the classification model cannot exclude specific classes for an instance. The strategy works for different numbers of excluded classes as annotation, e.g., an annotator can only exclude one or two classes in case of a classification problem with many classes. In this context, Cebron et al. made the limiting assumption that each class has an equal probability of being excluded. However, real human annotators could have certain preferences to exclude a class for an instance. Active learning with partial feedback (ALPF) proposed by Hu et al. [10] is another strategy querying partial label information about instances. ALPF assumes that the class labels can be organized into composite classes K = {K 1 , . . . , K O }, where each composite class is a subset of one or multiple classes, i.e., K 1 , . . . , K O ⊂ Ω Y . These composite classes can be generated through an existing hierarchy of the class labels Ω Y , e.g., when classifying animals, a composite class for dogs would contain all dog breeds in Ω Y . These composite classes are then used in combination with an instance to formulate queries of the type: "Does instance x n ∈ X belong to the composite class K o ∈ K?". Accordingly, the set of queries is defined through Q X = X × K. An annotator can either answer yes or no such that the annotation set is Ω Z = {yes, no}. ALPF's main idea is to use these yes/no queries to gradually reduce the set of potential classes to which an instance could belong. A neural network [79] with an extended variant of the cross-entropy loss function is trained with this partial label information. Hu et al. introduced three different utility measures to select queries, namely expected reduction in entropy, expected remaining classes, and expected decrease in classes. The first measure can be seen as a variant of US for partial labels. The second and third measures estimate how much an annotation would affect the set of potential classes an instance can belong to. ALPF showed promising performance results on large-scale classification benchmark data sets. Thus, it made a step toward annotating real-world data sets with a considerable number of classes, which cannot be surveyed in their entirety by a human annotator.
Instance queries with self-assessments: For various classification problems, studies have shown that annotators can provide meaningful self-assessments in addition to a class label as an annotation for an instance [16,51].
Allowing annotators to express their missing knowledge regarding an instance's annotation is a common approach to incorporate annotators' self-assessments [28,29,36,51,55,56]. In this case, we can define the annotation set as Ω Z = Ω Y ∪ {unconfident}, where unconfident represents that an annotator is not able or rejects to provide a class label as annotation. Wallace et al. [51] proposed the strategy multiple expert active learning (MEAL) differing between novices and experts as annotators. MEAL queries experts only for instances that could not previously be assigned to any class with certainty by the novices. In contrast, the strategies [28,29,36,55,56] use the instances annotated as uncertain to train their annotator models estimating the annotators' performances. We provide a more detailed discussion on how they obtain these performance estimates in Appendix C.
Song et al. [17] proposed a real-world AL strategy with confidence-based answers for crowdsourcing annotation tasks. It aims at aggregating the answers of the annotators of the crowd for creating highly accurate data sets at minimum AC. For this purpose, an annotator is required to provide a confidence score z ∈ Ω Z = [−1, 1] as an instance's annotation. Only binary classification tasks are included, such that a confidence score is transformed to the probability (z + 1) 2 for the positive class. Relying on these probability estimates provided by multiple annotators for a single instance, a class label is aggregated by determining the maximum likelihood solution of a Beta distribution given the observed confidence scores.
The inferred mean of this Beta distribution is an estimate of the true class posterior probability. For the instance selection, US is applied in combination with the confidence intervals obtained by the fitted Beta distribution. Hence, the strategy of Song et al. selects instances for which the decision of the classification model and the aggregated label are uncertain.
Ni and Ling [47] proposed a similar strategy named best multiple oracles (BMO). It aims at guaranteeing a user-defined threshold c min ∈ Ω C = [0.5, 1] for the correctness of the class labels used for training a classification model. The main idea is to re-annotate an instance until the instance's class label certainty reaches the certainty threshold c min . BMO selects instances according to a user-defined selection strategy, e.g., US or EER. For annotating an instance, an annotator provides an estimated class label y ∈ Ω Y = {1, 2} and a confidence score c ∈ Ω C as additional feedback. Such an annotation (y, c) ∈ Ω Z = Ω Y × Ω C is assumed to describe that y is an instance's correct class label with probability c.
The strategies of Song et al. [17] and Ni and Ling [47] are limited to binary classification problems. In contrast, Calma et al. [16] proposed a strategy copying with multi-class problems. Therefor, it expects a class label y ∈ Ω Y and a corresponding confidence score c ∈ Ω C = [0, 1] as annotation. Hence, the annotation set is given by Each annotation (y, c) is transformed to a vector p = (p 1 , . . . , p C ) T ∈ [0, 1] C of class probabilities. An element of this vector is computed according to The obtained probability vectors are then used to train a generative classification model in combination with the 4DS strategy [80] as a basis for the estimation of instances' utilities. On the one hand, strategies asking for numerical confidence scores can improve the classification model's performance if these scores are well-calibrated. On the other hand, confidence scores increase the annotation effort and can be in multiannotator settings strongly biased.
Instance queries with model predictions: Using the classification model predictions during the formulation of queries may ease and speed up the annotation process. In this context, Bhattacharya and Chakraborty [11] proposed a strategy reducing the annotation effort per query. Its idea is to preselect the set of possible class labels to which an instance may belong. This set is defined through a user-defined number O ∈ {2, . . . , C − 1} of the most probable class labels predicted by the classification model. Accordingly, a query is formulated as: "To which class in {y (n1) , . . . , y (n O ) } ⊂ Ω Y does instance x n ∈ X belong?". The utility of such a query considers only the probabilities of the O most likely classes for instance x n . For this purpose, a probability difference matrix is computed. Subsequently, the maximum eigenvalue of the inverse matrix D −1 n represents the instance's utility. This eigenvalue is inversely proportional to the average absolute difference between the estimated probabilities [81]. As a result, instances whose top O estimated class-membership probabilities are close to each other have high utilities. During the entire annotation process, Bhattacharya and Chakraborty assume that an annotator always provides the true class label of an instance. This assumption includes cases where an instance's true class label is not part of the preselected top O likely classes. However, in real-world applications, an annotator may be confused in such a case, in particular, if the annotator trusts the classification model's predictions.
Biswas and Parikh [41] also proposed a strategy directly incorporating the classification model's predictions into queries. Therefor, it employs queries of the form:" Does instance x n ∈ X belong to class y ∈ Ω Y ? If this is not the case, can you explain the reason?". Accordingly, a query consists of a pair of instance and class label: (x n , y) ∈ Ω X = X × Ω Y . As annotation, a yes is expected if the instance x n actually belongs to the class y. Otherwise, the annotator is required to explain why this is not the case. For example, an image of a city is given, and the real-world AL strategy queries whether this image shows a forest. The answer is not yes. As a result, the annotator explains that this image does not belong to the class forest because it is not natural enough. The term natural is a so-called relative attribute in this context. We can interpret a relative attribute as a high-level feature to compare instances among each other. Assuming that there are r 1 , . . . , r O relative attributes, the set of possible explanations is defined as ∈ Ω E representing a too high/low value for the relative attribute r o .
Together with the answer yes the explanations form the annotation set: Ω Z = {yes} ∪ Ω E . Given this interaction scheme, Biswas and Parikh answered three main questions.
(1) How to learn from attribute-based explanations? If an annotator says that "x n is too r o to belong to class y", i.e. r ↑ o as annotation, the AL strategy computes the strength of the attribute r o in the queried instance x n as r o (x n ). Then, the strategy identifies all non-annotated instances with an attribute strength about r o (x n ). These instances can also not belong to class y if x n is too r o to belong to class y. Hence, the training data of the classification model is updated by adding these instances as counterexamples for class y. This kind of label propagation works analogously for the case that the annotator provides r ↑ o as an annotation. (2) How to learn relative attribute models during the annotation process? A prerequisite for learning from relative attributes is the specification of the attribute predictors r 1 , . . . , r O . Such an attribute predictor is a function with r o (x) = w T o x. Given a ranking of instances regarding the o-th attribute, the weights w o ∈ R D ideally satisfy r o (x n ) > r o (x m ) if instance x n has a stronger presence of attribute r o than instance x m . The attribute predictors can be either pre-trained [82] or learned during the annotation process. In the latter case, the ranking of instances is intelligently built upon the relative feedback of the annotators.
(3) How to consider relative feedback annotations during the query selection? Traditional US considers only class labels as annotations for an instance. Therefore, Biswas and Parikh extended traditional US by accounting for the potential attribute-based feedback. The corresponding utility measure estimates the possible reduction of the entropy when annotating a query. For this purpose, it computes the expected entropy reduction over the different potential annotations in Ω Z . Finally, the query leading to the maximum entropy reduction is selected. The strategy of Biswas and Parikh [41] expects the annotators to provide explanations if an instance does not belong to a specific class. In contrast, Teso and Kersting [12] proposed the strategy CAIPI where a query includes an instance's predictions with an explanation for an annotator. As a result, a query takes the form: "Does instance x n ∈ X belong to class y n1 ∈ Ω Y because of explanation e n ∈ Ω E ?". The set of possible queries can be formalized as Q X = X × Ω Y × Ω E . The instance to be queried is selected through a user-defined utility measure, e.g., US. The class label is simply the selected instance's most probable class outputted by the classification model. The explanation is generated through the implementation of local interpretable model-agnostic explanations (LIME) [83]. Such an explanation is a collection of relevant components, e.g., words in a document or objects in an image, that mainly lead to the classification model's prediction. The set of explanations can be represented through weights The magnitude w o indicates the overall contribution of the o-th component to the classification model's prediction. In contrast, the weight's sign indicates whether the component is a positive or negative indicator for the prediction. Once a query has been selected, an annotator can interact in three different ways: (1) The annotator confirms the query if the prediction and explanation are correct. (2) The annotator contradicts the query if the prediction is wrong and provides the instance's true class label. (3) The annotator corrects the explanation if the prediction is correct while the explanation is wrong. These three different interaction possibilities can be summarized through the annotation set Ω Z = {yes} ∪ Ω Y ∪ Ω E . The third case is novel in AL. CAIPI incorporates the corrected explanation by generating synthetic instances that teach the classification model to identify irrelevant components. Teso and Kersting empirically showed that CAIPI could increase the annotators' trust in the classification model and that corrected explanations can enormously improve the classification model's performance. A disadvantage of the CAIPI is its need for appropriate component definitions regarding the classification problem at hand.

B. Region Queries
Real-world AL strategies may use region queries to capture high-level information about a classification task. The main challenge concerns the generation of human-understandable region queries whose annotations enhance the performance of a classification model. This challenge is similar to membership query synthesis [84], where the class label of any instance in the feature space Ω X can be queried. In this setting, Baum and Lang [85] encountered the problem that many synthetic instances had no natural semantic meaning and were thus difficult to annotate by human annotators. A further challenge of AL with region queries is the number of regions in a feature space. In the case of numerical features, there are infinitely many regions. Hence, existing region query utility measures take only a finite subset of possible regions into account. We discuss the main approaches for limiting the number of queries and measuring their utilities in the following.
Region queries for natural language processing: In natural language processing [86] tasks, documents as instances are often described by many features. Therefore, the sole use of instance queries may result in poor classification performance [65]. As an alternative, specific features can be queried [87,88], such as the correlation between a feature and a class. For example, in the classification task distinguishing hockey from baseball related text documents, the presence of the word puck is a strong indicator for the class hockey [65]. Assuming the feature value x nd ≥ 0 is given by the frequency of the word puck indexed by d in the document instance x n , the exemplary region query is formalized by q d = X d > 0 with its annotation z d = hockey. As a result, all documents containing the word puck are annotated with the class hockey. A feature can also be an indicator for multiple classes [65,88], e.g., the presence of the word player may indicate a baseball or hockey related document. Druck et al. [65] proposed the strategy GE WU which selects a query X d > 0 ∈ Q X = {X 1 > 0, . . . , X D > 0} for annotation through a weighted US variant. The weighting is to balance the tradeoff between very frequent and infrequent features (i.e., words).
The strategy DUALIST [52] uses the same form of region queries. However, it additionally combines them with instance queries. In each learning cycle, DUALIST selects a fixed number of the most uncertain instances (i.e., documents) employing the entropy-based US (cf. Eq. 6 in the survey). Moreover, the top informative region queries are presented to an annotator. Their utilities are defined as information gains that their annotations would provide: with δ d indicating the presence or absence of a feature (i.e., word) in an instance (i.e., document). This measure is inspired by a common feature selection method specifying the most salient features in text classification [89]. To speed up the annotation process, DUALIST organizes the selected region queries into classes. Therefor, the query X d > 0 is associated with the class with which it occurs most frequently and any other class with which it appears at least 75% often.
Asking generalized questions: Region queries of the form X d > 0 consider only a single feature and may be overly general. As a result, an annotator may not provide a mean-ingful annotation. Although the AL strategies GE WU and DUALIST allow an annotator to ignore a region query, the AL strategy asking generalized queries (AGQ) [61] goes further. It constructs queries with an adaptive degree of specificity. For this purpose, the most uncertain instance x n * is selected according to US in the first step (cf. Eqs. 5, 6 in the survey). Instead of asking for the annotation of the specific instance x n * , irrelevant features are identified and removed from its feature vector in the second step. The resulting query is formalized by q n * = ⋀ d∈R n * (X d ≐ x n * d ), where R n * ⊆ {1, . . . , D} denotes the index set of estimated relevant features regarding instance x n * . To determine those features, the set R n * (initially containing all feature indices R n * = {1, . . . , D}) is gradually reduced by removing the indices of irrelevant features. It is assumed that the values of irrelevant features have a negligible impact on the class membership probabilities estimated by the classification model. Based on this assumption and given the current index set R n * of assumed relevant features, for each feature X d with d ∈ R n * , synthetic instances are generated to test how the probabilistic estimates of the classification model are affected when the feature values for the feature X d are varied. The feature d * leading to the smallest change in the probabilistic estimates is selected as irrelevant: R n * ← R n * ∖ {d * }. This procedure is repeated until the change in the probabilistic estimates for each feature X d with d ∈ R n is below a user-defined threshold.
Region queries generated by that AL strategy AGQ are still restricted in their form, since a feature X d is either entirely irrelevant or set to a specific value X d ≐ x n * d . In particular, these queries are overly specific when dealing with numerical features. The AL strategy AGQ + [62], being an extension of AGQ, resolves this problem. After the indices R n * of the relevant features have been identified according to AGQ, the robustness of the estimated class membership probabilities regarding the value ranges of each of the relevant features X d with d ∈ R n * is tested. Therefor, a set of values X n * d is assigned to each relevant nominal feature X d with d ∈ R nom n * ⊆ R n , whereas an interval [x min n * d , x max n * d ] is assigned to each numerical feature X d with d ∈ R num n * ⊆ R n * . The corresponding query takes the form Since a region query generated by AGQ or AGQ + may cover regions with instances of varying classes, so-called proportion labels are expected as annotations. In a binary classification problem, the proportion label of a region is a number between zero and one: Ω Z = [0, 1]. It informs about the proportion of positive instances in the region. There are a variety of methods to train classification models using proportion labels. The corresponding research area is called learning with label proportions [90,91].
Haque et al. [43] proposed generalized query-based active learning (GQAL) as another strategy for region queries. It is closely related to ACQ + . The main difference is that GQAL supports multi-class classification problems whereas AGQ + is restricted to a binary classification setting.
Rule induced active learning query: As pointed out in [53], a drawback of AGQ and AGQ + is the use of synthetic instances drawn from an estimated distribution. If the sampled instances do not reflect the actual distribution, the inferred query fails at defining a critical region within the feature space. For this reason, the AL strategy rule-induced active learning query (RIQY) [53] takes only the observed instances into account. Following the AL strategy AGQ + , the instance x n * with maximum utility is selected. However, instead of applying exclusively φ US (cf. Eq. 6 in the survey) as a utility measure, RIQY combines the uncertainty regarding an instance's class membership with its density and its dissimilarity to previously selected instances. Thus, this utility measure selects less redundant and more representative instances. Subsequently, a set of similar non-annotated instances (neighbors) N x n * ⊂ U and a set of dissimilar instances (enemies) E x n * ⊂ U are specified with reference to the selected instance x n * . They form the training set of a rule induction classifier such as C4.5 decision tree [92], whose task is to generate classification rules separating the instances in the neighbor set N x n * from instances in the enemy set E x n * . The learned rules define regions around instance x n * and represent the possible queries having the same form like the one shown in Eq. 15. Since a rule-based classifier generates multiple rules, a rule selection is performed based on the rule's accuracy and coverage. The rule's accuracy is a proxy for its discriminative power, and its coverage describes how many observed instances are located in the region defined by the rule. The rules with accuracy above a minimum threshold are ranked according to their coverage scores, and the top ones are selected as queries for annotation with proportion labels.
Hierachical region queries: Since RIQY and AGQ + extend the standard US by defining a region around the most uncertain instance, both AL strategies reduce the problem of utility estimation for a region query on utility estimation for an instance query. This way, the quality of the resulting region is not directly considered during the estimation of the query utility. Hence, the constructed region may not be meaningful. In this context, the term meaningful describes two points of view [15].
(1) From a human annotator's view, the region represents a reasonable and interpretable population of possible instances. (2) Whereas from the classification model's viewpoint, the region covers a large part of the feature space while being pure in the sense that the vast majority of instances in this region belong to the same class.
Several AL strategies aim at defining such regions in a hierarchical manner, namely hierarchical active learning with group proportion feedback (HALG) [14], hierarchical active learning with proportion feedback on regions (HALR) [13], and adaptive hierarchical active learning with proportion feedback on regions (A * HALR) [15]. The AL strategy A * HALR is an advancement of the other two AL strategies, and we summarize its main idea in the following. It constructs a binary hierarchical tree. Each node of this tree represents one region for which an annotator can be queried to provide a proportion label. An exemplary tree is illustrated in Fig 2. The region of the root node covers the entire feature space Ω X , and the first query q 1 = Ω X asks for the proportion label of this feature space. The obtained proportion label z 1 ∈ Ω Z = [0, 1] can be interpreted as the prior probability for the positive class. In the example of Fig. 2, the proportion label z 1 = 0.75 indicates 75% positive instances and 25% negative instances. Subsequently, the root node is divided into two sub-regions through value constraints on one of the features X 1 , . . . , X D . For example, in Fig. 2, two subregions are defined through the value constraints X 2 < 10 and X 2 ≥ 10. Such a split for the region of a query q i is defined according to the standard decision tree splitting based on information gain. However, A * HALR has no access to the class label of each instance in G i ⊆ X denoting the groups of observed instances contained by the region of query q i . For this reason, it relies on two heuristics generating instance-level annotations as proxies of the class labels. On the one hand, an unsupervised heuristic can be applied. In this case, a Gaussian mixture model (GMM) [93] divides the instances G i into two clusters. The computed cluster membership probabilities for one of the two clusters, also known as responsibilities, are used as annotations. On the other hand, a supervised heuristic can be used. For this purpose, the positive class membership probabilities predicted by the current classifier for all instances in the group G i are used as annotations to compute a split. As only one of both heuristics can be applied as the splitting criterion, A * HALR formulates the selection problem as a twoarm bandit problem [94]. It adapts its policy for the heuristic selection during the entire annotation process.
After each split, the proportion label for one of the two resulting region queries is requested. The example in Fig. 2 shows that an annotator provided the proportion label z 2 = 0.65 for the query q 2 . Based on this information and in combination with the proportion label z 1 as annotation of the query q 1 , the proportion label z 3 of the query q 3 can be inferred. Generally speaking, let G i = G j ⊍ G k ⊆ X define the partitioning of the instance group of query q i into the instance groups of query q j and q k . Moreover, the annotations z i , z j are known. Then the proportion label for query q k is inferred through Next to the definition of a split, the selection of the region to be split is essential. The potential regions that can be split are given by the leaves of the current hierarchical tree of region queries. After the initial split of query q 1 and the annotation of query q 2 and q 3 , the region represented by query q 2 was selected for splitting in Fig. 2. But, one could have also split the region represented by query q 3 . To decide which region is to be split, A * HALR computes a region's uncertainty as a proxy of its utility measurement. The uncertainty of a region considers two factors: the label impurity and the number of observed instances enclosed by a region. These factors are combined into a kind of Gini-index being a product of the region's proportions of positive and negative instances and its number of instances: The steps of region selection, splitting, and annotation are executed consecutively in each AL cycle.

C. Comparison Queries
Comparison queries ask for relative information between multiple instances. The main task concerns the selection of the instances to be compared. For this purpose, a utility measure assesses possible comparison queries. In the following, we discuss the main approaches for measuring the utility of this type of query.
Class-based comparisons of instances: In classification problems with many classes, annotators require much time to assign an instance to a class. In contrast, binary annotations, i.e., Ω Z = {yes, no} answers, are less time-consuming [44,45] and less error-prone [34]. Therefore, several AL strategies aim at reducing the annotation effort and the number of mistakes by querying whether two instances belong to the same class.
Joshi et al. [44,45] proposed one of these AL strategies. Its main idea is to select a non-annotated instance and compare it with an annotated instance whose class label is already known. If an annotator confirms that both instances belong to the same class, the non-annotated instance is added to the set of annotated instances. Otherwise, another annotated instance is selected for comparison with the non-annotated instance. This process is iterated until a match is found. Accordingly, the set of potential queries is given through Q X = {{x, x ′ } x ∈ U ∧ ∃y ∈ Ω y ∶ (x ′ , y) ∈ L}. Their utilities are measured separately to select a pair of instances. The utility of a non-annotated instance is computed as the difference between the negative expected MC (estimated by a costsensitive variant of EER) and the expected AC (estimated by the expected number of comparisons to obtain a match). Once the non-annotated instance x n * with maximum utility has been selected, its class membership probability estimates are sorted. In the first comparison, the annotator has to decide whether the instance x n * and another randomly picked instance of the class with the highest probability belong to the same class. If they do not match, x n * is compared to another randomly picked instance of the class with the second-highest probability and so on. Hence, we can interpret the class membership probability estimates of x n * as utilities for selecting one of the already annotated instances.
In summary, the strategy of Joshi et al. [44,45] obtains class labels by comparing non-annotated to annotated instances. A single comparison reveals only new class information regarding the non-annotated instance. In contrast, the AL strategies querying pairwise label homogeneity active learning (QHAL) [33] and its advancement pairwise homogeneity based active learning (PHAL) [34] aim at increasing the information content by asking whether two non-annotated instances belong to the same class. Thus, the query set is given by At the start of the annotation process, an ensemble with the semisupervised graph min-cut learning algorithm [95] as the base learner is constructed. The ensemble's classification models create so-called k-NN graphs from the observed instances for different values of k (cf. Fig. 3). Each ensemble member determines a classification of the instances by partitioning its graph. It minimizes the number of similar pairs of instances belonging to different classes, i.e., partitions. The idea of PHAL is to repeatedly refine the edge weights of these graphs by querying the label homogeneity information regarding two non-annotated instances. If both instances belong to the same class, the edge weight is increased. Otherwise, it is decreased. PHAL selects only pairs of non-annotated instances lying on the max-flow path of a graph in the ensemble. These pairs are more effective (i.e., have a higher utility) in reducing the upper bound of the graph min-cut learning algorithm's misclassification error than a random selection. At the end of each learning cycle, a predefined number of non-annotated instances with the ensemble's highest prediction confidence are added to the set of annotated instances.  PHAL has two drawbacks on the one hand. First, it considers only binary classification problems. Second, it has high computational complexity due to the costly execution of graph min-cut learning algorithm. On the other hand, PHAL is less sensitive to incorrect annotations than traditional AL strategies because erroneous changes of edge weights do not significantly affect the overall model.
Kane et al. [20] developed a rather theoretical strategy. Next to instance queries asking for class labels, it employs comparison queries of the form: "Is instance x n more likely to belong to the positive/negative class than instance x n ?". The corresponding query set is Q X = X ∪ (X × X ). The strategy focuses on the learning of halfspaces [96]. For this class of classification problems, including certain assumptions, the authors prove that the additional use of comparison queries leads to an exponential improvement over traditional AL strategies. Hence, approximately O(log N ) queries are sufficient to reveal all class labels of the N observed instances in X . A similar setting is targeted by Xu et al. [21] and Hopkins et al. [2]. For annotating comparison and instance queries, they also allow certain noise types, e.g., Tsybakov [97] or Massart [98] noise. The theoretical analyses of these strategies point out the benefit of comparison queries since they can substantially reduce the query complexity.
Similarity-based comparisons of instances: In complex tasks, such as predicting a patient's risk for suffering from a particular disease, it is difficult to confidently provide a class label, even for medical experts [99]. Therefore, the strategy active patient risk prediction (ARP) [31] resorts to relative comparisons between patients. For example, given an image created with magnetic resonance imaging (MRI), ARP does not ask whether the corresponding patient will get a particular disease but requests a similarity-based comparison to MRI images of other patients. An exemplary query would be: "Is patient A more similar to patient B than patient C?". ARP aims to use the resulting similarity information for updating a similarity matrix S ∈ R N ×N ≥0 between the patient's medical records. These records represent the observed instances X from an ML perspective. Since the number of theoretical comparisons would grow exponentially with the number of instances, ARP restricts the query set to Q X = {N 1 , . . . , N N } where N n ⊂ X is the neighborhood set of the n-th instance. These neighborhood sets can be defined according to the k-NN algorithm, for example. An annotation of a selected neighborhood set N n * induces an ordering of the contained instances regarding their similarities to instance x n * . Such annotation is used to update the similarity matrix S. Therefor, the linear neighborhood propagation framework [100] is employed. Its idea is to learn the similarity matrix S minimizing the reconstruction error: When solving this optimization problem, the obtained annotations are incorporated as relative constraints, e.g., S[n, m] ≥ S[n, l] is required, if instance x m is more similar to instance x n than instance x l . The annotation costs are reduced by selecting the neighborhood set G n * with the highest utility. This utility is estimated by solving a counting set cover problem [101] greedily. By doing so, ARP aims at determining a minimum set of neighborhood sets whose union covers all observed instances. Based on an initial subset of instances (patients) with already assigned risk predictions, the learned similarity matrix can be used to estimate the risk for the remaining instances (patients). [31] experimentally showed that ARP performs better than randomly selecting neighborhood sets for annotation and is even competitive to instance queries asking for class labels (e.g., absolute risk estimates). However, it is questionable how the neighborhood sets' initial definition affects ARP's performance.
Xiong et al. [30] proposed a related strategy to ARP. It queries relative similarities among samples. However, instead of neighborhood sets, it assumes triples of instances as queries: q lmn = (x l , x m , x n ) ∈ Q X = X 3 . We can interpret such a triplet as the question "Is x l more similar to x m than x n ?". An annotator returns an answer z lmn ∈ Ω Z = {yes, no, uncertain} as annotation. Generally, it is assumed that the annotation depends only on the true class labels of the triplet's instances: The utility of an instance triplet is estimated in a probabilistic manner. More specifically, the strategy employs the mutual information criterion [102]. Accordingly, the utility of a query q lmn , i.e., instance triplet, is the degree upon which the query's annotation reduces the uncertainty of the three instance's unknown class labels y l , y m , y n . Since there are Q X = X 3 = N 3 possible queries, the utility computation is infeasible for a large number N of observed instances. In this case, the strategy randomly samples a subset of queries and selects the query with the highest utility in this set. The annotated queries in D are used to learn a distance metric [103] for the observed instances X . Subsequently, the instances are partitioned into clusters, e.g., using the k-means clustering algorithm [104]. The obtained cluster assignments are interpreted as proxies of the class labels and serve as training data for an arbitrary classification model. One major issue of this strategy is the information loss if an annotator provides uncertain as an annotation. Then, the annotation is not used by the distance learning algorithm, and the annotation effort is wasted.

APPENDIX C ANNOTATOR PERFORMANCE MODELS
In this appendix, we analyze concrete real-world AL strategies regarding their annotator performance models. We structure this analysis according to the annotator performance types, i.e., uniform, annotation-dependent, and query-dependent performance, identified in Section IV in the associated survey.
To the best of our knowledge, each of these models assumes instance queries such that we define Q X = X as the query set for this appendix. The interpretation of the annotators' performance random variables P 1 , . . . , P M depends on the concrete models. Without further specification we assume that these variables are binary where P m = 1 indicates an optimal (correct) and P m = 0 a non-optimal (false) annotation.

A. Uniform Annotator Performance
Uniform annotator performance means that the quality of the annotations depends only on an annotator's characteristics. As a result, the goal of the following models is to define an annotator performance function ψ as a proxy for the distributions Pr(P m ) in case of persistent or Pr(P m t) in case of time-varying performances for each annotator a m ∈ A.
Annotator models based on interval estimation learning: Donmez et al. [66] proposed IEThresh as one of the first real-world AL strategies considering error-prone annotators. It employs an annotator model based on interval estimation (IE) learning. Originally, IE was developed to address the exploration-exploitation trade-off regarding action selection in reinforcement learning [105]. In our real-world AL setting, we face a similar issue. There, finding an action yielding the maximum expected reward corresponds to selecting the annotator with the maximum expected performance. For this purpose, IEThresh introduces the reward function r ∶ A → {0, 1} as mapping from the set of annotators to binary values. Given the majority vote annotationẑ n ∈ Ω Y for instance x n , the reward is one, if the annotator a m agrees with the majority vote annotation and zero otherwise: This reward estimate requires selecting multiple annotators per instance to take the majority vote of their annotations. Based on the reward function, IEThresh defines the quality of an annotator a m as the upper confidence bound of the expected reward. It is computed according to: where r m ∈ [0, 1] is the mean reward of a m , s 2 m ∈ [0, 0.25] is the reward's variance of a m , and t (Nm−1) γ ∈ (0, ∞) is the γ ∈ (0.5, 1) quantile of the Student's t-distribution whose degrees of freedom is defined as N m ∈ N, i.e., the number of instances annotated by a m (cf. Eq. 9 in the survey). As annotator performance, the reward's upper confidence bound prefers annotators with a high expected reward (first summand in Eq. 21) and/or a high uncertainty in the estimated reward (second summand in Eq. 21). In total, the model requires a data set D and a value for the hyperparameter γ to compute performance values such that parameters can be defined as ω D = (D, γ). The annotator model of IEThresh can also start with a data set D = ∅ containing no annotations. For this purpose, it assumes that each annotator has initially provided one correct and one false annotation. Concerning the hyperparameter γ, the annotator model is quite robust. However, the quality of the majority vote annotations as estimates of the true class labels strongly affects the accuracy of the annotator performance estimates.
The AL strategy IEAdjCost [54] employs a similar annotator model, which considers annotators receiving different payments per annotation. For this purpose, it defines two phases during the annotation process. In the first phase, it explores the annotation qualities, which are estimated according to Eq. 21. This exploration phase can start with zero annotations. At the same time, the annotator model monitors the accuracy of these estimates by computing the confidence interval length for each annotator a m ∈ A. The performance value of an annotator a m is estimated well, if the confidence interval length is below or equal to a threshold: Whenever a user-defined fraction λ ∈ (0, 1) of the M annotators have well estimated performance values, the annotator model tries to determine an AC-optimal set of annotators A * ⊆ A with A * = ⌈λM ⌉ (cf. Eq. 6). If the annotators A * have a combined performance value above or equal to a threshold ρ ∈ [0, 1], IEAdjCost enters the second phase by exploiting the performance of the annotators A * . We can summarize this two-phase annotator performance measure as The many hyperparameters in ω D = (D, γ, δ, λ, R) are a major disadvantage of this annotator model.
Probabilistic annotator models: Long et al. [23,24] and Long and Hua [25] proposed probabilistic annotator models based on Gaussian processes. They directly model a global noise term in the annotations. This way, they ensure a certain level of robustness against wrongly annotated instances far from the decision boundary. Furthermore, they define and estimate the performance of each annotator a m ∈ A as the correctness probability across all instances: ψ(x n , a m ω D ) = Pr(P m = 1 ω D ).
The parameters ω D are learned using the EP algorithm. The previously described annotator models assume persistent annotator performances. In contrast, the annotator model of the AL strategy SFilter [60] drops this assumption. It models the quality of each annotator as a time-varying latent state sequence. For this purpose, it assumes that the change in the annotator performance from one to the next step follows a Gaussian distribution with a zero-mean and a known variance, which is shared among all annotators. The use of a zeromean Gaussian distribution avoids any directional bias. As a result, the annotator model can detect increases or decreases in the performance of an annotator. Due to a lack of real-world data with timestamps for each annotation, it is challenging to validate the assumptions of SFilter's annotator model.

B. Annotation-dependent Annotator Performance
Annotation-dependent annotator performance means that the quality of the annotations depends on an annotator's characteristics and the optimal annotation for a query. The following strategies consider class labels as annotations, i.e., Ω Z = Ω Y . As a result, their goal is to define an annotator performance function ψ as a proxy for the distributions Pr(P m Y = y) in case of persistent or Pr(P m Y = y, t) in case of time-varying performances for each class y ∈ Ω Y and annotator a m ∈ A.
Probabilistic annotator models: Many strategies [32,27,35,42] employ an annotator model estimating the probability that an annotator a m provides the correct annotation for an instance of class y ∈ Ω Y : These annotator models mainly differ in their approaches for approximating this probability. The annotator model of PMActive [42] determines its parameters in a maximum likelihood approach using the EM algorithm [106]. In the E-step, PMActive estimates the instance's actual class labels in the data set D and uses them to update the model parameters ω D in the M-step. Considering the binary case with C = 2 classes, the annotator model of PMActive employs a logistic regression model [107] with the parameters w ∈ R O , O ∈ N to estimate an instance's true class membership probability where σ ∶ R → (0, 1) denotes the logistic function andx ∈ R O represents a (non-linear) transformation of instance x. The computed class membership probabilities are used to determine the distribution Pr(Z m Y = y, ω D ), which is modeled as a Bernoulli distribution: The parameter µ my ∈ [0, 1] represents the estimated performance value of the annotator a m regarding instances of class y. Altogether, the parameters to be optimized of this model are given by ω D = (w, µ 11 , µ 12 , . . . , µ M 1 , µ M 2 ) in the binary case. The final performance value of an annotator a m concerning an instance x n is computed by taking the maximum of these class-dependent performance values: A problem of this performance computation is the missing consideration of the unknown true class label y n ∈ Y of the instance x n ∈ X . The annotator model of the strategy GPC-MA [35] overcomes this issue by taking an instance's estimated class membership probabilities into account. The corresponding performance value of an annotator a m regarding an instance x n is given by To estimate the required probabilities in the above equation, GPC-MA performs a fully Bayesian treatment based on a Gaussian processes classifier [108]. The model's parameters are determined with the EP algorithm [109]. The overall estimation procedure is highly computationally intensive.
The annotator model of the strategy Proactive [32] also defines the annotator performance following Eq. 29. However, it resorts to a simpler and more efficient approach compared to the model of GPC-MA. For this purpose, it requires a an initial subset of fully annotated instances: where I represents the indices of the initially fully annotated instances. The model computes the majority vote annotation z n (cf. Eq. 20) for each of these instances as an estimator of the true class label. Then, the performance value of an annotator a m per class y is computed according to Despite its efficiency, a disadvantage of this model is the uncertainty in the specification of the data set D init . Especially, its size D init is a critical issue. Moreover, once the model has computed the performance values in Eq. 31 on the data set D init , they are not refined during the following learning cycles. The class membership probabilities in Eq. 29 are obtained from any desired probabilistic classification model θ D . Following our notation, the parameters of this annotator model are defined as ω D = (D init , θ D ).
Nguyen et al. [27] proposed a similar annotator model differing regarding two aspects. First, it requires expert annotations as estimates of ground truth labels and compares them with the annotations of crowd workers. Second, it incorporates prior terms into the nominator and denominator of Eq. 31. These prior terms are smoothing hyperparameters. Taking a Bayesian view, we can interpret them as prior counts.

C. Query-dependent Annotator Performance
Query-dependent annotator performance means that the quality of the annotations depends on an annotator's characteristics, a query, and optionally the optimal annotation for this query. As previously mentioned, the following annotator models consider only instances queries, i.e., Ω X = X . As a result, their goal is to define an annotator performance function ψ as a proxy for the distributions Pr(P m Z = z, X = x n ) in case of persistent or Pr(P m Z = z, X = x n , t) in case of timevarying performances for each observed instance x n ∈ X , the optimal annotation z ∈ Ω Z , and annotator a m ∈ A.
Annotator models using self-assessments: Many annotator models rely on annotators' self-assessments to estimate their performances. Such a self-assessment can be a confidence score or a confirmation of a lack of knowledge.
Several real-world AL strategies [28,29,36,55,56] allow the annotators to provide uncertain as an alternative annotation to class labels. Accordingly, these strategies define Ω Z = Ω Y ∪ {uncertain} as set of possible annotations. The idea is to query annotators only for instances where an annotator will not provide uncertain as an annotation. For this purpose, the corresponding annotator models define the annotator performance as: The probability in Eq. 32 is estimated by defining a binary classification problem where instances annotated with a class label y ∈ Ω Y and the ones annotated with uncertain belong to separate classes. The annotator models differ in their approaches for solving this classification problem. Fang and Zhu [36] proposed the strategy EIAL whose annotator model uses the diverse density concept [110] to transform instances into a new feature space before a user-defined classification model is trained. Zhong et al. [29] with their strategy ALCU-SVM employ an SVM and Käding et al. [28] with their strategy GP-EMOC PDE+R employ Gaussian processes to solve the binary classification problem. Donmez and Carbonell [55,56] proposed an annotator model employing a k-means clustering [104] to estimate the probability in Eq. 32. An annotator is queried to annotate the k instances closest to the respective k cluster centroids. If an annotator provides a class label, then the belief of obtaining a class label is propagated to nearby instances within this cluster. Otherwise, the belief of getting uncertain as annotation is propagated analogously.
In another scenario, Donmez and Carbonell [55,56] extended their model toward numerical confidence scores by propagating obtained confidence scores within a cluster. Ni and Ling [47] proposed the strategy BMO. Its annotator model also expects numerical confidence scores as annotations to estimate the annotators' performances. It assumes binary class labels Ω Y = {1, 2} including confidence scores Ω C = [0.5, 1.0] as annotations, i.e., Ω Z = Ω Y × Ω C . Its annotator model estimates the annotators' confidence scores as proxies of their performances. The model is based on a k-NN approach and trained on the confidence scores of each annotator. It predicts the annotator performance according to: where c om ∈ Ω C denotes the confidence score of annotator a m for instance x o and N k m (x n ) ⊂ X contains the k-NN of instance x n among the instances annotated by the annotator a m . Correspondingly, the denominator of Eq. 33 represents the average confidence of annotator a m for the k-NN of instance x n . The nominator is the average distance, where the addition of one prevents dividing by zero.
Annotator models relying on annotators' self-assessments make the central assumption that these assessments are meaningful. However, there are settings for which humans fail at assessing their capabilities [111]. As a result, the previously discussed annotator models provide unreliable annotator performance estimates.
Probabilistic annotator models: Du and Ling [62] proposed a quite simple annotator model as part of their strategy active learning algorithm with human-like noisy oracle (AL-HO). Considering only a single annotator, the goal is to train a classification model behaving similarly to the annotator. Accordingly, AL-HO expects the classification model to output probabilities that represent the single annotator's performance. In the case of a linear relationship between annotator and classification model, the annotator performance can be directly expressed through the classification model's most confident prediction: ψ(x n , a m ω D ) = max y∈Ω Y (Pr(Y = y X = x n , θ D )) . (34) Accordingly, the annotator model is equal to classification model, i.e., ω D = θ D . The primary issue of this model is its restriction to the single annotator scenario.
Zhao et al. [39] developed an annotator model that considering the expertise and difficulty of annotating an instance. The expertise of an annotator a m is represented through a realvalued number e m ∈ R. As the expertise e m approaches +∞, the performance of annotator a m increases. As the expertise e m approaches −∞, the annotator a m becomes malicious and provides incorrect annotations on purpose. In the case e m = 0, the annotator a m randomly guesses annotations. The difficulty of annotating an instance x n is denoted by d n ∈ (0, +∞). For simple instances, the difficulty d n approaches zero, whereas it converges to +∞ for complicated instances. In the case of a binary classification problem, the model computes the performance according to: where the parameters ω D = (e 1 , . . . , e M , d 1 , . . . , d N ) are estimated in a maximum likelihood approach using the EM algorithm. In the E-step, the instances' true class labels are estimated. Subsequently, these estimates are used to update the current estimates of annotators' expertises and instance difficulties. The application of this annotator model is restricted to problems, where each instance has at least one assigned annotation. Another issue of this model is that an instance's level of difficulty is not subjective but identical for each annotator. Wallace et al. [112] also adopt this modeling approach. However, they do not directly estimate the instances' difficulties and the annotators' expertises but assume that the annotators' salaries are rough proxies of their expertises.
The annotator model proposed by Yan et al. [49,50] drops such assumptions. It posits that the annotation z nm provided by an annotator a m depends on the instance x n and its actual but unknown class label y n . To model this behavior, it uses a Bernoulli distribution for binary classification: where w m ∈ R O are the parameters of a logistic regression model for annotator a m . This model estimates the probability of obtaining the true, i.e., P m = 1, or false, i.e., P m = 0, class label as annotation. As a result, the annotator performance is defined through: ψ(x n , a m ω D ) = Pr(P m = 1 x n , ω D ) = σ(w T m x n ). (37) All parameters ω D = (w 1 , . . . , w M ) of this annotator model are determined in a maximum likelihood approach using the EM algorithm, where the true class labels are estimated in the E-step and used to update the model parameters in the M-step.
The annotator models proposed by Fang et al. [37,38,48] make similar assumptions to the one of Yan et al. [49,50]. However, they additionally introduce latent topics, e.g., sports, entertainment, and politics, in case of a text classification problem. Based on them, they estimate expertise for each pair of annotator and topic. Since an instance may belong to one or multiple topics, the annotator model estimates the instances' membership degrees regarding the different topics. The latent topics and associated membership degrees are learned through a GMM [93]. Again, the EM algorithm is employed to learn the annotator model's parameters. The annotator model of the strategy self-taught active learning (STAL) [48] additionally considers collaborations between annotators. For this purpose, the estimated best annotator teaches the estimated worst one regarding the annotation of a specific instance. The learning process of the worst annotator is then modeled by simulating that the worst annotator provided the same annotation as the best one. In this case, the annotator performance is timevarying because the annotators can get better throughout the annotation process. Finding an appropriate representation of the aforesaid topics is often difficult. Therefore, the annotator model Fang et al. [37,38] additionally exploits transfer learning. Its goal is to find common latent topics minimizing the divergence between the target domain and a related domain.
Yang et al. [18] proposed the learning-from-targeted crowds (LFTC) model working with latent topics. However, it does not use a pre-trained GMM for this purpose but learns them during the training. The LFTC model models the annotation z nm of an annotator a m as a function of the instance x n and its true (but unknown) class label y n through a Bernoulli distribution: Pr(Z m = z nm X = x n , Y = y n , ω D ) = σ(w T m Fx n ) δ(znm≐yn) (1 − σ(w T m Fx n )) 1−δ(znm≐yn) , where F ∈ R L×O denotes a matrix with L ∈ N, L ≪ min(n, m) and w m ∈ R O is an annotator-dependent vector. We can interpret the product Fx n as a representation of instance x n through L latent topics and the vector w m as an embedding of the expertise of annotator a m regarding these topics. The parameters ω D = (F, w 1 , . . . , w M ) of the LFTC model are learned in an iterative fashion using the EM algorithm. Yang et al. have shown that their LFTC model also scales toward deep learning applications. The previously discussed probabilistic annotator models do not consider the annotator model's epistemic uncertainty regarding its performance estimates. Therefore, Herde et al. [1] proposed the Beta annotator model (BAM) as part of their multi-annotator probabilistic active learning (MaPAL) strategy. For each annotator, it solves a binary classification problem with correct and false annotation as possible classes. These class labels are unknown and have to be estimated. For this, BAM trains a classifier with the annotations of the annotators A ∖ {a m } to assess the correctness of the annotations of annotator a m . This way, any bias toward one annotator is avoided [35]. The performance of annotator a m is subsequently modeled through a Beta distributions: Pr(P m X = x n , ω D ) = Beta(P m f nm + β), where the random variable P m ∈ [0, 1] denotes the annotation accuracy, f nm = (f nm1 , f nm2 ) T ∈ R 2 ≥0 are kernel frequency estimates, and β = (β 1 , β 2 ) T ∈ R 2 >0 are hyperparameters. The kernel frequency estimates f nm represent the estimated numbers of false and true annotations given by annotator a m in the neighborhood of instance x n . Using a kernel function to quantify neighborhood relations between instances, they are estimated through a Parzen window approach [113]. Accordingly, we can interpret the vector β as prior observations of false and true annotations. Finally, BAM defines the annotator performance as the Beta distribution's expectation: ψ(x n , a m ω D ) = E[Beta(P m f nm + β)] (40) This way, BAM allows the direct incorporation of prior knowledge about the annotators and additionally considers the uncertainty in the performance estimated. BAM's disadvantages are its high computational complexity and its restriction to the kernel-based Parzen window approach. Static annotator models: Similar to the annotator model of Moon and Carbonell [32], the model used by the strategy CEAL [19] requires an initial fully annotated data set D init according to Eq. 30. Additionally, it expects a matrix S ∈ R N ×N ≥0 of similarities between all pairs of observed instances. Given these preliminaries, it computes the performance of annotator a m regarding instance x n according to where N k xn,D init denotes the k-NN of the instance x n in the set D init . Correspondingly, we can interpret the annotator performance as the similarity-weighted average number of annotations agreeing with the majority vote annotationsẑ n ′ (cf. Eq. 20). Chakraborty [3] proposed a similar annotator model. Instead of a k-NN approach, it trains one logistic regression model per annotator based on the initial fully annotated data D init . Each of these models solves a binary classification problem where agreement and disagreement with the majority vote annotation represent the two classes to be distinguished. Due to the requirement for an initial fully annotated data set D init , the disadvantages of both annotators model are related to the discussed ones of the annotator model of Moon and Carbonell [32]. In particular, these models are static such that their performance estimates do not change during the annotation process.

APPENDIX D SELECTION ALGORITHMS
In this appendix, we analyze concrete real-world AL strategies regarding their selection of query-annotator pairs. We structure this analysis according to sequential and joint selection algorithms identified in Section VI in the associated survey.

A. Sequential Selection of Queries and Annotators
Sequential selection of queries and annotators is made in two steps. In the first step, one or multiple queries are selected, and corresponding annotators are assigned in the second step.
As a result, annotating the selected query q l * is expected to be most beneficial for the training of the classification model. Subsequently, several AL strategies [29,35,37,38,42,47] present the selected query q l * to a single annotator a (l * ) whose estimated performance regarding this query is maximum compared to the remaining annotators: If the annotator performance estimates are inaccurate, the actual best annotator may be ignored. Moreover, querying too often the same annotator can bias the annotator performance estimates [35]. In particular, performance estimates regarding annotators who have been queried only a few times are often unreliable [66]. To resolve these issues, Zhao et al. [39] proposed two alternative annotator selection algorithms. The first algorithm chooses an annotator with the probability being proportional to the respective performance estimate: Pr(A = a m Q = q l * ) = ψ (q l * , a m ω D ) ∑ a∈A ψ (q l * , a ω D ) , where ψ (q l , a ω D ) ≥ 0 is required and A denotes the random variable defined over the annotator set A. The second strategy is inspired by the -greedy algorithm used for the multiarmed bandit problem in reinforcement learning [114]. The probability of selecting an annotator a m is computed as The hyperparameter ∈ [0, 1) controls the explorationexploitation trade-off. The estimated best annotator is selected to exploit the knowledge about the annotators' performances, whereas another random annotator is picked for exploration.
Selecting only a single annotator per iteration cycle, i.e., A l * = 1, can be disadvantageous. In particular, in the initial learning phase, retraining the annotator model and the classification model with partially false annotations leads to non-reliable estimates of annotator performances and query utilities in subsequent iteration cycles. To overcome this issue, querying a user-defined number of annotators per query represents a more reliable alternative [23,24,25] at the expense of increased AC per query. However, this number is a fixed hyperparameter during the entire annotation process. Thus, the number of selected annotators is independent of their performance estimates. To obtain a more adaptive selection of multiple annotators, Donmez et al. [66] proposed a thresholdbased selection as part of their strategy IEThresh. It specifies the annotator set according to A l * = a ∈ A ψ (q l * , a ω D ) ≥ ρ ⋅ ψ q l * , a (l * ) ω D , (48) where ρ ∈ [0, 1] is a hyperparameter. It specifies the minimum annotator performance to be selected in dependence of the performance of the estimated best annotator a (l * ) . Although the number of selected annotators is adaptive, an appropriate parametrization of ρ needs to be determined regarding the characteristics of the classification problem and the available annotators. The strategy IEAdjCost of Zheng et al. [54] uses different annotator selection algorithms for different stages during the annotation process. In the initial phase, the annotator selection is performed similarly to IEThresh. The idea is to explore the performances of the annotators in this stage. Once these performance estimates have been sufficiently explored, IEAdjCost switches the annotator selection algorithm to exploit the gained knowledge about the annotators. In this exploitation phase, the strategy determines a fixed subset of annotators reaching a combined accuracy above a user-defined threshold while minimizing the AC (cf. Eq. 6).
None of the above-described selection algorithms considers any forms of collaboration among the annotators. However, Chang et al. [115] showed that collaboration could lead to improved annotations. The strategy STAL of Fang et al. [48] selects next to the estimated best annotator also the estimated worst annotator. Instead of querying the two annotators independently, the best annotator chooses an annotation and explains this decision's worst annotator. This way, the worst annotator can gain new knowledge to enhance the performance. While it would be an option to propagate the best annotator's knowledge to more than one annotator, the pairwise collaboration ensures that error knowledge is not propagated among too many annotators that would eventually deteriorate the annotation process.
Another important aspect when selecting annotators concerns their annotation workloads. For example, assigning too many queries to the same annotator may decelerate the annotation process since this annotator has to process the queries sequentially. As a solution, Wallace et al. [51] explicitly model the workload across multiple annotators as part of their strategy MEAL. They use a categorical distribution representing either a preference for uniform workloads among the annotators or another objective. From this distribution, annotators are then drawn and assigned to respective queries.
Batch of queries: Selecting only a single query during each AL iteration cycle is likely to have no significant impact on deep learning models' performances because of their local optimization methods [116]. Therefore, the strategy deep active learning from targeted crowds (DALC) proposed by Yang et al. [18] selects a batch with a user-defined number of queries having the highest utilities per iteration cycle. Subsequently, it assigns each of these queries to the annotator with the respective highest estimated performance. As this selection algorithm does not consider query diversity, the queries may request redundant learning information leading to worse performance than random sampling [117].

B. Joint Selection of Queries and Annotators
Selecting queries without considering the performances of the available annotators can result in low-quality annotations because there is no guarantee that at least one annotator has a sufficient performance regarding the selected query [29]. This problem can be resolved by applying a selection strategy that jointly selects queries and annotators.
Single query: For this purpose, the query utility measure φ and the annotator performance measure ψ are combined appropriately. Taking the product of both represents a simple but effective combination [19,32,55,56]. Accordingly, a queryannotator pair is selected through This selection balances the trade-off between query utility and annotator performance by ensuring that both need to be high.
There are more advanced approaches to combine annotator performance and query utility [1,3,49,50]. All of them focus on instance queries and class labels as annotations.
Yan et al. [50] proposed a selection algorithm picking an instance-annotator pair by solving a linearly constrained, biconvex optimization problem with the BFGS [118] algorithm. As a solution, one obtains an instance including annotator importance values. This instance is not guaranteed to be in the set X of observed instances. Therefore, Yan et al. resorts to select the observed instance closest to the optimal one and the annotator with the respective highest performance.
The strategy ML+CI * of Yan et al. [49] estimates the information that an annotator's class label provides regarding an instance's true but unknown class membership. For this, mutual information [102] is employed as a criterion for selecting an instance-annotator pair.
The previous two selection algorithms employed US as instance utility criterion. In contrast, Nguyen et al. [27] and Herde et al. [1] compute the classification model's expected performance gain when obtaining an instance's class label from a certain annotator. In this context, Nguyen et al. [27] differs only between two groups of annotators: error-prone crowd workers and omniscient experts. For crowd workers, the expected performance gain is computed on the set of nonannotated instances. This performance gain directly considers the estimated accuracy of the crowd workers' majority vote annotation for an instance. The expected performance gain is computed for the experts on the set of instances already annotated by the crowd workers. The idea is to query only experts for re-annotating instances that the crowd workers likely assigned to the wrong class. Herde et al. [1] with their strategy MaPAL advance the performance gain computation by explicitly differing between all individual annotators instead of groups. Based on the probabilistic active learning framework [119], MaPAL selects the instance-annotator pair maximizing the classification model's probabilistic performance gain. This selection is non-myopic by simulating an instance's annotations from multiple annotators. However, such a kind of look-ahead increases the computational complexity of the instance-annotator pair selection.
Batch of queries: All of these joint selection algorithms select only a single query-annotator pair during each iteration, whereas Chakraborty [3] allows for a batch selection, i.e., S > 1. The corresponding selection algorithm solves an optimization problem finding a trade-off between annotator performances, instance query utilities, and redundancies between selected instance queries. The redundancies are considered by incorporating cosine similarity measurements between instances into the objective function. This way, a batch of instances with low similarities to each other will be selected. Chakraborty has shown that the optimization problem can be formulated as an equivalent linear programming problem.