Learning from Multiple Expert Annotators for Enhancing Anomaly Detection in Medical Image Analysis

Building an accurate computer-aided diagnosis system based on data-driven approaches requires a large amount of high-quality labeled data. In medical imaging analysis, multiple expert annotators often produce subjective estimates about"ground truth labels"during the annotation process, depending on their expertise and experience. As a result, the labeled data may contain a variety of human biases with a high rate of disagreement among annotators, which significantly affect the performance of supervised machine learning algorithms. To tackle this challenge, we propose a simple yet effective approach to combine annotations from multiple radiology experts for training a deep learning-based detector that aims to detect abnormalities on medical scans. The proposed method first estimates the ground truth annotations and confidence scores of training examples. The estimated annotations and their scores are then used to train a deep learning detector with a re-weighted loss function to localize abnormal findings. We conduct an extensive experimental evaluation of the proposed approach on both simulated and real-world medical imaging datasets. The experimental results show that our approach significantly outperforms baseline approaches that do not consider the disagreements among annotators, including methods in which all of the noisy annotations are treated equally as ground truth and the ensemble of different models trained on different label sets provided separately by annotators.


Introduction
Computer-aided diagnosis (CAD) systems for medical imaging analysis are getting more and more successful thanks to the availability of large-scale labeled datasets and the advances of supervised learning algorithms [1,2].To reach expert-level performance, those algorithms usually require high-quality label sets, commonly scarce because of the costly and intensive labeling procedures.A typical label collection process in medical imaging is "repeated-labeling", where multiple clinical experts annotate each data instance to overcome human biases [3,4,5].However, because of the differences from annotator biases and proficiency, annotations from the repeated-labeling process often suffer from high inter-reader variability [6,7,8], which could reduce leaning performance if we treat them as ground-truth.
Many prior works have been done to mitigate inter-reader variations in annotations, which can be categorized into two main groups: (i) one-stage approach and (ii) two-stage approach.The first group learns the model, annotators' proficiency, and latent true labels jointly.Meanwhile, the second group first estimates the true label of each instance from its multiple label sets [9].This process is known as "truth inference".After that, a supervised learning model is trained on the estimated true labels.All of those approaches show impressive results on both classification and segmentation problems [10,11].
This work aims at addressing a fundamental question "How to train a deep learning-based detector effectively from a set of possibly noisy labeled data provided by multiple annotators?" [12].To this end, we introduce a novel approach that learns from multiple expert annotators to improve the performance of a deep neural network in detecting abnormalities from chest X-ray images.The proposed approach, as visualized in Figure 1, consists of two stages.The first one is truth inference using Weighted Boxes Fusion (WBF) algorithm [13] to estimate the true labels and their confidence scores.The second stage is to train an object detector on estimated labels with a re-weighted loss function using implicit annotators' agreement, which is represented by the estimated confidence scores.For evaluation, we first simulate and test the proposed approach on a multiple-experts-detection dataset from MNIST [14] called MED-MNIST.We then validate our approach on a real-world chest X-ray dataset with radiologist's annotations.Experiments on those scenarios demonstrate that the proposed approach provides better detection performance in terms of mAP scores than the baseline of treating multiple annotations as ground truth and the ensemble of models supervised by individual expert annotations.
In summary, our main contributions in this work are two-folds: • First, we introduce a simple yet effective method that allows a deep learning network to learn from multiple annotators to improve its performance in detecting abnormalities from medical images.The proposed approach aims at estimating the true annotations from multiple experts with confidence scores and uses these annotations to train a deep learning-based detector.This helps remove uncertainty in the learning process and provides higher label quality to train predictive models.
• Second, the proposed approach demonstrates its effectiveness on both simulated and real medical imaging datasets by surpassing current stateof-the-art methods on the context of learning with multiple annotators.In particular, our method is simple and can be applied for a wide range of applications in medical imaging and object detection in general.The codes used in the experiments are available on our Github page at https:// github.com/huyhieupham/learning-from-multiple-annotators.We also have made the dataset used in this study available for public access on our project's webpage at https://vindr.ai/datasets/cxr.
The rest of the paper is organized as follows.Related works on learning from multiple annotators and weighted training techniques are reviewed in Section 2.
Section 3 presents the details of the proposed method with a focus on how to estimate the ground truth annotations from multiple experts.Section 4 provides comprehensive experiments on a simulated object detection dataset and a realworld chest X-ray dataset.Section 5 discusses the experimental results, some key findings, and limitations of this work.Finally, Section 6 concludes the paper.

Related works
Learning from multiple annotators.There are two major lines of research on learning from multiple annotators: two-stage approaches [15,9,16] and one-stage approaches [10,17,18].Two-stage approaches infer the true labels first, then train a model using the estimated ones.The most simple solution for label aggregation is majority voting, in which the choice of majority annotators regards as the truth [19].However, when the skill levels of the annotators differ, the majority voting strategy may not work well.This is a common occurrence in the general "learning from crowds" problem when "spammers" are present.
Later approaches typically incorporate other information into the truth inference procedure, such as the annotators' proficiency [20], annotators' confusion matrix [21,22], or the difficulty of each sample [23].While two-stage approaches have the advantage of simplicity in both implementing and debugging, they do not make use of the raw annotations in model learning.One-stage approaches address this issue by simultaneously estimating the hidden true labels and learning the desired model from noisy labels of multiple annotators.Earlier works in this group use Expectation Maximization (EM) algorithm [24] for jointly modeling the annotators' ability and the latent ground-truth.More recent approaches employ end-to-end frameworks which enable the neural networks to learn directly from the noisy labels [12], and further developed by incorporating annotators' confusion matrix [11,10], or instance features [17].(ii) emphasize easy examples.Methods in the group (i) include hard-example mining [25,26], which is a bootstrapping technique over the difficult examples; boosting algorithms [27], where the misclassified examples in preceding weak classifiers are assigned with higher weights; and focal loss [28] that addresses class imbalance problems by adding a regulator to the cross-entropy loss for focusing on hard negative examples.Works in the group (ii) are instances of broader topics such as curriculum learning [29], which is biologically inspired by human gradual learning, with easier examples are preferred in early training stages; learning with noisy labels [30,31], which prefers examples with smaller training losses as they are more likely to be clean.

Unlike any approaches above, we propose in this paper a new loss function that
assigns more weights to more confident examples that determine by the consensus of multiple annotators.Our experimental results validate the correctness of this hypothesis.

Proposed Method
This section presents details of the proposed method.We first give a formulation on learning from multiple annotators (Section 3.1).We then introduce a simple way to estimate the true labels from multiple annotators (Section 3.2).
Next, our network architecture and training methodology with a new re-weighted loss function are described (Section 3.3). .We then train a supervised object detector with the estimated labels using the proposed re-weighted loss function.In order to evaluate the effectiveness of the proposed method, we use a gold-standard test set T = x (j) , y (j) M j=1 containing M examples.In medical imaging scenarios, where the true labels are not available, we obtain the gold-standard test labels y (j) from the consensus of a group of experiences radiologists.Figure 1 below shows an overview of the proposed method.

Estimating the true labels from multiple expert annotators
We firstly estimate the true labels using Weighted Boxes Fusion (WBF) algorithm [13].This technique is used for combining predictions from multiple sources, i.e., using ensemble to achieve better prediction results or combining labels of different expert annotators.We describe the WBF algorithm in more 2 Iterate through all boxes in A in a cycle and attempt to find a matching box in the list F .Two boxes are defined matched if they have a high degree of overlap (e.g.IoU > 0.4).If there are more than one matching boxes in F , the one with the highest IoU will be chosen.
3 If the matching box is not found in step 1, add the current box to L and F as new entry for the new cluster before moving on to the next box in the list A.

Network architecture and training methodology
Object detection is a multi-task problem, in which the loss function consists of two parts: (1) the localization loss L loc for predicting bounding box offsets and (2) the classification loss L cls for predicting conditional class probabilities.
In this work, we focus on one-stage anchor-based detectors.A general form of the loss function for those detectors can be written as where t and t * are the predicted and ground truth box coordinates, p and p * are the class category probabilities, respectively; IoU {a, a * } denotes the Intersection over Union (IoU) between the anchor a and its ground truth a * ; η is an IoU threshold for objectness, i.e. the confidence score of whether there is an object or not; β is a constant for balancing two loss terms L cls and L loc [32].
We use fused boxes confidence scores c i k obtained from Algorithm 1 to get a re-weighted loss function that emphasizes boxes with high annotators agreement.
The new loss function, which we name it as Experts Agreement Re-weighted Loss (EARL) can now be written as where c is the fused box confidence score.

Experiments
We validate the proposed method in both synthetic and real-world scenarios: (1) the MED-MNIST, an object detection dataset, which was simulated from MNIST [14] with multiple expert annotations; (2) VinDr-CXR [5], a chest X-ray dataset with labels provided by multiple radiologists.In the following sections, we describe those two datasets and our experiment setup, as well as the experimental results.

VinDr-CXR Dataset
VinDr-CXR [5], by far the largest public chest X-ray database with radiologistgenerated annotations.It consists of 18,000 chest X-ray scans that come with both the localization of critical findings and the classification of common thoracic

Rads-VinDr-CXR Dataset
One intriguing characteristic of the VinDr-CXR dataset [5] is that 94.28% of the abnormal scans in the training set (3,315 out of 3,516) were annotated by a group of three radiologists with their correspondence IDs being R8, R9 and R10.
As a result, we create Rads-VinDr-CXR, a sub-dataset that is only annotated by those three radiologists.The Rads-VinDr-CXR serves as a suitable multiple annotators dataset to validate the proposed approach.

Evaluation metric
For all experiments, we report the detection performance using the standard mean average precision metric at a threshold of 0.4 (mAP@0.4)[33].Specifically, a predicted object is a true positive if it has an IoU of at least 0.4 with a ground truth bounding box.The average precision (AP) is the mean of 101 precision values, corresponding to recall values ranging from 0 to 1 with a step size of 0.01.
The final metric is the mean of AP over all lesion categories.We also employ mAP@[0.5:0.95:0.05]as an additional metric to assess the model's performance on different IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05.

Implementation Details
The main detector used in our experiments is YOLOv5-S [34].The network is built with PyTorch 1.7.1 (https://pytorch.org/)and trained on two NVIDIA RTX 2080 Ti GPUs.All training and test images are resized to the dimension of 640 × 640 pixels.The detector is trained for 50 epochs with 1cycle learning rate decay [35] using the SGD optimizer [36].The initial learning rate is set to 1e-3.
To validate the robustness of the proposed approach across different deep learning detectors, we further train and evaluate EfficientDet [37] with sizes D3 and D4.Specifically, all images are resized to 640×640 pixels and the model is trained for 30 epochs with constant learning rate 3e-4 using the AdamW optimizer [38].

Comparison with the state-of-the-art
To the best of our knowledge, there is no existing multiple-annotators model for object detection tasks in the literature.Hence, we compare the performance of the proposed method against the baseline, which uses all experts' annotations per example without taking into account the disagreement among annotators.
On the Rads-VinDr-CXR dataset, we further compare our method with the Rads-ensemble, which is the ensemble of independent models trained on separate radiologists' annotation sets.In this case, the WBF algorithm is used to combine the predictions of those models.

Experimental Results
Table 1 and Table 3 report the experimental results of the YOLOv5-S detector on MED-MNIST and VinDr-CXR datasets, respectively.On both synthetic and real-world datasets, the proposed approach outperforms the chosen baselines, even with the ensemble of individual experts' models.Specifically, on the test set of the MED-MNIST dataset, our method reports an overall mAP@0.4 of 0.980 and an overall mAP@[0.5:0.95:0.05] of 0.849.These results are much higher the performance of the baseline with mAP@0.4= 0.975 and mAP@[0.5:0.95:0.05]= 0.815, boosting the mAP scores of the baseline by 0.51% and 4.2%, respectively.Experimental results on the VinDr-CXR and Rads-VinDr-CXR datasets also validate the effectiveness of the proposed method.We achieve an overall mAP@0.4 of 0.200 on the VinDr-CXR dataset and an overall mAP@0.4 of 0.158 on the Rads-VinDr-CXR dataset.We emphasize that these results outperform both the baseline model, individual model trained on label provided by individual annotator (i.e.R8, R9, R10 ), as well as the ensemble model.
The experimental results with EfficientDet detector are provided in Table 3.We found that better detection performances compared to the baseline have been reported.This evidence confirm the robustness of the proposed approach across deep learning detectors.

Key findings and meaning
To the best of our knowledge, the proposed method is the first effort to train an image detector from labels provided by multiple annotators, which is crucial in constructing high-quality CAD systems for medical imaging analysis.
In particular, we empirically showed a notable improvement in terms of mAP scores by estimating the true labels and then integrating the implicit annotators' agreement into the loss function to emphasize the clean bounding boxes over the noisy ones.The idea is simple but effective, allowing the overall framework can be applied in training any image machine learning-based detectors.

Limitations
Despite the higher predictive performance over the relevant baselines, we acknowledge that the proposed method has some limitations.First, the overall architecture is not end-to-end.It may not fully exploit the benefits of combining truth inference and training the desired image detector.Second, applying the WBF algorithm to annotation sets with a high level of noise may produce lowquality training data.This case is quite impractical in the medical imaging field when the annotators are experienced clinical experts, but it frequently occurs in the general learning from crowds problems.

Conclusion
This paper concentrates on the use of annotations from multiple experts to build a robust deep learning system for abnormality detection on medical images.We propose using Weighted Boxes Fusion (WBF) algorithm to obtain the aggregated annotations with the implicit annotators' agreement as confidence scores.The estimated annotations are then used to train a deep learning detector with a re-weighted loss function that incorporates the confidence scores to localize abnormal findings.We empirically demonstrate that the proposed approach outperforms current state-of-the-art baseline approaches in both synthetic and real-world scenarios.To the best of our knowledge, we introduce for the first time an effective method that trains an object detector from multiple annotators.
We believe our method is simple and can be applied widely in medical imaging.
Weighted training examples.In this paper, we propose a new re-weighted loss function in which we assign more weights to examples that we consider be more confident.Previous works on the use of weighted training examples can be briefly categorized into two groups: (i) emphasize hard examples and Given a set of N training images {x i } example x i given by annotator r ∈ S(R), which S(R) is a set of R different expert annotators.In this study, we make use of those expert annotations ỹ(r) set of true labels with confidence scores {y i ; c i } N i=1

Figure 1 :
Figure 1: Illustration of the proposed approach that aims to build a deep learning system for abnormal detection on medical scans from multiple expert annotators.The training process contains two stages.The first stage focuses on truth inference, in which it estimates the true labels using the WBF algorithm [13] with the implicit annotator's agreement as confidence scores.The second uses the estimated confidence scores to train a deep learning-based detector using a re-weighted object detection loss function.To provide abnormality analysis during the testing phase, only the fully trained image detector is required.

detail in Algorithm 1 .
The final examples used to train deep learning detectors contain merged boxes with confidence scores.The visualization of fused boxes and the corresponding confidence scores are shown in Figure 2. Our fusion box algorithm emphasizes that the greater agreement between bounding boxes (e.g., two or three annotators have the same diagnosis for an abnormal finding on the image), the more likely the box annotation is correct.Algorithm 1: The WBF algorithm applied for multiple expert annotations Input: An image x with a list of annotations ỹ given by a set S(R) of R experts.The expert r ∈ S(R) with proficiency pr provides the annotations including rx boxes, Ar = [box 1 , . . ., boxr x ].All of the experts' annotations being merged into a list A. Output: A list of k fused boxes F = [box 1 , . . ., box k ]. 1 Declare empty lists L and F for boxes clusters and fused boxes, respectively.Each position in the list L can have a cluster of boxes or a single box.Each position in F has only one box, which is the fused box from the corresponding cluster in L.

4 2 T i=1 p i 6
If the match is found in step 1, add this box to the list L at the position pos which corresponds to the matching box in the list F 5 Set the fused box's coordinates F [pos] to be the weighted average of T boxes accumulated in cluster L[pos] with the following formulas: x 1,2 := T i=1 p i x 1,2 T i=1 p i y 1,2 := T i=1 p i y 1,Set the the fused boxes' confidence scores in F to the number of boxes in the corresponding cluster in L once all boxes in A have been processed.c := c min (T, N ) The fused boxes with confidence scores now represent the annotators' level of agreement.(a) The original annotations provided by multiple radiology experts.The same abnormal finding is represented by the sample color.(b) Fused boxes with corresponding confidence scores after applied the WBF algorithm.

Figure 2 :
Figure 2: (a) Visualization of multiple expert annotations on a chest X-ray example from the VinDr-CXR dataset [5] and (b) the fused boxes with confidence scores obtained by the WBF algorithm.

4. 1 .Figure 3 :
Figure 3: Visualization of the original and synthesized transition matrices.To simulate the false negative scenario, we use an additional class called no_obj.

Figure 4 :
Figure 4: The MED-MNIST dataset with multiple expert annotations, obtained by perturbing boxes and classes from the MNIST dataset [14].

Figure 5 :
Figure 5: Visualization of abnormal findings (different bounding box colors represent different findings) from the VinDr-CXR dataset: (top) Each scan in the training set was annotated by three different radiologists; (bottom) Test set annotations were obtained from the consensus of five radiologists.

Table 1 :
Experimental results on the MED-MNIST dataset.The highest scores are highlighted in red.

Table 2 :
Experimental results on the VinDr-CXR and Rads-VinDr-CXR datasets with the YOLOv5-S detector.The highest scores are highlighted in red.

Table 3 :
Experimental results on the VinDr-CXR dataset while EfficientDet is used as the detector.The scores are measured in mAP@[0.5:0.95:0.05],with highest values highlighted in red.