Active Learning Strategy for COVID-19 Annotated Dataset

The efficient diagnosis of COVID-19 plays a key role in preventing its spread. Recently, many artificial intelligence techniques, such as the deep neural network approach, have been implemented to help efficient diagnosis of COVID-19. However, the accurate performance of deep learning depends on the tuning of many hyperparameters and a large amount of labeled data. This COVID-19 data bottleneck also leads to insufficient human resources for data labeling, which presents a challenging obstacle. In this paper, a novel discriminative batch-mode active learning (DS3) is proposed to allow faster and more effective COVID-19 data annotation. The framework specifically designed to suit the imbalanced data phenomenon that is characteristic of COVID-19 data. Extensive experiments over four public real-world COVID-19 datasets from several countries such as Brazil, China, Israel and Mexico show that our active learning framework significantly outmatches other state-of-the-art models. Our proposed framework achieves an average G-Mean of 10% improvement for the four datasets. Finally, the results of significance testing verify the effectiveness of DS3 and its superiority over baseline active learning algorithms.

due to the fast spread of the virus, time limitations, and the heavy burden on the healthcare system. To solve this problem, many works have implemented active learning to reduce the annotation time and effort. Active learning works by intelligently selecting the most informative samples to be labeled by a domain expert. It is expected that active learning will be able to maintain the model performance while reducing the annotation cost. Recently, many works have combined active learning with deep neural networks. The underlying idea for this combination is that deep neural networks can detect complex data representation, such as images, and thus improve the prediction outcome. Furthermore, deep active learning also improves the automatic detection of COVID-19 through lung x-ray image recognition. However, this study has a drawback as it is challenging to implement to real human since the deep neural network retraining procedure require certain amount of time thus affecting human mental awareness [2]. Furthermore, in the real world the dataset of COVID-19 is significantly imbalanced. The number of positive cases is small compared to negative cases, making positive cases a minority class. Therefore, it is a challenge for the active learning model to select the most informative samples. In several works, informative samples are regarded as the samples that will improve the model performance [18]- [20]; however, for imbalanced data, it is also essential to select the minority class. In this paper, a novel approach in active learning framework to reduce annotation cost is proposed. We propose a discriminative batch-mode active learning framework, called DS3 1 , to implement a discriminative, skew-specialized sampling that is suitable for imbalanced data. The experimental results demonstrate that DS3 can greatly cut the annotation cost for training a model and consistently outperforms the state-ofthe-art active learning methods in the diagnosis of COVID- 19. The contribution of the paper can be summarized as: • We propose a novel batch-mode active learning specifically designed to solve imbalanced data annotation. • We perform discriminative batch-mode active learning that outperforms the state-of-the-art active learning approaches in cutting the labeling cost and achieves effective diagnosis of COVID-19. • We perform experiments on four real COVID-19 patient datasets. We compare the DS3 algorithm with other state-of-the-art batch-mode active learning algorithms. We statistically test the performance of our model compared to other popular classifiers with the Wilcoxon test.
The results show that DS3 outperforms most other stateof-the-art batch-mode active learning model.
The remainder of this paper is organized as follows. A brief discussion of the related studies in section II. The detailed description of the experimental materials, proposed framework, and algorithms will be shown in Section V. Section 1 The source code and datasets of this work are publicly available at https://github.com/analyticray/Discriminative-Batch-Mode-Active-Learning-Framework-DS3-VI demonstrates the experimental results and corresponding empirical evaluation. Section VII presents a discussion about this work. This paper is concluded with an assessment of future work in Section VIII.

II. RELATED WORK
COVID-19 pandemic has attracted many researchers to develop state-of-the-art models in performing automatic detection. For example Mohammed et al. [17] proposed a multicriteria decision making (MCDM) to evaluate and benchmark the different diagnostic models for COVID-19 with respect to evaluation criteria. Another study by AlWaisy et al [15] proposed a novel multimodal deep learning system for identifying COVID-19 data based on X-Ray Images. The proposed DeepNet architecture showed promising result in COVID-19 prediction. Similarly, Alwaisy et al. [16] proposed an advanced ResNet34 deep neural network image recognition model to classify healthy and COVID-19 infected patient based on the X-Ray images. More recently, Abdulkareem et al. [21] performed a large comparison study on various machine learning and deep learning models. The study reported that ResNet50 achieved the optimum accuracy of 98.8%. Although most of these studies showing promising results, however, they are performed on large labeled data which are expensive to generate. Therefore, active learning strategy is proposed in this paper to reduce the cost of annotation in COVID-19 prediction task. Conventional active learning (i.e. pool-based active learning) has been extensively explored in the literature [18], [22]. Most of the methods operate in an iterative manner, where "the most informative sample" is chosen for labelling. Subsequently, the model is retrained with the newly labeled example. The steps are iterated alternatively until most of the examples can be classified with "reasonably high confidence" [23]. Re-training after each iteration is quite costly, especially with complex and expensive models. This is the main rationale behind batch-mode active learning methods, which select a group of informative instances simultaneously. The BMAL methods are characterized into two main groups: 1) global methods, 2) cluster-based approaches. Global methods try to find the most informative set of samples from the whole space directly by solving an optimization problem [20], [24]- [28]. These approaches have mathematically and empirically demonstrated a good performance, however, they do not scale well with big datasets [29]. On the other hand, clusteringbased methods, which are highly scalable, partition either whole [30] or a fraction of (i.e. the most uncertain) unlabeled space [23], [31], [32] to reduce the probability of picking correlated queries. Once the partitions are formed, one or multiple instances are chosen to represent it. Recently, many works have implemented a combination of deep neural networks for active learning [33]- [35]. However, to the best of our knowledge, COVID-Al [36] is the only work that explores active learning for CT scan data labeling. The authors use hybrid active learning with a 3D residual network that simultaneously considers sample diversity and predicted loss. Despite there being many deep active learning methods that directly utilize an uncertainty-based sampling strategy, deep active learning can easily lead to insufficient diversity of batch query samples (such that relevant knowledge regarding the data distribution is not fully utilized). This in turn leads to low or even invalid Deep Learning (DL) model training performance. Thus, a feasible strategy would be to use a hybrid query strategy in a batch query, taking into account both the information volume and diversity of samples in either an explicit or implicit manner.

III. PROBLEM DEFINITION
First X = {x 1 , x 2 , ..., x n } denotes a dataset of n instances. Let's introduce the labeled set L and unlabeled set U , where L U = X and L U = φ. Every instance in L, x L i is associated with a label y l i , which has been revealed by a domain expert d and thus is known, whereas the labels associated with x U i are still unknown. The proposed approach interactively selects a batch B of samples that satisfies B ⊂ U and |B| = b, where the batch size b is defined by human handling ability. Note, that all instances are of equal annotation cost. The proposed approach operates in T iterations. In each iteration, the learner will choose b instances to be labeled by the domain expert and add these labeled examples to the L to update the classifier.

IV. PROPOSED APPROACH
This section presents our proposed approach, namely, the discriminative skew-specialized sampling (DS3), which has been specifically designed to tackle the class-imbalance problem in real-world applications. The illustration of the proposed approach can be seen in Fig. 1. The framework is comprised of two main components: 1) Batch-mode imbalance learning, which predominantly focuses on finding a compromise between exploration and exploitation to effectively cover an uncertain space subject to a predefined budget and 2) Balancing approach, which addresses the unbalanced class problem for active learning.

A. BATCH-MODE IMBALANCE LEARNING
The main objective of DS3 is to develop a scalable batchmodel framework for the class-imbalance problem. The success of batch mode active learning (BMAL) depends on selecting representative samples [37] as well as the batch size and total budget constraints [29]. The key question is how to find the most representative samples from both the minority and majority classes to cover the whole uncertain space given the limited budget. To achieve these goals, the DS3 learning component consists of two folds: a) Partition-based exploration and representation, and b) Skewed-specialized sampling.

1) Partition-based exploration
Dealing with massive amounts of unlabeled data, it is not feasible for a domain expert to examine every entry, and given the limited budget, it is very likely that portions of minority space are poorly represented. Therefore, it is beneficial to develop a discriminative model that can distinguish the most informative samples based on certain criteria such as ranking function. The proposed discriminative model is inspired from Guo and Schuurmans [25] work. Having access to both labeled and unlabeled samples, we built a model to maximize the expected log likelihood of the labeled data and to minimize the entropy of the missing labels on the unlabeled data: , where w specifies the classification model, L is the labeled data, U the unlabeled instances and α is the tradeoff parameter. In order to maximize the objective function in equation 1 , we construct a scoring function for a set of selected candidates S in iteration t + 1 according to : where w t+1 is the parameter set for the conditional classification model trained on the new labeled set L t+1 = L t ∪ S and H(y|x j , w t+1 ) denotes the entropy of the conditional distribution P (y|x j , w t+1 ) such that Thus, the next strategy would be selecting a batch that has highest rank. We ranked all the samples using equation 2 and took the highest rank. We only selected highest scores (10% from the unlabeled amount). However, selecting top K data as a batch would harm the performance since many homogeneous samples would be selected due to sharing similar uncertainty scores. Thus, we used a partition-based approach for uncertain space exploration that divides the problem space into a K disjoint partition, resulting in a higher potential to explore the regions of the minority class [38]. In this work, we use K-Means and set the cluster size to 180 based on the budget we derived from a batch-selection experiment. In our experimental studies, we also examine the effect of changing the cluster size.
Once a cluster is formed, a representative set, which is significantly smaller from the original set, needed to be identified. J in question 4 represent the centroid of each cluster. A good representation should capture most of the information from the original set. Three samples near the centroid of the cluster is used as the represented sample, with the intuition that the central point could represent a substantial portion of the instances inside a specific partition. Furthermore, several pieces of literature mention that the clusters are represented by a central point [31], [32]. However, in K-Means clustering, VOLUME 4, 2016 2) Skewed-specialized sampling In a highly-skewed environment where the amount of samples in the minority class is extremely low (under 10%) [19], conventional active learning approaches tend to perform poorly. It is due to this fact that even using the intelligent active learning approach, the probability of picking a minority sample is under 1% [39]. Thus, the model's performance tends to fluctuate over the training iterations. One of the classic approaches to overcoming the class imbalance is to represent the classes in a more balanced way either by oversampling the minority class, under-sampling the majority, or a blend of two approaches. Here, a simple yet effective method is proposed which maintains the original population of the minority class while under-sampling the majority class in the query set. This method keeps the model stable by selecting the best representative sample to be labeled.

B. BATCH SELECTION
Much research on batch-mode active learning picks the batch as an arbitrary number thus neglecting the real human limitation on labeling. Commonly, a batch of 20, 50, 100, 150 and 200 samples are selected as batches for labeling. However, there is lack of explanation as to how this number is selected. Therefore, this work follows studies that explore their reasoning behind selecting a specific number of batches by implementing recent studies [40], [41] that select 180 as the batch size. While Mirisaee et al. [40] chose 180 because it provides a good representation of the entire data, Fajri et al. [41] selected 180 by doing a real human labeling experiment.
The work [41] shows that 180 samples is suitable for clustered text data which has a large feature space. Therefore, it Initialize k cluster randomly; Set cluster prototype as cluster centroid ; Select representative data from cluster using equation 4 ; Balance the amount using random under-sampling ; end Sample representation selection X * ; Add label (x * , y) to L and remove X * from U ; Update the model C t using L ; is well suited for lower feature spaces such as the COVID-19 dataset presented in this paper.

A. DATASETS
Several experiments are conducted on four publicly available COVID-19 datasets from several countries. We focus primarily on COVID-19 datasets as they represent both a recent data science problem and an imbalanced set of data. Table  1 illustrates the characteristics of datasets used in this paper. As the study is designed for predicting the COVID-19 cases, thus the dataset is designed to contain mixed features. The features range from categorical and numerical data, such as age or COVID-19 symptoms. The Table 1 shows the number  of features in each dataset as well as the imbalanced ratio.

1) Data pre-processing and parameter tuning
This paper follows a standard machine learning data preprocessing, including deleting the null value and performing encoding on categorical data.

B. TWO-DIMENSIONAL VISUALIZATION OF DATASETS
Further experiments compared the sample selection strategy of each model and used T-SNE [44] visualization as it able to preserve local structure of the data compared to PCA. Fig.  2  several upper bounds while CBMB measured the classification cost based on uncertainty sampling. Both approaches can increase the ability of selecting the most representative sample in 'round shape' data, such as in the Mexican and Israeli datasets. However, in a spherical data shape, such as the Brazilian and Chinese datasets, a cluster-based approach, such as Certainty-based BMAL (CBMAL) and DS3, has better performance. This performance is supported by the clustering algorithm, which can locate the representative sample at the edge of the data shape.

C. LEARNING ALGORITHM & HYPERPARAMETERS SETTING
The proposed DS3 approach is a model-agnostic method; thus, any classification algorithm could be implemented. In the experiment, the Random Forest is selected as the main learning algorithm. The model extracted the entropy from class prediction probability that resulted from Random Forest as the uncertainty sampling method, shown in equation 3. The random forest model is chosen because it is simple and shows potential performance in many machine learning problems [45], [46]. The hyperparameter of the random forest is set to have 50 numbers of trees and 4 level of depth for each tree. Several experiments with different types of classifier is performed with details in section VI-C. The other hyperparameter in the experiment is the cluster size, which is fixed to 60 by default. However, several experiments with different cluster size is also tested to evaluate the sensitivity of the proposed approach in section VI-B.

D. BASELINE METHODS
The DS3 method was compared with the most recent clustered-based BMAL approaches and the standard active learning method: • CBMAL (Certainty-Based BMAL) [31]. The most ambiguous points are clustered together and the most uncertain point inside a cluster is sent for labelling. • AL(Active Learning) [18]. In uncertainty-based active learning, first the model calculates the uncertainty of each sample then it presents a batch of uncertain data to be labeled. • LBC [37]. As one of the most recent state-of-the-art objective-driven batch active learning methods, LBC uses the lower bounded certainty score of unlabeled data. Subsequently, a large similarity matrix over all unlabeled space is formed and a random greedy algorithm is employed to find a candidate batch for labeling. • CBMB (Cost-Bound Make-Balance [39]. CBMB is a recent active learning approach that was implemented in unbalanced class distribution. This approach consists of two parts, Cost Bound and Make Balance. Cost Bound is used to select the candidate sample based on a cost condition (uncertainty sampling or generated sample cost) while Make Balance is used to balance the majority class samples with the amount of minority class samples. The majority sample selection is done using random strategy.

E. EVALUATION CRITERIA
In conventional classification problem, accuracy is a standard choice for performance evaluation. The accuracy score is straightforward and easy to implement: This work implement accuracy as one of the evaluation metrics. However, it fails to reflect the performance on the skewed datasets. In such scenarios, G-Mean and F1 measures are widely used in the literature. G-Mean is the geometric mean of the accuracies of both minority and majority classes: × TN TN+FP (6) and F-1 measures output the harmonic mean of precision and recall: F1 = 2 × Precision × Recall Precision + Recall (7)

VI. RESULTS AND DISCUSSION
In this section evaluations of the performance of the proposed approach and comparisons of the model with other stateof-the-art methods is discussed. The experiments focus on the G-Mean measurement; and the investigation of why the proposed of the balancing approach outmatches other model is also presented.

A. PERFORMANCE EVALUATION
The proposed DS3 algorithm was compared with several state-of-the-art active learning models: CBMAL [31], CBMB [39] and LBC [37], and a common active learning baseline [18]. For comparison, a standard pool-based active learning strategy is implemented, dividing each dataset into 3 disjoint sets train (10%), test (20 %) and unlabeled (70 %). Fig. 3 compares the F1 score of the active learning model. The figure shows the proposed algorithm ranked first in F1 score. DS3 generally outperformed the state-of-the-art active learning models. It outperformed best when the data it was in a spherical shape, for example in Brazilian and Chinese datasets where most of the informative samples reside on the edge of the data location far away from the center. Thus, for this shape, a clustering-based active learning approach performed well in selecting the most informative sample.  Fig. 4 shows the result of each dataset with respect to the ROC curve. Almost all models perform equally well in the Mexican, Israeli, and Chinese datasets. The performance differences on these datasets are only marginal. In the Brazilian dataset, DS3-labeled data showed a higher ROC Curve with 0.71, which is 0.11 points above LBC and CBMB.

B. EXPERIMENTS USING DIFFERENT CLUSTER SIZES
A further experimentation of the robustness of the DS3 approach using different cluster sizes is conducted. The main objective of this experiment was to explore whether cluster size has a significant impact on the approach. In CBMAL approaches, such as DS3, the cluster size selection will influence the representation of the data. Thus, the choice of cluster size should maximize the creation a homogeneous cluster, leading to the ease of selecting representative data and contributing to the better model performance. Table 3 illustrates DS3 performance across different cluster sizes. In particular, it shows that the performance of the DS3 model increases with increased cluster size. For example, in the Mexican COVID-19 dataset, DS3 with cluster size of 100 has a G-Mean of 0.59 and an F1 score of 0.20. This number rises when the cluster size is extended to 300. Each model gains performance, reaching a G-Mean of 0.65 and an F1 Score of 0.37. However, in examining the result, one can infer that the accuracy of DS3 in the Israeli dataset behaves opposite to the general result. This could be influenced by the characteristic of the Israeli dataset, which is the largest dataset wherein 180 samples selected from the cluster could not represent the dataset well.

C. EXPERIMENTS USING DIFFERENT CLASSIFIERS
The DS3 underlying classifier was compared with other wellknown tree-based algorithms such as AdaBoost, CatBoost, XGBoost, and LightGBM. The Random Forest was chosen as the underlying classifier of our main algorithm. Previously, Support Vector Machine was a popular classifier for active learning [47], [48]; however, many recent works prefer Random Forest as the base classifier since it works well for BMAL in unbalanced class distribution [49]- [51]. Since most of the datasets have an imbalanced class ratio, Random Forest was chosen as base classifier. The results in Table  4 compare the performance of each classifier with regards to the G-Mean, F1 Score, and Accuracy score. The results show that DS3 with Random Forest performs slightly better compared to other base classifiers. For example, Random Forest has a better F1 score in the Mexican and Chinese COVID-19 datasets, and DS3 with Random Forest reaches an F1 score of 0.37 and 0.89 in the Mexican and Chinese datasets respectively. In other datasets, Random Forest performs equally well compared to other approaches. However, the performance of DS3 Random Forest is lower than XGBoost and LightBM in both the Israeli and Brazilian datasets; nonetheless the differences between Random Forest and these classifiers is only 0.01 and thus does not illustrate the real performance of DS3 with Random Forest. Further examination of the performance of each classifier is measured statistically. First a test of the normality assumption is conducted by performing the Kolmogorov-Smirnov test, and the F1 score is used as the base score for statistic evaluation. The results showed that the test sample failed the normality test, thus non-parametric tests such as the Wilcoxon test, are more appropriate for evaluating the performance of our model statistically. We used the Wilcoxon signed-rank test, and the results are presented in Table 6.
The Wilxocon test shows that in almost all datasets Random Forest is significantly better than AdaBoost, XGBoost, CatBost, and LightBM. However, there are no significant differences between Random Forest and LightBM in the Israeli and Brazilian datasets.

D. EXPERIMENTS USING DIFFERENT INITIAL AMOUNTS OF TRAINING DATA
A final experiment was conducted to evaluate how our DS3 framework behaves with different amount of training data. The initial training data was set to a default of 10% of the dataset; however, it is interesting to evaluate the time performance with increase amounts of initial training set data. Highlights of the execution time of each models with respect to the various amount of training data is presented in Table 5. The value in bold presents the lowest execution time while the highest running time is highlighted in italics. The experiment only use the Mexican COVID-19 dataset since it is the largest dataset presented in this paper. Table 5 also illustrates that traditional AL has the lowest execution time across all the training sets. This is intuitive since AL does not have any extra computational costs other than calculating the entropy. The LBC model shows the highest time performance since its characteristic is the opposite of AL with a lot of extra computational costs. The other three methods CBMAL, CBMB, and DS3 show an extra computational time compared to AL. CBMB has lower execution time compared to the CBMAL and DS3, while DS3 shows the highest. However, considering the results, DS3 has a better F1 score in the Mexican dataset compared to the other approaches. DS3 has an F1 score of 0.35, which almost doubled the performance compared to the rest of the models used on the Mexican dataset.

VII. DISCUSSION
This research shows that a discriminative-based approach works best for batch-mode active learning in imbalanced data scenarios. There are several potential explanations for the results. The first potential explanation is that the ability of DS3 to select the most representative data; since data     that belong to the minority class inside the cluster leads to better sample selection compared to other models. The second potential explanation is that our balancing mechanism leads to more stable performance. It is envisaged that our framework will have a positive impact on the community as our model could be used as a solution to reduce COVID-19 data annotation cost. Secondly, with reduced cost, the deep learning model can be trained for automatic COVID-19 detection efficiently. Finally, our framework could be transferred to other domains that focus on reducing the cost of annotation in both balanced and imbalanced datasets.
Although showing a better performance compared to other baselines, the purposed approach is no silver bullet. Thus, VOLUME 4, 2016 there is room for improvement in the proposed framework. For example, in the DS3 balancing approach, the data selection is at random. In future works, it would be interesting to see how other sampling methods would behave when combined with partition-based models.

VIII. CONCLUSION
This paper proposes a discriminative batch-mode active learning framework, called DS3, for the diagnosis of COVID-19. The framework can greatly reduce the cost of manual labeling for training models and can further relieve the burden of the healthcare system in the case of a fast-spreading pandemic. The proposed framework can boost the performance of any machine learning model by simultaneously considering diversity and representativeness of the data samples that also fit the imbalanced data distribution. To verify the effectiveness of DS3, extensive experiments have been conducted on various real-world COVID-19 datasets. The experimental and statistical significance test results demonstrate that the DS3 outperforms the baselines of state-of-the-art batch-mode active learning methods.