Trainable Weights for Multitask Learning

The research on multi-task learning has been steadily increasing due to its advantages, such as preventing overfitting, averting catastrophic forgetting, solving multiple inseparable tasks, and coping with data shortage. Here, we question whether to incorporate different orderings of feature levels based on distinct characteristics of tasks and their interrelationships in multitask learning. While in many classification tasks, leveraging the features extracted from the last layer is common, we thought that given different characteristics of tasks there might be a need to encompass different representation levels, i.e., different orderings of feature levels. Hence, we utilized the knowledge of different representation levels by features extracted from the various blocks of the main module and applied trainable parameters as weights on the features. This indicates that we optimized the solution to the question by learning to weigh the features in a task-specific manner and solving tasks with a combination of newly weighted features. Our method SimPara presents a modular topology of multitask learning that is efficient in terms of memory and computation, effective, and easily applicable to diverse tasks or models. To show that our approach is task-agnostic and highly applicable, we demonstrate its effectiveness in auxiliary task learning, active learning, and multilabel learning settings. This work underscores that by simply learning weights to better order the features learned by a single backbone, we can incur better task-specific performance of the model.


I. INTRODUCTION
With the rise of deep learning, measures to solve multiple tasks simultaneously (i.e., multitask learning) have become an active field of research with various needs arising from the data-dependent characteristics of neural networks.The most common background of adapting multitask learning is to build a ''swiss knife'' that generalizes well to multiple The associate editor coordinating the review of this manuscript and approving it for publication was Alberto Cano .tasks.By learning multiple tasks in parallel, it is hoped that the knowledge learned in each task will help the other tasks and this shared positive influence among tasks will lead to better generalization ability [1].In addition, without having to build an individual classifier for each individual task, we can reduce memory and computation time with multitask learning architecture [2].This goal often encompasses simultaneous learning of multiple tasks or lifelong learning (i.e., incremental or continual learning), which requires the network to maintain its performance on previously learned tasks while not losing performance in its current task [3], [4], [5], [6], [7], [8].Multitask learning is also adapted when there is a strong need to share the weights, e.g., training a model to classify images and estimating the model's classification error so that the model can predict the data, which will incur the largest error, from the unlabeled pool of images (active learning) [9].Also, multitask learning has been used in an auxiliary learning task to build a model that operates on par with transfer learning using self-supervised pre-training methods [10], [11], [12].This auxiliary learning scheme has been reported to help cope with data shortage and prevent overfitting.Owing to these diverse applications and the need for multitask learning, we performed experiments with regard to three applications (i.e., active learning, auxiliary task learning, and multilabel learning) to prove our effectiveness.Additionally, we conducted experiments using ResNet architectures [13] of different parameter sizes and proved that our method works robustly regardless of the model size.To the best of our knowledge, this is the first study to prove the effectiveness of the method's various multitask learning applications in inter-research fields, considering the advantages of multitask learning.
Although multitask learning has been widely adopted, the hard parameter sharing method remains dominant regarding its usage in multitask learning applications [10], [14].The hard parameter sharing method maintains task-specific output layers only at the end of each task and shares the other hidden layers among all tasks [15].Despite its poor performance compared to the state-of-the-art methods, the method is still dominant in real-world applications owing to its simplicity and efficiency in terms of computation and memory.Therefore, we introduce our new architecture, Sim-Para, which resembles hard parameter sharing method in its efficiency but offers better performance.
We summarize our contributions below: 1.We propose our architecture SimPara, a simple, modular, and efficient algorithm for multitask learning.2. Using small-, medium-, and large-sized models, we demonstrate that our algorithm works robustly regardless of the size of backbone model.3. We verify the efectiveness of this approach by expeiments on three multitask learning application tasks (i.e., auxiliary task learning, active learning, and multilabel learning).

II. RELATED WORKS A. MULTITASK LEARNING
Given that different tasks require learning different knowledge, multitask learning concerns how well we choose the optimal features for particular tasks.Numerous methods have been proposed to select appropriate weights for use in each task, such as choosing a different combination of layers for each task via lasso regularization [11], selecting appropriate weights to use with learned binary masks [4], using task-specific parameter masks generated by pruning [5], and selecting a minimal subset of task-specific existing layers [3].A similar research is reported in [17], which aims for simplicity with minimal task-specific layers on trunk of a VGG architecture.Although many multitask learning architectures fully exploit the last block by dividing it into a several branches to be task specific [10], [12], [15], [16], informative low-level features learned in front of the trunk are overlooked.Our trainable parameters rank the usefulness of diverse levels of features in tasks to improve performance.Furthermore, we leverage the low-level knowledge learned in front and make use of it to increase the performance.
Another limitation in previous multitask learning research is that some multitask learning approaches are not easily applicable [16], [18].The tendency to sacrifice efficiency for performance has been continuous because studies compare their intra-domain performance with that of other multi-task learning models.Thus, these multitask learning modules are far from concerning time complexity and memory usage.Although multitask learning is an important field of research, focusing on the architecture by exploring the performance of the algorithm in diverse settings is important because it should be applicable to various domains.Therefore, we introduce task agnostic approach applicable to diverse tasks, and experiments confirm that our model frequently performs stably with regard to various task characteristics with subtle limitations.

B. INCREMENTAL LEARNING
Incremental learning (also known as continual learning or multi-domain learning) involves introducing classifiers or detectors to a few classes at a time [3], [4], [5], [6], [7], [8].Also known as Multi-Domain Learning (MDL), the domains are learned incrementally, in settings where the data is not available at once.Multi-task learning approaches have been frequently evaluated for their effectiveness in incremental learning-related setting, that is, to explore the sequential addition of entire datasets or image classification tasks [3], [4], [5], [6].Whenever a new data is provided to learn additionally while being obliged to maintain the previously acquired knowledge, the multi-task learning algorithm treats the new subset of data as a new task in incremental learning.However, in our architecture, we aim for simultaneous learning and not sequential training; hence, we did not test our method on incremental learning.

C. SELF-SUPERVISION TASKS AND AUXILIARY TASK LEARNING
To cope with the high cost of data annotation in computer vision, multitask learning has been actively used in relation to supervised learning [10], [11], [12].Simultaneously learning a combination of self-supervised pretraining tasks has been reported to enhance the performance, and this scheme often entails multitask learning.An auxiliary task learning, a type of multi-task learning, is different from the general multitask learning in that multiple tasks exist to enhance the performance in one main task.Thus, priority exists among the tasks, and the main task is the utmost goal to fine-tune while the other auxiliary tasks are secondary.
Because neural networks are prone to overfitting, that is, they tend to even memorize the random labels in the training set or mostly focus only on obvious data features; auxiliary tasks help the models learn the features that are overlooked, but meaningful.The choice of auxiliary tasks to use and avert negative transfers has been an active area of research in multi-task learning [5], [19], [20].

D. ACTIVE LEARNING
Active learning, which is about finding the most optimal unlabeled data to query the experts, can be solved using multitask learning algorithms.Yoo and Kweon [9] adapted multitask learning to active learning by simultaneously learning two tasks; one task to predict the loss incurred by the prediction of the model and the other task as a main task to solve (eg., detection).We expand our research on the architecture used in [9] by adding trainable weights to the extracted features immediately before the feature-wise concatenation.

III. APPROACH
As mentioned earlier, our proposed method extends the architecture reported in [9] to generalize and improve its capability.As shown in Figure 2, our approach adopts two main phases: extracting multiple features from different levels of learned representations and learning trainable weights to better learn the interrelationships among multiple tasks.We name the trunk that solves its task from shared blocks as the ''downstream task module'' and the module below that utilizes the extracted features from the downstream task module as the ''pretext task module''.Thus, for n number of tasks, there would be one downstream task module and n-1 pretext task modules.

A. FEATURE EXTRACTION
As the number of layers increases, the model gradually learns representations from low to high-level features [21].To use a wide spectrum of representation levels, we extracted features from every block in the network, as shown in Figure 2. In Figure 2, as there are three blocks and by extracting each feature map from each block, three feature maps would be extracted for use in sub task in total.Then, for every feature map, we performed global average pooling, then passed it through a fully connected layer, and finally added non-linearity to the feature vector using ReLU activation.We hypothesized that because different knowledge will be required for different tasks, leveraging various levels of features will be useful as the task-specific weights will optimize the learning with various levels of representations for its own purpose.

B. TRAINABLE WEIGHTS FOR EXTRACTED FEATURES
Because performance must be stable regardless of the task characteristics or data, adapting trainable weights to automatically optimize the influence of extracted features thereby optimizing interactions between tasks is crucial and is our main contribution.We restricted the feature weights from having negative values because experimentally, without restricting the signs of the parameters on the features, it usually worsened or had no impact on the performance.We initialized all trainable weight parameters as 1 and limited the parameters to not to exceed 10000 for training stability.Using the trainable parameter p b for block b, we perform scalar multiplication on the feature vector f b− earned after nonlinearity, to re-weigh the features on the particular task and obtain the reweighted feature r b as follows: For example in Figure 2, because there are three blocks, α is p 1 , β is p 2 , and γ is p 3 .However, the Greek alphabetical notations are arbitrary as there could be more than three blocks in the backbone model, so we address with p b for more accurate illustration.After applying each scalar parameter to each feature map, we perform horizontal concatenation of all reweighted feature vectors {r 1 , , , r b } to derive the task-specific tensor for a particular downstream task t.The tensor is then passed through a fully connected layer for handling the task.We assumed that the each scalar parameter applied to respective feature map will act like a signal amplifier that controls the impact of each feature map, and this presumption is validated by our experiments.Lastly, there is a specific loss for each task, and the final loss L is the sum of all task-specific losses (l ) derived as L = t l t .The task-specific loss could be any loss type that matches the task characteristic (e.g., categorical cross entropy loss for multi-class classification, l1 loss for regression, and etc.).

IV. EXPERIMENTS AND RESULTS
We used four ResNet architectures [13] as trunks in our experiments: ResNet-18, ResNet-32, ResNet-50, and ResNet-101.To prove the stability of the model in a task-agnostic manner, we validated our method under three evaluation settings: auxiliary task learning [10], multitask learning in active learning [9], and multilabel classification, For every experiment, we compared with the approach of the baseline research of the task.

A. PUBLIC DATASET
As shown in Table 1, we considered various public datasets.To be specific, for active learning, we used CIFAR-10 and CIFAR-100 datasets [22] to assess the performance of our method.For Auxiliary task learning, we used CIFAR-10, CIFAR-100, SVHN [23], and STL-10 [24].For multi-label learning, we used the FGVC Aircraft dataset [25] with a three-level hierarchy, namely variant, family, and manufacturer -levels in fine order.We have also used private dataset for active learning setting, and information about this dataset will be addressed later in this work.

B. SELECTING THE NUMBER OF TRAINABLE PARAMETERS
As introduced earlier, we utilized ResNet series for our experiment: ResNet-18 and ResNet-101 were used for active learning, ResNet-50 for incremental learning, and ResNet-32 for auxiliary task learning.As SimPara is modular, the number of trainable weights and features to be extracted can be chosen.Here, we made use of the last layer of each block in networks, so that the number of blocks determines the number of trainable features.The reason behind is that we aimed to  get an evenly distributed diverse level of features ranging from low to high level representations.We used four trainable weights for ResNet-18, ResNet-50, and ResNet-101 since there are four similar blocks (we excluded the one in front); in contrast, we used three trainable weights for ResNet-32 because there are three blocks.

C. AUXILIARY TASK LEARNING: LoRot-E
Because neural networks require high-quality data, a few approaches have been proposed to investigate the optimal combinations of pretext tasks in self-supervised learning and associated multi-task learning settings [10], [11], [12].Among the previous works, we selected LoRot-E [10], which uses the hard parameter sharing architecture as a backbone to learn the auxiliary task and main task simultaneously.We compare how our multi-task learning architecture performs in auxiliary task learning with the adaptation of [10].For a fair comparison with the original approach, we replaced the backbone architecture of LoRot-E with our approach and evaluated the performance in auxiliary task learning.Because hard parameter sharing, i.e., the naïve multi-head architecture, is simple, it is not surprising that architecture optimized for multi-task learning outperforms hard parameter sharing approach.However, we emphasize that the size of our approach and [10] does not differ much and our approach still outperforms the hard parameter sharing algorithm.
To briefly explain the algorithm of [10], there are 16 surrogate classes in the localizable rotation task as an auxiliary task, while the general classification task as the downstream task also exists.The role of the rotation task is to enhance the performance in the downstream task while the LoRot-E model solves both tasks simultaneously using the hard parameter sharing approach.For both tasks, categorical cross-entropy loss was used.
We report the training and test accuracy averaged over three runs with the standard deviation in Table 3 and the backbone model used is ResNet-32.In Table 3, the numbers in bold denote the best-performing method for each of the   [22], we prepared an imbalanced training setting as in [26].Using the imbalance ratio µ = 0.01, the number of classes K, and the number of samples in the original train set for class i as ñ i , the sample numbers n i for class µ is defined as n i = ñ i v i where µ ∈ (µ, 1).The imbalanced data distribution is visualized in Figure 3. SVHN [23] and STL-10 [24] datasets used were balanced data during training.In this experiment, we used a downstream task module and a pretext task module to solve the classification problem of the original classes and the task of the pseudo rotation labels, respectively as introduced in [10].Shown in Table 3, the performance of our architecture is better than the hard parameter sharing architecture, LoRot-E.The result indicates that the proposed model meets its goal in auxiliary task learning by preventing overfitting and helping the model learn less obvious yet helpful features for decision-making.
Before attaching trainable weights to features and featurewise concatenation, we explored how each feature participated in the pretext task.As shown in Figure 4, clusters become more identifiable as the layer from which features are extracted deepens.Although the last layer in the last block is not directly related to the classification problem in the main task, we conclude that the highest-level features that the model learned as the layer deepens are the most helpful in accomplishing both the main and the pretext tasks in the CIFAR-10 dataset.Thus, we verified that the pretext and main tasks have a positive-transfer relationship in the CIFAR-10 dataset of the auxiliary task learning setting [10].
As shown in Table 4, ablation studies were performed to determine the optimal architecture for the proposed approach.LoRot-E denotes the hard parameter sharing approach, LoRot-E with Feature Extraction represents feature-wise concatenated feature maps for the classification of auxiliary task, LoRot-E w/ ReLU+GAP represents the addition of ReLU and global average pooling operations after LoRot-E w/ Feature and lastly, the LoRot-E w/ SimPara indicates altering the architecture of LoRot-E to SimPara.Also in Table 4, the numbers in bold denote the bestperforming method.Note in Table 4 that the only difference between LoRot-E with ReLU + GAP and LoRot-E with SimPara is the trainable scalar parameters.Thus, the result validates our prior assumption that the trainable weights will control the amount of impact each feature map will bring and enhance the performance.

D. TRACING THE CHANGING PATTERNS OF WEIGHTS APPLIED TO FEATURES
While most of the multi-task learning algorithms make use of high-level representations (i.e., the last layer of any neural network), our study proposes that this may not be optimal and that multitask learning is highly task-and data-dependent.As shown in Figure 5, we observed that, in most cases, the features extracted from the last block (black line) contributed the most to the performance.However, more importantly, in Figure 5 (d) the weight applied to the feature extracted from the first block (blue line) exceeded that from the last block (black line).This observation is noteworthy as the model SimPara also exhibited a superior performance of 70.079 compared to the hard parameter sharing method, which was 67.575, as reported in Table 3.
From Figure 5 (d), we could assume that the trainable parameters help choose optimal features more effectively depending on the unique characteristics of the dataset.More essentially, this demonstrates the effectiveness of our module on the dataset in that the high-level feature transfer isn't always optimal.Moreover, it also supports the possibility that compared to the features extracted from the last block of the trunk, the features learned in the forehead may be more useful depending on the dataset, and we may be possible to devise better way to exploit the features in front than the ones that are brought by the skip-connections in residual blocks.However, we leave this part of work to the future work.

E. ACTIVE LEARNING IN MULTITASK LEARNING
We also tested our module on active learning setting, where the downstream task is classification and the subtask is to predict the loss of the given data [8].In active learning, a new group of labeled images, which are chosen by the model that 105638 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.are estimated to be the most helpful in terms of the model's performance, are added to the training dataset each cycle.The loss for the downstream classification was categorical cross entropy loss and the loss for pretext task was the margin ranking loss with the margin of 1.The same training setup was applied to the CIFAR-10 and CIFAR-100 datasets for both SimPara and learning loss with active learning (LLAL) approaches [9] with ResNet-18 and ResNet-101 as the backbones.The test accuracy is shown in Table 5, 6, 7, and 8 and numbers in bold denote the best-performing method in all tables and standard deviation in brackets.For simplicity, we denote the Learning Loss for Active Learning [9] as LLAL.In CIFAR-10, reported in Table 5, SimPara outperformed in 9 out of 10 cycles with ResNet-18 and as shown in Table 6, SimPara outperformed 6 out of 10 cycles with ResNet-101 as the trunk.In CIFAR-100 dataset, as described in Table 7, SimPara performed better in 8 out of 10 cycles with ResNet-18 and as shown in Table 8, SimPara outperformed 7 out of 10 cycles with ResNet-101 as the trunk.

F. PRIVATE DATASET (SEOUL NATIONAL UNIVERSITY HOSPITAL, SNUH)
Active learning is frequently used in medical fields due to the high cost of asking doctors or medical experts to label data manually.As shown in Figure 6, we tested our model on two private medical datasets from Seoul National University Hospital (SNUH): X-ray data for the neutral poses of the spine and fundus dataset.Spine dataset comprises 3,831 black and white labeled data with 766 data as the test set.The fundus dataset comprises 20,000 color images with 4,000 data for the test set.We solved the binary classification task as the downstream task on both datasets.For the spine data, label 1 indicates lumbar spinal stenosis and label 0 indicates normal.In the fundus dataset, label 1 indicates abnormal fundus image while label 0 indicates normal fundus image.
The criterion was same with the active learning experiments we held on the public data in Table 5,   For the spine dataset, we used a batch size of 20 and trained it with 300 labeled images in the initial cycle.In each cycle, 300 images were added from 700 randomly sampled data from the unlabeled pool of images.The size of the spine data was 700 × 600 pixels and was resized to 512 × 512 for use as the model input.For the fundus dataset, we used a batch size of 500, trained with 500 images in the initial cycle, and added 500 labeled images in each cycle from 3000 randomly sampled data points from the unlabeled pool of images.The size of the fundus data was 256 × 256 pixels but we resized it to 128 × 128 to use as the model's input.
As shown in Table 9 and 10, our method also outperformed the baseline architecture in private datasets.The bold numbers denote the best performance in both tables.Shown in Table 9 about spine dataset, our model outperformed 6 out of 7 cycles and reported in Table 10 of fundus dataset, our model outperformed 5 out of 7 cycles.In the fourth cycle of spine dataset, the accuracy of the baseline dropped to 40.90, whereas SimPara achieved 67.01 for the same cycle.Hence, the instability of the baseline method and stabilizing effect of our module were observed, and we credit the scalar parameters with the stability.

G. EXPLORING MULTI-LABEL LEARNING
We observed how our architecture performs on the multilabel FGVC-Aircraft dataset, which has multiple hierarchies, namely variant, family, and manufacturer, in fine to coarse order with regard to its classes.The proposed approach was compared with two other approaches: training each independent individual classifier for each of the three labels (Individual Classifier) and the conventional approach of multi-task learning (Hard Parameter Sharing).Note that the method of training individual classifier is the most computation-costly.
The results in Table 11 show that adapting various degrees of feature maps will not lead to the best performance, especially if the tasks are highly related that the usage of last layer becomes extremely dominant.In Table 11, the best performing method was Hard Parameter Sharing and the worst performing method was Individual Classifier, while our method performance is in between.Our interpretation of the shown result is that although there was positive knowledge transfer between the three labels, as the impact of the last layer was exceptionally dominant and useful, the hard parameter sharing approach performed better than our approach in which considers more levels of features.We also could infer from the result that adapting the proposed method to stateof-the-art vision models for a sole classification task will not lead to an increase in performance.

V. CONCLUSION
Although many approaches to multi-task learning have been introduced, several studies prefer the most efficient approach, hard parameter sharing in multi-task learning applications.The main reason behind is to preserve memory and time efficiency needed for the application.Therefore, as an enhanced version of hard parameter sharing, we introduced an approach that adapts various feature maps from vision models with weights to control their impact on the tasks.Our approach has shown improved performance in two multi-task learning settings (i.e., auxiliary task learning and active learning) with various datasets.
However, better knowledge is needed in terms of which task our model will best perform.Here, we observed that when the characteristics of tasks which are learned simultaneously are different from each other, our approach performs well, and not so well in vice versa.We leave it to the future works to explore possible ways to make our approach (i.e., concatenation of differently weighted features) also enhance sole-task models.Lastly, we conclude that our approach generally performs better than the hard parameter sharing in multitask learning and also achieved our goal of enhancing the performance of the original baseline model for active learning.

FIGURE 1 .
FIGURE 1. Architecture of hard parameter sharing.

FIGURE 2 .
FIGURE 2. Architecture of SimPara that highly resembles the hard parameter sharing (Figure1) method.Indigo layers indicate shared layers among tasks, dark orange and green layers are task-specific layers for tasks A and B, respectively.Alpha (α), beta (β), and gamma (γ ) denote the trainable weights.

3 .
and standard deviation over three runs to achieve a top-1 accuracy for naïve LoRot-E and LoRot-E with SimPara.

FIGURE 4 .
FIGURE 4. T-SNE visualizations of each feature right before the feature-wise concatenation from a randomly selected 10 classes in the validation set with the model of the top-1 pretext task accuracy of 61.629 trained with SimPara in highly imbalanced CIFAR-10 dataset.The labels indicate the labels of rotation for the pretext task in and we have 500 data points for each plot.(a) is the visualization with the feature from the first block, (b) is with the second block, and (c) is with the last block

TABLE 4 .FIGURE 5 .
FIGURE 5. Changing patterns of the trainable scalar weights during training for 200 epochs in auxiliary task learning (LoRot-E) on different datasets.Red, blue, and black line denotes the averaged trend of change over multiple runs for the weight applied to the layer extracted from first, second, and last block, respectively.

FIGURE 6 .
FIGURE 6.Data statistics of private datasets from SNUH.
6, 7, and 8, and we trained for 200 epochs.In Figure 7, two data samples for each label from each private SNUH dataset are visualized.Samples in (a) and (b) of Figure 7 are samples from spine dataset, while samples in (c) and (d) of Figure 7 are samples from fundus dataset.(a) indicates normal (label: 0) in spine dataset, (b) denotes spine stenosis (label: 1) in spine dataset, (c) indicates normal image (label: 0) in fundus dataset, and (d) shows abnormal image (label: 1) in fundus dataset.
CHANGWOO LEE received the B.S. degree in industrial engineering and statistics from Inha University.He is currently pursuing the Ph.D. degree with the Department of Medical Device Development, Seoul National University.He is a Researcher with the Department of Transdisciplinary Medicine, Seoul National University Hospital.His current research interest includes machine and deep learning using medical images.HYUK JIN CHOI received the M.D. and Ph.D. degrees.He is currently a Clinical Professor with the Department of Ophthalmology, Seoul National University Hospital Healthcare System Gangnam Center.He specializes in cornea, external eye disease, and cataracts.His current research interest includes healthcare systems-based ophthalmic research.CHANG-HYUN LEE received the M.D. and Ph.D. degrees.He is currently an Associate Professor with the Department of Neurosurgery, Seoul National University Hospital, SNU Medicine.His current research interests include spine surgery and AI.BYOUNGJUN JEON received the Ph.D. degree.He is currently a Research Professor with the Office of Hospital Information, Seoul National University Hospital.His current research interests include medical informatics and drug screening.EUI KYU CHIE received the M.D. and Ph.D. degrees.He is currently the Deputy VP of Seoul National University and a Professor with the Department of Radiation Oncology, Seoul National University College of Medicine.He was the PI of the ''Intensive care platform of Million Patient information for AI CDSS Technology (IMPACT)'' project consortium.His current research interests include precision radiation oncology and expanding to medical informatics.YOUNG-GON KIM received the Ph.D. degree.He is currently the Deputy Head of the AI Division and an Assistant Professor with the Department of Transdisciplinary Medicine, Seoul National University Hospital.His current research interest includes medical image processing for disease diagnosis and prognosis using machine and deep learning.

TABLE 2 .
Number of parameters and notations for model.

TABLE 5 .
Active learning results in a table of image classification for CIFAR-10 (ResNet-18).

TABLE 6 .
Active learning results in a table of image classification for CIFAR-10 (ResNet-101).

TABLE 7 .
Active learning results in a table of image classification for CIFAR-100 (ResNet-18).

TABLE 8 .
Active learning results in a table of image classification for CIFAR-100 (ResNet-101).

TABLE 9 .
Active learning results in a table of image classification forSpine dataset from snuh.

TABLE 10 .
Active learning results in a table of image classification for Fundus dataset from snuh.