Coupling Semi-Supervised and Multiple Instance Learning for Histopathological Image Classification

The annotation of large datasets is often the bottleneck in the successful application of artificial intelligence in computational pathology. Commonly, Whole Slide Images (WSIs) are sliced into patches and a machine learning model is trained in a supervised fashion to predict the label of each patch. Unfortunately it is time-consuming and expensive to obtain detailed patch-level annotations and for this reason recently Multiple Instance Learning (MIL) and Semi Supervised Learning (SSL) approaches are gaining popularity to train with fewer annotations. In this work we couple SSL and MIL to train a deep learning classifier which combines the advantages of both methods and overcomes their limitations. Our method is able to learn from the global WSI diagnosis and a combination of labeled and unlabeled patches. Furthermore, we propose and evaluate an efficient labeling paradigm that guarantees a strong classification performance when combined with our learning framework.We perform extensive experiments on three different public cancer datasets SICAPv2, PANDA and Camelyon16. The advantages of each model component as well as the efficient labeling technique are empirically proven and the performance gains in comparison to the SSL and MIL baselines are highlighted.We compare to the state-of-the-art and completely supervised training.With only a small percentage of patch labels our proposed model achieves a competitive performance on SICAPv2 (Cohen’s kappa of 0.801 with 450 patch labels), PANDA (Cohen’s kappa of 0.794 with 22,023 patch labels) and Camelyon16 (ROC AUC of 0.913 with 433 patch labels). Our code is publicly available: https://github.com/arneschmidt/ssl_and_mil_cancer_classification.


I. INTRODUCTION
T HE analysis of histopathological biopsies is the gold standard for the diagnosis of many different cancer types.In the last years, Computer-Aided Diagnosis (CAD) systems based on artificial intelligence have gained attention as a promising tool to reduce pathologists' workload, improve the repeatability and to avoid the variability of diagnostic processes.For the training of deep learning algorithms, initially many approaches relied on detailed locallevel annotations of the digitized biopsies by pathologists [1].
Unfortunately, due to the large size of the WSIs, this process 11 is a time-consuming task which makes it difficult to obtain 12 large and heterogeneous annotated datasets.This recently led 13 to the rise of approaches that do not need detailed local-   The training procedure (Figure 1) can be applied to any  We take all patches of the WSI and apply a weak augmentation to obtain soft labels and pseudo labels by the CNN predictions.Based on these labels we train the same CNN with the strongly augmented patches. Step As some of these vectors of probabilities will serve later as training targets, we will call ŷ soft labels. Step In the case of two or more global labels, this pseudo label assignment is performed for each of them.
Step 3 Use the strongly augmented image patches β(x) and a combination of groundtruth labels, pseudo labels an soft labels to train the CNN.Mathematically, the loss func-tion is described as: Here, H(•, •) denotes the cross-entropy loss for classifi- annotations available which made them particularly suitable 363 for the validation of the proposed method.The SICAPv2  We used the four-fold cross validation of SICAPv2 for the model selection and observed that an increasing complexity of the model led indeed to a better performance until EfficientNet-B5.Models B6 and B7 did not show any further improvements, so we chose EfficientNet-B5 as our backbone.

14169
level annotations.Instead, they utilize the MIL assumption 15 where the image patches form the instances and the complete 16 WSI the bag [2].In this setting, no patch-level annotations 17 are needed and only the diagnosis of the biopsies are used 18 for training.Another strategy to learn with fewer patch-level 19 annotations is SSL where only a subset of the image patches based) or their predictions (instance-based).Recently, the 75 classical aggregation functions based on max or average 76 pooling have been replaced by more advanced mechanisms, 77 such as learnable attention methods [4].Campanella et al. [2] 78 showed promising results processing the top-ranked positive 79 instance features with an RNN.In other works the use of 80 instance-based aggregations based on top and bottom ranked 81 instances [5] or min-max aggregation [6] have been pro-82 posed.Further approaches use embedding-based MIL via 83 multi-head attention mechanisms [7] or combine instance-84 level predictions with embeddings [8].Hashimoto et al. [9] 85 use multiple scales with attention mechanisms and domain 86 adversarial training for malignant lymphoma subtype classi-87 fication.Common limitations of existing approaches are the 88 requirement of very large datasets [2] and the incapability to 89 make class predictions at instance level [4] [2] [9].Further, 90 recent approaches often include complex multi-stage training 91 procedures with multiple models [2] [9].This motivates the 92 development of well performing, but simpler approaches for 93 an easy application in clinical practice.94 SSL approaches use labeled and unlabeled patches for 95 training.Otalora et al. [10] take ideas from the Teacher-96 Student procedures [11] to obtain pseudolabels, and Pulido 97 et al. [12] apply MixMatch [13] and FixMatch [14] under 98 a highly noisy and imbalanced data setting.The SSL com-99 ponent of our work is related to FixMatch and Unsuper-100 vised Data augmentation (UDA) [15]: UDA proposes to use 101 unlabeled images for so called consistency regularization.102 Fixmatch extends the idea of consistency regularization with 103 pseudolabels: Based on weak image augmentations pseudo 104 labels are assigned to confident predictions while the network 105 is trained with strong image augmentations.The drawback of 106 common SSL methods is that they do not make use of global 107 information (bag labels) and always require a certain amount 108 of labeled instances.109 Inspired by these previous works on MIL and SSL, we 110 propose a method able to take advantage of both learn-111 ing strategies: it incorporates the augmentation strategy of 112 FixMatch [14] and the consistency regularization of Unsu-113 pervised Data augmentation (UDA) [15] while the pseudo 114 label assignment is driven by the MIL perspective.As a 115 result, the proposed method inherits the advantages of SSL 116 and MIL while overcoming their existing limitations: our 117 method achieves competitive results on small datasets, pro-118 vides multi-class instance-level predictions, only needs one 119 training procedure, one stage and one model that performs y bi = N C ∀i ∈ I b (2) For all positive bags we know that some patches must contain the pattern of the present cancer class Y b .For each bag b ∈ B + : ∃i ∈ I b : y bi = Y b (3) which in the case of a primary and secondary label applies to 163 both Y 1 b and Y 2 b .Note that the targets y are represented as a C-dimensional 165 probability vector with each dimension representing one 166 class probability and the class labels are described as one-hot We propose a data setting that we name Efficient Labeling 170 (EL): For each cancerous WSIs the pathologist only points 171 out a few cancerous patches instead of annotating the whole 172 WSI.For each global label Y b some corresponding patch 173 labels y bi = Y b are assigned.We consider the annotation of a 174 few cancerous patches per WSI a realistic and time-efficient 175 strategy for the annotation of a new dataset from scratch 176 or the data collection in already deployed CAD systems.177 In the latter case the pathologist provides labels during the 178 diagnostic process (human-in-the-loop, see f.e.[16]).179 In our experiments, this data setting is simulated by pick-180 ing randomly a certain amount of patch labels and hide the 181 others during model training.This allows us to systematically 182 study the effect of a varying amount of patch labels.183 We divide the indices of each positive bag into the set 184 of labeled (L ⊂ I b ) and the set of unlabeled (U ⊂ I b ) 185 instances such that all labels {y bi |i ∈ L b } are available due 186 to pathologists annotation, while the labels {y bi , |i ∈ U b } 187 remain unknown. 223

224
classification model and is divided into three steps that are 225 repeated for each training epoch:

FIGURE 1 .
FIGURE 1.Proposed training framework for cancerous WSI, combining MIL and SSL.We take all patches of the WSI and apply a weak augmentation to obtain soft labels and pseudo labels by the CNN predictions.Based on these labels we train the same CNN with the strongly augmented patches.

1
Obtain the CNN predictions of the weakly augmented image patches in the positive bags.For a given image patch x bi of the positive bag b ∈ B + , we apply the weak image augmentation α.The weakly augmented image patch α(x bi ) is used to predict the CNN output probability vector p θ (y|α(x bi )) which we define as ŷbi :

2
Calculate pseudo labels for each positive bag b ∈ B + .We know from equation (3) that some patches have the same class as the WSI.Given the global label Y b of the bag, we assign this pseudo label to the k patches whose class probabilities of class Y b are the largest of all instances in the bag.Concretely, this is done by the following steps: (i) Create a list of probability vectors ŷbi ordered with respect to class Y b .(ii) Select the k first items of this list to define the index set P b ⊂ I b .(iii) Assign the one-hot class label Y b to the patches indexed by P b as a pseudo label y ps :

251 cation and λ is a hyperparameter to assign a higher weight 252 toAlgorithm 1 FIGURE 2 .
FIGURE 2. Simplified illustration of label propagation with weak and strong image augmentation.The datapoints correspond to image patches in our case.Shown are two example points x1 and x2 with the corresponding regions of weak and strong image augmentation (Vα(x) and V β (x) respectively).

405FIGURE 3 .
FIGURE 3.Comparison of data label settings.Efficient Labeling (EL) with P annotated patches per primary and secondary Gleason grade of all WSIs and Complete Annotation (CA) of W WSIs with all the patch labels.We plot the mean and standard deviation of Cohen's quadratic kappa (patch level) of five runs against the total amount of annotated training patches of SICAPv2.
3, 5, 10, used in Step 2 in II-C) and λ = 3 (tested λ = 1, 2, 3, 5, used in equation 6 D) which showed the best results.The network was fine-tuned with stochastic gradient descent and the learning rate 0.01.For the relatively small datasets SICAPv2 class balanced loss was used (based on true y and estimated labels ŷ) to stabilize the training as we sometimes observed the convergence to 'bad' local minima (for the experiments P = 0 and P = 1 in Figure 3).We resized the image patches to 250x250 which is the input resolution for our model.For image augmentation brightness shift, random flip and rotation were used.The difference between weak and strong augmentation in our experiment was the intensity of the brightness shift (multiplication of the alpha channel with a factor) which led to a darker or brighter version of the original image.While the weak augmentation α uses random brightness shift factors between 0.9 and 1.1, the strong augmentation β uses a range from 0.5 to 1.5.The stronger the brightness shift the harder it gets to visually recognize the pattern in the images.D. ABLATION STUDIES 472 To study the effect of the different loss components and the 473 improvement over the SSL and MIL baselines we performed 474 an ablation study for the SICAPv2 dataset with efficient 475 labeling (see section II-A) and 5 patch labels per WSI and 476 global label (equal to the experiment P=5 of section III-E).

527
and secondary Gleason grade of each WSI.For the second 528 approach (CA) we randomly selected W WSIs and used all 529 patch labels of this selection for training.In Figure3we 530 observe that the EL setting required substantially less labels 531 than CA to obtain good results.We explain this by the higher 532 variability of the annotated patches with EL that allows the 533 network to learn from more diverse examples.The annotated 534 patches of CA have a higher co-similarity and therefore 535 contribute less information to the model training.The steep 536 ascent of the performance from P = 0 to P = 5 proofs 537 the efficiency of our learning approach and EL.To estimate 538 the saved time and resources (to annotate a dataset) for the 539 model training with our approach, we use the total amount 540 of local annotations.Concretely, we count the total number 541 of labeled patches used for training.We compare settings 542 with a reduced number of patches with supervised training 543 using all available patch labels.In the case of SICAPv2, our 544 model with P = 5 and EL showed a performance close to 545 the supervised one with only using 450 of the 4384 available 546 patch labels.This means that approximately 10 times less 547 labeled patches were needed for training.548 For PANDA, the ratio of saved labeling effort is comparable: 549 the model trained with P = 5 uses approximately 10 times 550 less patch annotations for training than the supervised setup, 551 but with much higher absolute numbers: while the supervised 552 model is trained with 205,111 labeled patches, the model 553 with P = 5 obtained 22,023 patch labels.For Camelyon16 554 (Table3), the advantage is even bigger: the model with P=5 555 and EL used 433 patch labels while the supervised model This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/This article has been accepted for publication in a future issue of this journal, but has not been fully edited.Content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2022.3143345,IEEE Access Schmidt et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS (a) Predictions (P=0) (b) Predictions (P=5) (c) Predictions (P=all) (d) Ground truth

FIGURE 4 .
FIGURE 4. Visual example of model predictions for a test WSI of SICAPv2.The cancerous areas are marked in green (Gleason Grade 3), blue (Gleason Grade 4) and red (Gleason Grade 5).We compare the model predictions trained with P = 0 (MIL), P = 5 (some patch labels), P = all (supervised), corresponding to the patch labels per class and WSI available during training.The marked areas of predictions are interpolated from patch-level predictions and therefore not as fine-grained as the ground-truth annotation.While the model trained in a MIL setting (a) correctly identifies the cancerous areas, the predicted classes are incorrect.The model predictions with the setting P = 5 depicted in (b) are very similar to those of the supervised model (c) and the ground truth (d).
1The exact classes of these cancerous patches 385 remains unknown, so they can only be used as unlabeled data 386 for training.We disregard all patches with less than 50% 387 tissue and assign the label 'non-cancerous' to all patches 388 that have at least 95% pixels annotated as non-cancerous 389 or background.To the cancerous patches of Radboud we [21]21]dataset was used to validate the proposed method on 365 prostate cancer for the multiclass Gleason grading scenario.366 This dataset contains 155 prostate WSIs which are sliced into 367 512x512 overlapping patches.The primary and secondary 368 Gleason grade for all WSIs as well as patch-level labels are 369 included for a large number of instances in the dataset.In our 370 work, we maintained the proposed partitions of the original 371 dataset for training, validation and testing.372 Additionally we use the PANDA dataset 2 for prostate cancer 381 ters ('Radboud' and 'Karolinska') but only the WSIs from 382 Radboud have local annotations of the Gleason grade while 383 the annotations from Karolinska distinguish non-cancerous 384 1 Available at: https://data.mendeley.com/datasets/9xxm58dvs3/1 2 Available at: https://www.kaggle.com/c/prostate-cancer-gradeassessmentand cancerous.

TABLE 1 .
Ablation studies on SICAPv2 with 5 patch labels per cancerous WSI and global label.The results are reported as the mean and standard deviation of 5 independent runs.This article has been accepted for publication in a future issue of this journal, but has not been fully edited.Content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2022.3143345,IEEE Access Schmidt et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS 477In Table1we first compare 4 different label settings: only 478 the available ground truth labels (GT), ground truth and soft 479 labels (GT + SL), ground truth and pseudo labels (GT + PL) 495 For the MIL baseline, we disregarded the available patch 496 labels for training.Max-pooling inspired our pseudo-label This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/

TABLE 2 .
Comparison with previous works of prostate cancer patch-level Gleason grading.We report the average result of 5 independent runs.Results reported on different datasets, patch size and resolutions, see *

TABLE 3 .
Comparison with previous works of metastasis detection in sentinel lymph nodes of breast cancer patients (Camelyon16).Our reported results are the average of 3 independent train and test runs.
** Tested on Camelyon16, trained on MSK breast dataset (total 9894 WSIs, see [2] for details) *** Trained and tested on MSK breast dataset (total 9894 WSIs, see [2] for details)assignment and is commonly used, f.e. in [2]: the global label was assigned to one patch with the highest class probability.E.EFFICIENT LABELING VS.COMPLETE ANNOTATIONIn the next experiment we compared two different data settings for the prostate cancer dataset SICAPv2: We wanted to study whether with limited resources it is better to use Efficient Labeling (EL, see section II) with a few patch labels from all available WSI or a few WSI with the Complete Annotation (CA).In the first case (EL) we randomly sam-pled a certain amount P of patch labels for the primary