Graph Segmentation-Based Pseudo-Labeling for Semi-Supervised Pathology Image Classification

Pathology image classification is an important step in cancer diagnosis and precision treatment. Training a pathology image classification model in a fully supervised manner requires exhaustive pixel-level manual annotations from pathologists, which may not be practical in real applications. Semi-supervised learning (SSL) has been widely used to exploit large amounts of unlabeled data to facilitate model training with a small set of labeled data. However, due to the limited annotations, it still suffers from the issue of inaccurate pseudo-labels of unlabeled data. In this paper, we propose a novel framework for semi-supervised pathology image classification, which incorporates graph-based segmentation to refine initial pseudo-labels of tissue regions by considering local and global contextual relationships of patches in whole-slide images (WSIs). Moreover, we define a new energy function for graph construction that allows the graph to take into account the uncertainty of network predictions on unlabeled data. Extensive experiments on two different pathology image datasets demonstrate the effectiveness of our method compared with state-of-the-art SSL baselines. In particular, when using 5% labeled data, our approach outperforms a strong baseline by 2.81% AUC.

lesions, deep learning methods require region-level or even 23 pixel-level annotations on WSIs since they generally take 24 patches as inputs. Therefore, if there are sufficiently large 25 and finely-grained annotated data on WSIs, deep learning 26 can assure excellent and stable performance as they have 27 The associate editor coordinating the review of this manuscript and approving it for publication was Chulhong Kim . already shown in natural image domains [4], [5], [6], [7], 28 [8], [9], [10], [11], [12], [13]. However, obtaining such large 29 and exhaustively annotated data on WSIs is an extremely 30 laborious and time-consuming process because (1) WSIs are 31 tremendously large high-resolution images (even with the 32 size larger than 100,000 × 100,000 pixels) and (2) WSIs 33 are highly heterogeneous and complex, in consequence of 34 acquisition procedures. To acquire region-level annotations 35 on WSIs (localizing tumor regions or isolated tumor cells), 36 specialized pathologists need to examine tissue regions at 37 multiple magnification levels to consider both context and 38 details of tissue. This process hinders the potential of fully 39 supervised learning, which requires as much annotated data 40 as possible. 41 Semi-supervised learning (SSL) has been actively 42 researched in the domain of natural images [14] and recently 43 as new pseudo-labels of unlabeled data and the network is 86 retrained on both labeled and pseudo-labeled data. Our main 87 contributions are summarized as follows: 88 • We propose a novel approach to alleviate the inherent 89 limitation of SSL, i.e., strong dependency on the labeled 90 data, by incorporating the classical graph-based segmen-91 tation method. To the best of our knowledge, this is the 92 first work that uses graph-based segmentation to refine 93 pseudo-labels. 94 • We design a regional term correction scheme to make 95 confident pseudo-labels more significant in graph-based 96 segmentation.

97
• We achieve outperforming performance compared to 98 conventional SSL-based pathology image classification 99 methods on the public pathology datasets, especially 100 when the annotation is highly limited (1-2%). 101 The remainder of the paper is organized as follows. 102 Section II reviews recent work on SSL and pathology image 103 classification. The details of our method are described 104 in Section III. The experimental results, ablation studies, 105 and performance comparisons are provided in Section IV. 106 Finally, Section V presents our conclusions.

118
Pseudo-labeling [25] is one of the most popular SSL meth-119 ods. The goal of pseudo-labeling is to create pseudo-labels for 120 unlabeled data, where the pseudo-labels are obtained from the 121 predictions of the model trained on labeled data. The model 122 is then retrained using both labeled and pseudo-labeled data. 123 Towards this direction, Lee et al. [25] directly used network 124 predictions to obtain hard pseudo-labels. Shi et al. [27] addi-125 tionally considered confidence scores based on the density 126 of local neighbors in the feature space, and Iscen et al.
[28] 127 proposed a label propagation method to assign pseudo-labels 128 to unlabeled data. Moreover, another study demon-129 strated theoretical support of using network predictions as 130 pseudo-labels [29]. 131 However, pseudo-labeling inherently has a strong depen-132 dency on the reliability of pseudo-labels; thus, there 133 have been various approaches to tackle this problem. 134 Arazo et al. [30] showed that naive pseudo-labeling is 135 prone to overfit to incorrect pseudo-labels due to the con-136 firmation bias and proposed to generate soft pseudo-labels. 137 Rizve et al. [31] argued that conventional pseudo-labeling-138 based methods underperform due to incorrect pseudo-labels 139 and proposed an uncertainty-aware pseudo-label selection 140 method. Wang   Overall framework of the proposed method. The CNN is first pretrained on the patches extracted from the labeled data (gray arrow). The patches extracted from the unlabeled data are then fed to the pretrained CNN to obtain pseudo-labels, resulting in a pseudo-labeled WSI. Next, the pseudo-labels are refined via graph-based segmentation, resulting in a refined WSI. Finally, the refined pseudo-labels are added to the training data for retraining.  WSIs using the InceptionNet-based architecture [10]. , [17], [18], [19], [20], [21], [22].  Pseudo-labeling-based approaches have also proven effec-190 tive in the computational pathology domain [17], [21], [22]. 191 To this end, Jaiswal et al. [17] investigated the effective-192 ness of pseudo-labels in breast cancer detection of lymph 193 node metastases. In addition, Shaw et al. [21] utilized 194 pseudo-labeling in the form of a teacher-student chain to 195 fine-tune the model for colorectal cancer classification. 196 Silva-Rodriguez et al. [22] also proposed a teacher-student 197 framework for Gleason score prediction. In particular, 198 the teacher model is trained via the multiple instance 199 learning framework, and the student model is trained on 200 pseudo-labels generated by the teacher model. Although the 201 above methods [17], [21], [22] investigated the effectiveness 202 of pseudo-labels in pathology image classification, the inher-203 ent problem of pseudo-label based-methods, i.e., the problem 204 of strong dependency on the reliability of pseudo-labels, still 205 exists. Therefore, it is desirable to find a way to obtain more 206 reliable pseudo-labels and design a framework to use them as 207 supervision for effective network training.

209
In this section, we first formally define our problem and the 210 notations used in this paper. We then provide an overview and 211 details of the proposed SSL framework. Finally, we provide 212 more details about pseudo-label refinement, a key component 213 of our proposed method.  Fig. 1) is trained on D tr l as follows: the unlabeled data D u , we obtain the pseudo-labeled data Then, D tr p is added to the initial train-250 ing data D tr l , yielding combined data D tr lp = D tr l ∪ D tr p .

251
Using D tr lp , f can be retrained as follows: for all WSIs in the unlabeled data D u , we obtain refined 267 pseudo-labeled data D tr . D tr p+ is added to 268 the initial training data D tr l , and the CNN is retrained on 269 D tr lp+ = D tr l ∪ D tr p+ as follows: where θ lp+ is a set of network parameters after training 272 with D tr lp+ . A key component of the proposed framework is to refine the 275 pseudo-label of unlabeled data and provide better supervision 276 to the CNN for patch-level classification, as illustrated in 277 Fig. 3. To consider the local and global contextual relation-278 ships of patches in unlabeled WSI X u , we first construct a 279 slide-wise graph structure using patches from an unlabeled 280 WSI X u ∈ D u . Then, we define an energy function E to 281 obtain a refined pseudo-labelȲ u from the initial pseudo-282 label Y u . Here, the initial pseudo-label Y u is used as initial 283 seeds for estimating the parameters of the energy function E, 284 and the prediction result P u is used to modify the energy 285 function E to leverage the reliability of the initial pseudo-286 label. The refined pseudo-labelȲ u is obtained by minimiz-287 ing the energy function E and used to construct refined 288 pseudo-labeled data D tr p + , which is added to the initial training 289 data D tr l to retrain the CNN using (3).

291
A graph G = V, E consists of a set of nodes V and a set 292 of edges E. In this work, we use a patch-graph such that the 293 patches from an unlabeled WSI are defined as the nodes of 294 our graph. For simplicity, we shall omit the superscript u and 295 the subscript j. A WSI and its patches are thus denoted as X 296 and {x k } n k=1 , respectively. Also, each pair of connected nodes 297 is defined as a single edge e = {x p , x q } ∈ E, where p and 298 q are node indexes, and the 8-neighbors are used to define 299 connectivity. Here, the interconnected edges between graph 300 nodes are called n-links (N ), which represent the informative 301 relationship between neighboring nodes. In addition to the 302 graph nodes, there are two special terminal nodes, called S 303 and T . In our graph, S and T correspond to the tumor and 304 FIGURE 3. Illustration of the proposed pseudo-label refinement. Based on the pseudo-labels obtained from the network prediction, GMM is constructed for each class (normal and tumor). Using the Gaussian probability of each GMM, the edge weights between S/T nodes and graph nodes are assigned. The network prediction is used to adjust the edge weights to assign higher weights on confident normal/tumor patches (dotted arrow).
normal nodes, respectively. Other types of edges connect-305 ing graph nodes to these terminals are called t-links (T ),  an energy function that encodes the regional and boundary 316 properties is defined as follows:

318
where R( Y ) and B( Y ) are the regional and boundary terms, 319 respectively, and λ is a coefficient that specifies rela-320 tive importance of R( Y ) and B( Y ). The regional term

328
where P(h k |Y) denotes a Gaussian probability distribution 329 function: The boundary term in (4) is defined as The boundary term B( Y ) computes the penalty of assigning 346 different labels to two adjacent nodes, which is assigned 347 on each n-links to represent the local contextual relation-348 ships of patches in WSI, according to the boundary penalty 349 function B x p ,x q , which measures the similarity of them. 350 If x p and x q are similar in the feature space, B x p ,x q assigns 351 a high penalty and vice versa. Specifically, the boundary 352 penalty function is defined as When computing the Gaussian probability in (6), the standard 358 graph-cut-based image segmentation [42] requires initial seed 359 points, which are assumed to be given by user interactions. 360 Specifically, manually labeled nodes should be given as ini-361 tial seed points to estimate the parameters of each GMM. 362 In contrast, we use the initial pseudo-labels Y of unlabeled 363 data for initial seed points, which are obtained by applying the 364 CNN parameterized by θ l in (1). However, using the initial 365 pseudo-labels Y as the seed points to compute the regional 366 term in (5) can lead to undesirable results since they are 367 hard-labels obtained by thresholding the network predictions.

368
The pseudo-labels resulting from different probabilities are 369 expected to influence the energy function differently. In other 370 words, the less the network prediction becomes confident, the 371 more likely Y becomes erroneous. To this end, we define a 372 new regional term as follows:

376
where p k and 1 − p k represent the probability of x k to be 377 tumor and normal, respectively. Note that z Y is inversely 378 proportional to the prediction probability since the regional 379 term measures the penalty of label assignments. Using the 380 new regional term, the energy function in (4) is modified 381 as:

383
where R = k R k (y k ).

398
A The patches with more than 50% of the background pixels 439 (i.e., pixels with intensity values higher than 200 in the HSV 440 space) were removed. The overall data statistics are shown 441 in Table 1.

443
We used the ResNet50 architecture [40] for all our exper-444 iments. We pretrained the network for a maximum of 445 100 epochs with the initial training data D tr l from the labeled 446 data. We terminated the training earlier when the validation 447 loss did not decrease for five consecutive epochs and used 448 the model with the best validation loss for inference. For this 449 pretraining, we used the Adam optimizer, batch size of 64, 450 and learning rate of 10 −5 .

451
When retraining the network with both labeled data and 452 refined pseudo-labeled data D tr lp+ , we trained the network 453 for a maximum of 30 epochs, starting from the model 454 with the best validation loss. For this retraining, we used 455 64 batches, consisting of 8 labeled patches and 56 pseudo-456 labeled patches. We also terminated the training earlier using 457 the same criterion. 458 VOLUME 10, 2022

459
We compared our proposed framework with a set of widely 460 used semi-supervised baselines, namely Mean Teacher [23],

461
VAT [24], MixMatch [33], and pseudo-label [25]. We also 462 compared ours with the state-of-the-art SSL approaches,  demonstrating that the proposed method effectively refine 499 inaccurate pseudo-labels and thus helpful for network train-500 ing. Also, the performance of the proposed method is 0.67% 501 and 1.79% higher than that of Soft-label [30] and UPS [31], 502 respectively, which shows that the proposed method refines 503 pseudo-labels more precisely using graph representation. 504 Moreover, the proposed method shows comparable perfor-505 mance to the fully supervised model (92.92% vs. 94.62%) 506 using only 20% of training WSIs as labeled WSIs. We also 507 present visualization results in Fig. 4. We can see that the 508 proposed method predicts the tumor tissue region better than 509 other semi-supervised baselines.

510
To investigate the effectiveness of the regional term cor-511 rection, we compared the proposed method with and without 512 applying it in pseudo-label refinement. Table 2 shows that 513 the proposed method without the regional term correction 514 outperformed several semi-supervised baselines. For exam-515 ple, with 2% labeled WSIs, the proposed method without the 516 regional term correction achieved average AUCs of 82.44%, 517 outperforming all other semi-supervised baselines except for 518 UPS [31], which achieved average AUCs of 83.19%. with 519 the regional term correction, the performance scores were 520 further increased by 1.71%, 1.65%, 2.71%, 2.41% and 2.95% 521 when the percentages of labeled WSIs are 1%, 2%, 5%, 522 10% and 20%, respectively, which are the highest among 523 all other compared methods. These improvements are also 524 evident in the WSIs overlaid with pseudo-labels, as visualized 525 in Fig. 5. Note that the proposed method without the regional 526 term correction significantly increased both true positives 527 (in red boxes) and false positives (in blue boxes) compared 528 to the initial pseudo-labels. Meanwhile, the proposed method 529 with the regional term correction increased true positives and 530 decreased false positives, demonstrating the effectiveness of 531 the regional term correction. Similar to the Camelyon16 dataset, we observe that 534 our proposed method outperforms the supervised and 535 semi-supervised baselines on the TCGA dataset as shown 536 in Table 3. With 1%, 5%, 10% and 20% labeled WSIs, 537 our proposed method achieved average AUCs of 75.83%, 538  and UPS [31], respectively, demonstrating the superiority of 547 pseudo-label refinement in proposed method. Overall, our 548 proposed method outperforms all other approaches, espe-549 cially when annotation budgets are limited. Compared to the 550 fully supervised baseline trained on whole training WSIs, the 551 proposed framework achieved a comparable performance of 552 an average AUC of 96.15% using only 20% labeled WSIs.

553
Table 3 also demonstrates that the proposed method 554 without the regional term correction outperformed several 555 semi-supervised baselines on the TCGA dataset. For exam-556 ple, with 10% labeled WSIs, the proposed method without the 557 regional term correction achieved average AUCs of 94.33%, 558 TABLE 3. Experimental results on TCGA with different percentages of labeled WSIs. Each score represents the mean AUC ± standard deviation obtained by five different random samplings of the training WSIs. The supervised upper bound performance using the whole training WSIs (550 WSIs) is 98.62%. FIGURE 5. WSI samples and their overlaid pseudo-labels in red colors: (a) The initial pseudo-labels obtained from the network predictions, (b) refined pseudo-labels obtained using the proposed refinement without regional term correction, (c) refined pseudo-labels obtained using the proposed refinement with regional term correction. The black contours indicate the boundaries of the ground truth tumor region. The blue and red boxes correspond to the normal and tumor region examples. Note that the regional term correction contributes to reducing the false positives (in blue boxes) while maintaining true positives (in red boxes).

566
In this paper, we propose a semi-supervised deep learning 567 framework for pathology image classification, which incor-568 porates a graph-based segmentation into pseudo-labeling 569 process to obtain accurate labels of unlabeled data. The 570 proposed framework refines initial pseudo-labels based 571 on graph-based segmentation which considers local and 572 global contextual relationships between patches in a WSI. 573 Also, for better segmentation, we newly formulate the 574 energy function which leverages the reliability of the initial 575 pseudo-labels. The high-quality pseudo-labels generated by 576 the proposed framework are used as supervision signal to 577 train the model. The experimental results on two inde-578 pendent datasets demonstrate that our proposed frame-579 work outperforms state-of-the-art semi-supervised learning 580 baselines.

581
The proposed method has some limitations. First, dur-582 ing the pseudo-label refinement process, we extract the fea-583 ture vector of a given patch by simply averaging color 584 intensities. The performance could be improved if the fea-585 tures that can better reflect the characteristics of pathology 586 images are utilized. In the future, we will investigate various 587 pathology-specific features such as the shape and morphol-588 ogy of cell nuclei for pseudo-label refinement. Second, the 589 experiments were conducted for breast and kidney cancer 590 classification tasks only. Further studies on various cancer 591 classification tasks are desired to explore the generalizability 592 of our approach. 593 93968 VOLUME 10, 2022