A Graph-Based Multi-Scale Approach With Knowledge Distillation for WSI Classification

The usage of Multi Instance Learning (MIL) for classifying Whole Slide Images (WSIs) has recently increased. Due to their gigapixel size, the pixel-level annotation of such data is extremely expensive and time-consuming, practically unfeasible. For this reason, multiple automatic approaches have been raised in the last years to support clinical practice and diagnosis. Unfortunately, most state-of-the-art proposals apply attention mechanisms without considering the spatial instance correlation and usually work on a single-scale resolution. To leverage the full potential of pyramidal structured WSI, we propose a graph-based multi-scale MIL approach, DAS-MIL. Our model comprises three modules: i) a self-supervised feature extractor, ii) a graph-based architecture that precedes the MIL mechanism and aims at creating a more contextualized representation of the WSI structure by considering the mutual (spatial) instance correlation both inter and intra-scale. Finally, iii) a (self) distillation loss between resolutions is introduced to compensate for their informative gap and significantly improve the final prediction. The effectiveness of the proposed framework is demonstrated on two well-known datasets, where we outperform SOTA on WSI classification, gaining a +2.7% AUC and +3.7% accuracy on the popular Camelyon16 benchmark.


I. INTRODUCTION
A NALYZING histological tissues is crucial for the diag- nosis and treatment planning of multiple human body lesions and diseases [1], [2].Fortunately, modern microscopes allow to scan of conventional glass slides into digital Whole Slide Images (WSIs), keeping the information in a multi-resolutions pyramidal structure (Fig. 1).This representation allows to store a large amount of information, preserving the ability to quickly analyze the tissue from different perspectives, i.e., from higher scales to observe the cellular The authors are with the Dipartimento di Ingegneria "Enzo Ferrari," Università degli Studi di Modena e Reggio Emilia, 41125 Modena, Italy (e-mail: gianpaolo.bontempo@unimore.it;federico.bolelli@unimore.it;angelo.porrello@unimore.it;simone.calderara@unimore.it;elisa.ficarra@unimore.it).
Digital Object Identifier 10.1109/TMI.2023.3337549Fig. 1.This figure depicts the pyramidal structure of WSIs, highlighting the information available at different scales which ranges from structure level at lower resolutions to cellular interaction at higher resolutions (i.e.lower levels).
morphology and the different cellular compartments, and from lower ones to identify distinct tissues, such as tumor, stroma, and immune system cells, and their location.However, due to their gigapixel size, the manual analysis of WSIs requires specific tools [3] and is usually expensive and time-consuming for pathologists.For these reasons, several methodologies have been recently proposed to support clinicians with fast, accurate, and automatic analysis [4], [5], [6].
Feeding modern neural networks with the entire gigapixel image is not feasible, as it requires significant computational resources and time.Therefore, researchers have focused on developing algorithms that can analyze WSIs efficiently by cropping them into smaller patches.Additionally, since pixel-level annotation is time-consuming, annotation is generally provided at WSI level.As a consequence, an effective solution to analyze the WSIs is the use of weakly supervised paradigms such as Multiple Instance Learning (MIL) [7], [8], whose application is becoming increasingly popular in WSI learning-based analysis [9], [10], [11], [12], [13].In MIL, each WSI is considered a "bag" composed of many patches called instances.By restricting the rule to a binary classification setting (e.g., tumor/not tumor), if at least one instance (patch) is positive, the image (bag) is marked as positive [14].
Recent proposals integrate multiple resolutions extracted from the WSI in the MIL learning representation mechanism employing a mere features concatenation [10], arguing its effectiveness w.r.t.single-resolution algorithms.Other very recent approaches [9], [15], [16] resort to more complex hierarchical structures by adapting Vision Transformers (ViT) [17] and Graph Neural Networks (GNN) [18].However, we argue that such methods do not take advantage of the full potential of the WSI pyramidal structure.As an example, the flat concatenation of features extracted at different resolutions [10] does not consider the substantial difference in the informative content they provide.On the other hand, in [15] many efforts have been spent to model intra-scale structural connection through a cluster-based pooling layer while overlooking interscale patterns.
A proficient learning approach should instead consider the differences in the information content available at different scales, including global structures and local cellular regions.Our proposed approach aims to leverage this heterogeneity by preserving the scale diversity of the input data.This paper proposes a pyramidal Graph Neural Network combined with (Self) Knowledge Distillation to leverage the multi-resolution structure of WSI in a MIL classification setting.In this context, the graph message passing brings information from higher to lower scales.More specifically, the proposed framework employs a 2-tiers GNN to provide a more contextualized representation.The first tier processes each input scale separately, preserving the scale diversity in terms of information content and acting as an adapter.The second tier allows for full communication between scales, making the information flow between all the considered resolutions and capturing their relationships.
The contextualized features generated by our GNN module are transmitted to individual (one for each scale) attentionbased MIL to derive bag labels.The (Self) Knowledge Distillation mechanism is introduced to encourage agreement across the predictions delivered at different resolutions.At the same time, individual scale features are learned in isolation to preserve the diversity in terms of information content.By transferring knowledge across scales, connected by the GNN, the model self-improves as information flows during training.Our main contributions can be resumed as follows: • We propose the use of GNNs to naturally exploit the heterogeneous information contained in WSI images, employing the message passing for sharing the informative content provided at different scales; • We devise a learning strategy based on (Self) Knowledge Distillation, that promotes the agreement across the predictions delivered by different scales.This approach allows valuable secondary information, which can only be inferred from a specific resolution, to be transferred to the other resolutions.Since connected through a GNN, the agreement between scales enables for self-improvement of the teacher module; • An exhaustive set of experiments on two publicly available datasets, Camelyon-16 1 [19] and TCGA-Lung,2 validate and confirm the effectiveness of our proposal for analyzing a variety of WSI and supporting clinical decision; • The source-code is available on GitHub3 to ensure experiment reproducibility and future comparisons.

A. MIL for Disease Detection in WSI
Existing MIL approaches for WSI classification can be clustered based on two different aspects: the number of resolutions employed and the aggregation mechanism used to provide the final prediction.With the former, we can distinguish singlescale algorithms [10], [12], [20], where the tiling process is done at a unique resolution, and multi-scale approaches [10], [21], [22].
Regarding the aggregation mechanism, a further distinction between instance-level predicting algorithms [23], [24] and those that deal with bag-level [10], [11], [12], [20], [21], [22] predictions can be highlighted.In the first case, the patch probability is employed to produce the final result (e.g., mean or max pooling), while bag-level approaches aggregate instances into bag representation and feed a classifier with it.The main difference between recent proposals concerns the attention mechanisms employed for aggregating instance-level information.Single Scale.Regarding the single resolution, the classical AB-MIL [11] is based on a side-branch network to calculate the attention scores.In [20] a similar attention mechanism is employed as support for a double-tier feature distillation approach, which distills features from pseudo-bags to the original slide.Those features are selected by relevance as MaxMin [20] or by aggregation as AFS [20].Another approach proposed in the literature, DS-MIL [10], is to apply non-local attention aggregation measuring the distance with the most relevant patch.In 2021, Lu et al. [25] proposed an algorithm based on a clustering loss applied on single or multiple branches (CLAM-SB and CLAM-MB), a variant of the classic AB-MIL.In Trans-MIL [12], a typical transformer architecture is used.The issue related to all the aforementioned solutions is that they miss to consider the mutual instance correlation.In [9], the authors leverage the well-performing Dino [26] feature extractor, proving its effectiveness also in this scenario.Beyond the classical attention mechanism, there are also algorithms based on Recurrent Neural Networks (RNNs) [22] and GNNs [27], [28] able to solve the same task.Also, these proposals miss to consider multiple resolutions and ignore the pyramidal structure of the WSIs.Multi Scale.Recently, different authors focused on multiresolution approaches.DSMIL-LC [10] concatenates representations obtained at different resolutions (e.g., a low instance representation is concatenated with all the ones obtained at a higher resolution).MS-RNNMIL [22], instead, fed an RNN with instances extracted at different scales.In [29], a selfsupervised hierarchical transformer is applied at each scale.In MS-DA-MIL [21], multi-scale features are included in the same attention algorithm.In H 2 MIL [15] the multi-resolution is exploited by a GNN architecture and the information is aggregated through a specifically designed iterative pooling layer based on patches' location.The proposed aggregation uses the WSI pyramidal structure, but forcing a local attention that may affect performances.Previously, Chen et al. [30] proposed the PTree-Net that selects patches at different resolutions based on the thumbnail level attention map.Those patches are then connected in a tree structure, later investigated through a relevance-enhanced GNN.However, this approach misses considering the hierarchical structure of patches and their heterogeneity [15].

B. Knowledge Distillation
Distilling knowledge from a more extensive network (teacher) to a smaller one (student) has been widely investigated in recent years [20], [31], [32], [33], [34].Typically, the loss evaluates the mimic capabilities of the student observing the teacher.Recent self-supervised representation learning approaches have also used this idea.In [26], [35], and [36], knowledge distillation is used to realize an agreement between networks that receive as input variants of the original image.In literature, knowledge distillation has been applied to different fields ranging from model compression [37] to WSI analysis [20], [31].In [20], distillation is used to transfer the knowledge between MIL tiers applied on different subsampled bags.In [38], a self-distillation term is used in a regression task where the student model is trained on soft labels provided by the teacher.Moreover, [39] includes the weighted-ground truth targets in the loss term to better guide the student.In [40], the authors analyze the importance of teacher diversity and they provide a series of label smoothing methods to directly increase predictive diversity.Differently, our model applies (self) knowledge distillation between WSI scale resolutions.This way, improving the worst scale benefits the best one since they are connected through the GNN layer.

III. METHOD
Our proposal aims to enhance the information flow through WSI resolutions.In this respect, while existing works [10], [12], [25] take into account the interactions between scales by mostly leveraging trivial operations (such as concatenation of related feature representations), we instead provide a novel technique that builds upon: i) a GNN module that propagates patches' representation taking into account the natural multi-resolution structure of WSI; ii) a regulation term based on (self) Knowledge Distillation.This regulation term guides the most effective resolution to train the others more effectively self-improving at the same time.In the following section, we will discuss the proposed architecture in detail.

A. Architecture
Our approach can be decomposed into three main stages: • Feature Extraction.Given the whole slide image, we resample it at various resolutions; for the i-th scale, we divide the resulting image into regular patches . ., x i N , which are then fed into a self-supervised feature extractor f (•; θ i ).This way, we obtain multiple grids of latent representations; • Context Enrichment.The representations are then re-arranged as nodes of a multi-Graph Neural Network (GNN), which spreads the information between resolutions through a message passing mechanism; • Multiple Instance Learning.An auxiliary module weights the contribution of each instance to the representation of the whole slide used to perform classification.
Feature Extraction.Recent studies have shown the effectiveness of self-supervised learning techniques in capturing patch-level characteristics.For instance, the authors of [10] resorted to SimCLR [41], an approach that aims to align the representations of positive pairs and keep the representations of negative pairs distant.This approach suffers from a slow convergence rate and has a huge memory footprint.
To address the shortcomings of self-supervised learning, our work advocates for Dino [26], which has recently shown promising results in the field of image understanding.In particular, it does not reckon on forging negative pairs during optimization, but rather, the training objective focuses solely on aligning the representations of positive pairs.In more detail, the latter is computed by two distinct networks, playing the roles of teacher and student.This way, the training is faster and requires lower computational resources, thus obtaining excellent representations in a few days.Similarly to [36], the risk of collapsing solutions is avoided through slower teacher network updates obtained through the Exponential Moving Average (EMA).The authors of Dino introduced additional enhancements such as the exploitation of the Vision Transformers [42] in the design of the backbone networks.However, we refer the reader to the original paper for a deep understanding of Dino.Concerning our work, we devise an initial stage where multiple feature extractors f (•; θ 1 ), . . ., f M (•; θ M ) are trained, each of which ends up being an expert of one of the M zoom scales of interest.On top of that, we freeze the weights of these networks and use them as patch-level feature extractors during the next step.
In our analysis, we focus only on two resolutions at the time (e.g. the 10× and 20× magnitudes) so that M = 2, but the approach can be extended to consider even more scales.Therefore, the overall input X for the subsequent layers can be summarized as follows: High-level representations  local patches.In general terms, such a module takes as input multi-scale patch-level representations X and produces the resulting Y through a cascade of two neural sub-networks GNN 1 and GNN 2 : where A 1 , A 2 and A 1∪2 identify the adjacency matrices used within this block and later described in this paragraph.The former module, GNN 1 , focuses only on intra-scale relations and no interactions are hence allowed between nodes from different sub-graphs.The computation of the resulting hidden activations H can be summarized as follows: where GAT identifies the graph attention layer proposed in [43].As can be understood, the two sub-graphs are processed through two independent graph layers.To explain such a design choice, we recall that the input embedding spaces X 1 and X 2 originate from different feature extractors.This way, individual scale features are learned in isolation to preserve the diversity in terms of information content.However, a direct shared vertex-wise transformation would clash with that input distribution shift. 4For this reason, we ask the two independent modules to act as adapters, projecting the representations into a shared embedding space.The second module, GNN 2 , allows for inter-scale relations.In particular, data are no longer considered as two distinct sub-graphs but rather as a unique and complex structure in which nodes from different scales can interact.This way, we aim to propagate neighborhood information across low and high resolutions. Briefly: We obtain the adjacency matrix A 1∪2 by preserving the edges already in A 1 and A 2 and adding those connections between a parent WSI patch (lying in the low resolution, we call this relation "part of") and its children, i.e. the higher-scale patches it contains.
For what concerns A 1 and A 2 , we propose to use two types of distances for the grid calculation: • Chessboard distance.In this case, a patch neighborhood is defined by the 8-connectivity, i.e., its surrounding 8 patches in image space.
• Similarity distance.The similarity distance is calculated as (X i ) • (X i ) T , and each node is linked to its K closest neighbors.This way, semantically similar nodes are connected in feature space.A visual representation of the proposed multi-layer graph structure is provided in Fig. 2. Bag-Level Representations via Critical Instance Selection.Following existing and well-established works [10], [15], [20], we compute the overall representation for the whole slide through a technique based on Multiple Instance Learning.Our model is inspired by DSMIL [10] and employs a self-attention mechanism that provides bag-level feature vectors, one for each of the two WSI scales involved in the pipeline.Such a module is based on self-attention [33] and builds upon the prior individuation of the critical patches Y CRIT = [y 1 CRIT , y 2 CRIT ], selected in a winner-takes-it-all fashion: where ) is a scoring projection network (with weights matrices W 1  CRIT and W 2 CRIT ) that assigns a single scalar weight to each patch.It is noted that the same equation holds for the selection of the critical instance y 2 CRIT of the high-level resolution.Once these instances have been individuated, the bag-level representations y 1 BAG and y 2 BAG are computed as: ). (2) For an in-depth description of how self-attention weights and values have been calculated, we refer the reader to [44].

B. Aligning Scales With (Self) Knowledge Distillation
We have hence obtained two distinct sets of predictions for the two resolutions: namely, a bag-level score (e.g., a tumor is either present or not) and a patch-level one (e.g., which instances contribute the most to the target class).However, as these learned metrics are inferred from different WSI magnitudes, a disagreement may emerge: indeed, as previously observed in this Section, the higher resolutions generally yield better classification performance w.r.t.lower ones.In this work, we exploit such a disparity to introduce two additional optimization objectives, which exploit the predictions of the high scale as a teaching signal for the low one.Further than improving the results of the low scale, the aim is to propagate the benefits of such an improvement to the shared message-passing module and back to the high resolution.
To achieve this goal, we propose extending the transfer by promoting consistency between scales.As the poorer scale is additionally encouraged to chase the richer one, the message-passing module placed in the middle turns out to be crucial in discovering valuable patterns and relations, which could help not only the lower scale but also, the higher one.In formal terms, we propose extending the training objective with a twofold term.The former asks the predictions from the two scales to be close as much as possible via (Self) Knowledge Distillation (KD) [45]: where KL identifies the Kullback-Leibler divergence and τ is a temperature that lets secondary information emerge from the teaching signal.Under the light of Bayesian statistical inference, the probability distribution given by the higher scale represents the prior distribution we enforce while fitting the lower one.
The second aligning term regards the scores computed through Eq. (3).More in detail, it encourages the two resolutions to assign criticality scores in a consistent manner: intuitively, if a low-resolution instance has been considered critical, then the average score attributed to its children instances should be likewise high.Such desiderata are carried out by minimizing the Euclidean distance between the low-resolution criticality grid map z 1 (Y 1 ) and its sub-sampled counterpart computed by the high-resolution branch: In the equation above, GraphPooling identifies a pooling layer applied over the higher scale: to do so, it considers the relation "part of" between scales.Overall Objective.To sum up, the overall optimization problem is formulated as a mixture of two objectives: the one requiring higher conditional likelihood w.r.t.ground truth labels y (e.g. a tumor is either present or not) and carried out through the Cross Entropy loss L CE (•; y); the other one based on (Self) Knowledge Distillation: where λ is a hyperparameter weighting the trade-off between the teaching signals provided by labels and the higher resolution, while β balances the contributions of the consistency regularization introduced in Eq. ( 4).

IV. EXPERIMENTS
Datasets.The proposed framework is evaluated on two different benchmarking datasets: the Camelyon16 [19] and a Lung dataset from The Cancer Genome Atlas (TCGA) project [10], which are widely employed for the evaluation of state-of-theart proposals for WSI analysis [10], [20], [25].More specifically, the Camelyon16 dataset -which derives from the homonym challenge that took place at the International Symposium on Biomedical Imaging (ISBI) in 2016is designed for the automatic detection of metastases in Hematoxylin and Eosin (H & E) stained WSIs of lymph node sections [19].It contains 398 WSIs, 128 of which are part of the official test set.Images have been acquired with two different slide scanners, named RUMC and UMCU, with 20× and 40× objective lenses, respectively, and comparable specimen-level pixel sizes, i.e. 0.243 µm × 0.243 µm for the RUMC and 0.226 µm × 0.226 µm for the UMCU.In this respect, the knowledge about the specimen-level pixels sizes allows to combine data acquired with different scanners and resolutions: we exploit this metadata to extract different scales from the original whole-slide images.To evaluate the performance of our proposal, we have adhered to the official training/test sets.
The second dataset -the TCGA Lung datasetis publicly available on the GDC Data Transfer Portal and comprises two subsets of cancer: Lung Adenocarcinoma (LUAD) and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE I MEAN NUMBER OF PATCHES PER WSI OBTAINED BY OUR PRE-PROCESSING ALGORITHM AT DIFFERENT MAGNIFICATION LEVELS FOR BOTH THE CONSIDERED DATASETS
Lung Squamous Cell Carcinoma (LUSC), counting 541 and 513 WSIs respectively.The task tackled in this case is the classification of LUAD vs LUSC.To split the dataset into training and test, we have followed what has been devised by DSMIL [10], also removing ten corrupted slides from the dataset, according to [10].Pre-Processing.Before feeding our framework, WSIs are pre-processed to extract and filter background patches: to this aim, we adopt the CLAM framework [25].In particular, after an initial segmentation process based on Otsu [46], non-overlapped patches within the foreground regions are considered.The original WSI is memory-loaded with a down-sampled resolution, usually 32×, and converted into the HSV color space.The saturation channel of the image is filtered with median blur to smooth edges and then thresholded to produce a binary foreground map of tissue regions.Additional morphological operators [47] and connected components labeling [48] are employed to fill gaps and holes within the map.Eventually, the foreground objects are filtered based on their area to remove spurious noise blobs.Examples of the filtering process are depicted in Fig. 3. 256×256 patches are finally extracted within the segmented contours, ensuring no overlap between patches.The process is repeated multiple times for each slide at different target resolutions (5×, 10×, and 20× in this study).When changing the resolution, the number of patches can vary significantly (from a hundred to hundreds of thousands).Moreover, given that the process is independently performed at different scales, the number of patches at different levels does not necessarily respect a fixed ratio (e.g., 4:1 between 20× and 10× magnification levels).To provide a summary, Tab.I reports the average number of resulting patches per WSI at different magnification levels.
Metrics.The performances of our model have been measured through the AUC (Area Under the Curve) and the accuracy.
The AUC measures the area under the ROC (Receiver Operator Characteristics) curve by varying the probability threshold.The accuracy is evaluated by selecting the best threshold suggested by the ROC.Additionally, as proposed for the Camelyon16, the detection/localization performance is measured by means of the Free Response Operating Characteristic (FROC) curves.FROC differs from the ROC analysis since it substitutes the false positive rate on the x-axis with the average number of false positives per image.Specifically, a detection is considered a true positive, if the location of the detected region is within the ground truth lesion.If there are multiple findings for a single ground truth region, they are counted as a single true  [20].DAS-MIL PERFORMANCE ARE OBTAINED WITH T = 1.5, β = λ = 1

TABLE III COMPARISON BETWEEN SCALES. THE TARGET COLUMN INDICATES THE FEATURES PASSED TO THE TWO MIL LAYERS: THE "∥" SYMBOL INDICATES THAT THEY HAVE BEEN PREVIOUSLY CONCATENATED
Moreover, all the detections not within a specific distance from the ground truth annotations are considered false positives.Implementation Details.We leveraged the Pytorch-Geometric codebase [49] to build the graph structure proposed in this paper.To train our models, we used the Adam [50] optimizer with a learning rate of 2 × 10 −4 and a cosine annealing scheduler with a 1 × 10 −5 decay without warm restart [51].The Dino [26] feature extractor has been trained with two NVIDIA RTX5000 GPUs and all subsequent experiments have been performed on a SLURM server [52] with a single NVIDIA RTX2080 GPU.

A. Comparison With the State-of-the-Art
Tab. II compares the results obtained by the proposed framework with the state-of-the-art architectures, considering both single-scale and multi-scale alternatives.Whenever available, the results published in the official papers are reported, otherwise, the experiments are replicated on our premises, using the code provided by the original authors.We compare our architecture with six different single resolution MIL mechanisms, i.e., MILLRNN [22], ABMIL [11], CLAM [25], Trans-MIL [12], DTFT [20] and DS-MIL [10], as well as multi resolution proposals MS-DA-MIL [21], MS-MILRNN [22], DSMIL-LC [10], HIPT [29] and H 2 -MIL [15].As can be observed: i) the joint exploitation of multiple resolutions is generally more efficient; ii) DAS-MIL, the approach proposed in this work, yields robust and compelling results (especially on Camelyon16, where it provides 3.7% and 2.7% more than the SOTA accuracy and AUC, respectively).We acknowledge that methods reported in that table are based on different feature extractors, thus influencing the overall performance.The following subsection reports exhaustive ablation studies that shed light on the advantages of our technical contributions, disentangling the contribution of the feature extractor.Comparing C16 and TCGA datasets, there is a significant difference in the signal intensity.The former has roughly < 10% of tumor tissue, while the latter has > 80% of tumor regions per slide.In this sense, the contextualized representation provided by our DAS-MIL is much more effective in the first case.

B. Model Analysis
This section reports multiple ablation studies that detail the contribution of each component, highlighting their strengths and weaknesses.Single-Scale vs Multi-Scale.Let's start by analyzing the contribution of different scales (see Tab. III).For single-scale experiments, we fed the model only with patches extracted at a single reference scale and the corresponding adjacency matrix.For what concerns multi-scale, instead, magnitudes can RESULTS ARE OBTAINED WITH α = β = 1.0 be combined in different ways. 5The experiments revealed that the best results are obtained when the model is trained with 10× and 20× input resolutions.Tab.III also highlights that 5× magnitude is less effective and presents worse discriminative capabilities: we ascribe it to the type of samples in the WSIs (e.g.untreated tumor samples) and the specimen-level pixel size, underlying that different datasets and classification tasks may benefit from different scale combinations.At this point, the reader may wonder how to choose the scales to be employed without running all the possible combinations on the final test set.For this purpose, we highlight that such a selection may depend upon prior analyses of the domain and the nature of the task at hand, considering the peculiarities of the specific biological structure under analysis.In practice, it constitutes an hyperparameter; as such, common techniques based on Cross Validation (CV) can be employed to choose the proper resolutions: as reported in Tab.IV, the results on the validation set suggest and confirm our previous outcomes.
Interestingly, while [10] needs to concatenate the representations from different scales to attain satisfactory performance, such an operation has a negligible impact on our framework.In order to merge information from different resolutions, our approach already devises graph layers.While the concatenation assigns the same importance to the input representations, we argue that our mechanism can more effectively weight the contributions of patches since it can learn to route valuable features through the message-passing algorithm dynamically.The Impact of Feature Extractors and Graph-Based Fusion.We refer the reader to Tab.V for an investigation of these aspects.In more detail, we consider both SimCRL and Dino, as well as the recently proposed graph mechanism H 2 -MIL [15]: in doing so, we fix the input resolutions to 5× and 20×, according to our previous analysis.We can draw the following conclusions: i) when our DAS-MIL feature propagation layer is used, the selection of the optimal feature extractor (i.e.SimCLR vs Dino) has less impact on performance, as the message-passing can compensate for possible lacks in the initial representation; ii) DAS-MIL appears a better features propagator w.r.t.H 2 -MIL.On this latter point, we conjecture that our approach has a characteristic that better captures the peculiarities of WSIs, i.e., how the global representation of the slide is computed.Indeed, H 2 -MIL exploits a global pooling layer that fulfills only the spatial structure of patches.Consequently, if non-tumor patches surround a tumor patch, its contribution to the final prediction is likely to be outweighed by the IHPool module of H 2 -MIL.Differently, our approach is not restricted in such a way, as it can dynamically route the information across the hierarchical structure (also based on the connections with the critical instance).On the Role of (Self) Knowledge Distillation and Consistency Regularization.To assess the merits of the regularization objectives discussed earlier, we conducted several experiments varying the values of the corresponding balancing coefficients and present their results in Tab.VI.Lowering their values (even reaching λ = 0 i.e. no distillation is performed) negatively affects the performance.Notably, such a statement holds not only for the lower resolution (as one could expect), but also for the higher one, thus corroborating the claims we made in Sec.III-B on the bidirectional benefits of (Self) Knowledge Distillation in our multi-scale architecture.
We have also performed an assessment on the temperature τ , Tab.VII, which usually controls the smoothing factor applied to the teacher's predictions.We found out that the lower the temperature, the better the results, thus suggesting that the teacher scale is naturally not over-confident about its predictions but rather well-calibrated.results when mean pooling is used.Employing mean-pooling prevents knowledge transfer from misleading isolated samples, ensuring greater accuracy.However, when computing the 8-similarity graph in feature space, max pooling becomes more relevant.Not only it is more robust to overfitting (achieving 0.94% AUC at the last epoch), but it also outperforms the graph build on image space using the same pooling strategy.This is likely because using the feature space ensures the graph homophily property, dramatically reducing noise related to outlier instances.

Max vs Mean
Localization.In clinical scenarios, a discriminative model has to provide not only good predictive capabilities but also a reasonable explanation for its final prediction.The proposed architecture can provide such an explanation, highlighting the disease-positive patches extracted by the framework.An example is depicted in Fig. 5 where the model's output is compared with the ground-truth provided for a small subset of WSIs in the Camelyon dataset.Moreover, the chart in Fig. 4 compares the localization properties (FROC curves) of DAS-MIL with different graph types and DSMIL.Notably, the use of a similarity graph can have a positive impact on the localization task.In particular, when the number of connected neighbors K is low (e.g., K = 4), the model detects a high number of true positive instances at the cost of a few false positives.On the other hand, when K is high (e.g., K = 16), although the number of false positives is higher, the model achieves higher sensitivity.We can stress that K implicitly regulates the smoothing operation performed on the attention scores by the graph.

V. CONCLUSION
This paper proposes a novel way to exploit multiple resolutions in the domain of histological WSI.We conceive a novel graph-based architecture that learns correlations between different WSI resolutions.Specifically, a GNN cascade is used to extract a context-aware and instance-level feature considering the spatial relationship between scales.This connection is further boosted during the training process by a distillation loss, asking for an agreement between scales.On the one hand, an extensive set of experiments shows the effectiveness of the proposed distillation approach Tab.VI, Tab.VII, and of the graph mechanism employed Tab.V.
On the other hand, a few criticisms about the proposed architecture can be highlighted: i) while the criticality-based two-stage MIL approach is well suited for localization-related tasks (such as tumor detection), scenarios like tumor staging/survival prediction may require a deeper analysis across multiple critical regions [53]; ii) our DAS-MIL relies on a separate feature extractor for each target scale.While this is advantageous in terms of overall accuracy, it would require additional training stages for introducing new magnification(s) or adapting those already available for targeting a new task; iii) as currently devised, the patch-level feature extractors are trained in a self-supervised fashion and frozen in the subsequent supervised stages of our pipeline.However, it could be beneficial to envision end-to-end training, which could promote representations more aligned with the task under consideration.
In addition to tackling the aforementioned issues, future works will explore proper and profitable ways to leverage on more than two scales.Such an analysis will focus not only on mere technicalities but also on the rationales behind using deeper pyramidal structures: which medical imaging tasks would effectively benefit from such an approach?Are all the scales equally crucial for all the tasks?

Manuscript received 29
September 2023; accepted 19 November 2023.Date of publication 28 November 2023; date of current version 3 April 2024.This work was supported in part by the European Union's Horizon 2020 Research and Innovation Programme under Grant 965193 and in part by the Department of Engineering "Enzo Ferrari" of the University of Modena through the FARD-2022 (Fondo di Ateneo per la Ricerca 2022).(Corresponding author: Federico Bolelli.)

Fig. 2 .
Fig. 2. Overview of the proposed DAS-MIL framework.The features extracted at different scales are connected considering both the image (as depicted in this Figure) and similarity space, by means of different graphs.The nodes of both graphs (A 1 and A 2 ) are later fused into a third one (A 1∪2 ), respecting the relation "part of".The contextualized features generated by the GNN module are passed to distinct attention-based MIL blocks (one for each scale) that extract bag labels.(Self-) Knowledge Distillation mechanism encourages agreement across the predictions delivered at different scales, both at bag (L KD ) and instance level (L CRIT ).

W 1 V
CRIT ) Attention scores w.r.t. the critical patch.* V(Y 1 i ;Patch-level value.

Fig. 3 .
Fig. 3. Examples of WSI pre-processing.Green contours represent the considered tissue, the blue ones are holes the algorithm will discard.In this examples, the holes are mainly composed by fat tissue.

Fig. 5 .
Fig. 5. Localization Example.From left to right we have (a) the original image, (b) the tissue segmented with pre-processing algorithm, (c) the positive-disease annotation provided by Camelyon16, and (d) the attention map extracted with DAS-MIL.

TABLE II COMPARISON
WITH STATE-OF-THE-ART SOLUTIONS.RESULTS MARKED WITH " †" HAVE BEEN CALCULATED ON OUR PREMISES AS THE ORIGINAL PAPERS LACK THE SPECIFIC SETTINGS; ALL THE OTHER NUMBERS ARE TAKEN FROM [10] AND Pooling with Different Graph Topologies.The results presented in Tab.VIII indicate that (self-) knowledge distillation across different resolutions typically yield better

TABLE VIII EXPLORING
THE EFFECTIVENESS OF GraphPooling, EQ. (4), IMPLEMENTED AS mean OR max POOLING WHEN USING DIVERSE GRAPH TOPOLOGIES.OUR ANALYSIS FOCUSES ON GRAPHS BUILD UPON THE IMAGE SPACE CONSIDERING THE chessboard DISTANCE (8-CONNECTIVITY) OR K TOP NEAREST NEIGHBORS IN FEATURE SPACE