A Graph-Transformer for Whole Slide Image Classification

Deep learning is a powerful tool for whole slide image (WSI) analysis. Typically, when performing supervised deep learning, a WSI is divided into small patches, trained and the outcomes are aggregated to estimate disease grade. However, patch-based methods introduce label noise during training by assuming that each patch is independent with the same label as the WSI and neglect overall WSI-level information that is significant in disease grading. Here we present a Graph-Transformer (GT) that fuses a graph-based representation of an WSI and a vision transformer for processing pathology images, called GTP, to predict disease grade. We selected 4,818 WSIs from the Clinical Proteomic Tumor Analysis Consortium (CPTAC), the National Lung Screening Trial (NLST), and The Cancer Genome Atlas (TCGA), and used GTP to distinguish adenocarcinoma (LUAD) and squamous cell carcinoma (LSCC) from adjacent non-cancerous tissue (normal). First, using NLST data, we developed a contrastive learning framework to generate a feature extractor. This allowed us to compute feature vectors of individual WSI patches, which were used to represent the nodes of the graph followed by construction of the GTP framework. Our model trained on the CPTAC data achieved consistently high performance on three-label classification (normal versus LUAD versus LSCC: mean accuracy = 91.2 ± 2.5%) based on five-fold cross-validation, and mean accuracy = 82.3 ± 1.0% on external test data (TCGA). We also introduced a graph-based saliency mapping technique, called GraphCAM, that can identify regions that are highly associated with the class label. Our findings demonstrate GTP as an interpretable and effective deep learning framework for WSI-level classification.


I. Introduction
Computational pathology [1]- [4], which entails the analysis of digitized pathology slides, is gaining increased attention over the past few years. The sheer size of a single whole slide image (WSI) typically can exceed a gigabyte, so traditional image analysis routines may not be able to fully process all this data in an efficient fashion. Modern machine learning methods such as deep learning have allowed us to make great progress in terms of analyzing WSIs including disease classification [5], tissue segmentation [6], mutation prediction [7], and spatial profiling of immune infiltration [8]. Most of these methods rely on systematic breakdown of WSIs into image patches, followed by development of deep neural networks at patch-level and integration of outcomes on these patches to create overall WSI-level estimates [9], [10]. While patch-based approaches catalyzed research in the field, the community has begun to appreciate the conditions in which they confer benefit and in those where they cannot fully capture the underlying pathology. For example, methods focused on identifying the presence or absence of a tumor on an WSI can be developed on patches using computationally efficient techniques such as multiple instance learning [11]. On the other hand, if the goal is to identify the entire tumor region or capture the connectivity of the tumor microenvironment characterizing the stage of disease, then it becomes important to assess both regional and WSI-level information. There are several other scenarios where both the patch-and WSI-level features need to be identified to assess the pathology [12], and methods to perform such analysis are needed.
The success of patch-based deep learning methods can be attributed to the availability of pre-trained deep neural networks on natural images from public databases (i.e., ImageNet [13]). Since there are millions of parameters in a typical deep neural network, de novo training of this network requires access to a large set of pathology data, and such resources are not necessarily available at all locations. To address this bottleneck, researchers have leveraged transfer learning approaches that are pre-trained on ImageNet to accomplish various tasks. Recently, transformer architectures were applied directly to sequences of image patches for various classification tasks. Specifically, Vision Transformers (ViT) were shown to achieve excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources for model training [14]. Position embeddings were used in ViTs to retain spatial information and capture the association of different patches within the input image. The self-attention mechanism in ViT requires the calculation of pairwise similarity scores on all the patches, resulting in memory efficiency and a simple time complexity that is quadratic in the number of patches. Leveraging such approaches to perform pathology image analysis is not trivial because each WSI can contain thousands of patches. Additionally, some approximations are often made on these patches such as using the WSI-level label on each patch during training, which is not ideal in all scenarios as there is a need to process both the regional information as well as the WSI in its entirety to better understand the pathological correlates of disease.
Similar to the regional and WSI-level examination, we argue that an expert pathologist's workflow also involves examination of the entire slide under the microscope using manual operations such as panning and zooming in and out of specific regions of interest to assess various aspects of disease at multiple scales. In the zoom-in assessment, the pathologists perform an in-depth, evaluation of regional manifestations of disease whereas, the zoom-out assessment involves obtaining a rational estimate of the overall disease on the entire WSI. Both these assessments are critical as the pathologist obtains a gestalt on various image features to comprehensively assess the disease [12].

A. Related Work
Recent attempts to perform WSI-level analysis have shown promising results in terms of assessing the overall tissue microenvironment. In particular, graph-based approaches such as graph convolutional networks have gained a lot of traction due to their ability to represent the entire WSI and analyze patterns to predict various outcomes of interest. To learn hierarchical representation for graph embedding, recent approaches have proposed pooling strategies. For example, in AttPool [15], Huang and colleagues devised an attention pooling layer to select discriminative nodes and built the coarser graph based on calculated attention values. AttPool sufficiently leveraged the hierarchical representation and facilitated model learning on several graph classification benchmark datasets. Zhou and colleagues developed a cell-graph convolutional neural network on WSIs to predict the grade of colorectal cancer (CRC) [16]. In this work, the WSI was converted to a graph, where each nucleus was represented by a node and the cellular interactions were denoted as edges between these nodes to accurately predict CRC grade. Also, Adnan and colleagues developed a two-stage framework for WSI representation learning [17], where patches were sampled based on color and a graph neural network was constructed to learn the inter-patch relationships to discriminate lung adenocarcinoma (LUAD) from lung squamous cell carcinoma (LSCC). In another recent work, Lu and team developed a graph representation of the cellular architecture on the entire WSI to predict the status of human epidermal growth factor receptor 2 and progesterone receptor [18]. Their architecture attempted to create a bottom-up approach (i.e., nuclei-to WSI-level) to construct the graph, and in so doing, achieved a relatively efficient framework for analyzing the entire WSI.
Several other researchers have leveraged graph-based approaches to process WSIs to address various questions focused on survival analysis [19]- [22], prediction of lymph node metastasis [23], mutational prediction [24], cell classification [25], and retrieval of relevant regions [26]. With the objective of retaining the correlation among different patches within an WSI, Shao and colleagues leveraged the computationally efficient sampling approach of multiple instance learning and the self-attention mechanism of ViT in TransMIL [27]. To deal with the memory and quadratic time complexity issue in ViT, they adopted the Nystrom Method [28]. They expressed the significance of retaining these correlations among instances by showing good performance for tumor classification in 3 different datasets: CAMELYON16, TCGA-NSCLC and TCGA-RCC. Motivated by these advances, we submit that integration of computationally efficient approaches such as ViTs along with graphs can lead to more efficient representation learning approaches for the assessment of WSIs.

B. Contributions
The main contributions of this paper are summarized below:

•
We developed a graph-based vision transformer for digital pathology called GTP that leverages a graph representation of pathology images and the computational efficiency of transformer architectures to perform WSI-level analysis. To build the GTP framework, we constructed a graph convolutional network by embedding image patches in feature vectors using contrastive learning, followed by the application of a vision transformer to predict a WSI-level label.
• Using WSI and clinical data from three publicly available national cohorts (4, 818 WSIs), we developed a model that could distinguish between normal, LUAD, and LSCC WSIs.

•
We introduced graph-based class activation mapping (GraphCAM), a novel approach to generate WSI-level saliency maps that can identify image regions that are highly associated with the output class label. On a few WSIs, we also compared the performance of the GraphCAMs with pathologist-driven annotations and showed that our approach identifies important disease-related regions of interest.
• Over a series of ablation studies and sensitivity analyses, we showed that our GTP framework outperforms current state-of-the-art methods used for WSI classification.

A. Study Population
We obtained access to WSIs as well as demographic and clinical data of lung tumors (LUAD and LSCC) and normal tissue from the Clinical Proteomic Tumor Analysis Consortium (CPTAC), the National Lung Screening Trial (NLST) and The Cancer Genome Atlas (TCGA) ( Table I). CPTAC is a national effort to accelerate the understanding of the molecular basis of cancer through the application of large-scale proteome and genome analysis [29]. NLST was a randomized controlled trial to determine whether screening for lung cancer with low-dose helical computed tomography reduces mortality from lung cancer in high-risk individuals relative to screening with chest radiography [30]. TCGA is a landmark cancer genomics program, which molecularly characterized thousands of primary cancer and matched normal samples spanning 33 cancer types [31].

B. Graph-Transformer
Our proposed Graph-Transformer (GT) ( Fig. 1 (a)) fuses a graph representation G of an WSI ( Fig. 1 (b)), and a transformer that can generate WSI-level predictions in a computationally efficient fashion. Let G = (V, E) be an undirected graph where V is the set of nodes representing the image patches and E is the set of edges between the nodes in V that represent whether two image patches are adjacent to each other. We denote the adjacency matrix of G as A = A ij where A ij = 1 if there exists an edge (υ i , υ j ) ∈ E and A ij = 0 otherwise. An image patch must be connected to other patches and can be surrounded by at most 8 adjacent patches, so the sum of each row or column of A is at least one and at most 8. A graph can be associated with a node feature matrix F, F ∈ IR N × D , where each row contains the D-dimensional feature vector computed for an image patch, i.e., node, and N = |V|.
Using all the pixels within each image patch as features can make model training computationally intractable. Instead, our framework applies a feature extractor to generate a vector containing features and uses it to define the information contained in an image patch, which is a node in the graph. This step reduces the node feature dimension from W p × H p × C p to D, where W p , H p , and C p are width, height, and channel of the image patch, and D × 1 is the dimension of extracted feature vector. The expectation is that the derived feature vector provides an efficient representation of the node and also serves as a robust means by which to define a uniform representation of an image patch for graph-based classification.
As described above, current methods that have been developed at patch-level impose WSIlevel labels on all the patches or use weakly supervised learning to extract feature vectors that are representative of the WSI. This strategy is not suitable for all scenarios, especially when learning the overall WSI-level information. We leveraged a strategy based on selfsupervised contrastive learning [32], to extract features from the WSIs. This framework enables robust representations that can be learned without the need for manual labels. Our approach involves using contrastive learning to train a convolutional neural network (CNN) that produces embedding representations by maximizing agreement between two differently augmented views of the same image patch via a contrastive loss in the latent space ( Fig.  1(C)). The training starts with tiling the WSIs from the training set into patches and randomly sampling a mini-batch of K patches. Two different data augmentation operations are applied to each patch (p), resulting in two augmented patches (p i and p j ). The pair of two augmented patches from the same patch is denoted as a positive pair. For a mini-batch of K patches, there are 2K augmented patches in total. Given a positive pair, the other 2K − 1 augmented patches are considered as negative samples. Subsequently, the CNN is used to extract representative embedding vectors (f i , f j ) from each augmented patch (p i , p j ). The embedding vectors are then mapped by a projection head to a latent space (z i , z j ) where contrastive learning loss is applied. The contrastive learning loss function for a positive pair of augmented patches (i, j) is defined as: where 1 [k ≠ i] ∈ {0, 1} is an indicator function evaluating to 1 if and only if k ≠ i and τ denotes a temperature parameter. Also, sim(u, v) = u T v/‖u‖‖v‖ denotes the dot product between L 2 normalized u and v (i.e., cosine similarity). For model training, the patches were densely cropped without overlap and treated as individual images. The final loss was computed across all positive pairs, including both (i, j) and (j, i) in a mini-batch. After convergence, we kept the feature extractor and used it for our GTP model to compute the feature vectors of the patches from the WSIs. GTP uses these computed feature vectors as node features in the graph construction phase. Specifically, we obtained the node-specific where f i is the D-dimensional embedding vector obtained from Resnet trained using contrastive learning and N is the number of patches from one WSI. Note that N is variable since different WSIs contain different numbers of patches. As a result, each node in F corresponds to one patch of the WSI. We defined an edge between a pair of nodes in F based on the spatial location of its corresponding patches on the WSI. If patch i is a neighbor of patch j on the WSI ( Fig. 1(B)), then GTP creates an edge between node i and node j as well as set A ij = 1 and A ji = 1, otherwise A ij = 0 and A ji = 0. GTP uses feature node matrix F and adjacent matrix A to construct a graph to represent each WSI.
We implemented the graph convolutional (GC) layer, introduced by Kipf & Welling [33], to handle the graph-structured data. The GC layer operates message propagation and aggregation in the graph, and is defined as: where A is the symmetric normalized adjacency matrix of A and M is the number of GC layers. Here, A = A + I is the adjacency matrix with a self-loop added to each node, and D is a diagonal matrix where D ii = ∑ j A ij . H m is the input of the m-th GC layer and H 1 is initialized with the node feature matrix F. Additionally, W m ∈ IR C m × C m+1 is the matrix of learnable filters in the GC layer, where C m is the dimension of the input and C m+1 is the dimension of the output.
The GC layer of GTP enables learning of node embeddings through propagating and aggregating needed information. However, it is not trivial for a model to learn hierarchical features that are crucial for graph representation and classification. To address this limitation, we introduced a transformer layer that selects the most significant nodes in the graph and aggregates information via the attention mechanism. Transformers use a Self-Attention (SA) mechanism to model the interactions between all tokens in a sequence [34], by allowing the tokens to interact with each other ("self") and find out who they should pay more attention to ("attention"), and the addition of positional information of tokens further increases the use of sequential order information. Excitingly, the Vision Transformer (ViT) enables the application of transformers to 2D images [14]. Inspired by these studies, we propose a transformer layer to interpret our graph-structured data. While the SA mechanism has been extensively used in the context of natural language processing, we extended the framework for WSI data. Briefly, the standard qkv self-attention [34] is a mechanism to find the words of importance for a given query word in a sentence, and it receives as input a 1D sequence of token embeddings. For the graph, the feature nodes are treated as tokens in a sequence and the adjacency matrix is used to denote the positional information. Given that x ∈ ℝ N × D is the sequence of patches (feature nodes) in the graph, where N is the number of patches and D is the embedding dimension of each patch, we compute q(query), k(key) and v(value) (Eq.3a). The attention weights A ij are based on the pairwise similarity between two patches of the sequence and their respective query q i and key k j in Eq.3b. Multihead Self-Attention (MSA) is a mechanism that involves combining the knowledge explored by k number of SA operations, called 'heads'. It projects concatenated outputs of SA in Eq.3c. D h (Eq.3a) is typically set to D/k to facilitate computation and maintain the number of parameters constant when changing k.
MSA(x) = SA 1 (x); SA 2 (x); …SA k (x) × U msa , and U msa ∈ ℝ k ⋅ D ℎ × D . (3d) The goal of the transformer layer is to learn the mapping:ℍ T, where ℍ is the graph space, and T is the transformer space. We define the mapping of ℍ T as: the feature nodes (Eq.4a), whose state at the output of the transformer layer z L 0 serves as mapping of T Y : y = LN(z L (0) ) . (5) Recently, position embeddings were added to the patch embeddings to retain positional information [14]. Position embedding explores absolute position encoding (e.g., sinusoidal or learnable absolute encoding) as well as conditional position encoding. However, the learnable absolute encoding is commonly used in problems with fixed length sequences and does not meet the requirement for WSI analysis. This is because the number of patches that can be extracted can vary among different WSIs. To address this, Islam and colleagues showed that the addition of zero padding can provide an absolute position information for convolution [35]. In our work, we do not leverage position embeddings in the same fashion as ViTs or via inclusion of zero padding. Rather, we used position embedding in the form of an adjacency matrix for the graph convolutional layers. The adjacency matrix of the WSI graph accounts for the spatial information and inter-connectivity of the nodes, which is preserved when performing graph convolutions. To reduce the number of inputs to the transformer layer, a pooling layer was added between the graph convolution and transformer layer (see below). The positional information between the pooled adjacency matrix and the original patch adjacency matrix is preserved by a dense learned assignment used in pooling the features. With this framework, we were able to avoid the need of adding an additional encoder in the transformer and instead preserved positional information in the GC layers of GTP, thus reducing the complexity of our model.
The softmax function is typically used as a row-by-row normalization function in transformers for computer vision tasks [36], [37]. The standard self-attention mechanism requires the calculation of similarity scores between each pair of nodes, resulting in both memory and time complexity that is quadratic in the number of nodes (O(n 2 )). Since the WSI graph contains several nodes (in thousands), it may lead to out-of-memory issues during model training and thus applying the transformer layer directly to the convolved graphs is not trivial. We therefore added a mincut pooling layer [38] between the GC and transformer layers that reduced the number of inputs from thousands to hundreds of nodes. Alternatively, mean pooling can be used to reduce the number of nodes but it may lose information. As such, the mean pooling operation does not contain learnable parameters like what we have with the case of min-cut pooling. Min-cut pooling could preserve the local information of neighboring nodes and reduce the number of nodes at the same time. The idea behind min-cut pooling is to take a continuous relaxation of the min-cut problem and implement it as a pooling layer with a custom loss function. By minimizing the custom loss, the pooling layer learns to find min-cut clusters on any given graph and aggregates the clusters to reduce the graph size. At the same time, because the pooling layer can be used as part of the entire GTP, the classification loss of GTP that is being minimized during training will influence the min-cut layer as well. In so doing, our GTP graph-transformer was able to accommodate several patches as input, underscoring the novelty of our approach and its application to WSI data.

C. Class Activation Mapping
To understand how GTP processes WSI data and identifies regions that are highly associated with the class label, we proposed a novel class activation mapping technique on graphs (Fig.  2). In what follows, we use the term GraphCAM to refer to this technique. Our technique was inspired by the recent work by Chefer and colleagues [39], who used the deep Taylor decomposition principle to assign local relevance scores and propagated them through the layers by maintaining the total relevancy across layers. In a similar fashion, our method computes the class activation map from the output class to the input graph space, and reconstructs the final class activation map for the WSI from its graph representation.
Let A (l) represent the attention map of the MSA block l in Eq.3b. Following the propagation procedure of relevance and gradients by Chefer and colleagues [39], GraphCAM computes the gradient ∇A (l) and layer relevance R n l with respect to a target class for each attention map A (l) , where n l is the layer that corresponds to the softmax operation in Eq.3b of block l. The transformer relevance map C t is then defined as a weighted attention relevance: where ⊙ is the Hadamard product, E ℎ is the mean across the 'heads' dimension, and I is the identity matrix to avoid self inhibition for each node.
The pooled node features by the mincut pooling layer are computed as X pool = S T X, where S ∈ ℝ N g × N t is the dense learned assignment, and N t and N g are the number of nodes before and after the pooling layer. To yield the graph relevance map C g from transformer relevance map C t , our GraphCAM performs mapping C t to each node in the graph based on the dense learned assignments as C t S C g . Finally, GraphCAM reconstructs the final class activation map on the WSI using the adjacency matrix of the graph and coordinates of the patches.

D. Data and Code Availability
All the WSIs and corresponding clinical data can be downloaded freely from CPTAC, TCGA and NLST websites. Python scripts and manuals are made available on GitHub (https://github.com/vkola-lab/tmi2022).

III. Experiments
We performed several experiments to evaluate our GTP framework. The NLST data (≈ 1.8 million patches) was exclusively used for contrastive learning to generate patch-specific features, which were used to represent each node. The GTP framework was trained on the CPTAC data (2, 071 WSIs) using 5-fold cross validation, and the TCGA data (2, 082 WSIs) was used as an independent dataset for model testing using the same hyperparameters.
We also conducted ablation studies to evaluate the contribution of various components on the overall GTP framework. By blocking out the GTP components, we were left with a framework that is comparable to the state-of-the-art in the field. Finally, we used GraphCAMs to identify salient regions on the WSIs, and explored their validity in terms of highlighting the histopathologic regions of interest.

A. Experimental Settings
Each WSI was cropped to create a bag of 512 × 512 nonoverlapping patches at 20× magnification, and background patches with non-tissue area > 50% were discarded. We used Resnet18 as the CNN backbone for the feature extractor. We adapted the Adam optimizer with an initial learning rate of 0.0001, a cosine annealing scheme for learning rate scheduling, and a mini-batch size of 512. We kept the trained feature extractor and used it to build the graphs. We

B. Ablation Studies
To evaluate the effectiveness of our proposed GTP framework, we performed several ablation studies as well as compared the GTP model performance with other state-of-the-art methods (Table II). Since our framework leverages graph-based learning and transformer architectures, we selected a graph-based method called AttPool [15] and a transformer-based framework called TransMIL [27] for comparison, as they both retain spatial information to make a WSI-level prediction. To make a fair comparison, we used the same contrastive learning based model as the feature extractor for all methods. When implementing AttPool and TransMIL, we fine-tuned the hyperparameters used in the previously published original work to achieve the best performance on CPTAC and TCGA cohorts. Later, we removed the transformer component and trained the graph and compared it with the full GTP framework to study the performance of the graph-based portion of our model. Due to computational constraints, it was not feasible to remove the graph component and train the transformer; there was an out-of-memory issue when the number of patches exceeded 4, 760 for batch size of 1 on a GeForce RTX 2080 Ti, GPU memory: 11GB workstation.
We also performed a detailed analysis to evaluate the effect of contrastive learning on the GTP model performance by conducting studies with and without it (Table III). We adopted the supervised learning-based model, Resnet [40], instead of contrastive learning-based model. Resnet ⋆ represents the model trained on ImageNet [13]. Resnet † represents Resnet ⋆ fine-tuned on the same NLST data which is used for contrastive learning. To train Resnet † via supervised learning, we assigned the WSI label to all patches belonging to the same WSI. We also added an unsupervised learning model, a convolutional autoencoder (CAE) [41], for comparison. Additionally, to make a fair comparison in terms of domain specificity, we added the contrastive learning model trained on STL-10 [42] to compare to the model trained on NLST. The STL-10 dataset is an image recognition dataset for developing unsupervised feature learning, whose images were acquired from labeled examples on ImageNet, whereas the NLST dataset is a lung cancer imaging dataset. In essence, these ablation studies allowed us to fully evaluate the power of the GTP framework for WSI-level classification.

C. Computing Infrastructure
We used PyTorch (v1.9.0) and a NVIDIA 2080Ti graphics card with 11 GB memory on a GPU workstation to implement the model. The training speed was about 2.4 iterations/s, and it took less than a day to reach convergence. The inference speed was < 1 s per WSI with a batch size of 2.

D. Performance Metrics
For the classification task (LUAD vs. LSCC vs. normal), we generated receiver operating characteristic (ROC) and precision-recall (PR) curves based on model predictions on the CPTAC testing and the full TCGA datasets. For each ROC and PR curve, we also computed the area under curve (AUC), precision, recall, specificity, and accuracy. Since we used 5-fold cross validation, we took all the curves from different folds and calculated the mean AUCs and their variance. Delong's statistical test was used to show whether the AUCs of two different models were significantly different. GraphCAMs were used to generate visualizations and gain a qualitative understanding on the model performance.

E. Expert Annotations
A randomly selected set of WSIs (n = 10 for LUAD and n = 10 for LSCC) were uploaded to a secure, web-based software (PixelView; deepPath, Boston, MA) for expert review. Using an Apple Pencil and an iPad, E.J.B. annotated several histologic features of LUAD and LSCC. Tumor regions of LUAD were annotated by their LUAD tumor subtypes (solid, micropapillary, cribiform, papillary, acinar, and lepidic). For both LUAD and LSCC, other histologic features of the tumor were also annotated including necrosis, lymphatic invasion, and vascular invasion. Non-tumor regions were annotated as normal or premalignant epithelium, normal or inflammed lung, stroma, and cartilage. The annotations of each region are reflective of the most predominant histologic feature. The annotations were grouped into tumor and non-tumor categories and exported as binary images and processed to quantify the extent of overlap between the model-derived GraphCAMs and the pathologist-driven annotations. Intersection over union (IoU) was used as the metric to measure the overlap between the model and the expert.

IV. Results
The GTP framework that leveraged contrastive learning followed by fusion of a graph with a transformer provided accurate prediction of WSI-level class label (Table II). High model performance was observed on the normal vs. LUAD vs. LSCC task on the CPTAC test dataset but dropped slightly on the TCGA dataset. High model performance was also confirmed via the ROC and PR curves generated on both the CPTAC and TCGA datasets for all the classification tasks (Fig. 3). On each task, the mean area under the ROC and PR curves was high (all > 0.9) on the CPTAC test data. For the TCGA dataset, which was used for external testing, the mean area under the ROC and PR curves dropped slightly, especially for the LUAD and LSCC classification tasks. The model leaned towards incorrectly classifying a few LSCC and LUAD cases but correctly classified most of the WSIs with no tumor.
The GTP framework achieved the best performance compared with other state-of-the-art methods such as TransMIL [27] or AttPool [15] (Table II). The AttPool framework selects the most significant nodes in the graph and aggregates information via the attention mechanism. The TransMIL framework uses a self-attention mechanism from the transformer to model the interactions between all patches from the WSI for information aggregation. We submit that the graph structure along with graph pooling enabled the GTP to capture the short-range associations while the transformer helped capture the long-range associations. Also, since we were able to perform batch training instead of just a single sample per iteration, the model achieved faster convergence and also processed any potential variabilities between multiple samples in the training process.
The GT-based class activation maps (GraphCAMs) identified WSI regions that were highly associated with the output class label (Fig. 4). We also observed a high degree of overlap between the expert-identified regions of interest with the binarized GraphCAMs. Specifically, the maximum value of the intersection over union (IoU) was 0.857 at threshold probability of 0.6 for the LUAD case (Row 1 in Fig. 4), and an IoU of 0.733 at threshold probability of 0.2 for the LSCC case (Row 2 in Fig. 4). Additionally, the expert pathologist selected 20 cases from the TCGA cohort (10 LUAD and 10 LSCC) and annotated the tumor regions on them. We then estimated the overlap between the model-identified regions of interest (via GraphCAMs) and the pathologist annotations, resulting in a high degree of agreement (Mean of maximum values of IoU = 0.817 ± 0.165). Importantly, the degree of overlap remained fairly consistent as the threshold for binarization varied, supporting that our approach to identifying important WSI regions has clinicopathologic significance. We also observed that the same set of WSI regions were highlighted by our method across the various cross-validation folds (Fig. 5), thus indicating consistency with our technique in identifying salient regions of interest. Also, since we can generate class-specific probability for each GraphCAM, our approach allows for better appreciation of the model performance and its interpretability in predicting an output class label. We must however note that in certain cases when the model fails to predict the class label, the GraphCAMs may not result in interpretable findings (Fig. 6).
Ablation studies revealed that our GTP framework that used contrastive learning and combined a graph with a transformer served as a superior model for WSI-level classification (Table III). When contrastive learning was replaced with a pre-trained architecture (Resnet18 with and without fine tuning), the model performance for the 3-label classification task dropped. The reduction in performance was evident on both CPTAC and TCGA datasets. The model performance also dropped when we trained a novel convolutional auto-encoder [41] in lieu of contrastive learning. These results imply that the feature maps generated via contrastive learning were informative to encode a variety of visual information for GT-based classification with a fair degree of generalizability. We also studied the effect of the domain specificity of the dataset on the model performance, by comparing contrastive learning models trained on STL-10 and NLST (Table III). The STL-10 [42] is an image recognition dataset for developing unsupervised feature learning. The images were acquired from labeled examples on ImageNet. The contrasive learning models trained on NLST performed significantly better than STL-10, indicating that the domain knowledge learned by contrastive learning helps improve the generalizability of our GTP model. (Table III).
To analyze the impact of different model hyperparameters and graph configurations on the classification performance, we performed a series of ablation studies (Table IV). In general, the consistency of the results generated using different choices of hyperparameters demonstrate robustness of our proposed model. When the dimension of the hidden state was reduced by half (i.e., 128 to 64), we noticed that the model performance degraded. Similar results in model performance were observed when the number of GCN layers reduced from 3 to 1. In essence, additional GCN layers increase the receptive field of the nodes and could continue to provide long-term information during model training. Increasing the number of min-cut nodes improved the performance, since typically representing the data using more tokens can contribute to higher prediction accuracy [14]. Also, we found that 3 transformer blocks were sufficient to integrate information across the entire pooled nodes from the graph. It must be noted that using more parameters for model training might alter the performance but would be feasible only at a higher computational cost. For instance, the cost can increase quadratically with respect to the token number in the self-attention blocks. Nevertheless, to find a trade-off between computational efficiency and model effectiveness, we tested various parameters including the effect of different forms of the graph construction. Increasing the patch size lowered the accuracy. Using 4-node connectivity instead of 8-node connectivity reduced the receptive field, resulting in lower accuracy. Choosing smaller patch sizes increased the size of graph, which increased the computational cost. Lastly, when a 10% overlap between patches was considered and graph constructed, the model performance slightly reduced. These findings indicate that our proposed GTP framework along with the selected hyperparameters is capable of reasonably integrating information across the entire WSI to predict WSI-level output of interest.

V. Discussion
In this work, we developed a deep learning framework that integrates graphs with vision transformers to generate an efficient classifier to differentiate normal WSIs from those with LUAD or LSCC. Based on the standards of various model performance metrics, our approach resulted in classification performance that exceeded other deep learning architectures that incorporated various state-of-the-art configurations. Also, our novel class activation mapping technique allowed us to identify salient WSI regions that were highly associated with the output class label of interest. Finally, we found that our model-identified regions of interest closely matched with pathologist-derived assessments on the WSIs. Thus, our findings represent novel contributions to the field of interpretable deep learning while also simultaneously advancing the fields of computer vision and digital pathology.
The field of computational pathology has made great strides recently due to advancements in vision-based deep learning. Still, owing to the sheer size of pathology images generated at high resolution, WSI assessment that can integrate spatial signatures along with local, region-specific information for prediction of tumor grade remains a challenge. A large body of work has focused on patch-level models that may accurately predict tumor grade but fail to capture spatial connectivity information. As a result, identification of important imagelevel features via such techniques may lead to inconsistent results. Our GTP framework precisely tackled this scenario by integrating WSI-level information via a graph structure and thus represents a key advancement in the field.
One of the novel contributions in our work is the generation of graph-based class activation maps (GraphCAM), which can highlight the WSI regions that are associated with the output class label. Unlike other saliency mapping techniques such as attention rollout [43] or layerwise relevance propagation [44], GraphCAMs can generate class-specific heatmaps. This is a major advantage because an image may contain information pertaining to multiple classes, and for these scenarios, identification of class-specific feature maps becomes important. This is especially true in real-world scenarios such as pathology images containing lung tumors. Class-specific features could be useful, for example, in identifying rare mixed histology adenosquamous tumors that contain both LUAD and LSCC patterns [45]. In such cases, training well-known supervised deep learning classifiers such as convolutional neural networks that use the overall WSI label for classification at patch-level or even at the WSI-level may not necessarily perform well and even misidentify the regions of interest associated with the class label. By generating class-specific CAMs learned at the WSI-level, our GTP approach provides a better way to identify regions of interest on WSIs that are highly associated with the corresponding class label.
Our study has a few limitations. We leveraged contrastive learning to generate patch-level feature vectors before constructing the graph, which turned out to be a computationally intensive task. Future studies can explore other possible techniques for feature extraction that can lead to improved model performance. Our graph was constructed by dividing the WSI into image patches, followed by creation of nodes using the embedding features from these patches. Other alternative ways can be explored to define the nodes and create graphs that are more congruent and spatially connected. While we have demonstrated the applicability of GTP to lung tumors, extension of this framework to other cancers is needed to fully appreciate its role in terms of assessing WSI-level correlates of disease. In fact, our method is not specific to cancers and could be adapted to other computational pathology applications.
In conclusion, our GTP framework produced an accurate, computationally efficient model by capturing regional and WSI-level information to predict the output class label. GTP can tackle high resolution WSIs and predict multiple class labels, leading to generation of interpretable findings that are class-specific. Our GTP framework could be scaled to WSIlevel classification tasks on other organ systems and has the potential to predict response to therapy, cancer recurrence and patient survival.  (a) Each whole slide image (WSI) was divided into patches. Patches that predominantly contained the background were removed, and the remaining patches were embedded in feature vectors by a contrastive learning-based patch embedding module. The feature vectors were then used to build the graph followed by a transformer that takes the graph as the input and predicts WSI-level class label. (b) Each selected patch was represented as a node and a graph was constructed on the entire WSI using the nodes with an 8-node adjacency matrix. Here, two sets of patches of an WSI and their corresponding subgraphs are shown. The subgraphs are connected within the graph representing the entire WSI.
(c) We applied three distinct augmentation functions, including random color distortions, random Gaussian blur, and random cropping followed by resizing back to the original size, on the same sample in a mini-batch. If the mini-batch size is K, then we ended up with 2 × K augmented observations in the mini-batch. The ResNet received an augmented image leading to an embedding vector as the output. Subsequently, a projection head was applied to the embedding vector which produced the inputs to contrastive learning. The projection head is a multilayer perceptron (MLP) with 2 dense layers. In this example, we considered K = 3 samples in a minibatch  Gradients and relevance are propagated through the network and integrated with an attention map to produce the transformer relevancy maps. Transformer relevancy maps are then mapped to graph class activation maps via reverse pooling.   For each WSI, we generated GraphCAMs and compared them with annotations from the pathologist. The first column contains the original WSIs, the second and third columns contain GraphCAMs and pathologist's annotations, respectively and the fourth column contains the binarized GraphCAMs based on the threshold from the Intersection of Union (IoU) plot in the last column. The first row shows an LUAD case and the second row denotes an LSCC case.  GraphCAMs generated on WSIs across the runs performed via 5-fold cross validation are shown. The first column shows the original WSIs and the other columns show the GraphCAMs with prediction probabilities on the cross-validated model runs. The first row shows a sample WSI from the LUAD class and the second row shows an WSI from the LSCC class. The colormap represents the probability by which an WSI region is associated with the output label of interest.  The first row shows a sample WSI from the LUAD class where the model prediction was LSCC, and the second row shows an WSI from the LSCC class where the model prediction was LUAD. The first column shows the original WSI, and the second and third columns show the generated GraphCAMs along with prediction probabilities. The bold font underneath certain GraphCAMs was used to indicate the model predicted class label for the respective cases. Since this is a 3-label classification task (normal vs. LUAD vs. LSCC), the LUAD and LSCC probability values do not add up to 1.