Knowledge Distillation in Histology Landscape by Multi-Layer Features Supervision

Automatic tissue classification is a fundamental task in computational pathology for profiling tumor micro-environments. Deep learning has advanced tissue classification performance at the cost of significant computational power. Shallow networks have also been end-to-end trained using direct supervision however their performance degrades because of the lack of capturing robust tissue heterogeneity. Knowledge distillation has recently been employed to improve the performance of the shallow networks used as student networks by using additional supervision from deep neural networks used as teacher networks. In the current work, we propose a novel knowledge distillation algorithm to improve the performance of shallow networks for tissue phenotyping in histology images. For this purpose, we propose multi-layer feature distillation such that a single layer in the student network gets supervision from multiple teacher layers. In the proposed algorithm, the size of the feature map of two layers is matched by using a learnable multi-layer perceptron. The distance between the feature maps of the two layers is then minimized during the training of the student network. The overall objective function is computed by summation of the loss over multiple layers combination weighted with a learnable attention-based parameter. The proposed algorithm is named as Knowledge Distillation for Tissue Phenotyping (KDTP). Experiments are performed on five different publicly available histology image classification datasets using several teacher-student network combinations within the KDTP algorithm. Our results demonstrate a significant performance increase in the student networks by using the proposed KDTP algorithm compared to direct supervision-based training methods.


I. INTRODUCTION
T HE development of modern slide scanners for capturing multi-gigapixel Whole Slide Images (WSIs) has enabled significant growth of computational pathology [9], [13], [30], [35], [45], [49]. In clinical practice, these WSIs are considered as a gold standard for better cancer grading, improved diagnoses, and prognosis [47]. These WSIs have been leveraged by many machine learning techniques to facilitate clinicians and pathologists to assess the degree of malignancy of cancer by automatically analyzing the tumor micro-environment [47], [54]. A typical WSI may contain tens of thousands of pixels at the highest magnification level. Such enormous sizes of WSIs pose significant challenges to machine learning techniques due to the increased demand for computational power and storage capacity. To handle this challenge, WSIs have often been divided into patches of relatively smaller size which are then processed by the machine learning techniques as shown in Fig. 1. The main aim of machine learning techniques is to assist pathologists in in improving their diagnosis performance by increasing reproducibility and reducing inter-observer variations [11], [28], [37], [53].
Automatic tissue phenotyping in histology images is one of the important tasks in computational pathology [4], [19], [20], [44]. One of its aims is to learn cancer biomarkers within the tumor-infiltrating lymphocytes landscape for better cancer diagnosis, grading, prognosis, and evaluating response-totreatment [13], [22], [30], [32]. It also has an important role in profiling intra-tumor heterogeneity, epigenetics, and cancer progression [34]. Four different examples of tissue classifications are shown in Fig. 1. This fundamental problem has been addressed by many machine learning researchers. However, wide variations of textures, tissue structure, and heterogeneity in histology images pose significant challenges to machine learning techniques [22]. In order to capture such heterogeneity, deep neural networks have been employed which require a large number of annotated training samples to learn rich feature representations. Such deep neural networks have obtained excellent results however, expensive computational resources are also required in addition to the huge volume of annotated training WSIs [45], [47]. Therefore, such tissue classification tools are not feasible on devices with limited resources, e.g., embedded devices.
In order to reduce the amount of training data as well as computational resources, knowledge distillation techniques have recently been proposed that effectively train a lightweight student network from a heavyweight teacher network [14]. The generalization ability of the student model can be improved by  [41]. (c) Gastrointestinal cancer classification showing tissue images of micro-satellite stable on the left and micro-satellite instable on the right [23].
training it to mimic the feature representations and matching the predictions of the teacher model. Recently, such techniques have also obtained significant attention in the machine learning community for object classification [6], action recognition [31], and object tracking applications [42]. In computational pathology, knowledge distillation can reduce the resource requirements at the inference time thus improving the response time and reducing the cost of equipment. Also, the generalization capability of the student model trained with knowledge distillation is much better than the one trained on just the tissue phenotyping data. It is because the teacher network leverages the benefits of large-scale datasets such as ImageNet for pre-training. This knowledge is then transferred to the student model in the context of tissue phenotyping resulting in improved performance. However, the strength of knowledge distillation techniques is not fully explored in computational pathology research for the purpose of training lightweight models for tissue classification.
In the current work, we bridge this research gap by proposing a novel knowledge distillation algorithm for histology image classification task. Most initial knowledge distillation techniques proposed using prediction of the teacher model to be used as a target for the student model [17], [57]. Although it produces good results, however, the information is quite abstract at the final layer. The student model gets only the opportunity to learn the information kept by the final layers ignoring rich information contained in the intermediate layers of the teacher model [14]. In order to exploit this information, feature-map-based knowledge distillation has been proposed in which the feature maps of a student layer are matched with the feature maps of a particular teacher layer [1], [14]. These methods improved from the initial knowledge distillation work however, the supervision provided by the teacher network is still limited to the number of layers in the student network. In order to handle this drawback, we propose multi-layer supervision for a single student layer. More precisely, we propose each student layer be supervised by multiple teacher layers providing better knowledge distillation compared to the existing feature-based techniques.
In the proposed algorithm, a lightweight student network is trained to mimic the rich feature representations of a heavyweight teacher network which is pre-trained using conventional schemes. Each layer in the student model gets supervision from multiple layers of the teacher model. The existing feature mapbased knowledge distillation techniques proposed consecutive teacher layers to supervise the corresponding student layers in the same order. We however propose a distributed supervision covering the whole spectrum of feature maps in the teacher model. It is obtained by providing backward links from the latter teacher layers to the earlier student layers such that fewer student layers cover most of the teacher layers. The multi-layer supervision enables the student layers to encode rich information which was not possible from direct training of the student model.
In order to make the distributed supervision more effective, an attention mechanism is exploited which facilitates better knowledge distillation from multiple teacher layers to a single student layer. For this purpose, we compute self-similarity between different student and teacher layers separately. The similarity matrices are then non-linearly transformed into queries and keys such that the overall algorithm performance improves [48]. This is obtained by using two different fully connected neural networks. The queries and keys are then projected to obtain self-attention weights which appropriate the supervision of different teacher layers to a particular student layer. These self-attention weights transfer the rich semantic information contained in the later layers of the teacher model to the earlier layers of the student model through knowledge distillation resulting in significant performance improvements.
The proposed algorithm is dubbed as Knowledge Distillation for Tissue Phenotyping (KDTP). A large number of experiments are performed on five different tissue classification datasets [8], [19], [22], [23], [41] using many combinations of the teacher-student models with KDTP algorithm. In some of these combinations, we observe the improved performance of the student model even beyond that of the teacher model. For instance, ResNet-18 when used as the student network and trained using our proposed KDTP algorithm has consistently outperformed the ResNet-50 used as the teacher model. This demonstrates the effectiveness of using the proposed algorithm for the tissue classification task. The main contributions of the current work are as follows: 1) In this work, we improve the tissue classification performance using a knowledge distillation algorithm that includes both multi-layer supervisions as well as responselevel distillation. 2) A novel multi-layer self-attention-based feature maps distillation is proposed which facilitates multiple teacher layers to supervise a single student layer. 3) A novel forward links-based distributed knowledge distillation is proposed which distributes the teacher supervision to each student layer. 4) Extensive teacher-student model combinations are tested on five different datasets to validate the effectiveness of the proposed algorithm. The rest of this paper is organized as follows: Section II presents a literature review on tissue phenotyping and knowledge distillation methods. Section III explains the proposed algorithm in detail. Section IV presents the exhaustive experimental evaluations while Section V draws the conclusion and future directions of the current work.

II. LITERATURE REVIEW
We divide the related work into two different sections to briefly summarize state-of-the-art tissue phenotyping and knowledge distillation methods.

A. Tissue Phenotyping Methods
The classical tissue phenotyping approaches compute local texture features such as local binary patterns and Gabor features which are then used for classifier training [2], [24], [40]. For instance, Kather's et al., used a multi-texture feature analysis method for tissue phenotyping of eight distinct classes in CRC histology images [24]. Similarly, a dictionary learning-based approach utilizing Gabor features has also been proposed by Sarkar et al., [40]. Other classical texture features-based approaches are also proposed in [2].
Moving towards deep learning era, state-of-the-art tissue phenotyping performance has been advanced [22], [45], [52]. In end-to-end deep learning-based methods, a Convolutional Neural Network (CNN) is trained on a set of training images for the task of patch-based tissue phenotyping. For instance, Bejnordi et al., employed three different networks to classify stromal and epithelium tissues from breast cancer WSIs [12]. In some other methods, the CNN's have also only been used as a features extractor component to training a classifier [3]. For instance, AlexNet architecture [27] was used for deep features extraction by Yu et al., [52]. The extracted features are then used to train a linear support vector machine classifier for histology image segmentation. Some studies also consider fine-tuning the existing trained networks [22], [40]. For instance, Kathers et al., fine-tuned a VGG-19 network [43] on nine distinct tissue classes for the estimation of tumor-stroma scores which is then used for a large-scale study for survival prediction analysis [22]. Some other researchers have recently proposed biologically more meaningful features based on cellular interactions for tissue classification [19], [20]. Han et al. proposed weakly supervised semantic segmentation method [15]. They used patch-level labels for the estimation of pixel-level labels using weak supervision for tissue semantic segmentation. Li et al. proposed a pyramidal deep broad learning method for tissue classification [29].
In the current work, we propose to fine-tune a student network using a pre-trained teacher network for the purpose of tissue phenotyping. Our approach is based on multiple types of knowledge distillation supervision including multi-layer feature maps-based and network prediction-based supervision. To the best of our knowledge, no such knowledge distillation technique containing multi-layer feature supervision has previously been proposed for tissue phenotyping in histology images.

B. Knowledge Distillation Techniques
Earlier knowledge distillation methods relied on the predictions of a larger teacher neural network to distill knowledge to the student network [14], [17]. For instance, Hinton et al., used the predictions of the teacher network as soft targets for the student network for image classification problem [17]. Zagoruyko et al., used the attention maps of the teacher network to train the student network [55]. Thus the attention is transferred from the teacher to the student improving the student's classification performance. Wang et al., also transferred attention using selected features for knowledge distillation [50]. The importance of the features is dynamically established during the knowledge distillation step. Chen et al., used the logits for the knowledge transfer in object detection task [7]. Zhang et al., employed the heatmaps generated by the teacher model for knowledge distillation to the student network in human pose estimation task [56]. Zhang et al., extended the idea of using a single teacher network towards using multiple teachers or students [57]. His work proposed mutual learning of multiple deep networks using logits. These early studies reported improved performance in different tasks however, these methods rely on the final output of the teacher network which is difficult for the student network to learn especially at the initial and intermediate layers [14].
In addition to the teacher network response, feature maps at the intermediate layers have also been used for distilling knowledge to the student network [1]. A variety of featurebased knowledge distillation methods have been proposed in the literature [1], [10], [14]. For instance, Romero et al., directly matched the feature activations of the teacher model and the student model for knowledge distillation [1]. Passalis and Tefas distilled knowledge by using the probability distribution in the features space [38]. Kim et al., proposed an improved form of the intermediate representation for better knowledge transfer using feature maps [26]. Jin et al., used the concept of hint layers to better supervise the student model [21]. Chen et al., proposed to adaptively assign attention weights to different teacher layers which are then used to distill knowledge to student model in a cross-layer manner [6].
Despite significant progress in knowledge distillation research, its applications in computational pathology are quite sparse [5], [25], [36]. Chaudhury et al., proposed mutual learning of teacher and student networks for breast cancer classification [5]. Marini et al., also proposed a knowledge distillation method for Gleason score classification in prostate cancer images [36]. Recently, Dipalma et al., proposed resolution-based distillation for improving histology image classification [10]. Ke et al., used the self-distillation model for identifying patch-level MSI and MSS in histology images [25]. In contrast to these knowledge distillation approaches, we propose multi-layer supervision for each student layer which is distributed on multiple teacher layers. The student supervision is distributed over all intermediate layers of the larger teacher model by exploiting an attention mechanism. To the best of our knowledge, our proposed algorithm is novel not only in computational pathology applications but also in general knowledge distillation research.

III. PROPOSED METHODOLOGY
In this section, we explain the proposed Knowledge Distillation for Tissue Phenotyping (KDTP) algorithm in detail. Our KDTP algorithm consists of one deeper teacher network and one shallower student network. Both networks are pre-trained on ImageNet dataset [39] for the natural image classification task. Both networks are then fine-tuned on different histology image datasets for classification tasks shown in Fig. 1. The fine-tuned student network is then further trained using the proposed knowledge distillation algorithm consisting of teacher response supervision as well as intermediate representations supervision. Our proposed knowledge distillation algorithm is shown in Fig. 2. The details of the proposed algorithm are discussed in the following subsections.
1) Teacher Response-Based Knowledge Distillation: Let be the training dataset consisting of n tissue instances from c distinct tissue classes, where d i is the feature vector and y i is the corresponding ground-truth label in the form of one hot-encoded vector. The teacher and student logits are normalized using a soft-max layer with a softening parameter 0 < σ ≤ 1.0 to get their respective responses as [14]: where g(i, j) is the logit corresponding to j-th class and i-th instance in a batch and r s/t is the normalized response of student or teacher model. The response r t ∈ R b×c of the teacher model for each input tissue instance d i in batch b is used to supervise the corresponding student response r s ∈ R b×c using Kullback Leibler (KL) divergence as [14]: where b is mini-batch size. In addition to KL, the multi-class cross-entropy classification loss function L CE (y i , r s ) is also used to train the student network [14], [17]. The combined response-based loss function to be minimized while training the student network is given as follows: where α is the hyper-parameter that is used to ensure the relative importance of both loss terms.

2) Teacher Intermediate Representations-Based Knowledge
Distillation: The intermediate representation supervision is obtained by minimizing some distance measures between the feature maps of the teacher and the student at intermediate layers. Let s p (i) ∈ R c p ×h p ×w p and t q (i) ∈ R c q ×h q ×w q be the feature maps of the p-th student layer and q-th teacher layer for i-th tissue image, where c, h, and w represent the number of channels, height, and the width of the respective feature maps. The feature-based knowledge distillation is obtained by minimizing the following loss function [14], [17]: where f s (·) and f t (·) are the transformation functions to match the spatial dimensions of the teacher and student features map using the pooling operations and Multi-Layer Perceptron (MLP). Since a single student layer gets supervision from multiple teacher layers, therefore, the overall features-based distillation loss is given by: where p s and q t are the total number of layers in the student and teacher networks. The parameter n p,q (i) is an attention map for each position in a batch and is learned in an end-to-end manner as discussed in the following section. The overall objective function of the proposed algorithm is given by: where β is a hyper-parameter to be learned on the training dataset.

A. Learning Transformation Functions
In order to compute the similarity between teacher and student intermediate representations as shown by (4), the spatial dimensions of the two layers should be transformed to a common subspace using f s (·) and f t (·). These transformations are obtained by first applying a pooling operation on the larger height and width of any of the two layers to match it with the smaller dimension. Then, we use an MLP to reduce the larger number of channels to the smaller ones in the two layers. This MLP consists of three sequential layers each comprising 3 × 3 convolutional filters. The number of filters in each layer is selected such that the number of target channels is obtained at the output. The weights of these MLPs are learned in an end-to-end fashion while training the overall network. In order to reduce the number of these MLPs for the deeper teacher-student combinations, we employ them at the block level instead of the layer level.

B. Attention Mechanism
Layer semantics in a deep neural network varies with the depth. Earlier layers provide local semantics while the later layers provide global context. For effective feature map-based knowledge distillation, an attention mechanism is required to guide the supervision process. It is required to identify the effectiveness of a particular teacher layer to be the supervisor of a specific student layer. For this purpose, we compute the similarity between the feature maps for each teacher layer and the student layer within a particular batch. More specifically, the feature map at the p-th student layer s p ∈ R c p ×h p ×w p is vectorized as R c p h p w p ×1 . For the i-th instance of the tissue image within the batch b, the similarity map is given by: where S p is the matrix of feature maps at layer p for a full batch and ρ s (i, p) ∈ R b is the similarity of i-th instance s p (i) with all other vectors in the batch b. Similarly, for the teacher network, the similarity of the same image i within the same batch b for the q-th layer is given by: where T q is the matrix of feature maps at layer q for a full batch and ρ t (i, q) ∈ R b is the similarity of i-th instance t q (i) with all other vectors in a batch. The batch similarity vectors are then transformed using two different fully connected networks θ s (i, p) = F C s (ρ s (i, p)) and θ t (i, q) = F C t (ρ s (i, q)), where θ s (i, p) ∈ R z and θ t (i, q) ∈ R z are transformed similarity vectors having dimension z < b, the batch size. Each of these fully connected networks shares their parameters across all batches and all images. These fully connected networks are learned in an end-to-end fashion to minimize student network loss. Motivated by self-attention mechanisms in [48], [51], the attention weights are computed by using the exponential of the similarity function as: the sum of attention weights for a particular student layer across all teacher layers is given by: where q t is the number of teacher layers. The normalized attention weights are then given by: In this formulation, the sum of normalized attention weights for a particular instance and fixed student layer turns out to be one across all teacher layers. This will ensure that the feature magnitude is not amplified due to the usage of attention weights. The attention weight n p,q (i) is then used in (5) for the computation of intermediate representation loss.

IV. EXPERIMENTAL EVALUATIONS
In this section, we evaluate the proposed KDTP algorithm on five publicly available benchmark datasets including Invasive Ductal Carcinoma (IDC) classification in breast cancer histology images [8], Colon cancer classification into high, low, and normal grades [41], tissue phenotyping using CRC-TP dataset [19], Kather's Colon Cancer dataset [22], and classification of microsatellite stability/instability in gastrointestinal cancer [23]. The results are compared with several baseline individual teacher and student networks as discussed in the following sub-sections.

A. Teacher-Student Architectures
We employ a number of teacher-student combinations based on well-known deep networks including VGG [43], ResNet [16], MobileNet [18], and ShuffleNet [33] for evaluation. For rigorous evaluation of the proposed algorithm, the shallow and deeper versions of these networks are employed in our experiments. For the case of the student network, we employed VGG-8, VGG-13, VGG-19, ShuffleNetV1, ShuffleNetV2, and MobileNetV2. For the case of the teacher network, we used ResNet-8, ResNet-34, ResNet-50, VGG-19, and ShuffleNetV2.
All networks are pre-trained on the ImageNet dataset. We first fine-tuned teacher models on the aforementioned histology image datasets for the tissue classification task. Also, the last layer of each student model is fine-tuned to a particular histology image dataset while the rest of the network weights are kept frozen. These networks are then used for the evaluation of the proposed KDTP algorithm We set a momentum of 0.9 in all our experiments for network training using stochastic gradient descent. We also employed data augmentation techniques including horizontally and vertically flipped images, rotation using five different angles, and image blurring. We set the initial learning rate as 0.01 and batch size of 64 in all architectures. We fine-tuned all the student and teacher models using 240 epochs.

B. Training Details of KDTP
To minimize our proposed KDTP loss function ( (6)), we set the hyper-parameter β to 2.5 × 10 −3 and the softening parameter σ in (3) to 0.25. The transformation functions used in (5) consist of a stack of three layers with 1 × 1, 3 × 3, and 1 × 1 convolutions to match the dimensions of teacher and student feature maps. The transformation functions are learned in an end-to-end manner. The fully connected layers of the attention mechanism F C s and F C t are also learned in an end-to-end manner. [19]: CRC-TP dataset is proposed by Javed et al., consisting of 280 k patches belonging to seven distinct tissue classes including tumor, stroma, complex stroma, smooth muscle, necrotic, normal benign, and lymphocytes. The dataset is generated using 20 H & E stained WISs of 20 distinct CRC patients. Each patch in this dataset consists of 150 × 150 pixels extracted at 20× magnification level. We employed the same training and testing splits of the seven tissue phenotypes provided by the respective authors.

C. Datasets 1) CRC-TP Dataset
2) Breast Cancer Dataset [8]: This dataset is proposed by Cruz-Roa et al., and used to classify positive and negative patches of IDC. It consists of 277,524 patches extracted at 40× resolution level from 162 WSIs. The size of each patch is 50 × 50 pixels. Out of those, 198,738 patches belong to negative IDC and 78,786 patches belong to positive IDC. Our algorithm is evaluated on this dataset for binary classification problems using 70% of training and 30% of testing patches.
3) Kather's Colon Cancer Dataset (KCCD) [22]: It contains nine different tissue classes: Muscle, Normal colon mucosa (Norm), Tumor colorectal adenocarcinoma epithelium (Tumor), background, adipose, Mucus, Lymphocytes (Lympho), Debris, and Complex stroma distributed over 100 K training samples and 7.18 K testing samples. Each sample has a resolution of 224 × 224 pixels and is extracted at 20× magnification level. [23]: This dataset contains 218,578 unique tissue patches derived from histological images of gastric cancer patients in the TCGA cohort [46]. All images are derived from formalin-fixed paraffinembedded (FFPE) diagnostic slides. This dataset is used for the binary classification of Micro-satellite Instablity (MSI) and stability (MSS). The training and testing images of each class are provided by the original author. The training and testing splits of MSI consist of 50285 and 27904 unique tissue images. The MSS training and testing splits contain 50285 and 90104 samples. [41]: The extended CRC dataset consists of 300 visual fields with an average size of 5000 × 7300 pixels [41]. This dataset is used for three class classification of tissue images into normal, low, and high-grade cancer. Similar to [41], we have also performed a three-fold cross-validation experiment. For each class in each fold, we extracted 25,000 patches each of size 224 × 224 from the visual fields. Each fold is once used for training, testing, and validation.

D. Performance Measures
The tissue image classification performance is evaluated using the weighted average F score. For a particular class z, we compute F z score as: where T P z denotes the True Positives which are the number of tissue images belonging to class z and also predicted as class z, F N z is the False Negatives which are the number of tissue images belonging to class z but predicted as some other class, F P z are the False positives which are tissue images not belonging to class z but predicted as class z. The aim is to maximize F z measure so that its value is close to one. The weighted F measure is computed as a weighted average of F z overall all classes as given below: where p z = n z /n is the probability of the z-th class, n z are the number of samples in that class, and n is the total number of tissue samples.

E. Variants of the Proposed Algorithm
In addition to the proposed KDTP algorithm, we have also evaluated the performance on two other variants including KDT P r which minimizes L r given by (3). KDT P r minimizes the cross-entropy loss and KL divergence between the logits.
The second variant is KDT P 1−1 in which one layer of the student is projected to only one layer of the teacher model as in most of the existing methods. In the KDT P 1−1 variant, the later layers of the wider teacher model are not used. It minimizes cross-entropy loss, KL-divergence as in (3), and feature matching loss between only corresponding layers as in (4). It is because the number of layers in student models is less than that of the teacher models.
In the proposed KDTP algorithm, (6) is minimized which includes cross-entropy loss, KL-divergence, and feature matching loss with attention as given in (5). Table I shows the performance comparison of the proposed algorithm with other baseline methods. In all our experiments, we observe that the KDT P r variant is consistently better than the corresponding student model. The proposed KDTP algorithm has even performed better than KDT P r . In most cases, the KDTP is even more accurate than the teacher model. For the case of ResNet-18/ResNet-50, the proposed KDTP has obtained 86.10% weighted average F score which is 4.30% better than the corresponding teacher model. Table II shows the comparative results of the proposed algorithm with other KD-based methods in terms of weighted average F score. In all experiments, the proposed KDTP algorithm has remained more accurate than all variants including KDT P r , KDT P 1−1 , and the student model. In some of the cases such as ResNet-18 as student and ResNet-50 as teacher network, the proposed KDTP has obtained 83.10% weighted average F score which is even higher than the only teacher model. A similar trend has been observed when student networks were VGG-8 and VGG-13 and the teacher network was ResNet-50. Compared to the teacher network, the maximum performance gained is 4.33% for the case of MobileNetV2 as a student and ShuffleNetV2 as a teacher. This demonstrates the effectiveness of our algorithm for histology image classification tasks. Table III shows the performance comparison of our proposed algorithm in terms of weighted average F score. The proposed algorithm variant KDT P r is more accurate than the only student model in all experiments. The second variant KDT P 1−1 which involves both feature-based and response-based knowledge distillation further improves the tissue image classification performance even beyond the teacher model. The final proposed KDTP algorithm has remained the most accurate among all variants. This is because of the multi-layer supervision obtained from the teacher model. The maximum performance gained by the KDTP algorithm from the student model is 13.12% for the case of MobileNetV2 as a student model and ResNet-18 as a teacher model. The performance of KDTP compared to the teacher

I. Evaluation on Gastrointestinal Cancer Classification Dataset
Table IV presents the comparative results of the proposed algorithm with other KD-based methods in terms of weighted average F score. The proposed KDTP algorithm has consistently remained the best performer compared to other variants. Similar to other datasets, the teacher-student combination of ResNet-50 and ResNet-18 has obtained the best performance of 80.22% which is 4.62% better than the teacher network. The same trend has also been observed for other teacher-student combinations such as VGG-13 and VGG-8, and ShuffleNetV2 and MobileNetV2.

J. Evaluation on Colorectal Cancer Grading Dataset
Table V presents the comparative results of the proposed algorithm with existing KD-based methods and teacher-student models in terms of weighted average F score. Similar to the aforementioned datasets, the proposed KDTP algorithm has maintained its superiority over the rest of the variants. Also, the best performer teacher-student pair is the ResNet-50 and ResNet-18 model which obtained 94.44% F score higher than the rest of the other teacher-student combinations. It is also 7.24% better than the only-teacher model which obtained 87.20%. This shows the effectiveness of the proposed KDTP algorithm in performance improvement of a smaller network ResNet-18 to outperform a deeper network ResNet-50.

K. Comparison With SOTA Methods
We have also compared the proposed KDTP algorithm with existing State-of-the-Art (SOTA) methods including knowledge distillation methods proposed by Hinton et al. [17], Zagoruyko et al. [55], and Chen et al. [6]. For a fair comparison, we evaluated these methods using ResNet-18 as a student model and ResNet-50 as a teacher model. The source codes released by the original authors are used for our implementation. All methods are trained on CRC-TP, Kather's colon cancer, and colorectal cancer grading datasets similar to our proposed algorithm. The results of the trained student model are compared in Table VI. The proposed algorithm has consistently outperformed the compared methods on three datasets for the tissue image classification task.  VI  PERFORMANCE COMPARISON OF THE PROPOSED KDTP ALGORITHM WITH  SOTA METHODS ON THE THREE DIFFERENT DATASETS FOR TISSUE  CLASSIFICATION. RESULTS ARE REPORTED USING WEIGHTED AVERAGE  SCORE F ON SEVEN, NINE, AND THREE DISTINCT CLASSES OF CRC-TP,  KCCD, AND CCGD. THE BEST TWO PERFORMANCES ARE SHOWN IN RED AND BLUE COLORS, RESPECTIVELY

L. Computational Time Analysis
During testing, only the student model is employed for all teacher-student combinations. Therefore, the computational time will depend on the size of the student model. For the case of ResNet-18 as a student model, an average time of 1.31 seconds is observed for an image patch of 224 × 224 pixels. This demonstrates that the proposed KDTP algorithm provides significant performance gained despite the low computational time.

V. DISCUSSION AND CONCLUSION
In this paper, we proposed a KDTP algorithm for improving the performance of shallow networks for the task of tissue phenotyping. It is a fundamental clinical pathology task for analyzing the tumor micro-environment for better cancer grading and survival analysis. Automatic tissue phenotyping has been well investigated using deep neural networks. However, the practical implementation of these networks suffers from many clinical challenges such as the need for excessive memory and computational resources which may not be feasible in clinical settings. On the other hand, computationally less expensive neural networks have shown degraded performance for tissue phenotyping. In order to enable these low computationally complex neural networks to perform well in clinical applications, we propose the use of knowledge distillation. It has not been well explored in computational pathology. In this technique, supervision from deeper networks is utilized for better training of shallower networks. For this purpose, we have proposed the KDTP algorithm which is employed on many teacher-student combinations where the teacher is a deeper neural network and the student is a shallower network. The KDTP algorithm is evaluated on five different histology image classification datasets including CRC-TP, Breast cancer, Kather's colon cancer, Gastrointestinal cancer, and Colorectal cancer grading.
The trained shallow networks have performed significantly better than their previous versions as well as their teachers. For example, MobileNetV2 is trained under the supervision of ResNet50 as a teacher model. As a result, we observed significant performance improvements in MobileNetV2 on all datasets. For the CRC-TP dataset, it was originally obtained 70.16% weighted average F score. Once, we retrained this network using the proposed KDTP algorithm its performance increased to 83.33% on the same dataset. Compared to the teacher network which obtained 81.80%, the shallow network has obtained even better scores.
The same teacher-student combination when employed in the breast cancer dataset has exhibited performance improvement from 68.55% to 79.11%. Similarly, on the Kather's colon cancer dataset the performance of MobileNetV2 improved from 76.10% to 92.11%. It is further evaluated on the gastrointestinal cancer dataset where the performance of MobileNetV2 is increased from 65.30% to 74.44%. In addition, we also evaluated this combination on the colorectal cancer grading dataset. The performance of MobileNetV2 is increased from 70.33% to 78.88%. These performance improvements are obtained without requiring any additional computational complexity at test time. Thus, our experiments demonstrate the significance of knowledge distillation algorithms in the field of computational pathology.
Considering another teacher-student combination of ResNet-50 and ResNet-18, the evaluations are performed on all five datasets. For the case of the CRC-TP dataset, the performance of ResNet-18 is improved from 80.11% to 86.10%. On the Breast cancer dataset, its performance is improved from 77.22% to 82.10%. On Kather's colon cancer dataset, its performance improved from 87.22% to 95.51%. For the case of the gastrointestinal cancer dataset, its performance is improved from 73.33% to 80.22%. Similarly, for the colorectal cancer grading dataset, its performance is improved from 84.55% to 94.44%. In all of these experiments, we observed that the shallower network, ResNet-18, has outperformed its teacher the deeper network, ResNet-50, by a significant margin. These experiments also demonstrated that the shallower networks can outperform deeper networks if trained properly using our proposed KDTP algorithm. Please note that the shallower networks are easy to deploy in clinical settings due to reduced resource requirements. Such a scheme can be potentially beneficial for the deployment of deep neural networks in resource-constrained hardware due to the reduction in computational and memory requirements.
In conclusion, we have presented a knowledge distillation network that can conduct histology image classification task in an automated and robust manner. The ability to automatically classify tissue images of various types has a direct bearing on the downstream analysis in pathology. It holds great potential not only for expediting the diagnostic process in clinics but also for extending our understanding of tissue/cellular characteristics, leading to an improved patient care and management. In the future, this technique may potentially be used for the discovery of low-cost cancer biomarkers.