Utilizing Knowledge Distillation in Deep Learning for Classification of Chest X-Ray Abnormalities

Automatic screening and diagnosis of lung abnormalities from chest X-ray images has been recently drawing attention from the computer vision and medical imaging communities. Previous studies of deep neural networks have predominantly demonstrated the effectiveness of lung disease binary classification procedures. However, large numbers of medical images—which can be labeled with a variety of existing or suspected pathologies—are required to be interpreted and reported upon daily by an individual radiologist; this poses a challenge in maintaining a consistently high diagnosis accuracy. In this paper, we present a competitive study of knowledge distillation (KD) in deep learning for classification of abnormalities in chest X-ray images. This method aims to either distill knowledge from cumbersome teacher models into lightweight student models or to self-train these student models, to generate weakly supervised multi-label lung disease classifications. Our approach was based on multi-task deep learning architectures that, in addition to multi-class classification, supported the visualizations utilized in saliency maps of the pathological regions where an abnormality was located. A self-training KD framework, in which the model learned from itself, was shown to outperform both the well-established baseline training procedure and the normal KD, achieving the AUC improvements of up to 6.39% and 3.89%, respectively. Through application to the publicly available ChestX-ray14 dataset, we demonstrated that our approach efficiently overcame the interdependency of 14 weakly annotated thorax diseases and facilitated the state-of-the-art classification compared with the current deep learning baselines.


I. INTRODUCTION
With the potential to escalate simple thoracic ailments into cancers, lung diseases are one of the leading causes of death worldwide. Chest radiography is the most common medical imaging technique used to diagnose them, owing to its efficiency in the identification and detection of cardiothoracic, pulmonary, and interstitial diseases; it currently occupies a significant role in lung disease treatment practices [1]. Accurate analysis of the large quantities of patient health information represents a major challenge for radiologists because the timely reporting of potential findings is necessary The associate editor coordinating the review of this manuscript and approving it for publication was Mehul S. Raval .
for effective treatment. The overlapping of tissue structures in the X-ray images or the low contrast resolutions with which they need to distinguish the lesion and surrounding tissues greatly increase the complexity of interpretation. This results in a certain number of missed detections and diagnoses. The wide applicability and interpretational difficulties of chest Xray images have led to the introduction of computer-aided detection (CAD) systems into medical imaging practices. CAD systems are predominantly divided into four steps: preprocessing, Region of Interest (ROI) segmentation, ROI feature extraction, and disease identification; however, there is a vital need for them to not only automatically process large numbers of medical images, but also enhance the certainty of accurate disease prediction. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ The extensive and successful developments of artificial intelligence (AI), along with the accumulation of large numbers of medical images, have opened up the promising possibility of building a CAD system integrated AI techniques. In particular, deep learning-based methods have achieved remarkable performances in various image-recognition tasks, including image classification [2]- [5] and semantic segmentation [6]- [9]; these methods have been proposed for re-application on anatomical and pathological medical imaging domains. Advanced deep learning, in combination with the construction of large medical databases, has recently enabled these algorithms to surpass the performances of conventionally medical techniques, particularly in tasks such as pulmonary nodule detection [10], detection of lymph node metastases in breast cancer [11], cerebral micro-bleeding detection [12], [13], skin cancer classification [14], [15], pneumonia diagnoses from radiographs [16], diabetic retinopathy detection [17], cardiologist-level arrhythmia classification [18], and cerebral micro-bleeding identification [19].
Whilst the general trend for deep learning models such as convolutional neural networks (CNNs), which learn features in an end-to-end manner with respect to millions of parameters, is towards deeper, wider, and more complex architectures, the expensive computation costs limit the capability of a deep learning solution in real-world applications. That is, the weights are transferred from a pre-trained model to a new network with a need of matching the network architecture in case of transfer learning. This means that the new network should be as sophisticated as the old one. Hence, it is arduous to deploy a cumbersome model to many applications. For instance, self-driving vehicles and mobile robots have limited memory and power resources. Even when these are in abundant supply, for example when a data system is hosted in a network cloud, effective deep networks serving clients at a lower cost are still necessary. However, data privileges or privacy issues can restrict access to the source data domain in real transfer learning problems. Therefore, it is essential to transfer the knowledge of a network trained on the data by accessing only the training data of the target domain.
Therefore, a recent study [20] proposed a knowledge distillation (KD) procedure to capture and transfer the knowledge of a trained teacher model to a student model. Typically, a teacher network exhibits a greater learning capacity and higher performance and can be used to teach a lower-capacity student network by providing soft-targets. Dark knowledge describing the similarity of privileged information from different classes can be transferred from these soft-targets, to enhance the performance of the student model. This process guides the training of a student network, and further uses an additional distillation loss to encourage the student model to mimic some aspects of the teacher model. Originally motivated by resource-efficient neural network compression tasks [20], KD procedures have found a variety of applications in such areas as adversarial defense [21], privileged learning [22], and learning with noisy data [23]. To extend this idea of mimicking the softened class scores provided by the teacher model, Fit-Nets [24] added hints to guide the intermediate layers' training. Liu et al. [25] introduced a supervisory signal for KD in the form of spatial attention, by computing the sum of squared activations along the channel dimension; this intuitively encouraged the student model to produce similar normalized spatial attention maps to the teacher model. As expected, recent works have expanded the scope of KD, for example by using semi-supervised adaptive distillation for a learning-efficient detector [54], knowledge adaptation for segmenting sematic regions [55], and a teacher assistant (an intermediary between teachers and students) for KD improvements [56]. With these new findings from the deep learning community, it is of great interest and importance to find ways of exploiting KD performances in medical imaging fields.
Meanwhile, the manual marking of the pathologically abnormal areas of X-ray images, performed by expert radiologists, typically requires more effort than simply labeling them. In other words, the bounding boxes for disease localization tasks are much more descriptive and informative than a single class label. As a consequence, chest X-ray datasets such as ChestX-ray8 and ChestX-ray14 [26] have recently been published. These provide comprehensive disease labels along with a small subset of abnormal region annotations, which are suitable for weakly supervised learning problems. Therefore, designing models for such small numbers of annotated masks is a crucial step toward clinical applications. Many attention-based mechanisms have recently been developed and have demonstrated the feasibility of the localization and recognition of multiple objects, in spite of using only simple class labels during training [27]. In addition, identifying regions containing unexpected and unique abnormalities within an image is of critical importance. Saliency mapping techniques [28], [29] identified such regions as being distinctive by using primitive signature features such as texture, shape, and color. Rendered as a heat-map in which hot regions correspond to a considerable impact on the model's final decision, saliency maps represent an important step toward understanding chest X-ray images and further improving models' classification performances.
For the first time in thorax multi-class classification (to the best of our knowledge), we address this problem by using the promising performances of KD approaches to support the automatic classification of 14 abnormalities appearing in chest X-rays, along with saliency map visualizations to ensure the accurate identification of abnormal regions. The main contributions of this paper are summarized as follows: • We utilized a variety of saliency mapping techniques, including vanilla/guided back-propagation, smooth gradients, and SmoothGrad integrated gradients to better understand our deep learning model's decision-making process.
• We proposed different KD training approaches, including original basic training, standard KD (deeper teachers teach lower-cost students), reversed KD (lower-cost students teach deeper teachers), defective KD (teachers trained over the first 50 iterations teach lower-cost students), and self-training KD (models teach themselves); we then compared their respective classification performances. The remainder of the paper is organized as follows. Section 2 describes the relevant recent works on chest X-ray lesion classification. In Section 3, we describe our proposed approaches for saliency map visualization and different KD training methods for thorax multi-classification. In Section 4, we introduce the ChestX-ray14 dataset and summarize our obtained results. The paper concludes in Sections 5 and 6 with a discussion and suggestions for future works, respectively.

II. RELATED WORKS
Asides from the various screening methods applied to detect suspected lung diseases, the promising results obtained from implementing deep learning techniques in chest X-ray image analysis tasks have recently attracted much attention [30], [31]. Several open-access datasets of chest X-ray images have allowed scientists to train, verify, fine-tune, and evaluate their new deep learning algorithms; these datasets have included chest X-rays with and without lung cancer nodules from the Japanese Society of Radiological Technology [32], frontal and lateral chest radiographs of disease annotations from the Indiana dataset [33], two databases (from Montgomery County and Shenzhen Hospital) to improve the CAD of pulmonary diseases [34], normal versus tuberculosis cases from the Royal Tropical Institute [35], and ChestX-ray14-the largest publicly available database currently available, containing annotations of 14 different lung diseases [26]. The TUNA-Net framework was proposed by [36] for pneumonia recognition on two public chest X-ray datasets; this model adapted the labeled adult chest X-rays in the source domain such that they appeared as though they had been taken from pediatric X-rays in an unlabeled target domain. TUNA-Net achieved a 96.3% AUC (the area under the receiver operating characteristic (ROC) curve) value in binary pediatric pneumonia classification. Salehinejad et al. [37] employed deep convolutional generative adversarial networks (DCGAN) to generate artificial images from five common pathological classes, then applied it to chest X-rays. The authors reported that data augmentation using these synthesized images increased the diversity of the training data, substantially improving the generation performance and classification of unseen data.
There have been many deep learning models proposed to achieve outstanding classification results on the ChestX-ray14 dataset. Rajpurkar et al. [30] proposed CheXNet -a 121-layer convolutional neural network for pneumonia classification. A 14-disease classification task was also attempted and competitive results were obtained under their proposed method. They also compared the performances of four radiologists on a subset of 420 annotated images against the CheXNet model and found that CheXNet exceeded the average radiologist performance, as measured by the F1 metric. A unified weakly supervised multi-label image classification and localization framework was introduced by Wang et al. [26] to evaluate the ChestX-ray8 dataset. After implementing a variety of pre-trained deep models and excluding the fully connected and soft-max layers, a transition layer, global pooling layer, prediction layer, and loss layer were all inserted. This approach facilitated the identification of plausible spatial regions due to the combination of activations from the transition layer and weights from the prediction inner-product layer. Their initially quantitative classification and localization results were promising, despite the procedure remaining too computationally strenuous for full implementation as an automated high-precision CAD system.
More recently, a variety of deep learning-based techniques have sought to approach the ChestX-ray14 problem. ChestNet [38] contained two main branches: a classification branch, which served as a unified network with a pre-trained ResNet-152 model to manage the complexities of handling local handcrafted features; and an attention branch, which explored the correlations between different disease labels and allowed for the localization of abnormal regions. In its performance comparison, it was shown to outperform three state-of-the-art deep learning models employing official patient-wise splits without extra training data. TieNet [39] was introduced to first classify ChestX-ray14 images by extracting distinctive X-ray images and embedded texts from corresponding reports; it was later converted into a chest X-ray-reporting system in a simulation, to output disease classifications with a preliminary report. It achieved an average AUC of over 90%, which was an improvement of 6 % compared to the baseline on an unseen and hand-labeled OpenI dataset. A multi-level attention model, implemented as an end-to-end trainable CNN-recurrent neural network (RNN) to highlight the meaningful regions, was also built in this study.
A fully convolutional recognition network [40] improved AUC scores in classifications of most diseases compared to the reference models, as well as remarkable prediction scores of disease localizations. Wang et al. [57] introduced Thorax-Net, which contained two branches for 14-label prediction and abnormality localization. The classification branch used a pre-trained ResNet-152 and the attention branch was equipped with several convolutional layers and the gradientweighted class activation mapping (Grad-CAM) module. This procedure yielded AUC scores of 0.788 and 0.896 by using the patient-wise official split and image-wise random split, respectively. It obtained higher AUCs compared to other deep models training with no external data. Ho and Gwak [41] proposed a pre-trained DenseNet-121 model to localize pathologically abnormal areas and a handcrafted, deep feature integration approach to classifying 14 disease classes. The authors demonstrated that their proposed methods could efficiently manage interdependencies between class annotations and achieved superior classifications to the then-current reference baseline on the ChestX-ray14 dataset. VOLUME 8, 2020 From the existing reports on ChestX-ray14, the transferal of features extracted from pre-trained models is seen to be preferable. However, the trends in model compression-in which a larger pre-trained model is built to allow the smaller model to learn complex features whilst minimizing the computation and memory costs-has not yet been investigated for X-ray datasets. In particular, a large and complex network or an ensemble model is first trained and extracts important feature information from the given data, thereby producing targeted predictions. A small network is then trained with the help of this more cumbersome model. The small model is able to produce comparable results or replicate the cumbersome model's results. Therefore, we propose different KD training strategies for 14-disease classification, as well as a variety of saliency mapping techniques for abnormal feature visualizations in X-ray images.

III. PROPOSED APPROACHES
We conducted extensive experiments to examine both the dominant features visualized by saliency maps and the common features of dark knowledge in KD.

A. SALIENCY MAPS
As the most common technique for interpreting deep neural networks (DNNs), saliency maps [42], [43] represent the gradient of the output class with respect to the input, based on a score function. They note how the changes in the output correspond to changes in input image pixels. The output value is increased under small changes in the pixels or exclusively positive values in the gradients. Thus, visualizing these gradients provides an intuitive measure of attention. In our design, using an input vector x ∈ R d and a model with the function S : R d → R 14 results in an explanation map of S : R d → R d , which maps inputs to particular objects of the same shape. Each dimension is then associated with the relevance or importance of the final output's dimension.

1) GRADIENT [44]
The gradient of the scalar logit for a specific class for the input is expressed as 2) GUIDED BACK-PROPAGATION (GBP) [45] GBP indicates the change in how the back-propagated gradient varies with ReLU. Using f l , f l−1 , . . . , f 0 as the feature maps derived during the forward pass of a DNN and R l , R l−1 , . . . , R 0 as the intermediate representation obtained during the backward pass (more concisely, f l = relu f l− 1 = max(f l−1 , 0) and R l+1 = ∂f out ∂f l+1 ), GBP aims to achieve zero outputs for all negative gradients; the mask is then computed as where 1 R l+1 >0 retains only positive gradients and 1 f l >0 retains only positive activations.

3) INTEGRATED GRADIENTS (IG) [46]
The gradient saturation is addressed by summing over-scaled values of the input. IG for an input x is defined as: wherex is typically set to zero and is the baseline input representing an absence of features in the input sample x i . 4) SMOOTHGRAD (SG) [47] SG seeks to alleviate noise and visual diffusion by averaging over all explanations of noisy versions of an input. Given an explanation E and a sample x, the SG explanation E SG is defined as where the noise vectors g i ∼ N (0, σ 2 ) are drawn independently and identically distributed from the normal distribution.

B. KNOWLEDGE DISTILLATION (KD)
In the standard KD model [20], knowledge is encoded and transferred based on the forms of the softened class scores. The total loss of the student model's training is given by where L CE (., .) represents the cross-entropy; y represents the one-hot vector of ground truths; σ is the soft-max function; z s and z T are the output logits of student and teacher models, respectively; α is a balancing hyper-parameter; and T is the temperature hyper-parameter. In (5), the first term denotes the cross-entropy loss using ground truth labels whilst the second term encourages the student model to mimic the softened class scores from the teacher model. As shown in the standard KD from Fig. 1, the student model was trained using the predictions of the teacher model along with the ground truth hard labels. A variant of the soft-max function including a temperature parameter T was used to produce the soft labels as where I is the input logits to the soft-max layer, and a higher value of T produces a smoother probability distribution over the 14 classes. Thus, the total loss function L is a combination of the KD loss (soft loss) L soft , the cross-entropy loss between the soft predictions of the teacher and students, and the hard loss L hard , given as: 160752 VOLUME 8, 2020 T. K. K. Ho, J. Gwak: Utilizing Knowledge Distillation in Deep Learning for Classification of Chest X-Ray Abnormalities  However, it is commonly understood that if we reverse the KD operation, the teacher will not be significantly improved because the student model is too weak to learn and transfer useful knowledge. Also, using a poorly-trained teacher model that has been trained on 50 first epochs may yield worse performances than normal KD or reverse KD procedures. Finally, if we self-train the model, it may achieve better results compared to all of the above strategies. For example, the model would learn from its softened-class targets with a 10% error when being trained from itself with a 90% accuracy criterion. To address these concerns, we followed the KD procedure illustrated in Fig. 1 and conducted all experiments pertaining to the five main training strategies: base training-simply training normal DNNs in an end-to-end manner; standard KD-training a teacher model to teach a student model; reversed KD-training a student model to teach a teacher model; defective KD-poorly training a teacher over the first 50 epochs to then teach a student model; and self-training KD-training a model to teach itself (Fig. 2). To feasibly conduct all KD training approaches, we selected six types of DNN models-with identical input sizes-to examine our proposed training methods, including MobileNet-v2 [2], VGG-19 [3], ResNet-32, ResNet-50, and Resnet-152 [4], and DenseNet-121 [49]. The first four models (MobileNet-v2, VGG-19, ResNet-32, ResNet-50) were used as student models; they are all relatively small and simple models, though sufficiently powerful to either learn X-ray features from both themselves and more complex teacher models (ResNet-152, DenseNet-121) or to transfer the distilled knowledge to deeper networks.

A. CHESTX-RAY14 DATASET
We evaluated our proposed approaches on the publicly available, recently published ChestX-ray14 dataset [26]; this is considered to be the largest collection of up-to-date front-view chest radiographs, containing a total of 112,120 X-ray images acquired from 30,805 unique patients. Each image is marked with a single or multiple pathological labels denoting 14 diseases, based on radiology reports with over 90% accuracy. In addition, there were 984 annotated images provided by board-certified radiologists.  It is greatly important to consider the data division step for proper evaluation of our proposed methods. On the patient-wise official split considered, all images from the same patient are only present in one of the training, validation, and testing subsets. Meanwhile, the image-wise random split would randomly divide all X-ray images into three subsets without considering on which subject an X-ray image was acquired. In other words, there is an average of 3.6 images per patient. The radiographs from the same subject are likely to appear in both training, validation and testing sets simultaneously leading to achieve much better performance than using the patient-level split. However, this should not be accepted in pattern classification tasks since it is burdensome to establish consistent benchmarks or sometimes known as ''cheating'' if patient samples from testing sets appear in the training data. Plus, because of the impact of randomness, it is required to conduct experiments multiple times to average the AUC scores. Concerning these sorts of problems, we thereby solely utilized the patient-level split which formulates more proper criteria to evaluate any models in thorax disease prediction.
Using the patient-wise official split, we divided the data from 30,805 unique patients into 70% for training, 10% for validation, and 20% for testing. We also augmented the training and validation datasets using randomized horizontal flipping procedures. Python 3.6.10 with Tensorflow 2.1.0, CUDA 9, and cuDNN 7.5 deep learning dependencies were used for implementing both (i) the visualizations from the different saliency mapping techniques and (ii) the 14-category classifications based on the five KD training strategies. We conducted our experiments within a total computation time of one week, using an i7-4770K 4-core CPU, a GeForce GTX 1070 GPU, and 32G of memory.

B. SALIENCY MAP VISUALIZATION
In this section, we discuss three selected saliency mapping techniques, including GBP, SmoothGrad, and SmoothGrad integrated Gradients. We assessed their efficacy in visualizing distinguishable thorax diseases. The main purpose of this saliency mapping task was to attempt to visualize the measure of attention for abnormal regions that were not originally annotated in the 984 ground-truths from our dataset. The findings from our saliency mapping algorithms may significantly help radiologists make decisions concerning the locations of abnormal regions, despite the lack of prior annotations for the X-ray images. From our observation, the InceptionV3 model was seen to better visualize the hot attention areas on the ChestX-ray14 dataset (with higher AUC scores) than other deep models (both other students and teacher models). Fig. 4 shows four examples of the thorax abnormalities-without any ground-truth annotations-identified by the InceptionV3 model, with AUC scores of 0.854 for Effusion, 0.739 for Fibrosis, 0.762 for Hernia, and 0.768 for Nodule classes. Concerning the efficacy of the saliency mapping methods, the integration of SmoothGrad and Gradients outperformed others; it produced more easily obtainable and clearer images for further disease analysis.
Our knowledge of thorax symptoms, along with the observations from Panel (d), demonstrated that the generated pleural effusion images indicated a hot attention region localized in the lower right-hand side, where the hot region was seen to track along the lateral wall and the right costophrenic angle was obscured by a meniscus. This finding was correctly noted by experienced radiologists as the true screening analysis of the chest X-ray images. Similarly, pulmonary fibrosis (Row 2) exhibited an increase of subpleural reticular markings with lower lobe predominance across both lungs. In Row 3, there was a mild opacification of the bilateral right-hand lobes, with an air-fluid level consistent with a hiatal hernia. The upper left-hand lobe pulmonary nodule was slightly marked in Row 4 of Fig. 4. The foci of the saliency maps were indeed on abnormally affected regions, and the disturbance effects and noises were reducible. These findings from the SmoothGrad integrated Gradients saliency mapping method were remarkably conducive to the analysis and distinguishing of different thorax diseases, even when the absence of annotation labels was taken into account.
To validate the potential of saliency mapping techniques, SmoothGrad integrated Gradients, in particular, we compared its aptitude of localizing abnormal regions with our previous study [4] using class activation map (CAM), which extracted the weight activations from the last convolutional layers of the pre-trained DenseNet-121 model (see Fig. 5). The blue bounding boxes denoted the ground truths in a total of 984 available annotations from the ChestX-ray14 dataset. Although we formally verified the ability of CAM methods, inaccurately abnormal region localizations from several instances were inevitable. Those abnormal instances, nevertheless, could be situated with the relatively high certainty by our proposed saliency map technique. In specific, we showed eight samples that were wrongly located by the precedent CAM method (the red area represented the most indicative pathology region while the blue indicated normal regions), except for cardiomegaly sample while it was clearly able to see that all eight samples were precisely highlighted in the SmoothGrad integrated Gradients method when comparing with the ground truths. This first demonstrated the significant improvement of saliency maps compared to previously baseline disease visualization techniques and further could be extended in the extensive X-ray analysis. VOLUME 8, 2020

C. THORAX MULTI-CLASS CLASSIFICATION RESULTS
In this section, we describe the extensive experimental results for the 14 thorax disease category classification using base training and three KD training strategies. In Table 1, results are shown for the six pre-trained deep models used for normal transfer learning (referred to as base training). DenseNet-121 obtained the best average AUC score with 80.97%, followed by ResNet-152 (79.01%), VGG-19 (76.17%), ResNet-50 (71.66%), MobileNet-v1 (67.10%), and ResNet-32 (66.05%). This suggests that the more complex and deeper models outperformed other smaller and simpler models when dealing with the challenging multi-class classification of chest X-ray images.
As expected, Standard KD at first outperformed the base training method. The student model was significantly improved by learning from the teacher. In particular, MobileNet-v1, VGG-19, ResNet-32, and ResNet-50 achieved 7.02%, 1.22%, 8.01%, and 3.86% AUC improvements, whereas a decrease of 0.21% was observed when DenseNet-121 was taught by ResNet-152. Similarly, when DenseNet-121 acted as a teacher model, it significantly improved upon all performances of the student models (7.84%, 2.5%, 10.36%, and 7.03% improvement accuracies with MobileNet-v1, VGG-19, ResNet-32, and ResNet-50, respectively); it even outperformed the ResNet-152 teacher model. This sheds light on the perspective that the weak student models could be significantly enhanced by superior teacher models.
Meanwhile, as aforementioned, we assumed that as the teacher model becomes more accurate, soft probabilities will extensively capture the underlying target class distribution and therefore deliver better supervision to the student model. That is, the smaller and less accurate models cannot be good teacher models. We, thereby, conducted Reversed KD experiments to settle this issue. The majority of experiments reported that the teacher models were not improved by Reversed KD training strategies (teachers were taught and trained by students), except in the case of Reversed ResNet-152/DenseNet-121. Therefore, we confidently confirm our hypothesis that the student models were incapable of transferring effective knowledge to the teacher models. Moreover, we explored the Defective KD training strategies, in which the teacher model was trained over the first 50 iterations, with the defective knowledge transferred to student models; we observed that student models could be greatly improved even with distilled knowledge from poorly-trained teacher models. For instance, MobileNet-v2, ResNet-32, and ResNet-50 student models were improved (by 3.08%, 3.60%, and 1.12% AUC, respectively) with ResNet-152, whilst it also achieved 5.08%, 8.11%, and 7.1% AUC score improvements compared to the base training with DenseNet-121. Although the poorly-trained teacher model performed less accurately than the Standard KD (as expected), the capacity for transferring knowledge to lower-cost student models was evaluated as of a higher level compared to both base training and Reversed KD in most of the experiments. Defective KD could be used to generate the soft targets of the model, where these learned soft targets then guide the teacher model's regularization processes.
To better demonstrate the distillation approach, we considered updating the output distribution of the teacher model using information from itself or simpler models, this is the so-called Self-training KD framework, in which there is no teacher model. The self-training KD method was applicable to the cases in which either a teacher model is unavailable or limited computation resources are provided. Concretely, the model was first assigned to train in the normal way to obtain a pre-trained model, it was then used for self-training by transferring the soft-targets, as described in Eq. (5). Formally, we minimized the Kullback-Leibler (KL) divergence of the logits between model M and its pre-trained model M t , using the loss function where D KL is the KL divergence; q is the ground-truth label; (z t is the output logits of pre-trained models) are the output probabilities of M and M t , respectively; τ is the temperature; and α is the weight parameter used to balance the two terms. We trained five baseline models, including MobileNet-v2, VGG-19, ResNet-32, ResNet-50, and DenseNet-121. The baseline models were trained for 200 iterations with an initial learning rate of 0.1, an SGD optimizer (with a momentum of 0.9 and a weight decay of 5e-4), and a grid search for finding the optimum hyper-parameter values. Column 3 in Table 1 shows that Self-training KD consistently outperformed the base training approach. For example, MobileNet-v2, VGG-19, ResNet-32, ResNet-50, and DenseNet-121 increased their accuracy performances by 3.58%, 0.98%, 3.29%, and 3.35%, respectively. However, Standard KD outperformed the self-training methods because the weak models by themselves transferred inefficient knowledge, except in the case of DenseNet-121. From our observations, Self-training KD with DenseNet-121 obtained the highest average AUC (82.56%), followed by Reversed KD with ResNet-152/DenseNet-121 (80.97%), and Defective KD with DenseNet-121/DenseNet-121 (80.21%). Fig. 6 shows the stable and accurate performance of Self-training KD with DenseNet-121 via the training and validation accuracies, as well as the accuracy improvement compared with the base training method.
Although we acknowledge that KD frameworks (standard KD and defective KD) demonstrated significant improvements compared to stand-alone model (base models), it is insufficient to train a student model with rich input information to obtain a well-trained teacher model, as we observed results from the reversed KD approach. In addition, it is undoubtedly noted that the training performance of the Self-training framework accumulated a higher degree of time-consuming, computational costs, and resource burdens compared to any simply base trained methods. Table 2 denotes the execution time for each approach (base training and three types of KD approaches). In general, the training time of each KD approach using the teacher model (ResNet-152) is lesser than which of the teacher model (DenseNet-121). Moreover, among six base models, the base training approach VOLUME 8, 2020   Fig. 6 -right), the amount of time costs differently for each iteration of two frameworks. DenseNet-121-based self-trained KD consumed twice as the amount of training time as the base training, which sometimes even led our computational resources to be exhausted.
To justify the potential of the proposed methods, we compared our best results achieved by the Self-training DenseNet-121 model with five state-of-the-art deep learning frameworks on the ChestX-ray14 dataset by evaluating on the per-class AUC scores and the average AUC scores, as shown in Table 3. The highest AUC score was punctuated in boldface for each row. Although the works of Guendel et al. [58] and Wang et al. [57] yielded another exceptional classification result by utilizing the image-wise random split -without consideration on which subject a radiograph was acquired and the radiographs from the same subject thus could be appeared in both training and testing concurrently, we disregarded their phenomenal results in Table 3. Instead, we included results of studies using the patient-wise official data split deemed to be a fairer and more proper evaluation of CAD on the classification of thorax diseases. Note that the work of Guendel et al. [58] trained their model not only on the ChestX-ray14 dataset but also on an external set of 180,000 images from the PLCO dataset [59]. The diagnosis performances presented in Table 3 indicate that our proposed framework obtained very competitive results with the highest per-class AUC scores in seven disease classes and the highest average AUC score.

V. DISCUSSION AND FUTURE WORKS
We demonstrated the suitability of saliency mapping techniques for visualizing the abnormal regions of chest X-ray images, as well as the competitive distilling performance achieved by transferring knowledge both from the large, highly regularized models into smaller ones and from the model into itself, to classify 14 pathological thorax diseases. However, our work has several notable limitations. First, although we attempted to evaluate a comprehensive machine-human annotated chest X-ray dataset, simulating the practical clinical challenges of handling over 100,000 images, it was difficult to correctly visualize and discriminate the 14 classes by applying a deep learning framework when the database was unbalanced and weakly supervised. The appearance of a thorax disease is usually accompanied by other related diseases visible in chest X-ray images; for instance, pneumothorax is often associated with pneumonia. The low rate of agreement between multiple radiologists in this dataset revealed a large bias; and the diagnoses should be voted upon by the majority of these experts. Therefore, there is a need to utilize external training datasets and an independent validation, such as MIMIC-CXR [60] or PLCO [59] to verify the generalizability of our proposed frameworks.
Second, we analyzed the output of saliency maps, SnoothGrad integrated Gradients in particular, which offered a good visual representation of isolating the abnormal regions and could further assist the deep network in classification decisions. However, there was no certainty that saliency map results (Fig. 4) could be correctly localized abnormalities due to the lack of disease annotations from ChestX-ray14. This means we were not able to make a comparison between our outcomes with real ground truths. Plus, we solely evaluated its aptitude on very limited numbers of annotated X-ray images which might lead to generating a huge tendency of localization. That is, the integration of center bias and background information was not always helpful for cases in which the abnormal areas (saliency targets) were unidentifiable in the margins of the X-ray images, or in cases where there were multiple diseases in the X-ray image. Thus, it is critically important to design an attention visualization model not only to facilitate generalization but also to help diagnosis the models' failures by identifying biases or fair and bias-free outcomes from the datasets, as was done in [51].
Third, our extensive experiments demonstrated the potential of KD strategies in chest X-ray disease classifications. Although we demonstrated the outperformance of Self-training KD in terms of classification results compared with base-training and standard-training KD, the time-consuming and enormous costs of computation presented substantial shortcomings of the Self-training KD framework. Besides, our KD model independently extracted instance features as the distilled knowledge from specific layers of the teacher models, without considering the instance's relationship to the student models or the inference procedure. It is difficult for student models to directly fit all the layer outputs from teachers. Therefore, it is necessary to create new KD designs that can help reduce the intra-class variances and magnify inter-class differences in the feature space, as well as prevent the occurrence of significant performance drops when both teachers and students have different architectures, as seen in [52]. It might also be better to replace the process of mimicking the teacher's representation space with that of preserving the pairwise similarities in the student's own representation space [53].
Lastly, there are 60,412 normal images and 51,708 images with at least one or more labels that yield to the problem of interdependency among labels. For example, an image, which is indicated with the presence of edema, possibly includes the presence of both consolidation (air space opacification) and pleural effusion (the pleural space with the abnormal fluid). This generated much disturbance for our proposed models to be trained and produced lower AUC scores since the proposed method recognized the potential of these interdependencies and further predicted pathological outcomes across all thorax categories ineffectively. Therefore, an approach, which allows the distillation at different internal points across the teacher and entitles the student to learn and compress the abstraction in the hidden layers systematically, is necessarily required. With proper internal representations, the student may outperform its conventional approach on either ground-truth labels, soft-labels, or both. From our observation, the poorly-trained teacher could remarkably enhance the student itself (as results shown by Defective KD), it is justifiable to interpret KD as a regularization term and to scrutinize KD from the perspective of Label Smoothing Regularization (LSR) [62]. LSR can mitigate the over-confidence problem and improve model calibration by replacing the one-hot labels with smoothed labels. The smoothed label can be split into two parts: the first part is the ordinary cross-entropy (one-hot label) distribution and the output; the second part corresponds to the virtual teacher to provide soft-targets by a uniform distribution. This indeed furnishes efficient regularization for the student and feasibly overcome the issue of interdependency among thorax labels.

VI. CONCLUSION
In this work, we proposed KD training strategies along with three types of saliency mapping techniques, with the aim of VOLUME 8, 2020 correctly classifying and visualizing 14 pathological thorax diseases from the public ChestX-ray14 datasets. Our experiments demonstrated the feasibility of implementing different KD training strategies, suggesting that the targeted models into which the distilled knowledge is transferred can be enhanced by the self-training KD method when difficulties arise in choosing superior teachers or when limited computation resources are available. Also, the results of the saliency mapping algorithms show promise in highlighting abnormal regions, despite featuring unbalanced and limited annotations of pathologies. Its capabilities can further represent a powerful tool with which clinicians or radiologists can review and interpret the decision-making processes of CAD algorithms in thorax disease diagnoses.