ThoraX-PriorNet: A Novel Attention-Based Architecture Using Anatomical Prior Probability Maps for Thoracic Disease Classification

Computer-aided disease diagnosis and prognosis based on medical images is a rapidly emerging field. Many Convolutional Neural Network (CNN) architectures have been developed by researchers for disease classification and localization from chest X-ray images. It is known that different thoracic disease lesions are more likely to occur in specific anatomical regions compared to others. This article aims to incorporate this disease and region-dependent prior probability distribution within a deep learning framework. We present the ThoraX-PriorNet, a novel attention-based CNN model for thoracic disease classification. We first estimate a disease-dependent spatial probability, i.e., an anatomical prior, that indicates the probability of occurrence of a disease in a specific region in a chest X-ray image. Next, we develop a novel attention-based classification model that combines information from the estimated anatomical prior and automatically extracted chest region of interest (ROI) masks to provide attention to the feature maps generated from a deep convolution network. Unlike previous works that utilize various self-attention mechanisms, the proposed method leverages the extracted chest ROI masks along with the probabilistic anatomical prior information, which selects the region of interest for different diseases to provide attention. The proposed method shows superior performance in disease classification on the NIH ChestX-ray14 dataset compared to existing state-of-the-art methods while reaching an area under the ROC curve (%AUC) of 84.67. Regarding disease localization, the anatomy prior attention method shows competitive performance compared to state-of-the-art methods, achieving an accuracy of 0.80, 0.63, 0.49, 0.33, 0.28, 0.21, and 0.04 with an Intersection over Union (IoU) threshold of 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, and 0.7, respectively. The proposed ThoraX-PriorNet can be generalized to different medical image classification and localization tasks where the probability of occurrence of the lesion is dependent on specific anatomical sites.

Thoracic disorders are one of the major health concerns worldwide as the heart and lungs, two vital human organs, are located within the thorax.In 2017, around 544•9 million people were affected by chronic respiratory illness [1], a thoracic disease, leading to 3.9 million deaths [2].Various medical imaging modalities, e.g., X-ray, Magnetic Resonance Imaging (MRI), and Computed Tomography (CT) can diagnose different thoracic disorders.The chest X-ray (CXR) remains the most commonly performed and widely available radiological diagnostic method to assess and diagnose thoracic diseases.The chest radiograph is an X-ray projection image of the thoracic cavity used to diagnose conditions affecting the chest, its contents, and nearby structures.It is one of the most effective and low-cost methods for diagnosing thoracic diseases.Since CXR is a projection imaging method providing a 2D image of the 3D thoracic structure, anatomical structures are overlapped in the resulting image.Therefore, diagnosis of diseases with CXR image highly depends on the skill and experience of the radiologist [3].However, in many underserved regions of the world, the number of skilled radiologists is insufficient.In such scenarios, automated CXR image interpretation using artificial intelligence (AI) can significantly benefit health systems.This is true even if the algorithms are not making full autonomous decisions and are only used to assist physicians.
©2023 IEEE.This article has been accepted for publication in IEEE ACCESS.See http://www.ieee.org/publications_standards/publications/rights/index.html for copyright information.
However, it is of paramount importance for the machine learning models to be explainable for the radiologists to trust them.Thus, providing an accurate location for the predicted pathologies is a prerequisite for computer-aided diagnosis.However, due to the lack of pixel-level ground truth annotation data, the deep learning models suffer from sub-optimal optimizations.A number of weakly supervised disease localization methods over the recent years have been proposed to solve this problem.In the literature, different attentionbased approaches [4]- [6] have been used for medical disease diagnosis, where the model traditionally learns to identify and focus on the regions of interest containing the lesions using activated feature maps from the classifiers.However, these methods are data-driven and are generally agnostic to the human anatomy and its dependence on identifying the diseased regions.They do not take into account the typical occurrence areas for a specific pathology, and thus, they often fail to predict the lesion region as recognized by radiologists.Intuitively, radiologists do not search all the parts when diagnosing chest X-ray images of a patient for thoracic diseases.Instead, they concentrate on the areas related to the symptoms of the disease of a patient.
Different thoracic disease lesions have unique characteristics and are identified in specific regions of a chest radiographic image.For example, when identifying pneumonia, a radiologist looks for white spots in the lungs that show the characteristics of infection.In contrast, the opacity features of pleural effusion manifest in the pleural space, not inside the lung region.Similarly, the cardiomegaly pathology is associated with the heart.Thus, we may consider that the diagnostic features of different thoracic diseases have a higher probability of occurrence in certain anatomical regions of the chest X-ray.Consequently, specific disease features may have a zero probability of occurrence in certain anatomical regions (e.g., observing consolidation features outside the lungs).Therefore, to reliably detect and localize thoracic diseases, we not only require deep learning-based models to learn the disease-specific features but also to focus on the specific anatomical regions where the likelihood of the disease is highest.However, the existing literature studies predict only the most discriminative areas for the pathology localization and classification of a patient without considering the prior distribution knowledge of the regions where a pathology most repeatedly appears.Although Chen et al. [7] and Kamal et al. [8] utilized lung segmentation-based attention mechanisms, disease-specific anatomical prior knowledge was not considered within the attention mechanism and abnormality localization.
Considering the limitations of previous works in this area, we propose a novel model architecture using two types of attentions: chest region of interest mask-based attention and disease-specific anomaly-based attention for disease classification.The main contributions of this paper are as follows: • We propose the concept of a novel probabilistic anatomical prior map that provides a spatial probability distribution of a disease occurrence within X-ray images.To the best of our knowledge, the idea of a disease-specific anatomical prior probability maps generated using an aggregation of disease ROI masks has not been explored in previous research works.• We developed an end-to-end model ThoraX-PriorNet, a novel attention-based architecture that focuses on specific regions of an X-ray image informed by both disease-specific anatomical prior probability maps and lung region-of-interest (ROI) masks.• We conducted a thorough experimental evaluation to compare the performance of the proposed ThoraX-PriorNet model with the existing methods.Detailed ablation studies conducted using the anatomy prior attention module (APAM) demonstrate the effectiveness of the proposed method in accurately detecting thoracic diseases.
The rest of our document is organized as follows.Section II reviews the related works in the thoracic disease classification and weakly supervised localization tasks.Section III presents our proposed approach in detail.Section IV discusses our experimental settings, such as datasets, data preparation, training scheme, and so on.We conduct comprehensive experiments in Section V, including ablation studies, performance comparison with state-of-the-art methods, statistical analysis, and so on, both for classification and localization tasks.In section VI, we conclude this paper.

A. ATTENTION
Attention mechanisms that selectively attend to zones of an image with a high probability of exhibiting particular diseases can yield a substantial performance improvement for machine learning models [9].Chest X-rays are frequently employed for diagnosing respiratory and cardiovascular conditions, precise interpretation of these images is imperative for effective treatment [10]- [15].Attention modules available in the computer vision literature can be divided into two main categories.One includes the Squeeze-and-Excitation (SE) approach that adaptively re-calibrates channel-wise feature responses by explicitly modeling inter-dependencies between channels [16].The other is Gather-Excite (GE) method Dense Block 1

Conv 1x1
Pool 2x2 Dense Block 4 (HxWxC) Densenet Feature Extractor which efficiently aggregates feature responses from a large spatial extent and excites, redistributing the pooled information to local features [17].Chen et al. [18] presented a nonlocal (NL) attention module to utilize the local relationship for capturing long-range dependencies.Wang et al. [19] have introduced a triplet attention model that can learn channelwise, element-wise, and scale-wise attention simultaneously.This approach helps to capture distinctive information relevant to the task of classifying thorax diseases.Ullah et al.
[10] incorporated channel-wise attention as layer in multiple positions in the their feed forward network for Covid-19 classification.Zhang et al. [20] presented attention guided with different parts of lung.Kamal et al. [8] used lung segmentation mask to provide attention in the lung region in a chest X-ray image.To overcome the domain mismatch of lung segmentation dataset they used GAN model to segment lung that was later used for providing attention.
Though providing attention modules in network enhances model performance, most existing approaches mainly focus on learning the attention map using global CXR images, without considering disease specific lung regions.Aiming to address this constraint, the proposed method generated disease specific probabilistic map from the provided bounding box annotation.Then, we provided probabilistic map guided and lung mask guided attention to focus at specific regions in chest X-ray image for thoracic disease evaluation.
©2023 IEEE.This article has been accepted for publication in IEEE ACCESS.See http://www.ieee.org/publications_standards/publications/rights/index.html for copyright information.

B. WEAKLY SUPERVISED LEARNING
Achieving success in supervised learning demands sophisticated network engineering and an enormous quantity of precisely labeled training data [21].Weakly supervised learning is becoming increasingly important in medical chest X-ray analysis as it can alleviate the need of extensive and precise annotations required for supervised learning.Wang et al. [22] introduce the ChestX-ray14 dataset, together with a baseline for evaluating weakly supervised lesion localization.Furthermore, numerous studies have previously investigated disease localization on CXR images [22]- [24], without directly utilizing ROI labels.Notably, prior research on localization such as Ye et al.'s [14] use of probabilistic-CAM Pooling and Ouyang et al.'s [25] use of hierarchical attention for weakly supervised abnormality localization have incorporated attention mechanisms in their architectures.In their study, Ullah et al. [10] utilized grad-CAM to produce a COVID-19 heatmap, with the aim of presenting classification outcomes that are supported by clinical evidence, and thus applicable to clinical practice.Employing saliency techniques, such as Class Activation Mapping (CAM), Grad-CAM [26], Grad-CAM++ [27], Eigen-CAM [28], and similar methods, to produce heatmaps can prove to be highly beneficial in furnishing clinical evidence.E. Rozenberg et al. [29] achieved high localization performance in regimes by learning to localize the areas with a limited annotation derived from a small fraction masked.Zhu et al. [30] proposed a convolutional attention-based network named PCAN that is pathologyaware and capable of capturing the variations in lesion size and location by generating pixel-wise diagnoses and pixelwise weights.Han et al. [31] leverage two views, i.e., radiomic and global image features, for training the framework for classifying and localizing thoracic diseases.To extract the radiomic features, they have exploited Grad-CAM generated by the image classifier backbone through a feedback loop mechanism.Xiao et al. [32] improved the performance of ViTs by pre-training with 266,340 chest X-rays using Masked Autoencoders, reconstructing missing pixels from a small part of each image.Li et al. [33] utilized an adaptive ViT with a DenseNet architecture with a feature pyramid structure to design the inter-patch and patch-wise long-range dependencies and obtain fine-grained feature maps.
However, the previous methods from the literature depend on the discriminative power of deep-learning convolutional networks and predict the area of a chest X-ray that is most responsible for classification as lesion area without considering the prior knowledge of the distribution of disease occurrence area in a chest X-ray image.Instead, we developed an endto-end novel attention-based architecture named ThoraX-PriorNet, which focuses on specific regions of a chest X-ray image guided by typical disease-specific spatial anatomical prior probability maps.
©2023 IEEE.This article has been accepted for publication in IEEE ACCESS.See http://www.ieee.org/publications_standards/publications/rights/index.html for copyright information.

III. PROPOSED METHOD
This section describes our proposed approach, where we have used a deep learning-based novel classification architecture, named ThoraX-PriorNet, that utilizes both the chest ROI mask and a disease-specific anatomical prior probability map for pathology classification and localization.We also describe in detail the extraction of the chest ROI mask and the generation of a disease-specific anatomical prior probability map.

A. GENERATING DISEASE-SPECIFIC ANATOMICAL PRIOR PROBABILITY MAP
We compute the disease-specific anatomical prior probability maps by identifying the spatial regions of the CXR images where the lesions are most likely to occur.To construct this map, we use the NIH Chest X-ray dataset, which includes 880 bounding-box annotated images identifying the regions of the abnormality [22].First, we create a binary image keeping the bounding-box interior spatial values equal to 1 and the rest equal to 0 for a particular disease.Out of the eight pathologies, seven pathologies (atelectasis, effusion, infiltrate, mass, nodule, pneumonia, and pneumothorax) can occur symmetrically in the lungs.Leveraging this behavior, we apply horizontal flipping to bounding boxes of these seven types of diseases to generate new annotations.We then take the sum of all binary images of a particular disease to generate unnormalized probability map.Finally, we normalize pixel values of the unnormalized probability map by dividing them by the maximum pixel value within that prob- ability map.The normalized mask is used in the network as the anatomical prior probability map for providing diseasespecific attention.
First, we obtain the unnormalized raw probability map.Let I k c (i, j) indicate the pixel position (i,j) of the k th constructed binary mask image from the bounding box annotated ground truth image for the disease class c.The disease-specific anatomical prior probability map M p c is generated as follows.
where N c indicates the number of CXR images available for the disease class c.Next, we normalize the raw map Mc to obtain the final anatomical prior probability map by, Here, the max operation identifies the maximum pixel value of the raw probability map Mc .Finally, these anatomical prior probability maps were generated for all eight diseases for which the bounding box annotations are available.Fig. 2 shows the generated disease-specific anatomical prior probability maps for the eight abnormalities.In the strictest sense, the obtained maps M p c (i, j) do not represent an actual probability distribution.Firstly, this is because the regions are obtained from the bounding box information that is larger than the actual disease regions.Secondly, obtaining a probability distribution requires that the integration over the entire map should equal unity.In actual implementation, the map's relative intensity values are more important than the absolute values.For, disease classes whose bounding-box annotations are not available, we used M p c (i, j) = 1.

B. CHEST ROI MASK GENERATION
We employ the well-established U-net [34] segmentation model to extract the lung regions from the input CXR images.We train the model using the 247 images from the JSRT dataset [35].The segmentation model produces undesirable small islands in the case of some images.To address these issues, we binarize and apply post-processing to the segmentation results to remove the unwanted islands based on the anatomical characteristics of the lungs.Since all other islands are small compared to the lung islands, we filter out the largest two islands representing the right and left lung.The sternum region is also important for some thoracic diseases and contains crucial information for classification.
To retain this region, we use the convex hull operation [36].Finally, we use morphological expansion to retain further information from the pleural regions.The overall chest ROI mask generation flow chart is provided in Fig. 3. Some of the CXR images and their corresponding generated masks are shown in Fig. 4.These postprocessing operations are represented by the postprocessing block in the ThoraX-PriorNet full architecture in Fig. 1.
©2023 IEEE.This article has been accepted for publication in IEEE ACCESS.See http://www.ieee.org/publications_standards/publications/rights/index.html for copyright information.

C. ANATOMICAL PRIOR ATTENTION MODULE (APAM)
In this section, we describe the anatomical prior attention module (APAM), which takes a feature map and a mask (chest ROI mask or anomaly probability map) as inputs to generate an attention map by providing spatial attention to the feature map.An illustration of the APAM framework is demonstrated in Fig. 5. First, we multiply the feature map with the input mask to generate a masked feature map.Later, we take the weighted sum of the feature map and masked feature map to retain information from the region outside the mask since some disease predictions may depend on the feature of the unmasked region.The weights are generated from the feature map and the masked feature map through a CNN.To learn the weights, we use a network similar to the channel-wise attention module described in [37].However, unlike [37], we aggregate spatial information from both the feature map and the masked feature map.Let F ∈ R C×H×W be the feature map generated by the backbone CNN network and M inp ∈ R 1×H×W be the input mask (chest ROI mask or anomaly probability map) resized to the spatial dimension of feature map F .We pass the feature map F into two pooling layers: global average pooling (AvgPool) and global max pooling (MaxPool).The two corresponding outputs from these pooling layers are denoted as F avg and F max respectively, where F avg , F max ∈ R C×1×1 .Again, let F m ∈ R C×H×W be the masked feature map which is produced after we multiply the feature map F with the input mask M inp .We obtain M avg , M max ∈ R C×1×1 after passing M through the global average pooling and global max pooling layers in a similar way.Here, ⊙ denotes element wise multiplication.Furthermore, instead of shared multi-layered perceptron (MLP), we use separate MLPs for all four spatial context descriptors (F avg , F max , M avg , M max ).After passing the spatial context descriptors through the CNN, the network produces the required channel weighting values, W ∈ R C×1×1 .The mathematical equation for generating the channel weighting values, W is provided below: Here,CLR 1 ,CLR 2 , . . .,CLR 4 indicate the blocks of sequential convolutional layer, and leaky ReLU activation layer and then CS indicates block of sequential convolutional layer followed by sigmoid activation layer.In CS block, we use the sigmoid activation function so that the components of weight W are within the range [0, 1].For the CLR blocks, we use the leaky ReLU with a negative slope of 0.2 to mitigate the vanishing gradient problem [38].Finally, we generate the attention map A ∈ R C×H×W from the weighted sum of F and F m using the formula below:

D. CLASSIFICATION AND LOCALIZATION
At first, we extract a feature map from the input image with a CNN backbone.We have used DenseNet-121 [39] as backbone for feature extraction.Then we use APAM to generate an attention map from the extracted feature map.
For generating attention maps from the feature map, we have used the image-specific chest ROI mask described previously with APAM to generate ROI attention map A ROI .
Here, W ROI is the weight generated by APAM from feature map F and masked feature map W ROI .Then, we have used K (K = number of abnormalities) numbers of diseasespecific anatomy prior probability maps with APAM to generate K disease-specific attention maps A c p .
Here, W c p is the weight generated by APAM from feature map F and masked feature map F c p of abnormality c.Then for predicting the probability of each disease, the imagespecific ROI attention map and the disease-specific attention map of that particular disease are channel-wise concatenated to produce a disease-specific concatenated map.
Here, A c cat ∈ R 2C×H×W .These concatenated maps are passed through individual global pooling and then 1 × 1 convolutional layers sequentially to generate the probability of that disease.And we have used the same convolutional layers on the concatenated feature maps to generate individual heatmap using CAM method.The schematic of proposed architecture of ThoraX-PriorNet is shown in Fig. 1.

E. LOSS FUNCTION
We concatenate the predicted raw values from each of the pathology-specific classifiers and pass them through a sigmoid layer to generate the probabilities, p s = [p s 1 , . . ., p s i , . . ., p s c ]. Here, c represents the number of pathologies presented in a dataset.The ground truth vectors of each chest X-ray are expressed as an c-dimensional label vector, L = [l 1 , . . ., l i , . . ., l c ], where l i ∈ {0, 1}.l i denotes whether there is any pathology, i.e., 1 for presence and 0 for absence.We optimize the weight parameters of our model by minimizing the binary cross-entropy loss, defined as,

IV. IMPLEMENTATIONAL DETAILS A. DATA RESOURCES
We evaluate the proposed ThoraX-PriorNet architecture on the NIH ChestX-Ray14 and CheXpert datasets.These data resources are briefly described below.
NIH ChestX-Ray14: The NIH ChestX-Ray14 contains 112, 120 frontal chest X-ray images from 30,805 unique patients [22].All these images are annotated for 15 classes (14 diseases along with "No Findings").Within this dataset, 880 images are specially annotated by a bounding box for the localization of 8 diseases.In our classification experiments, we use 70%, 10%, and 20% data for training, crossvalidation, and testing, respectively.We train and test our model on the classification data for all 15 classes.On the other hand, we use the bounding-box annotated data of the 8 classes to assess the disease localization performance of our model.Note that there is no patient overlap between all the training, validation, and testing sets.The 880 images with bounding box information are not utilized in training or validation splits.
CheXpert: The CheXpert dataset [40] is a chest X-ray dataset containing class label annotation of 14 classes (13 diseases along with "No Findings").Other than positive and negative labels for each class, the dataset also contains an uncertainty label for some images.The dataset consists of 224,316 chest X-ray images for training and 230 chest X-ray images for validation.We use only frontal view chest X-ray images from this dataset.If we consider only images with a frontal view, there are about 200,000 chest X-ray images for training and 200 images for validation in the dataset.We use this dataset for the classification of five thoracic diseases, namely, atelectasis, cardiomegaly, consolidation, edema, and pleural effusion.
©2023 IEEE.This article has been accepted for publication in IEEE ACCESS.See http://www.ieee.org/publications_standards/publications/rights/index.html for copyright information.

B. DATA PREPARATION
The chest X-ray images from a dataset generally have diverse variations, such as rotations, shifts, and different scales, making it challenging for the deep-learning models to localize the lesion areas.To address this problem, we utilize the alignment module [41] to perform spatial alignment on all the images as well as on the bounding box images for generating abnormality masks.Given the input image I, the alignment module ϕ transforms I to ϕ(I).The canonical chest Xray image, known as the target image T , is generated by randomly selecting two thousand normal chest X-ray images and averaging them to a single image.To provide ϕ(I) with an aligned structure, we minimize the feature reconstruction loss [42] between ϕ(I) and T .The backbone of the alignment module consists of ResNet-18 architecture.The output of the alignment network is the affine transformation parameters.Finally, the affine transformation is applied to the original chest X-ray images to generate aligned chest X-ray images.Fig. 6 shows some examples of original and aligned X-ray images.
We first normalize the pixel values of chest X-ray images with the mean and standard deviation of pixels from the ImageNet dataset [43].Next, we resize the image to 586×586 pixels.Afterward, the training images are randomly cropped to 512 × 512 pixels [30], [44].The validation and test images are center-cropped to 512 × 512 pixels.We use the same resizing and cropping method for the corresponding anatomy prior maps and chest ROI masks.Following [44], [45], we use test time augmentation by utilizing average probabilities of ten cropped sub-images (four corner crops and one central crop and the horizontally flipped version of them) as the final prediction.In the case of CheXpert dataset preparation (image augmentation, dealing with class imbalance, uncertain labels, etc.), we use the same procedure described in [14].We use the same disease-specific anatomy prior maps computed from the NIH dataset for the CheXpert dataset.

C. TRAINING PARAMETERS
The Table 1 shows the hyperparameters used for training and evaluation of the deep learning model.These include the number of epochs, batch size, loss function, optimizer, learning rate, learning rate scheduler, and weight decay rate.We have utilized the exponential moving average scheme with an alpha rate of 0.997 for updating the model weight.
In addition, we have performed gradient accumulation with a step of eight iterations.

D. ACTIVATION MAP AND BOUNDING BOX GENERATION
We use class activation maps (CAM) for heatmap generation.For the generation of bounding boxes from the heatmap map, to evaluate localization performance, we first convert the activation map or heatmap to a binary mask using binary thresholding with a threshold value of 127.Next, we use the algorithm introduced by [46] to find the contours of the regions inside the binary mask and prepare bounding boxes around the contours by taking extreme boundary values of the contours as the edge of our bounding boxes.

E. EVALUATION METRICS
We use ROC-AUC (Receiver Operating Characteristic-Area Under Curve), also abbreviated as AUC, to measure the classification performance of our model on the NIH test data.Furthermore, we use the ratio of the number of cases with correct localization against the total number of cases in each class to report the localization performance of our models on 880 bounding-box annotated data of the NIH dataset.Here, we use IoU (Intersection over Union) between the predicted bounding box and ground-truth to detect correct localization following prior work [22], [25], [47].In this case, the local-  ©2023 IEEE.This article has been accepted for publication in IEEE ACCESS.See http://www.ieee.org/publications_standards/publications/rights/index.html for copyright information.

A. DISEASE CLASSIFICATION 1) Ablation Study
We have conducted several ablation studies on the NIH ChestX-ray14 dataset of our trained model for different thoracic abnormalities.First, we evaluate the impact of attention masks, i.e., probabilistic abnormality mask and chest ROI mask, on the classification performance.Table 2 shows the reported results.The baseline model showed a mean AUC (%) score of 84.30, which performed better for classifying diseases like Atelectasis, Nodules, Consolidation, and Edema.The baseline denotes the vanilla DenseNet121 model without incorporating the APAM block.Afterward, we added          Here, Atel = Atelectasis, Card = Cardiomegaly, Effu = Effusion, Infi = Infiltration, Nodu = Nodule, Pne1 = Pneumonia, Pne2 = Pneumothorax the APAM block and gradually used the different types of attention masks.Table 2 demonstrates that all three ThoraX-PriorNet variants achieve better classification scores than the baseline.We obtained the most significant jump in classification results when we used APAM with the probabilistic disease-specific masks, i.e., an AUC (%) score of 84.69.
Incorporating both types of attention masks yields a slightly lower score, i.e., a percentage AUC score of 84.67.But it improves the performance for pathologies like Effusion, Infiltration, Fibrosis, and Pleural Thickening.
©2023 IEEE.This article has been accepted for publication in IEEE ACCESS.See http://www.ieee.org/publications_standards/publications/rights/index.html for copyright information.
Next, we conducted an ablation study to explore the impact of the input image sizes on the classification perfor- Here, Atel = Atelectasis, Card = Cardiomegaly, Effu = Effusion, Infi = Infiltration, Nodu = Nodule, Pne1 = Pneumonia, Pne2 = Pneumothorax mance.We resize the input image into three different sizes: 256×256, 420×420, and 586×586 and crop 224×224 patches for 256×256, 368×368 patches for 420×420, and 512×512 patches for 586×586 as inputs, respectively (random crop during training, center crop during inference).Results on the NIH Chest X-ray dataset are shown in Table 3.We can see that increasing input image resolution improves the classification performance.However, the improvement range from 368×368 to 512×512 is lower compared to 224×224 to 368x368.More specifically, we observe that the increase in AUC score for small lesions, such as nodules, is significant in the higher resolution.Finally, we conducted an ablation study on the spatial dimension of the feature map and the attention masks.We downscale and upsample the attention masks and feature maps, respectively, to an intermediate size before using them in the APAM block.For the input image dimension of 512×512, the feature map size is 16×16.We also performed experiments by resizing the feature map to 32×32 and 48×48.The results are reported in Table 4.We observe that increasing the spatial dimension of the final feature map does not yield improvements in the classification performance.The 48×48 model has the same classification performance level as the 16×16 model.However, the 48×48 model improves the localization performance, which will be demonstrated in a later section.

2) Performance Comparison with SOTA Methods
Table 5 compares the AUC score of ThoraX-PriorNet with other state-of-the-art (SOTA) models on NIH ChestX-ray14 dataset.Here, we observe that the proposed model's performance is superior to existing SOTA methods in terms of the mean AUC score.More specifically, it has shown performance improvement in diseases like Atelectasis, Effusion, Infiltration, Mass, Consolidation, Edema, and Pleural Thickening.
Table 6 shows the comparison of our proposed model with existing state of the art models on CheXpert dataset.Here, we have used the same probabilistic masks which were generated for training on NIH Chest X-ray dataset and for providing disease guided attention.The results show that the proposed method provides superior results for diseases likecardiomegaly and edema, whereas performance on atelectasis, consolidation, and effusion are slightly less than the compared approaches.However, the overall mean AUC score is better compared to the other models.Our method shows an AUC score of 90.62%.
©2023 IEEE.This article has been accepted for publication in IEEE ACCESS.See http://www.ieee.org/publications_standards/publications/rights/index.html for copyright information.

1) Ablation Study
We have also conducted several ablation studies on the NIH ChestX-ray14 dataset to explore the impact of different aspects of our trained model on localization performance.First, we evaluate the impact of different types of attention masks.The results are reported in Table 7.We can observe a notable performance improvement after including the APAM module.Our proposed ThoraX-PriorNet outperforms the baseline model by large margins in all T(IoU) thresholds.APAM block utilizing both disease probabilistic maps and chest ROI maps achieves overall better results, especially in the higher thresholds compared to the APAM block using only one type of attention mask.
The impact of different input image resolutions on the localization performance is demonstrated in Table 8.We can observe that increasing the spatial dimension of the input image enhances the localization performance greatly.More Here, Atel = Atelectasis, Card = Cardiomegaly, Effu = Effusion, Infi = Infiltration, Nodu = Nodule, Pne1 = Pneumonia, Pne2 = Pneumothorax specifically, we observe that increasing spatial dimension shows greater performance improvement in the localization tasks for diseases with small spatial features (e.g., mass, nodule, pneumothorax).However, large lesions, such as cardiomegaly, are not benefited.The impact on localization performance due to different dimensions of the intermediate size of feature maps and attention maps is reported in Table 9.Similar to the input spatial dimension, we can observe that the 48x48 model achieved overall better localization performance compared to other models.
©2023 IEEE.This article has been accepted for publication in IEEE ACCESS.See http://www.ieee.org/publications_standards/publications/rights/index.html for copyright information.

2) Performance Comparison with SOTA Methods
Table 10 shows the quantitative comparison of the localization score of ThoraX-PriorNet with previous SOTA models.Note that Han et al.and Rozenberg et al.utilize bounding box information in their pipeline.As a result, their model is not directly comparable to ours and other SOAT models.In spite of that, our proposed method shows comparable performance at lower T(IoU) thresholds despite not using the bounding box supervision.Our proposed ThoraX-PriorNet achieved improvements of 2.56%, 18.87%, 22.50%, and 6.45% at  We have extracted the activation maps for eight different diseases from the NIH ChestX-ray8 dataset and plotted them in Fig. 8 to visualize the localization of the proposed model.The red boxes denote the ground truth boxes, while the green boxes denote the predicted boxes.We can observe that our model can identify and localize the abnormal findings.

C. STATISTICAL ANALYSIS
To perform statistical analysis, we have conducted a 10-fold cross-validation and used Nadeau and Bengio's corrected ttest method [61] for calculating the p-values.The results for the baseline and the proposed method are reported in Table 11.The baseline model achieves an average AUC (%) score of 84.41 with a standard deviation of 0.26, while our proposed method achieves 84.61±0.27.The statistical result yields a p-value of 0.017, denoting the improvement of the proposed method compared to the baseline.

D. COMPUTATIONAL COMPLEXITY ANALYSIS
The average time to process a single chest X-ray image during the testing phase, along with the floating point operation computation, for the input image dimension of 512×512 is reported in Table 12.Our proposed ThoraX-PriorNet takes an average of 8.74 ms to process a test image and requires 28.1 GFLOPS compute power to perform this task.
©2023 IEEE.This article has been accepted for publication in IEEE ACCESS.See http://www.ieee.org/publications_standards/publications/rights/index.html for copyright information.

E. ANALYSIS OF GENERAZIABILITY OF THE PROBABILISTIC ABNORMALITY MASKS
Different chest X-ray-based thoracic disease datasets may have diverse affine variations, such as rotations, shifts, and different scales.To address the affine variations, we have utilized the alignment module [41].In addition, the chest X-ray datasets may have intrinsic variations among them due to patient demographics, geographical diversity, class imbalances, different exposure settings and imaging protocols, scanner intrinsic variations, and so on, inherent to medical datasets.However, in our experiments, we are not utilizing or training different datasets together, a task that is reserved for domain adaptation and generalization methods [62]- [64].Here, we are generating the disease-prior masks by taking and aggregating the referenced spatial positions from the bounding boxes to get a probabilistic map.The domain variations due to exposure shift, different imaging protocols, or machine intrinsic variations are not propagated through the generated abnormality masks.However, we do acknowledge that the number and quality of the ground truth bounding boxes, class imbalance, patient demographics, or geographical diversity may have an effect on the generated probabilistic map, which may influence the performance of the proposed model.We have conducted experiments to evaluate the generalizability of the disease-prior probabilistic abnormality masks generated from a particular thoracic disease dataset.For this experiment, we have chosen the NIH chest X-ray14 [22] and the VinDr-CXR dataset [65], as they have provided bounding box annotations.We have performed the experiment for the six common pathologies between them, i.e., Atelectasis, Cardiomegaly, Pleural Effusion, Infiltration, Nodule/Mass,   First, we train the vanilla DenseNet-121 model on both datasets without using the aligned images.Afterward, we train the vanilla DenseNet-121 with the aligned images.Finally, we train our proposed ThoraX-PriorNet, utilizing the dataset-specific abnormality masks from the NIH chest X-ray14 and VinDr-CXR datasets, one at a time.The results are reported in Table 13.The average improvement is calculated as follows: Here, n is the number of thresholds, S i is the performance at a particular threshold i, and S ref i is the performance of the vanilla DenseNet-121 at threshold i.We can observe that adding the alignment module improves the performance of the vanilla DenseNet-121 on both datasets.Our proposed ThoraX-PriorNet achieves significantly improved scores compared to the vanilla DenseNet-121 using either of abnormality masks.However, we can notice that the disease-prior masks from the VinDr-CXR dataset yield the highest performance in both cases.Especially on the VinDr-CXR test dataset, the improvement for ThoraX-PriorNet is 39.77% with the VinDr-CXR disease-prior mask, compared to 17.52% with the NIH chest X-ray14 disease-prior masks.We hypothesize that this is due to two reasons.First, it is due to the quality of the probabilistic maps, as VinDr-CXR has a much higher number of available bounding box annotations.Second, the demographic and class ratio difference between NIH chest X-ray14 and VinDr-CXR may have an effect on the performance.Nevertheless, considering the average improvement in performance compared to vanilla DenseNet-121 with and without aligned images, our proposed model can achieve a significant improvement with either of the disease-prior probabilistic abnormality masks, proving the efficacy of utilizing the APAM block.
©2023 IEEE.This article has been accepted for publication in IEEE ACCESS.See http://www.ieee.org/publications_standards/publications/rights/index.html for copyright information.

F. ROC CURVES
The performance of the clinical diagnostic systems is primarily measured by their specificity and sensitivity.The ROC curves are generally used to assess the diagnostic performance of a clinical system by converting the continuous test results into the decision of the presence or absence of  pathology and to demonstrate the trade-off between clinical sensitivity and specificity for every possible cut-off for the clinical test.The ROC curves for each pathology on the NIH chest X-ray dataset are shown in Fig. 10 to visually represent the diagnostic performance of the proposed method.
©2023 IEEE.This article has been accepted for publication in IEEE ACCESS.See http://www.ieee.org/publications_standards/publications/rights/index.html for copyright information.

G. ANALYSIS OF RANDOM CROPPING AUGMENTATION
We have utilized the random cropping augmentation following previous studies [30], [66], as the random cropping augmentation has shown improved performance in thoracic disease detection in literature.In addition, we have also performed the alignment of images (where the images are transformed to align their spatial structure with the anchor image [41]) to ensure that the random cropping technique reliably encompasses all regions of interest within the images.The anchor image is constructed by taking an average of 2000 normal images.In Fig 11, we plot five different random cropping windows of size 512×512 on the anchor image of 586×586 dimensions (four outmost corners and one centered).We can observe that the random cropping windows can encompass the region of interest.
We have also conducted experiments to assess the impact of random cropping augmentation on the performance.The results are reported in Table 14.We can observe that the model performs better when random cropping is utilized.It is intuitive because the random cropping technique significantly augments the training data.Another benefit of utilizing random cropping during training is that we can use test time augmentations (TTA) that consist of different random cropping windows.We have followed the procedure mentioned in [44], [45] and applied TTA based on random cropping, i.e., utilizing average probabilities of ten cropped sub-images (four corner crops and one central crop and the horizontally flipped version of them) as the final prediction.The results are reported in Table X.We can observe that TTA with random cropping can enhance the performance further.

H. DISCUSSIONS
We make several observations by analyzing the extensive experimental evaluation results described in the previous sections.Our studies show that incorporating attention mechanisms like the proposed ThoraX-PriorNet can enhance the performance of thoracic disease classification and localization.The classification accuracy has improved from 84.30% to 84.67% for the inclusion of both chest ROI mask and disease-specific mask-based attention in the ThoraX-PriorNet architecture.The improvement in the case of localization is by a more noticeable margin from 0.74 to 0.80 with an IoU threshold of 0.1, 0.56 to 0.63 with an IoU threshold of 0.2, 0.41 to 0.49 with an IoU threshold of 0.3, 0.26 to 0.33 with an IoU threshold of 0.4, 0.14 to 0.22 with an IoU threshold of 0.5, 0.07 to 0.11 with an IoU threshold of 0.6, and 0.03 to 0.04 with an IoU threshold of 0.7.We can also observe that utilizing increased input image spatial VOLUME XX, 2023 resolution or increased feature map dimension shows more notable performance improvement in the localization tasks for diseases with small spatial features (e.g., mass, nodule, pneumothorax).
In addition, we have performed the statistical analysis and found the results statistically significant.We have also conducted experiments on the generalizability of the diseasespecific prior probabilistic abnormality masks generated from a specific dataset.We observe that though the quality and quantity of the ground truth boxes can affect the generated probabilistic map, our proposed attention mechanism based on the disease-specific probabilistic abnormality masks can achieve superior performance compared to vanilla deep learning architecture.
©2023 IEEE.This article has been accepted for publication in IEEE ACCESS.See http://www.ieee.org/publications_standards/publications/rights/index.html for copyright information.

VI. CONCLUSION
In this work, we present a novel architecture, ThoraX-PriorNet, providing attentions with disease-specific anatomy prior probability maps and chest ROI masks to simultaneously address the CXR image classification and abnormality localization problem.We evaluated our method on two publicly available datasets, NIH ChestX-ray14 and Stanford CheXpert and compared the results with recent state-ofthe-art methods.Extensive experiments show that the model, ThoraX-PriorNet performs better by a good margin when considering both classification and localization tasks in a single model and also in the constraint of multiple datasets.

FIGURE 1 .
FIGURE 1.A schematic of the proposed ThoraX-PriorNet architecture for disease classification from CXR utilizing both lung segmentation attention and disease-specific attention.The model consists of three components: the lung segmentation attention module, the disease-specific attention module, and then concatenation for classification.The lung segmentation U-Net model generates a lung ROI mask, which is then used to provide lung mask guided attention.The disease-specific probability mask is used along feature map to provide disease specific attention.Finally, the concatenated feature maps are used to make the final disease classification.

FIGURE 2 .
FIGURE 2. Disease-specific anatomical prior probability maps generated for the 8 diseases for which the bounding box annotations are available in the NIH dataset.

FIGURE 3 .
FIGURE 3. A flow-diagram of the chest ROI mask generation module.

FIGURE 4 .
FIGURE 4. Examples of some generated chest ROI masks.Top panel: Example input CXR images, Bottom panel: Corresponding chest ROI masks of the example CXR images.

)FIGURE 5 .
FIGURE 5. A schematic diagram of the Anatomy Prior Attention Module (APAM):A) Mask is multiplied with the feature map to generate a masked feature map; B) Featuremap and Masked Featuremap is being used to produce channel weighting vector.Here, GMP = Global Max Pooling and GAP = Global Average Pooling; C) Channel weighting vector is being used to produce final weighted featuremap i.e, Attention Map.

FIGURE 6 .
FIGURE 6. Examples of original chest X-ray images and aligned chest X-ray images.

FIGURE 7 .
FIGURE 7. Illustration of the training and validation loss and AUC curves on the NIH ChestX-Ray14 dataset.

FIGURE 8 .
FIGURE 8. Examples of some disease localization by our proposed method.The first column of each sample: Input CXR image with the ground truth bounding box (red color) and the predicted bounding box (green color).The second column of each sample: Corresponding activation map from the proposed model.
(a) Disease-specific anatomical prior probability maps generated from the NIH Chest X-ray14 dataset (b) Disease-specific anatomical prior probability maps generated from the VinDr-CXR dataset

FIGURE 11 .
FIGURE 11.Five different random cropping windows on the anchor image.The red window represents the random cropping window.
MD. IQBAL HOSSAIN earned his Bachelor of Science degree in Biomedical Engineering from Bangladesh University of Engineering and Technology (BUET) in 2022.Since mid-2022, he has served as a Research Assistant at the mHealth Lab within the Biomedical Engineering department at BUET, Bangladesh.Subsequently, in 2023, he embarked on his Ph.D. journey in Imaging Science at Washington University in St. Louis.His research focuses on explainable artificial intelligence and medical computer vision.MOHAMMAD ZUNAED (Student member, IEEE) completed his B.Sc. and M.Sc. in Electrical and Electronic Engineering from Bangladesh University of Engineering and Technology.He is currently working as a research assistant at the mHealth Lab, Bangladesh University of Engineering and Technology, under the supervision of Dr. Taufiq Hasan.Earlier, he worked as a lecturer in the Electrical and Electronic Engineering department at the Daffodil International University.MD.KAWSAR AHMED received his B.Sc. degree in Biomedical Engineering from Bangladesh University of Engineering and Technology (BUET), Bangladesh, in 2021.He is working as a Lecturer in the Department of Biomedical Engineering, BUET, Bangladesh.His research interests include Machine learning / AI for Biomedical Engineering, Medical imaging, Medical instrumentation & device design.

TABLE 1 .
Hyperparameters of the deep learning model used for training and evaluation.

TABLE 2 .
Ablation Study: Impact of different types of attention masks on the AUC (%) scores of our trained models on the NIH dataset.The best results are shown in red font.

TABLE 3 .
Ablation Study: Impact of input image spatial resolution on the AUC (%) scores of our trained models on the NIH dataset.The best results are shown in red font.

TABLE 4 .
Ablation Study: Effect of resizing feature and anatomy prior maps on the AUC (%) scores of our trained models on the NIH dataset.The best results are shown in red font.

TABLE 5 .
Comparison of AUC (%) Scores of our best performing model with state-of-the-art methods on the NIH dataset.The best results are shown in red font.

TABLE 6 .
Comparison of disease classification AUC Scores (%) of the proposed model and SOTA models on the CheXpert dataset.The best results are shown in red font.

TABLE 7 .
Ablation Study: Impact of different types of attention masks with respect to disease localization performance using different T(IoU) thresholds on the NIH dataset.The best results are shown in red font.

TABLE 9 .
Ablation Study: Effect of resizing feature and anatomy prior maps with respect to disease localization performance using different T(IoU) thresholds on the NIH dataset.The best results are shown in red font.

TABLE 10 .
Comparison of disease localization accuracy of the best performing proposed model with state-of-the-art methods.The best results are shown in red font.

TABLE 11 .
[61]istical analysis between the baseline and proposed model for a 10-fold cross-validation using Nadeau and Bengio's corrected t-test Method[61].

TABLE 12 .
Computational cost parameters by ThoraX-PriorNet for a single image on 512x512 dimension during the test phase on the NIH ChestX-Ray14 dataset.

TABLE 13 .
Evaluation of the generalizability of the disease-specific anatomical prior probability maps across different thoracic disease datasets.

TABLE 14 .
Effect of Random Cropping augmentation on our proposed method with or without test time augmentation.