Deep Learning Radiographic Assessment of Pulmonary Edema: Training with Serum Biomarkers

A major obstacle faced when developing convolutional neural networks (CNNs) for medical imaging is the acquisition of training labels: most current approaches rely on manually prescribed labels from physicians, which are time consuming and labor intensive to at-tain. Clinical biomarkers, often measured alongside medical images and used in diagnostic workup, may provide a rich set of data that can be collected retrospectively and uti-lized to train diagnostic models. In this work, we focused on the blood serum biomarkers BNP and BNPP, indicative of acute heart failure (HF) and cardiogenic pulmonary edema, paired with the chest X-ray imaging modality. We investigated the potential for inferring BNP and BNPP from chest radiographs. For this purpose, a CNN was trained using 27748 radiographs to automatically infer BNP and BNPP, and achieved strong performance ( AUC = 0 . 90, SEN = 0 . 88, SPEC = 0 . 81, r = 0 . 79). Since radiographic features of pulmonary edema may not be visible on low resolution images, we also assessed the impact of image resolution on model learning and performance, comparing CNNs trained at five image sizes (64 × 64 to 1024 × 1024). With comparable AUC values obtained at different resolutions, our experiments using three activation mapping techniques (saliency, Grad-CAM, XRAI) revealed considerable in-lung attention growth with increased resolution. The high-est resolution models focus attention on the lungs, necessary for radiographic diagnosis of pulmonary edema. Our results emphasize the need to utilize radiographs of near-native resolution for optimal CNN performance, not fully captured by summary metrics like AUC.


Introduction
Pulmonary edema is a condition characterized by excess fluid in the lungs, often caused by congestive heart failure (HF), among other etiologies (Staub, 1974;Murray, 2011). Due to its wide availability and ability to provide alternative diagnoses with similar features, chest radiographs are commonly used to monitor the progression of pulmonary edema (Hammon et al., 2014;Halperin et al., 1985). However, radiographic assessment of pulmonary edema is a challenging visual task, especially in mild and moderate cases, even for expert subspecialty cardiothoracic radiologists. Accurate assessment of pulmonary edema is crucial for guiding and monitoring response to treatment. Recently, several groups (Lakhani and Sundaram, 2017;Wang et al., 2020;Hurt et al., 2020a,b;Hwang et al., 2019;Rajpurkar et al., 2018) have reported the application of deep convolutional neural networks (CNNs) to classify chest radiographs for various pathologies, including pneumonia, pulmonary edema, pneumothorax, and many others. While these early works show the promise of CNNs for radiographic interpretation, most lack the specificity and granularity in diagnosis at a level that is typically required for diagnostic utility. One obstacle that impedes the development of CNNs for analysis of medical images is the need to assemble ground truth data based on expert opinion. This can be labor-and timeintensive, and for challenging tasks like assessment of pulmonary edema, can be difficult to ensure reliability of image annotation. Herein, we explored the potential to infer B-type natriuretic peptide (BNP) and NT-pro B-type natriuretic peptide (BNPP) from chest radiographs (Figure 1), proposing an objective source of ground truth for training CNNs to perceive variations in severity of pulmonary edema. BNP and BNPP are biomarkers measured from blood serum and may be included in the diagnostic workup of suspected cardiogenic pulmonary edema (Ware and Matthay, 2000). Elevated values of BNP and BNPP are indicative of atrial stretch, observed in acute heart failure and pulmonary edema (Huang et al., 2016;Ray et al., 2005).
We further observed that in the published literature, many CNN algorithms have been trained and evaluated on low-resolution images, commonly provided in public databases (Pan et al., 2019;Jaeger et al., 2014;Seah et al., 2019). However, many of the characteristics of pulmonary edema lie near the native resolution of chest radiographs, including interstitial Kerley B lines and peribronchial cuffing. In this work, we investigated the ability of CNNs to infer BNP/BNPP from pulmonary edema radiographics when trained using different image sizes (64 × 64 -1024 × 1024).

Dataset
With institutional review board approval and waiver of informed consent, we constructed a dataset of 27748 frontal chest radiographs with BNP or BNPP laboratory values from 16401 patients from our institution. We included all radiographs and laboratory measurements from Nov 4 th 2014 to Dec 1 st 2020 for patients who underwent either measurement of 2

CNN Training
Two-Stage Training: A two-stage pipeline was used to train a bifurcated CNN to jointly predict BNP and BNPP, shown in Figure 2. All CNNs were trained using Adam optimizer with a fixed learning rate of 1e-5 for 50 epochs, and batch size 16. In the first stage of training, a ResNet152v2 model (He et al., 2016), pretrained on the ImageNet dataset (Deng et al., 2009), was modified to infer BNPP from a chest radiograph. For training, a custom loss function based on mean absolute error (MAE) was used. Given a dataset of n input radiographs, we defined the loss over the dataset as: M AE BN P P = 1 n k∈{1,. . . ,n} AE k BN P P , AE k BN P P = |ln( 1 + y k where y k BN P P is the lab measured BNPP value andŷ BN P P k is the inferred BNPP value for the k th input radiograph in the dataset. The BNPP values range from (0-70,000 pg/mL) and are exponentially distributed, with a small number of values significantly higher than the mean. To account for this and prevent overfitting to outliers using MAE loss, we used log transformation of the measured and inferred BNPP values when calculating AE k BN P P (Cano-Espinosa et al., 2018).
In the second stage of training, an additional fully connected layer was incorporated at the last layer to predict both BNP and BNPP from a radiograph. Weights acquired from the first stage of training were frozen, except for the last fully connected layers. Both BNP and BNPP datasets were used to train the model at stage 2. Because the BNP dataset was significantly smaller than the BNPP dataset (n=1423 vs 26667 respectively), a scheduler was used to balance the number of BNP labeled radiographs and BNPP labeled radiographs in each minibatch of training examples. This ensures that for each epoch, the entire BNP training set was used (n=1124), while an equal number of BNPP labeled images were randomly sampled without replacement from the BNPP dataset. We further modified our custom MAE loss function from stage 1 (Equation (1)) to train both tasks simultaneously. To deal with missing values of BNP or BNPP measurements we ignored the outcome with missing measurement in the loss function using binary flags, α k and β k : 1, 1 if y BN P P , y BN P available 1, 0 if y BN P P available 0, 1 if y BN P available 0, 0 otherwise.
(2) Training at Multiple Resolutions: To explore the effect of image resolution on model performance, we trained five CNNs with similar architectures for different input resolutions with sizes of 64 × 64, 128 × 128, 256 × 256, 512 × 512, and 1024 × 1024. Images were cropped at their larger dimension to equal height and width and downscaled to the desired resolution with bilinear interpolation from python OpenCV 4.5.1.48 library. A single Nvidia V100 GPU was used to train lower resolution models (64 × 64 -512 × 512) and 8 NVIDIA V100 GPUs from an NVIDIA DGX cluster running in an NGC container on the Singularity runtime environment were used to train the 1024 × 1024 CNN. Synchronous distributed training was performed using TensorFlow 2.1.0 with mirrored strategy.

CNN Evaluation
CNNs are evaluated in terms of area under the receiver operating characteristic curve (AU-ROC or AUC ROC) and Pearson r. ROC curves were computed after binary thresholding of BNP and BNPP measurements, according to previously established screening thresholds for acute heart failure detection (greater than 400 for BNPP, greater than 100 for BNP) (Kim and Januzzi Jr, 2011). To assess the effect of resolution on CNN activation, we applied three activation mapping techniques (Saliency (Simonyan et al., 2014), grad-CAM (Selvaraju et al., 2017), and XRAI (Kapishnikov et al., 2019)) to each trained CNN. Activation maps were generated for each radiograph in the BNPP test set (n=2691).

Quantitative Analysis of CNN Attention
To measure the degree of CNN attention within the lungs, we propose two metrics: lung area attention (AA) and lung blur sensitivity (BS), both of which utilize lung masks from separately developed lung segmentation CNN (Figure 3a). The lung segmentation CNN was trained using 302 radiographs and their manuallyannotated lung masks, based on U-net implementation (Ronneberger et al., 2015).
Area Attention: We define lung area attention (AA) as the proportion of the highly activated pixels in the activation map that overlap with the lung segmentation mask (Figure 3-b): where x is the input chest radiograph, heatmap(x) is the normalized activation map from inference on x, thresholded at mean pixel value across all activation maps from a single model and technique, and mask(x) is the lung mask. Intuitively, a CNN with a high average lung AA value across the test set has focused mostly within the lungs rather than the rest of the image.
Blur Sensitivity: We define blur sensitivity (BS) as another way to estimate attention (Figure 3-c). Lung BS measures the sensitivity of the CNN to blurring the region denoted by a lung mask: whereŷ is a vector of the inferred values from a trained CNN for the entire test set, blur(ŷ, b) is a vector of inferred values from when each image in the test set has lungs blurred with a gaussian kernel of size b, and AU C(ŷ, y) is the AUC computed for a vector of inferred valuesŷ against ground truth vector y. We increased the Gaussian kernel sizes with respect to the image size to ensure a similar effect relative to the field of view. A model that relies on high resolution details within the lungs will have a larger lung BS value.  The relationship between measured laboratory values and the respective inferred values by CNNs are shown in Figure 4, for both BNP and BNPP test sets at different input image resolutions. There was relatively stronger correlation in case of BNP than BNPP values at all image resolutions, though the CNN training and evaluation sets were much smaller (r=0.642-0.762 for BNP and r=0.587-0.697 for BNPP). Pearson correlation coefficient between measured and inferred laboratory values increased with input image resolution, having the greatest effect at lower image resolutions. For BNP, peak Pearson r was 0.787 at 512 image size and decreased slightly to 0.762 at 1024. For BNPP inference, peak Pearson r was 0.699 at 256 which plateaued at higher image sizes.

CNN Evaluation
The relationship between input image resolution and AUC obtained for BNP and BNPP prediction, thresholded at 100 and 400 respectively for acute heart failure detection are shown in Figure 5. Increasing the image size from 64 to 1024 resulted in continuous increments in AUC (0.817 to 0.903 for BNP and 0.797 to 0.863 for BNPP) with the greatest improvement between the lowest resolutions. Using the Youden's j index on each AUROC at image sizes (64-1024), we measured the sensitivity for BNP (0.618-0.882) and BNPP (0.728-0.815) as well as specificity for BNP(0.904-0.810) and BNPP (0.728-0.723).

CNN Activation Mapping
Averaged Activation Maps: Three strategies of saliency, grad-CAM, and XRAI were then used on CNNs trained at all image sizes to assess CNN activation maps for consistent trends. The resulted maps at each resolution were averaged over all test cases for each strategy, presented as average activation maps in Figure 6

Quantitative Analysis of Model Attention
Area Attention: We calculated the average lung AA over all images in the test set (n=2691) for five CNNs, trained at different input resolutions, using three activation mapping techniques (saliency, grad-CAM, XRAI) ( Figure 6-B). Overall, increasing input resolution led to increasing average lung AA (0.40 to 0.64 for saliency, 0.26 to 0.80 for grad-CAM, 0.33 to 0.72 for XRAI). The greatest changes in average lung AA were observed when input resolution was increased from 512 × 512 to 1024 × 1024 (0.46 to 0.64 for Saliency, 0.58 to 0.80 for grad-CAM, and 0.54 to 0.72 for XRAI). At 64 × 64 image size, average lung AA< 0.5, indicates that the CNN trained at this resolution, focused less than half of its attention within the lungs. In contrast, all techniques yielded average lung AA > 0.5 for 1024×1024 image size, indicating such model mostly focused inside the lungs. Our analyses also suggested that average lung AA is independent of BNP or BNPP values (Appendix E). Lung attention seems to be consistent regardless of BNPP, but varies greatly with input resolution.
Blur Study: We calculated the lung BS based on AUC obtained from the test set (n=2691). Figure 6-C plots the average lung BS for five CNNs, trained at different image resolutions. Overall, increasing input resolution resulted in lung BS increasing from 0.01 to 0.13. For the models trained at lower image resolutions (64 × 64 -256 × 256), lung BS < 0.02 indicates that blurring the lungs caused trivial changes in AUC. The higher resolution models trained at 512 × 512 and 1024 × 1024 exhibit significantly higher lung BS values of 0.06 and 0.13 respectively.

Discussion
In this work, we demonstrated the feasibility of inferring BNP and BNPP from chest radiographs. A modified Resnet152V2 CNN was developed using staged training to deal with multiple data sets of different sizes. An optimal performance at larger input image size was achieved, which highlights the importance of spatial details for inferring BNP and BNPP values. At 1024 × 1024 image size, thresholding the inferred values at BN P > 100 and BN P P > 400, AUROCs were 0.903 and 0.863, while Pearson r values were 0.762 and 0.697. By applying three activation mapping techniques (saliency, grad-CAM, XRAI) and two proposed quantitative metrics (lung AA, lung BS) to our CNNs, we confirmed that increasing input resolution increased model attention to the lungs, the most clinically relevant region of the radiograph. To have generalized observations, we employed ResNet152v2 model architecture with minimal modifications. We chose this architecture due to its su-perior performance in our preliminary experiments (Appendix B). Few prior investigators have begun to explore the application of CNNs to infer blood serum biomarkers from chest radiographs. As detecting pulmonary edema on chest radiograph is challenging even for expert radiologist, assembling a dataset based on expert opinion may be inconsistent, as well as time-and labor-intensive. This work utilizes serum biomarkers as objective data to drive neural network training. (Seah et al., 2019) showed initial feasibility of using BNP for this task at 128 × 128 resolution which resulted in an AUROC of 0.82 on their test set, compared to our result of 0.903 AUC. Unlike our proposed model, their model attention was predominantly outside of the lungs. Other investigators who also developed other tools for detecting pulmonary edema from chest radiographs, achieved similar AUROC, ranging from 0.814-0.924 (Rajpurkar et al., 2018;Cicero et al., 2017;Sabottke and Spieler, 2020) with a variety of CNN architectures.
We thus expanded on these works and show that while some performance (in terms of AUROC) is maintained at lower image sizes, CNNs require higher resolution to ensure that their inferences are the result of lung attention. Our results provide more insight into the effect of image resolution on CNN learning. Future work can focus on developing novel architectures for this task, or on relating the BNP and BNPP values to radiologist grades of pulmonary edema.     Table 2.

Appendix F. Choice of Resolution
A potential limitation of our work is that we did not experiment with resolutions higher than 1024 × 1024, even though the native resolution of our chest radiographs was as high as 4700 × 4700. For our experiments, we selected resolutions to encompass the entire gamut of commonly used input resolutions when training CNNs on chest radiographs. 1024×1024 was selected as the maximum resolution in our work for two reasons: (1) this is the maximum resolution of images from the commonly used public NIH ChestX-ray14 dataset (Jaeger et al., 2014) and RSNA-Pneumonia dataset (Pan et al., 2019).
(2) Compute resources required for training increase two-fold with resolution (Table 6). Training a ResNet152v2 on 1024 × 1024 images was pushed the memory limits of our available hardware. Future work may be directed at studying the performance gains at even higher resolutions.