Uncertainty-Aware Deep Learning Methods for Robust Diabetic Retinopathy Classification

Automatic classiﬁcation of diabetic retinopathy from retinal images has been increasingly studied using deep neural networks with impressive results. However, there is clinical need for estimating uncertainty in the classiﬁcations, a shortcoming of modern neural networks. Recently, approximate Bayesian neural networks (BNNs) have been proposed for this task, but previous studies have only considered the binary referable/non-referable diabetic retinopathy classiﬁcation applied to benchmark datasets. We present novel results for 9 BNNs by systematically investigating a clinical dataset and 5-class classiﬁcation scheme, together with benchmark datasets and binary classiﬁcation scheme. Moreover, we derive a connection between entropy-based uncertainty measure and classiﬁer risk, from which we develop a novel uncertainty measure. We observe that the previously proposed entropy-based uncertainty measure improves performance on the clinical dataset for the binary classiﬁcation scheme, but not to such an extent as on the benchmark datasets. It improves performance in the clinical 5-class classiﬁcation scheme for the benchmark datasets, but not for the clinical dataset. Our novel uncertainty measure generalizes to the clinical dataset and to one benchmark dataset. Our ﬁndings suggest that BNNs can be utilized for uncertainty estimation in classifying diabetic retinopathy on clinical data, though proper uncertainty measures are needed to optimize the desired performance measure. In addition, methods developed for benchmark datasets might not generalize to clinical datasets.


Main
Deep neural networks have achieved impressive results in a wide variety of problems, ranging from large scale image classification [1], to natural language understanding [2], and medical image segmentation [3].However, the standard methods have been found to produce over-confident predictions, meaning that they are poorly calibrated [4].In classification tasks, a poorly calibrated network can place a high probability on one of the classes, even when the predicted class is incorrect, whereas a well calibrated classifier would place less probability mass on uncertain classes.The issue of uncertainty estimation is especially important in the medical domain, in order to trust confident model predictions for screening automation and referring uncertain cases for manual intervention of a medical expert.In this work we refer to the classifiers that can indicate their uncertainty as robust classifiers.
Over the past few years, the automatic classification of diabetic retinopathy by using deep neural networks has been under growing interest [5,6,7,8].More recently, the focus of attention has turned on developing robust deep learning methods for the classification task, most commonly using the approximate Bayesian deep learning approach that approximates the Bayesian neural network (BNN) posterior distribution in a computationally scalable manner.The previous works have considered a variety of aspects from studying the benefits of uncertainty estimates [9] to algorithmic development of robust methods [10,11].Although the variety of different studied algorithms has been diverse [9,12], the used datasets have been benchmark datasets.This leaves open the question of whether the algorithms generalize to clinical data.In addition, these recent works have mainly focused on the classification of diabetic retinopathy using binary classification schemes, i.e. "referable vs. non-referable" (RDR) or "healthy vs. any" diabetic retinopathy.However, in clinically oriented approaches there has been a shift towards the 5-class proposed international diabetic retinopathy classification system (PIRC) [8,13,14].In order for these approaches to have clinical use, it is of paramount importance that the algorithms generalize to clinical datasets and diabetic retinopathy grading systems, that remain yet unstudied.
The common aspect among the works focusing on robust methods is the use of uncertainty information to simulate a referral process, introduced by Leibig et al. [9].Each prediction is associated with an uncertainty estimate, and the least certain predictions are referred to experts, while the more certain predictions are used for evaluation.This process mimics a situation in which the automated system asks human intervention for uncertain cases, i.e. refers them to an expert.In practice, the holdout test set predictions are ordered according to their uncertainty and several referral levels are defined corresponding to a percentile of referred examples, i.e. 10% referral level means that 10% of the most uncertain examples are left out of the evaluation.
In Leibig et al. [9], a Monte Carlo (MC) dropout neural network was used for two binary classification tasks: classification of any diabetic retinopathy and RDR.They used the EyePACS dataset [15] for training and testing and evaluated the out-of-distribution performance with the Messidor dataset [16].They observed improved robustness in comparison to a baseline standard neural network.Filos et al. [12] conducted a more methodologically extended study for classification of RDR.The EyePACS dataset was used for training and testing, and out-of-distribution performance was evaluated with the APTOS [17].The work examined a number of approximate Bayesian deep learning methods, such as the MC dropout, Mean Field Variational Inference (MFVI), deep ensembles, and MC dropout ensemble.They observed that the approximate Bayesian methods outperformed the standard neural network in all the experiments where network uncertainty was utilized.Farquhar et al. [10] proposed the Radial BNN method.The model was benchmarked against MFVI, MC dropout, and deep ensembles using the EyePACS dataset, as both the training and test sets, for the RDR classification task.A single Radial BNN was found to outperform the MC dropout and MFVI, but not the deep ensemble.Overall, an ensemble of Radial BNN's turned out to outperform all other methods on all referral levels.In Band et al. [18] the RDR task was examined using the EyePACS dataset for training and testing, and out-of-distribution performance was evaluated with the APTOS, similar to Filos et al. [12].They also studied how the robust deep learning methods generalize when no images of severe and proliferative diabetic retinopathy are used in the training.The considered approximate deep learning methods were MFVI, Radial BNN, Function-Space Variational Inference, MC dropout, and Rank-1 Parameterized BNN, and ensembles of them.Also deep ensembles were used.It was found that the MC dropout ensemble performed the best for the within-distribution and the MFVI ensemble for the out-of-distribution experiments.
In this study, our objective is to analyze robust neural networks, i.e. networks that are inher-ently well calibrated, for the task of diabetic retinopathy classification on both RDR and PIRC classifications.For this reason we leave out the analysis of post-hoc methods, such as the test-time augmentation introduced in Ayhan and Berens [11] and Ayhan et al. [19], neural network softmax temperature scaling, and probability binning strategies [4].Many measures of uncertainty have been used in the previous works.In Leibig et al. [9], the standard deviation of the output of the BNN and the entropy of the posterior predictive distribution were considered as measures of uncertainty and found to perform similarly.In Filos et al. [12], Ayhan et al. [19], and Band et al. [18] the entropy was also selected as the measure of uncertainty.On the other hand, the mutual information between the parameters of the model and the output was considered as the measure of uncertainty in Farquhar et al. [10].
In the present work, we explore the benefits of uncertainty estimates for a clinical dataset of a Finnish hospital on both the binary RDR and the 5-class PIRC classification schemes.We investigate 9 different approximate Bayesian methods to extensively analyze recently proposed methods.
In addition, we study the out-of-distribution performance using three benchmark datasets: the Eye-PACS [15], the Messidor-2 [16,20], and the APTOS [17].We observe that the entropy uncertainty estimates improve AUC performance in the binary RDR system, but in the case of the 5-class PIRC system, the QWK only improves across the benchmark datasets.For the clinical dataset and PIRC system, we observe less robust classifier performance using entropy based uncertainty.To improve the quality of the uncertainty estimates, we additionally propose a novel classifier risk based uncertainty measure, which improves the within-distribution uncertainty performance on both the Finnish hospital dataset and the EyePACS dataset.As far as we know, clinical diabetic retinopathy severity schemes and clinical workflow datasets have not been studied before using robust neural networks.

Approximate Bayesian methods
Here we consider 9 different BNN approaches.The methods approximate the true distribution of the neural network parameters with different mechanisms.These methods are deep ensembles [21], Monte Carlo (MC) dropout [22], Mean Field Variational Inference (MFVI) [23], Generalized Variational Inference (GVI) [24], and Radial BNN [10].We use the ensembling approach also for MC dropout, MFVI, GVI, and Radial BNN.Our baseline is a standard neural network trained with dropout and L2-regularization, which is equivalent to the maximum a posteriori (MAP) estimate if a fully factorized normal prior, with the variance inversely proportional to the regularization weight, is used on the network parameters.
The BNNs produce an estimate for the posterior distribution, which can be used to take the uncertainty in the parameters into account when making a prediction [25].To measure the uncertainty, we use the entropy of the posterior predictive distribution, which has also been used in previous works.We show that this measure optimizes the negative log-likelihood when used in the referral process.This is because the referral process is a type of reject option classification with the uncertainty used as a risk measure.We propose a novel class of uncertainty measures based on the risk interpretation, from which we derive the QWK-Risk measure used as another uncertainty measure for the 5-class PIRC system.
The utility of the uncertainty information is evaluated in a similar manner as in previous works.We compute the uncertainty as the entropy or as the QWK-Risk of the posterior predictive distribution and reject some proportion of the most uncertain cases, which is called the referral level.The performance is evaluated for the binary RDR task using the area under the receiver operating characteristic curve (AUC), and for the 5-class PIRC task we use the quadratic weighted Cohen's A.

B.
C. Kappa (QWK), similar to Krause et al. [13] and Ruamviboonsuk et al. [14].Both the AUC and QWK are evaluated at 0% (no referral), 30%, and 50% referral levels, and are presented in a scale with maximum of 100 to improve readability.

Datasets
We use the following four datasets for our experiments: the EyePACS [15], KSSHP [26], Messidor-2 [20], and APTOS [17].The EyePACS and APTOS datasets were introduced for two different Kaggle competitions of diabetic retinopathy detection.The Messidor-2 is a common benchmark dataset that was introduced for research in computer-assisted diagnosis of diabetic retinopathy.The EyePACS, APTOS, and Messidor-2 datasets have been widely used in literature for training and analyzing robust neural networks.In order to examine if the results generalize to clinical hospital datasets, we use the non-public KSSHP dataset.The KSSHP set was collected from clinical workflow data from the Central Finland Health Care District.All the used datasets originate from different countries: The EyePACS from USA, KSSHP from Finland, APTOS from India, and Messidor-2 from France, allowing for extensive analysis for the generalization of the models under distribution shift by country.

RDR classification
Visual illustration of results for retained performance for the models trained on RDR classification is presented in Figure 1A.Detailed RDR results are presented in Table 1.On the full test set, i.e. 0% referral level, our EyePACS data trained models had better results on all of the benchmark sets in comparison to previous works, thus setting a higher standard as our baseline performance.Indeed, our worst models are better than the best models in previous works: on EyePACS Leibig et al. [9]  Also, all of the KSSHP data trained models had a high AUC on 0% referral level for the KSSHP test set.Indeed, every model outperforms the corresponding model trained and tested on the EyePACS dataset.However, the generalization of the KSSHP trained models to Messidor-2 and APTOS sets was worse in every comparison than the generalization of the EyePACS trained models.
The uncertainty enabled referral process results for the EyePACS data trained models are similar to Leibig et al. [9], as we observe the referral process to systematically increase the AUC on the EyePACS test set and on the Messidor-2 set, and as a novel result, we observe that the referral process also increases the clinical KSSHP dataset AUC.We observe different behaviour on the APTOS dataset than in Filos et al. [12] but similar to some models in Band et al. [18], as the AUC of all models systematically decreases when utilizing the uncertainty.Surprisingly, we observe strong uncertainty estimates for the deterministic MAP network, unlike in Leibig et al. [9] and Filos et al. [12], that observed generally worse uncertainty estimates for the point estimate approach.This difference is likely due to the strong regularization we used in training our MAP network.
It turns out that the models trained with the KSSHP data had different utility from the uncertainty referral than those trained using the EyePACS set.Specifically, on the KSSHP test set, from the 0% to 30% referral level, the MFVI and Radial ensemble models exhibit slightly decreased performance.In contrast, the GVI model performance stayed constant across this range, but decreased from 30% to 50%.In addition, the uncertainty estimates do not generalize to the EyePACS test set to such an extent as they did using the EyePACS set trained models on the KSSHP set.We see that on the EyePACS set, the uncertainty estimates given by the MAP network, deep ensemble, and the Radial ensemble cannot be used to improve the performance by referral, and the MFVI and GVI models do not improve upon the 30% referral rate.However, all the models perform well to the Messidor-2 dataset on all referral levels with the GVI ensemble even attaining 100.0 AUC for the 50% referral level.In the case of APTOS dataset we observe similar performance as with the EyePACS trained model, as the performance decreases along the referral rate, with the exception of baseline network improving from 0% to 30%.
In terms of overall performance, we observe that best results for the EyePACS, KSSHP, and the Messidor-2 datasets were obtained by referring data.For EyePACS and KSSHP the best results were for within-distribution trained models.The best performance on the EyePACS was obtained with MC dropout model that reaches 98.4 AUC with 50% referral level and on the KSSHP with MC dropout ensemble that reached 98.9 AUC with 50% referral level.The best out-of-distribution performance on the Messidor-2 set was obtained with the KSSHP trained GVI ensemble with 50% referral level that reached 100.0 AUC.For the APTOS set the referral did not improve performance, as the EyePACS trained MFVI ensemble with 0% referral level reached the highest AUC of 96.1.

PIRC classification
As no previous works have utilized robust neural networks for the 5-class clinical PIRC scheme, we can not directly compare the absolute performance.However, standard deep learning methods have been utilized, with similar data to our benchmark datasets, in Krause et al. [13].In the study, a deterministic Inception-v4 model was trained using over 1.6 million images from EyePACS affiliated clinics, 3 eye hospitals in India, one of them the same as the origin of the APTOS dataset, and the Messidor-2 dataset.The test set consisted of EyePACS originating images, on which the model achieved QWK of 84.0.Visual illustration of PIRC results are presented in Figure 1B, full results in 2. When no images are referred, the best performance is achieved with different ensembles: the deep ensemble achieves 81.3 QWK on the EyePACS dataset and 81.1 QWK on the KSSHP dataset.Similarly, the MFVI ensemble achieves 83.9 QWK on the Messidor-2 dataset, and the MC dropout ensemble achieves 87.9 QWK on the APTOS dataset.EyePACS trained models had better out-of-distribution performance than the KSSHP trained models.
While using entropy as the measure of uncertainty, the only EyePACS trained models that improved within distribution performance for all the referral levels were GVI, Radial, and GVI ensemble.These models also surpassed the 84.0 QWK when referring ≥ 30% of images.Also, deep ensemble achieved 86.8 QWK and Radial ensemble 84.6 QWK for the 30% referral level.All the models, except the MC dropout, consistently improve on the Messidor-2 dataset and reach over 84.0 QWK when referring ≥ 30% of images.On the APTOS dataset, all models consistently improve for all referral levels and they surpass the 84.0 QWK when referring ≥ 30% of images.However, we can see that on the KSSHP dataset, the models degrade consistently, apart from GVI and deep ensemble for 0% to 30% referral level, and no model reaches competitive performance on the KSSHP for any referral level.
When using the KSSHP data for training, we get significantly worse results compared to the EyePACS trained models, as no model consistently improves on the within distribution set, and overall, no model reaches the 84.0 QWK.Indeed, only the MFVI improves from 0% to 30% referral level.In terms of EyePACS generalization, all models improve from 0% to 30%, and Radial, MFVI ensemble, and Radial ensemble improve also from 30% to 50% referral level.However, no model reaches over 80.0 QWK.All models consistently improve for the Messidor-2 dataset and GVI, Radial, MC dropout ensemble, and Radial ensemble reach over 84.0 QWK for the 50% referral level.However, the MFVI has generally poor performance, as the QWK only increases from 45.7 to 54.5.On the APTOS dataset, MC dropout, GVI, Radial, and all ensemble models improve from 0% to 30% referral level.On the 30% to 50% referral level, GVI and all ensembles except the GVI ensemble improve.Even though some models improve in the referral process, no model reaches over 80.0 QWK, the deep ensemble being closest with 79.9 QWK on the 50% referral level.
The overall best performance for the EyePACS set is obtained using EyePACS trained Radial when referring 50% of images, QWK of 87.9, and for the KSSHP set using KSSHP trained MFVI and referring 30% of images, QWK of 81.6.For the out-of-distribution sets, the best Messidor-2 results are obtained using EyePACS trained Radial when referring 50% of images, QWK of 94.3, and for the APTOS set using EyePACS trained MFVI ensemble when referring 50% of images, QWK of 97.3.Since the uncertainty estimation is motivated from the point of view of clinical interest, it is extremely concerning that the methods appear to not work in a similar manner when trained on a clinical dataset in comparison to the benchmark datasets, especially on the clinical set itself.

QWK-Risk as alternative uncertainty measure
Our proposed QWK-Risk uncertainty measure results are presented in Figure 1C and Table 3.We can see that the QWK-Risk uncertainty based method systematically improves both the within distribution test results and cross out-of-distribution results for the two train sets.Within distribution for the EyePACS set, all models reach over 84.0 QWK when referring ≥ 30% of images.No Eye-PACS trained model reaches the 84.0 QWK on the KSSHP set.However, QWK-Risk enables deep ensemble to reach ≥ 80.0 QWK for all referral levels.In addition, deterministic MAP network, MC dropout ensemble, and MFVI ensemble reach ≥ 80.0 QWK for 50% referral level.On the Messidor-2 set, the performance of some models decreases and some improve in comparison to the entropy based uncertainty, most notably the GVI no longer reaches over 84.0 QWK but all other models reach over 84.0 QWK for referral levels ≥ 30%.Systematic decrease in the performance is seen for the APTOS dataset, and now only MC dropout ensemble and GVI ensemble reach over 84.0 QWK when referring 30% of images.
From the clinical perspective, an important finding is that using the QWK-Risk, all our models Using the QWK-Risk as the uncertainty measure, the best performance for the EyePACS set is obtained using EyePACS trained MC dropout ensemble and Radial ensemble when referring 50% of images, QWK of 92.6, for the KSSHP set using KSSHP trained GVI ensemble and referring 50% of images, QWK of 89.9, for the Messidor-2 using EyePACS trained MC dropout ensemble and MFVI ensemble when referring 50% of images, QWK of 94.3, and for the APTOS set using EyePACS trained MC dropout ensemble when referring 0% of images, QWK of 87.9.Table 3: PIRC results using QWK-Risk as the uncertainty measure.Mean and standard deviation computed for 100 bootstrap resamples of the test data.Relative improvement to previous referral level is denoted with a green up-pointing triangle, orange equals sign, or red down-pointing triangle for increasing, equal, or worse performance, respectively.

Conclusions
We have replicated the usefulness of entropy as uncertainty estimation in RDR classification task for most robust neural networks when training with the EyePACS benchmark dataset, and demonstrated to a smaller extend the same finding for the KSSHP clinical hospital dataset.We also observe that the quality of the uncertainty estimates on out-of-distribution tests decreases when the models are trained with the KSSHP set.In addition, we show that entropy is less suitable as a measure of uncertainty in the clinical 5-class PIRC classification task for the EyePACS and KSSHP datasets.
We have proposed and demonstrated that an uncertainty measure based on the classifier risk using the quadratic weighted Cohen's kappa provides a useful measure for the within-distribution uncertainty quantification that improves uncertainty based retained performance in comparison to entropy.However, the QWK-Risk decreases the out-of-distribution quality of the uncertainty estimates for the Messidor-2 and APTOS datasets.
From clinical perspective, the classifiers should be able to indicate uncertainty, such that the cases for which the classifier is confident can be automatically classified, while the difficult cases can be manually verified, and possibly corrected, by medical experts.Since most of the benchmark datasets only permit research use, for real world use-cases the classifiers should be able to be trained using "in-house" hospital datasets.When training the classifier on a hospital dataset, we expect that it can be utilized for future cases within the same hospital.In our studies we did not observe competitive performance for the KSSHP set on the clinical PIRC system, when entropy was utilized to quantify uncertainty.This highlights the concern that methods developed using benchmark datasets might not generalize to the clinical setting.However, the proposed QWK-Risk uncertainty measure enabled the models to surpass the state-of-the-art when 30% of the most uncertain cases were referred, which demonstrates that the robust neural networks with a more appropriate uncertainty function may be utilized for clinical datasets and clinical use-cases.
The degradation and variability in out-of-distribution performance may partly be caused by different grading conventions and other grading variability may prevent correct evaluation.There are several mechanisms causing variable grading.Firstly, the frequency of photographic screening may affect the distribution of retinopathy severity with advanced grades of retinopathy being rarer when screening is more frequent.The patients are referred and treated before the advanced stages and followed by clinical visits instead of fundus photography.Secondly, there are no grading schemes available for the classification of treatment outcomes after the treatment.As a result, severe retinopathy may be arbitrarily classified into many different grades.The third reason for variable grading is poor quality of the retinal images.The grading of poor quality photographs may reflect some preconceived notions of patient status, based on such factors as age and type of diabetes, availability of treatments and also previous treatments.
Our future work includes a more fine-grained analysis of the out-of-distribution performance of robust neural networks.We will be evaluating a larger dataset of diabetic retinopathy images from the Helsinki region in Finland.In addition to the country distribution shift, our aim is to examine the within country hospital region distribution shift using this data, in addition to the KSSHP dataset.Additionally, our future work includes utilization of some of the computationally more intensive solutions for robust deep learning.Moreover, we will be examining the impact of joint training from the different regions on performance and robustness.Lastly, we aim to examine multi modal inputs with the fusion of multi-view retinal images with the task of even finer grain diabetic retinopathy grading.

Datasets
All our datasets consist of color images of the human retina and are graded using the following 5class PIRC system for the severity scheme of diabetic retinopathy: no diabetic retinopathy (class 0), mild diabetic retinopathy (class 1), moderate diabetic retinopathy (class 2), severe diabetic retinopathy (class 3), and proliferative diabetic retinopathy (class 4).From the PIRC system, we derive the binary referable/non-referable diabetic retinopathy (RDR) classification, which is defined as the union of no diabetic retinopathy or mild diabetic retinopathy (≤ 1) and referable as retinopathy worse than or equal to moderate diabetic retinopathy (≥ 2).We chose to include the binary RDR system for comparison with the previous works.
The division of data to training, validation, and test sets were performed for EyePACS and KSSHP image datasets.For the EyePACS dataset, the official training set was used for training, the "Public" set was used for validation, and the "Private" set was used as the test set.For the KSSHP dataset, we used 70%, 10%, and 20% of images for the training, validation, and test sets, respectively.For the splits, we used stratified pseudo random sampling to preserve PIRC class distribution of the original set, but taking into account that images from the same patient cannot reside in more than one set.Due to the low amount of images in the APTOS and Messidor-2 datasets, 3662 and 1744 respectively, they were used only as out-of-distribution test sets.Complete description of the training, validation, and test sets are described in Table 4 and the test set class distributions in Table 5.The images were resized into standard size 512 x 512 and, to reduce known variability, preprocessed with steps described in the supplement.

Approximate Bayesian Deep Learning
The approximate Bayesian deep learning models take the uncertainty in account by computing the following posterior predictive distribution: Here x and y denote the image and target, respectively, D the training data, and θ the model parameters.The predictions are weighted averages using the posterior distribution p(θ | D) [27,28].
For the standard neural networks, the maximum likelihood (ML) or maximum a posteriori (MAP) estimates are typically used, which completely ignore the uncertainty in the parameters.The exact solution for Equation ( 1) is intractable for deep neural networks, and Markov Chain Monte Carlo is prohibitively expensive for real world scenarios.Thus approximations need to be utilized.We selected the deep ensemble, MC dropout, MFVI, GVI, and Radial BNN as our approximate Bayesian methods.The posterior predictive distribution can be inexpensively approximated with these methods using an approximate posterior distribution q(θ) and Monte Carlo integation with N samples: For all our experiments, we use a VGG16 type network as the base architecture identical to [10].Our baseline network is a MAP solution, which was regularized with L2 weight decay and dropout.Model implementation and optimization are described in detail in the Supplementary Section ??.

Deep Ensembles
The deep ensemble is a collection of multiple ML or MAP neural network models.The models are trained using different initializations in order to produce a diverse set of models.The predictions of the models are averaged to produce the prediction of the ensemble model, corresponding to heuristically setting the posterior as uniform distribution over the set if models in Equation (2).

Monte Carlo Dropout -MC Dropout
The Monte Carlo dropout method introduced in Gal and Ghahramani [22] allows approximate Bayesian inference during test-time for networks that have been trained using the dropout regularization method.The dropout works by sampling a binary mask r = [r 1 , . . ., r d ] from a Bernoulli distribution r i ∼ Bern(p) and masks the activations h of a layer by computing the Hadamard product r h [29], which is equivalent to masking rows of the weight matrix and elements of the bias vector [22].The posterior predictive distribution is then computed using the dropout distribution in Equation (2).

Mean Field Variational Inference -MFVI
The Mean Field variational approximations assume that the neural network parameters are independent and typically that the approximate posterior and prior are Gaussians [10,12,23].The evidence lower-bound (ELBO) can then be maximized, which provides a lower bound for the posterior probability, and for the diagonal multivariate Gaussian case, the equations become simple [23]: Here the KL[• || •] denotes the Kullback-Leibler (KL) divergence, the prior is N (m j , s 2 j ), and the variational posterior is N (µ j , σ 2 j ) for a parameter θ j .The remaining expected log-likelihood term is computed using Monte Carlo integration and the reparametrization trick [30].Similar to the MC dropout, the posterior predictive distribution is computed using samples from the fitted approximate posterior distribution.We used multivariate standard normal distribution as the prior in all experiments.

Radial Bayesian Neural Networks -Radial BNN
The multivariate normal distribution has the so-called "soap bubble" pathology in high dimensions, meaning that most of the probability mass is concentrated on thin shell far from the mean, resulting in samples with a high norm.In Farquhar et al. [10], the high norm is proposed to be the problem in training deep neural networks using the MFVI approach with Gaussian posteriors.To address this issue, they propose a novel "Radial BNN" posterior distribution.The proposed distribution is constructed such that the samples have the same expected norm as univariate standard normal distribution, regardless of the dimension.The sampling process is defined as follows for a single weight vector: ∼ N (0, I), r ∼ N (0, 1).
The resulting random variable w avoids the soap bubble pathology, however, it has no closed form probability density function or KL-divergence.The authors observe that a stochastic estimate of the KL-divergence can be computed similar to [28], allowing for optimizing the ELBO up to a constant.Multivariate standard normal distribution was selected as the prior, similar to MFVI.

Generalized Variational Inference -GVI
In Knoblauch et al. [24], a novel optimization-centric view to posterior inference is proposed.The problem of finding posterior distributions is viewed in terms of the so-called "Rule-of-Three", which separates the loss, divergence, and the space of feasible solutions as different aspects of the optimization procedure.Different configurations result in different posterior inference methods, for example the standard Bayesian posterior inference and variational inference.This view allows for principled use of divergences, which are robust to the mis-specification of the prior.Since the diagonal standard normal is typically chosen for computational convenience (rather than to incorporate any prior knowledge about good values of the neural network weights) [24], this approach is promising.We select the robust divergence as Rényi's α-Divergence (D α AR [ • || • ]), with α = 0.5.As the loss function, we use the negative log-likelihood, and as the space of feasible solutions the mean field normal posteriors and priors, where the prior was selected as the diagonal standard normal.The total minimization objective is then: D α AR [q(θ) || p(θ)] = 1 α(1 − α) log q(θ) α p(θ) (1−α) dθ.

Uncertainty estimation
Many uncertainty estimates have been used in previous works.The entropy of the posterior predictive distribution has been a typical choice, and has been observed to work well for the binary RDR classification.However, for the 5-class PIRC classification, we observe that the entropy does not systematically improve the referral process QWK on the clinical dataset.We seek to identify alternative measures of uncertainty, which would give more informed rejection rules.We measure the performance of our 5-class PIRC classifiers using the quadratic weighted Cohen's kappa [31]: and thus the κ QW K will always be zero.Thus we need the confusion matrix to be a non single entry matrix.For this purpose, we utilize an initial estimate of the confusion matrix C, computed on the validation set of the corresponding training set the model was trained on, and define the confusion matrix as a sum of the initial confusion matrix and a single entry matrix S j,i with 1 on index j, i.The entry j, i denotes a combination of a prediction-target pair where the predicted label is j and the target label is i.We propose the loss to then be the negative expected QWK: p(y = j | x)κ QW (C + S j,i ).( 16) The final QWK-Risk uncertainty estimate is then: The QWK-Risk can be interpreted as the expected negative QWK value for an input example x, similar to entropy being the expected negative log-likelihood.The QWK-Risk can then be directly used in the referral process.
The KSSHP dataset is a non-open dataset to which local and EU jurisdictional restrictions (such as GDPR) apply.Inquiries may be sent to the corresponding author.

Code availability
All the deep learning implementations are available through publicly available repositories, The MAP networks and dropout networks were implemented using standard Pytorch [36] library functions.The diagonal multivariate normal distributions, used for MFVI and GVI, as well as the Kullback-Leibler divergence, were implemented using Pytorch distributions package torch.distributions,available through the main Pytorch package.The Rényi's α-Divergence was implemented according to the description in Knoblauch et al. [24], and is presented in the Supplementary Section ??.The Radial BNN was reimplemented following the accompanying code for the seminal paper [10], that also includes code for the VGG-style network architecture used in this study.

Figure 1 :
Figure 1: Retained performance using (A) posterior predictive entropy to refer on the RDR task, (B) posterior predictive entropy to refer on the PIRC task, and (C) QWK-Risk to refer on the PIRC task.Each column shows which dataset was used for training and each row which test dataset was used for the results.

Table 1 :
RDR results.Mean and standard deviation computed for 100 bootstrap resamples of the test data.Relative improvement to previous referral level is denoted with a green up-pointing triangle, orange equals sign, or red down-pointing triangle for increasing, equal, or worse performance, respectively.

Table 2 :
PIRC results using posterior predictive entropy as the uncertainty measure.Mean and standard deviation computed for 100 bootstrap resamples of the test data.Relative improvement to previous referral level is denoted with a green up-pointing triangle, orange equals sign, or red down-pointing triangle for increasing, equal, or worse performance, respectively.reach ≥ 84.0 QWK for referral levels ≥ 30% when trained and tested on the KSSHP set.Indeed, the GVI ensemble even reaches 89.9 QWK for the within KSSHP distribution test when 50% of examples are referred.The generalization of the uncertainty estimates to the EyePACS set also increases and now the deep ensemble, MC dropout ensemble, and MFVI ensemble reach ≥ 80.0 QWK, however, no model can reach the 84.0 QWK.The QWK-Risk decreases the performance for the Messidor-2 and APTOS datasets in comparison to entropy.

Table 4 :
Number of images in each subset for each dataset.

Table 5 :
Class distribution for PIRC and RDR classification schemes of the test sets.