Uncertainty Analysis of Deep Kernel Learning Methods on Diabetic Retinopathy Grading

Diabetic retinopathy (DR) is a leading cause of vision loss. Therefore, screening for early signs and assessment of the DR severity is crucial and extensively studied. To support clinicians, screening could be automated by algorithms that can, for example, refer difficult decisions to specialists for further investigation. However, frequently used neural networks (NNs) typically do not know when they do not know and approximate Bayesian NNs often equally do not suffice to provide well-calibrated uncertainty estimates. Thus, in this work, we investigate whether and how to use deep kernel learning (DKL) which we designed as a hybrid combination of the state-of-the-art EfficientNet-B0 and a Gaussian process (GP) layer to improve the quality of uncertainty estimates in referral-based DR screening. To this end, we first analyze the necessity for recently proposed extensions to the DKL framework to resolve miscalibrated uncertainties, despite the theoretical superiority of GPs to uncertainty quantification. Our subsequent comprehensive comparison of the curated DKL’s performance to that of the most common approximate Bayesian NNs shows our DKL models to particularly improve the detection of near out-of-distribution (OOD) samples containing other eye diseases through epistemic uncertainty information, but also enhance the calibration of aleatoric in-distribution uncertainty and diagnostic performance. Hence, it can provide a substantial benefit for safety-critical medical applications, like automated DR screening, particularly by potentially reducing the risk of missing diseases other than DR due to the improved near-OOD detection performance.


I. INTRODUCTION
Patients suffering from diabetes have a high prevalence of developing diabetic retinopathy (DR) [1], which can lead to loss of visual acuity [2].Despite being a frequent cause of vision loss, it can be therapeutically treated well if detected early [1], [2], [3].Hence, it is important to screen for early signs of DR [3].However, the number of patients with diabetes, i.e. patients at risk to develop DR, is expected to severely grow in the future and impose an increased burden on the healthcare system [1], [3], [4].An automatic screening system could reduce the workload of ophthalmologists and facilitate screening for DR due to The associate editor coordinating the review of this manuscript and approving it for publication was Chulhong Kim .more streamlined access to screening, finally benefiting both healthcare providers and patients.With the state-of-the-art performance of deep neural networks (DNNs), a promising solution to this problem is the usage of convolutional neural networks (CNNs), i.e. recent studies show that CNNs can perform comparably to clinicians in detecting DR [5], [6], leading to the first AI systems admitted by the FDA [5].However, in contrast to humans which tend to perform well under uncertainty, deterministic DNNs usually extrapolate widely and lack calibrated uncertainties [7], [8], [9].Hence, to minimize erroneous predictions, mitigate the risk of not receiving the required intervention, and lower the likelihood of missing pathologies or diseases other than DR that could have been detected by specialists, calibrated uncertainty information is essential to allow the model to reject highly uncertain samples and refer patients to specialists for further inspection.
To solve this issue, recent work compares CNNs and several Bayesian neural network (BNN) methods, e.g., Monte Carlo dropout (MCD) [10], deep ensembles (DEs) [11], and other approximations using variational inference (VI), for the task of either detecting or grading DR.The authors highlight the feasibility of Bayesian methods and their benefit on the uncertainty calibration to be informative for model failures [12], [13], [14], [15], [16].MCD-and DE-based approaches are observed to frequently provide good calibration, whereas more sophisticated VI approaches may sometimes even perform worse than regular CNNs [12], [14].However, particularly both these simple baselines have shortcomings, potentially causing inconsistent uncertainty calibration under distribution shift [12], [17], [18].As Gaussian processes (GPs) typically provide wellcalibrated uncertainty estimates but do lack the feature extraction abilities of CNNs, Leibig et al. [14] analyze the use of a GP being trained on the latent features of a fixed CNN to detect DR.Acknowledging that a GP theoretically would be a promising alternative, they, however, find MCD to outperform the GP-based approach and assume this to be caused by the loss of full stack information when training the CNN and GP independently.
Based on these observations, the goal of this work is to investigate whether deep kernel learning (DKL) [19], i.e. a hybrid deep learning model combining these two methodologies in an end-to-end trainable architecture, can exploit the uncertainty quantification abilities of GPs within the uncertainty-informed DR classification screening.In particular, we analyze whether and how to use stochastic variational deep kernel learning (SVDKL) [19] to improve the quality of uncertainty estimates compared to the basic estimates being provided by standard approximate Bayesian convolutional neural networks (BCNNs) while maintaining high diagnostic performance and allowing to refer both patients affected by DR and those with uncertain diagnoses to experts.To the best of our knowledge, we are the first to provide a comprehensive evaluation of the SVDKL framework and recently proposed variants [20], [21], [22], [23] for the task of DR severity (sDR) grading and referable DR (rDR) detection that we compare to the most commonly applied approximative BCNN methods on both indistribution (ID) and out-of-distribution (OOD) data while disentangling aleatoric und epistemic uncertainty estimates.
The remainder of this paper is structured in the following way: First, in section II, we provide an overview of the study data as well as the most important fundamentals and the applied methodologies of the article.Herein, we provide a more detailed view of the advantages and practical applications of Bayesian deep learning, as well as our implementation and validation setup.Second, in section III, we present the results for (i) the performance and quality of the uncertainty estimates of SVDKL and three extensions to the method in section III-A, (ii) a comparison of the extended SVDKL method to standard BNNs in section III-B, and (iii) an ablation study that examines the effect of the individual extensions on the model performance and uncertainty calibration in more detail in section III-C.Finally, we discuss and conclude the conducted experiments and the obtained results in section IV and V.

A. STUDY DATA
In this study, we use the publicly available Kaggle Diabetic Retinopathy Detection (EyePACS) [24] database, collected in the USA, and the Indian Diabetic Retinopathy Image Dataset (IDRiD) [25] that both comprise ophthalmoscopic color fundus images with expert annotations for the five class (C = 5) DR severity, i.e., no (0), mild (1), moderate (2), severe (3), and proliferative DR (4).As near-OOD data, we employ the Retinal Fundus Multi-disease Image Dataset (RFMID) [26] which also contains retinal fundus images but comprises a multitude of other eye diseases in addition to DR.As a far-OOD dataset, the Society for Imaging Informatics in Medicine/International Skin Imaging Collaboration (SIIM-ISIC) [27] dataset is used, i.e. a task-unrelated, medical dataset containing dermoscopic images.We define the sets of the ID and OOD datasets as For preprocessing, ID images are cropped to a square containing the visible retinal disc, whereas OOD images are padded to a square to retain their aspect ratio.Subsequently, all images are resampled to (I h , I w ) = (224 × 224) pixels and the local color mean is removed using a Gaussian filter [28].An example image per dataset prior to and after preprocessing is given in Fig. 1.Images are standardized to the statistics of the training data prior to feeding them to the networks.

B. MAXIMUM A POSTERIORI ESTIMATION
Typically, two uncertainty types are distinguished: the irreducible aleatoric uncertainty caused by data noise, and the reducible epistemic uncertainty, caused by lack of knowledge about the true model [29], [30].Both are important to enable transparent usage within the high-stakes medical domain [9], [31].However, standard CNNs usually do not provide a sufficiently calibrated uncertainty in either with the CNN's likelihood p (y | X, θ ) and the prior probability distribution over the model weights p(θ) [8].As MAP estimates do not account for parameter uncertainty, regularly trained CNNs can only express aleatoric uncertainty [8].Furthermore, due to their flexibility, they are prone to overfitting [32] which can lead to overconfident decisions, i.e. they even often fail to provide well-calibrated aleatoric uncertainty estimates [7].

D. DEEP KERNEL LEARNING
Whereas obtaining well-calibrated uncertainty estimates by using BNNs remains difficult, stochastic variational deep kernel learning (SVDKL) [19] is a promising approach to combine the feature extraction capabilities of CNNs with the reliable uncertainty representation of GPs.To this end, Wilson et al. propose to use a GP-layer consisting of J GPs g(x) = [g 1 (x), . . ., g J (x)] with g j ∼ GP(0, k j ) ∀j ∈ {1, . . ., J }, that is operating on the CNN's latent space h(x, θ FE ) ∈ R Q .Thereby, each kernel function k j = k θ GP h j , h ′ j takes a subset h j of the latent activations as input.The SVDKL's likelihood p(y | f) is computed by, firstly, correlating the outputs of the individual GPs through the final linear layer F ∈ R C×J that resembles an additive GP layer f(x) = Fg(x) and, subsequently, applying nonlinear softmax activation.By using stochastic VI and a sparse GP approximation, i.e. introducing a variational distribution q(u) over the inducing variables u = {u 1 , . . ., u J }, and additionally restricting the inducing points to a grid as well as exploiting structure of the kernel matrix, Wilson et al. derive an efficient, and scalable sampling scheme (5) as an approximation to the evidence lower bound (ELBO).Thereby, the Kullback-Leibler (KL) divergence minimizes the difference between the variational distribution q(u) and the prior p(u).This sampling scheme allows for joint optimization of the CNN's parameters θ FE , variational parameters θ u , and the GP's kernel hyperparameters θ GP through automatic backpropagation and minibatch training, which only requires drawing S samples from the variational distribution for a minibatch of size B and, hence, only adds a small overhead to the regular CNN training and inference procedure.Wilson et al. observed improved performance over standard DNNs as well as independently training a GP on top of pre-extracted features of a CNN for several benchmarks.While not explicitly analyzing the uncertainty, they reason that SVDKL would automatically yield wellcalibrated uncertainties.

1) PATHOLOGICAL BEHAVIOR OF DKL
However, recent studies showed that SVDKL can be prone to overfitting and, hence, produces unreliable uncertainty estimates despite the GP layer [20], [21], [23].This was observed to be caused by increased model complexity [20] and by the optimization of the ELBO which can drive the GPs' kernels to overly correlate the training data leading to feature collapse [21], [23].Whereas this overfitting could implicitly be prevented by stochastic minibatch training [21], a fully Bayesian treatment, i.e. additionally making the feature extractor approximately Bayesian, should explicitly fix this issue [20], [21], [22], [23].Hence, Tran et al. [20] apply MCD to the feature extractor of the SVDKL model, to induce an approximate Bayesian posterior over the latent variables and observe this to mitigate the issue.More recently, Liu et al. [22] as well as van Amersfoort et al. [23] propose to tackle the feature collapse by constraining the CNN's feature extractor to be approximately bi-Lipschitz, by constraining the spectral norm (SN) of residual blocks within the feature extractor to be smaller than a constant.They reason that this would drive the latent representation to be input distance aware, i.e. force the mapping from the input space to the latent space to be both sensitive and smooth to changes in the input.

E. IMPLEMENTATION DETAILS 1) BASELINE SETUP
In this work, we use the EfficientNet-B0 [35] as the CNN baseline model.We train the model using L2-norm regularization and minimize the negative log-likelihood (NLL) objective function with the number of data points B of the current minibatch.
To build the BNN baselines, we add Monte Carlo dropout to the baseline CNN (MCD-CNN), which is implemented using regular elementwise Bernoulli dropout applied to the latent features.Additionally, we add spatial dropout [36] to each of the building blocks within the convolutional backbone of the EfficientNet and exploit stochastic model depth during inference to improve the Monte Carlo sampling [37], [38].Subsequently, we build deep ensembles from both the baseline CNN (DE-CNN) and the model using MCD (DE-MCD-CNN).All deep ensemble models consist of five model instances.Following the ensemble building approach of Band et al. [12], the DE members are sampled from the pool of trials conducted per non-DE method without replacement to reduce computational costs.

2) SVDKL IMPLEMENTATION
We implement SVDKL using the GPyTorch [39] toolbox and follow the proposed setup by Wilson et al. [19]: The backbone of a pre-trained baseline CNN is used as the feature extractor transforming the input to the latent space h(x, θ FE ) ∈ R Q upon which a layer with J = Q independent GPs is placed.Each SVDKL model is trained by minimizing the negative approximate ELBO according to (5) first finetuning only the variational parameters and then training the full model end-to-end.To build the spectral normed SVDKL model (SN-SVDKL), we adopt the implementation of van Amersfoort et al. [23] to handle grouped convolutions that are extensively used within the EfficientNet, i.e. applying spectral norm to each convolutional group individually.In addition, by using the sigmoid activation σ (x), we replace the Swish activations of the EfficientNet with their Lipschitz equivalent [40] LipSwish 3) TRAINING SETUP All model instances are solely trained on the EyePACS training split using the Adam optimizer from T = 10 randomly drawn seeds and PyTorch's [41] automatic mixed precision (AMP) package.Accordingly, ten DEs are created per model configuration as described above.Model convergence at training time is tracked by measuring the area under the receiver operating characteristic (AUROC) on the public EyePACS data split, reserving the private holdout split for model testing.
A plateau scheduler is implemented to halve the learning rate after ten epochs without improvement.Additionally, early stopping is used to terminate the training of individual trials after 20 epochs without improvement.For MCD, we draw a single, three, and S = 32 Monte Carlo samples during training, validation, and test time, respectively.Similarly to the class weighting applied by [12], [14], and [15], we counter the strong class imbalance of the EyePACS data by randomly oversampling the minority classes with a probability corresponding to their relative class frequency.In addition, we apply online data augmentation (random horizontal and vertical flipping, image rotation, translation, and scaling) to improve model generalization.Augmentation, model, and optimization hyperparameters were sequentially optimized using the Ax-platform [42] toolbox with either a multi-objective optimization for SNbased models or otherwise a Gaussian process expected improvement approach.

F. EVALUATION
To verify our working hypothesis, i.e., that the SVDKL framework can provide a benefit for uncertainty calibration, we evaluate each method for the task of predicting the five DR severity stages (sDR).As the calibration of uncertainty was observed to be task-dependent [12], [13] and, in particular, the detection of referable DR (rDR) is of high importance within the screening setting, we, in addition, generalize the severity grading to this binary detection by aggregating the predictive probabilities for the severity stages 2 to 4 to the class referable DR and stages 0 and 1 to the class non-referable DR, and define the set of task settings

1) IN-DISTRIBUTION DIAGNOSTIC PERFORMANCE
We use the (macro) AUROC, accuracy (ACC), NLL, and quadratic weighted Cohen's kappa (QWK) [43] as ID performance measures.In contrast to other standard metrics that are agnostic to the severity of misclassifications, the QWK score is a measure of reliability.It computes an inter-rater agreement score based on Cohen's kappa [44], which quadratically penalizes misclassifications based on the distance to the target class.When being applied to binary classification tasks it reverts to computing the regular kappa score.We define the set of ID diagnostic performance metrics as

2) IN-DISTRIBUTION UNCERTAINTY
We analyze each method's ID uncertainty calibration according to the expected calibration error (ECE) [7], and area under the accuracy rejection curve (AUARC) [45], for which we define the set of ID uncertainty metrics The ECE summarizes the deviation of the observed model confidence to a perfectly calibrated model in a single scalar value.The score is based on the assumption that the observed accuracy within a sample population matches the mean model confidence.The AUARC metric, in contrast, mimics the screening setting that allows to refer samples with high uncertainty for further inspection.To compute the metric, the accuracy rejection curve is constructed by varying the fraction of referred samples, whereby the accuracy is computed on the remaining samples.An ideal model would exhibit perfectly correlated uncertainty estimates to erroneous predictions [12], hence, the AUARC is expected to increase with better model calibration, i.e. it measures how well the uncertainty estimates align with the model's failures.For the ID task setting, mainly the total and aleatoric predictive uncertainties are relevant, for which we compute two independent scores, i.e.AUARC-total (AUARCt) and AUARC-aleatoric (AUARCa).We compute the total uncertainty from a given set of s ∈ {1, . . ., S} Monte Carlo predictions p s (y|f (x * , θ )) for a test sample x * as which equals computing the entropy H of the mean predictive distribution [12], [29].Similarly, the aleatoric uncertainty is defined as the average entropy computed over the given set of predictive samples [12], [29], i.e.
3) OUT-OF-DISTRIBUTION UNCERTAINTY As we expect a well-calibrated model to exhibit high uncertainty on unfamiliar samples that have not been part of the training data, we collect each model's predictive uncertainty estimates for the OOD datasets and measure the performance of how well we can separate the OOD from the EyePACS samples solely based on the model's predictive uncertainty.To this end, we analyze the AUROC performance to detect the OOD samples based on each model's total (AUROCt) and epistemic uncertainty estimates (AUROCe).Thereby, we compute the epistemic uncertainty as [12], [29] U epist = U tot − U aleat , (13) and equalize the number of samples of the EyePACS and the OOD dataset to create a balanced setting by random subsampling.We define the set of metrics to measure the OOD uncertainty calibration as

G. STATISTICS
We apply multiple analysis of variances (ANOVAs) to assess if the observed ID results in Table 1 and OOD outcomes in Fig. 3 show statistically significant differences for each applied metric individually.If statistically significant differences are found, post hoc two-sample t-tests are applied to compare any of the selected SVDKL-based models to all baseline models.To estimate statistical significance in the ablation study, the Wilcoxon signed-rank test is applied to the distribution of the performance differences in Fig. 4, independently for each factor and metric.In order to correct for the occurrence of Type-I errors, the Bonferroni-Holm correction is applied for both test settings.A significance level of α = 0.05 is used to determine statistical significance.

A. BENEFIT OF SVDKL EXTENSIONS
Firstly, we explore the occurrence of the pathological behavior recently observed in the literature [20], [21], [22], [23] causing uncalibrated model uncertainty when using the basic SVDKL and analyze the benefit of the proposed extensions on the aleatoric and epistemic uncertainty calibration on both ID and OOD data.To this end, we introduce the set ) of all combinations of SVDKL and the proposed extensions.

1) IN-DISTRIBUTION ANALYSIS
To analyze the ID uncertainty calibration, we compute the average performance over all t ∈ T randomly seeded trials for each SVDKL model in A I evaluated for all configurations contained in the set D ID × M ID, U × Q.The results are displayed in Fig. 2. From these, we overall observe both the AUARCt and AUARCa to be increasingly improved by the proposed extensions for the sDR grading task throughout both ID datasets.As a result, our findings show the DE-SVDKL models extended with both SN and MCD (DE-SN-MCD-SVDKL) to provide superior performance, i.e. we observe a relative performance gain of the AUARCt by about 12 % on the EyePACS test data from the basic SVDKL model to using the fully extended model architecture.Despite being less pronounced, this also applies to the binary rDR detection task for which we observe a relative performance gain of about 2 % for the same example as above.In contrast to this clear trend, the measured ECE exhibits high noise making the interpretation difficult: For the sDR grading on the EyePACS dataset, a general trend of the ECE score to perform worse with adding any extension to the plain SVDKL model can be observed.However, for the rDR detection and on the IDRiD data the ECE improves analogously to the AUARC, except for the SN-CNN for the binary and the DE-SN-MCD-CNN for the multiclass task.

2) OUT-OF-DISTRIBUTION ANALYSIS
To validate the alignment of uncertainty to the occurrence of unfamiliar samples, we compute the AUROC performance according to section II-F3 for discriminating between the ID EyePACS samples from those of the OOD datasets solely based on the captured uncertainty estimates.To this end, we analogously to the above analysis compute the average performance over all independent randomly initialized trials t ∈ T for each SVDKL model in A I evaluated for all configurations in the set D OOD × M OOD × Q for which the  6), and DE-SN-MCD-SVDKL (7) for both the sDR grading (top row) and rDR detection (bottom row) task.For better visual clarity, the non-DE (left) and DE-based models (right) are grouped within each subplot.For the ID analysis in (a), we normalize the displayed performance to the basic SVDKL's (model 0) average performance on the EyePACS data.Note that we display the negative ECE (NECE) and apply pairwise identical scales to the AUARC plots of the sDR and rDR task, as well as for the NECE for better comparison and to retain the qualitative differences of the relative performance increases.For the OOD analysis in (b), the displayed OOD performance is normalized to the AUROC performance corresponding to a random referral.
results are displayed in Fig. 2b.Matching the findings of van Amersfoort et al. [23], we on average observe a bad, nearrandom separation performance, i.e. the relative performance gain over a random baseline falls below 11 % when using the basic SVDKL model to detect the RFMID samples based on the total predictive uncertainty.Similarly, the results show a low detection performance for the OOD samples of the SIIM-ISIC dataset, i.e. the relative performance gain over a random baseline falls below 26 %.However, the epistemic uncertainty-informed OOD detection performs significantly better.In particular, for the sDR grading, we observe an average relative performance gain of approximately 53 % for the detection of the SIIM-ISIC data using the plain SVDKL that corresponds to a detection performance of AUROCe ≈ 77 % which is typically considered fairly good.Furthermore, by adding the proposed extensions, we observe a striking improvement of the OOD detection for both the near-(RFMID) and far-OOD (SIIM-ISIC) datasets, showing a better alignment of the observed uncertainty to the occurrence of OOD samples.Thereby, in particular, the epistemic uncertainty benefits from the extended model architectures and sampling schemes.With this, the observed relative performance gain over a random referral for using the fully extended SVDKL model (DE-SN-MCD-SVDKL) in the sDR setting is on average at 64 % and 95 %, which corresponds to a detection performance of the RFMID and SIIM-ISIC samples of AUROCe ≈ 82% and AUROCe ≈ 98%, respectively, i.e. a very high, near-perfect detection rate.Similar to the findings from the ID performance analysis, we observe the OOD sample detection to be impaired for using spectral normed SVDKL models.Following this ambiguity, the best detection performance is achieved by extending the SVDKL model by using MCD and building DEs (DE-MCD-SVDKL).

B. BASELINE COMPARISON
As we observe the basic SVDKL method to lack a sufficient quality of uncertainty calibration, we select the overall bestperforming DE-SN-MCD-SVDKL and DE-MCD-SVDKL models for the subsequent baseline comparison for which we define the set of models and subsequently, analyze if the selected SVDKL-based models can add value to the ID diagnostic performance and uncertainty calibration as well as the quality of the uncertainty estimates on the OOD data.

1) IN-DISTRIBUTION ANALYSIS
The results for the ID baseline comparison are presented in Table 1.These show the observed average performance over all randomly seeded trials t ∈ T for each model in A II evaluated for each task setting configuration in D ID × M ID, P ∪ M ID, U × Q with the standard deviation enclosed in brackets.The presented results show an overall higher mean performance for the SVDKL-based models, which partially -most frequently for the AUROC, QWK, and AUARCt score -exhibits a statistically significant improvement over all the compared baseline models.The largest gain is visible for the QWK which is for example improved by 2.1 % for the multiclass sDR task on the IDRiD data, i.e. comparing the DE-SN-MCD-SVDKL to the DE-MCD-CNN, suggesting that the predictions are more reliable.That is, they are better aligned with the target severity grades, and severe misclassifications are observed less frequently compared to the baselines.Similarly, we find the AUARC performance for the multiclass sDR task to increase 146178 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.by 1.8-1.9% for using the DE-SN-MCD-SVDKL over the DE-CNN on the EyePACS data.Thereby, we observe the AUARCa to be slightly outperformed by the AUARCt indicating that the additional epistemic uncertainty information allows us to reliably identify samples within the ID test sets that are unfamiliar from model training.As for the previous analysis, the ECE score shows an exception to this trend: The lowest mean is achieved using the plain CNN and it increases for the BNN baselines as well as the SVDKL-based models for the sDR grading task on the EyePACS dataset.In contrast, the score decreases or is approximately on par with the plain CNN for the rDR detection task and on the IDRiD dataset.Interestingly, the findings show a reduced ACC -and followingly also the AUARC performancefor the sDR grading on the IDRiD data throughout the methods.Nevertheless, we observe both the baselines and SVDKL-based models to overall generalize well, i.e. they are mostly agnostic to the imposed country and target shifts by the IDRiD data, and for the task generalization to the rDR detection.Furthermore, the presented results outline that using either of the fully extended SVDKL-based architectures can improve over the standard approximate BCNNs, concerning both diagnostic performance and uncertainty calibration measured through the AUARC.

2) OUT-OF-DISTRIBUTION ANALYSIS
Analogously to the OOD analysis of the previous section, in the following, we evaluate each method's ability to detect unfamiliar image samples by using the total and epistemic predictive uncertainty estimates.However, to get a more detailed measure of the near-OOD uncertainty calibration, i.e. if higher uncertainty estimates are indeed related to samples containing unfamiliar pathologies and diseases, we introduce two new subsets of the RFMID data: The first set, the RFMID-O subset, only contains samples being affected by any other disease than DR.In contrast, the second subset, which we refer to as RFMID-DR, only comprises subjects being healthy or suffering from any DR but no other disease to mimic the ID data.In addition to the latter subset, we also add the ID IDRiD data, both serving as a baseline, for which we ideally would observe a lower performance compared to the true OOD datasets due to the data and task similarity of the RFMID-DR and IDRiD to the EyePACS data.The results of this analysis are displayed in Fig. 3 which shows the average performance over all randomly seeded trials t ∈ T for each model in A II along with its standard deviation evaluated for all possible configurations in As expected, both the approximate BCNNs as well as the selected SVDKL models outperform the baseline CNN on the detection of samples originating from the far-OOD SIIM-ISIC dataset within both the rDR detection and sDR grading for using the total uncertainty estimate.Thereby, the approximate BNNs provide a strong baseline and partly outperform the SVDKL models.This difference is however less pronounced for the epistemic uncertaintybased OOD detection, for which we observe near-perfect OOD detection for both the baselines and SVDKL models with an average AUROC > 94%.Referring to the more difficult task of distinguishing between the EyePACS and RFMID data, both SVDKL-based models show a statistically significantly higher mean detection performance based on the epistemic uncertainty estimate (AUROCe), i.e. we observe an improvement of +2.0-3.0 % for the sDR grading and +1.8-3.1 % for the rDR detection task.Overall, we also observe the performance for the RFMID-O samples to be higher than for the full RFMID data, outlining that the models are indeed more uncertain on the samples containing unfamiliar pathologies related to other eye diseases than DR.Accordingly, we measure a lower detection performance for separating the EyePACS data from the RFMID-DR subset and IDRiD data, i.e. we observe the Bayesian baselines and SVDKL models to yield well-calibrated uncertainty estimates that correlate well with the semantic distance to the training dataset.Furthermore, comparing the two task settings in general, we observe a higher detection performance within the sDR grading than for the rDR task.Interestingly, we observe a fairly good AUROCe performance for the sDR grading on the IDRiD data across all methods, i.e. the IDRiD samples can be separated quite well from the EyePACS data and thereby exhibit lower uncertainty.Moreover, the findings show a worse-than-random detection performance across all compared methods within the rDR setting, i.e. that all models are equally more confident on the IDRiD than for the Eye-PACS data.This, on the one hand, suggests the distribution shift between both datasets to be larger than expected, and, on the other hand, relevant information about the model, i.e. epistemic, uncertainty is lost due to the binary aggregation.

C. ABLATION STUDY
Following the previous analysis, we observed the extended SVDKL models to add a benefit for both the diagnostic performance and the quality of the uncertainty estimates.We in particular observe the use of the fully extended SVDKL models to provide valuable information for detecting the near-OOD samples of the RFMID dataset.Nevertheless, the standard approximate BCNNs yield a strong baseline performance.Hence, in the following, we aim to disentangle the individual effects of the extensions and compare these to the performance improvement that can be achieved by using an SVDKL-based BNN.To this end, we deploy every possible combination of SVDKL-and CNN-based models with all the extensions examined in this paper and evaluate their impact on the performance within both the ID and OOD analysis.In detail, we define the set of every possible combination of base model and extension.Analogously to the preceding analyses, we then compute the average performance scores Pz of all randomly seeded trials t ∈ T for each model a ∈ A III evaluated on every evaluation configuration whereby Z ID = D ID × M ID, P ∪ M ID, U and Z OOD = D OOD × M OOD .Subsequently, we compute the average performance differences Pz,a,a ′ = Pz,a − Pz,a ′ (19) for paired trials that share the same evaluation configuration z ∈ Z and opposing model configurations a and a ′ .Thereby, opposing model configurations correspond to either using a CNN-or SVDKL-based model and dis-or enabling a 146180 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. .Statistical significance estimated according to section II-G is indicated with an asterisk above the respective violin.A higher delta indicates that adding the base model or extension improved the respective metric.Note that we display the log-likelihood (LL) and negative ECE (NECE) in (a) for better comparison, and that the H1 hypothesis of the reported statistical test corresponds to whether a statistically significant majority of paired differences improves or decreases the individual performance metric with respect to the selected base model or extension.
single extension (SN, DE, MCD) while keeping the other settings fixed.To give an example, we compute the differences for configurations (a, a ′ ) = (SN-MCD-SVDKL, SN-MCD-CNN) and (SN-MCD-SVDKL, SN-SVDKL).Finally, we aggregate the differences Pz,a,a ′ over the task settings Q and test datasets D ID/OOD to a single set P per metric M ID/OOD and extension or base model, for which we provide the box-violin plots displayed in Fig. 4.
The results of the ablation analysis show the deep ensembles to have a high, significant impact that mostly exceeds the effect of every other extension.Similarly, but less pronounced, adding Monte Carlo dropout significantly improves both the quality of uncertainty and model performance throughout the metrics.Again, the ECE imposes an exception to this trend.Considering the impact of using SVDKL, we overall observe similar proportions of all conducted trials to either decrease or improve the model performance and uncertainty calibration.This suggests that -under specific circumstances -using an SVDKL-based method seems to hurt the model performance compared to not using SVDKL at all.We attribute this to the fact that, e.g., the basic SVDKL model performs at most on par or worse compared to the baseline CNN, i.e. for the sDR grading on the EyePACS dataset the performance is decreased by 0.4-22.9% for all ID metrics (M ID, P ∪ M ID, U ).Nevertheless, the results also show a substantial proportion of trials for which adding SVDKL is observed to indeed induce a positive effect, e.g., for the ACC, QWK, NLL, ECE, AUROCt, and AUROCe, which is in line with the findings of the baseline comparison.Similarly, the obtained results of the trials with constraining the feature extractor to be approximately bi-Lipschitz by applying spectral norm partly impose a negative impact, i.e. a high proportion of the trials had a negative impact.Nonetheless, particularly for the AUROCe, the performance could be improved within some trials by additionally constraining the feature extractor to be approximately bi-Lipschitz.

IV. DISCUSSION
Our findings confirm the occurrence of the observed pathologies caused by the basic SVDKL as observed in the literature [20], [21], [22], [23].This highlights the necessity of regularizing the SVDKL's feature extractor to produce a distance-aware mapping, either by enforcing the mapping to be approximate bi-Lipschitz or by fully Bayesian modeling.We particularly observe the latter to improve both the ID and the OOD uncertainty calibration, i.e. a better alignment of the uncertainty estimates with the model's failures or the occurrence of OOD samples.Because both building DEs and applying MCD aim to improve model diversity and applying SN guides the model towards learning a distance-aware latent space representation that both prevent the mapping of OOD samples within the same region as ID samples, we expectedly observe the improvement to be greater for the epistemic than for the aleatoric uncertainty estimates.Nevertheless, by using the extended architectures over the basic SVDKL model we could also improve the aleatoric uncertainty estimates.In contrast to the consistent performance improvement observed from using MCD and DE, our findings revealed the spectral norm to exhibit instabilities, and only sporadic improvements in the quality of uncertainty could be achieved.With this, our results -unexpectedly -do not fully support the overall positive findings of Liu et al. [22].This, on the one hand, might be caused by that the method lacks guarantees for a true bi-Lipschitz mapping [23], i.e. too loose bounds on the spectral norm might not sufficiently prevent the feature collapse and followingly could cause arbitrary wrong uncertainty estimates.On the other hand, too tight bounds could prevent the feature extractor from learning a meaningful feature mapping.In accordance with this, we experienced the bound on the spectral norm to be difficult to optimize potentially leading to the unreliable curation of the feature collapse pathologies.
Regarding the findings of the ID analysis in section III-B, we observe the extended SVDKL-based architectures (DE-SN-MCD-SVDKL and DE-MCD-SVDKL) to mostly outperform or be at least on par with both the deterministic and the Bayesian baseline models across all settings and metrics analyzed.Furthermore, our findings highlight the model's ability to generalize well for distribution shifts and different task settings.As competing with the state-of-theart diagnostic performance is beyond the scope of this work, both the baseline and SVDKL models do not fully achieve the performance reported in the latest literature [12], [13], [14], [15], [46].Nonetheless, our findings suggest that SVDKL can improve the diagnostic performance over the presented baselines, and we additionally expect the performance of the SVDKL framework to improve accordingly by applying standard scaling methods for deep learning such as using high-resolution images, more capacitive base models, and exploiting more sophisticated training schemes.We particularly assume that the former would significantly improve the diagnostic performance according to [12], [13], [14], [15], [35], and [47].
Considering the ID uncertainty calibration, the obtained results for the ECE are very ambiguous, limiting profound reasoning on the model calibration based on this specific score.Despite being a frequently used metric for measuring model calibration, this ambiguity resembles the inconsistency of the ECE metric observed in recent studies [34], [48].And, indeed, we observe an improvement in the ID uncertainty calibration measured through the AUARC for using the extended SVDKL models.Moreover, our results show a significant improvement in the near-OOD detection task, i.e. on the RFMID dataset, by using the extended SVDKL models.This demonstrates that the SVDKL framework is more effective in detecting minor distribution shifts compared to the approximate BCNNs and highlights the model's high sensitivity to subtle alterations in the input space that can be measured through the model's epistemic uncertainty estimates.With this, our findings propose that the extended SVDKL models are more likely to recognize their lack of knowledge within the application for DR screening which, as a result, could minimize the risk of missing pathologies the network was not trained to detect.This is especially critical due to that specialists are only involved upon request -as indicated either by the model's uncertainty or when any referable DR is detected -and otherwise would prevent patients from receiving necessary therapeutic interventions.Similar to the findings by Hüllermeiern et al. [30], we observe the epistemic uncertainty to be task-dependent, i.e. it partly collapses due to the aggregation from the DR severity to the referable DR task setting, limiting the ability to separate between ID and OOD samples.This emphasizes the significance of expanding the DR severity grading task beyond the regularly analyzed referable DR detection to enable more insightful decision-making and improve clinical usability [13], [49].
Comparing the obtained quality of the ID and OOD uncertainty estimates to the related literature is difficult due to the lack of a standardized, quantitative evaluation protocol.
Frequently only qualitative results are provided that highlight the overall importance and value of using Bayesian approaches, i.e. to allow deep learning models to communicate their uncertainties, by, for example, showing the distribution of observed uncertainties comparing erroneous to correct predictions, or ID to OOD samples and reporting the AUROC and QWK performance for distinct fractions of referred patients [13], [14], [15].In contrast, Band et al. [12] also report quantitative results for the AUARC performance and the ECE score for the EyePACS dataset which were best for using a ResNet-50-based model that is similar to the DE-MCD-CNN used in this work.Whereas the observed ECE of 2.0 ± 0.0 is slightly lower than for our DE-SN-MCD-SVDKL architecture that achieves a score of 3.8 ± 0.7, the model proposed by Band et al. performs similarly with respect to the AUARC metric, i.e. 97.3 % vs. (97.0± 0.1) % in our work.However, Band et al. compute the AUARC score by varying the referral threshold instead of the fraction of referred samples, which renders the AUARC score more sensitive to the model temperature [34] and limits the comparability to the AUARC results reported within our work.

V. CONCLUSION
Because deterministic deep neural networks (DNNs) generally do not provide calibrated uncertainty estimates, and standard approximate Bayesian neural networks (BNNs) also often yield insufficient uncertainty calibration, this work analyzes whether the promising stochastic variational deep kernel learning (SVDKL) approach, which combines DNNs with Gaussian Processes into an end-to-end trainable model, can improve the quality of the predictive uncertainty on the task of DR severity grading and referable DR detection.Our experiments confirm the occurrence of the feature collapse of SVDKL observed in literature and highlight the need to adopt, e.g., full Bayesian feature extractors into the SVDKL framework to obtain well-calibrated ID and OOD uncertainty estimates [20], [21].Moreover, our findings show that using our proposed model, i.e. a fusion of the EfficientNet-B0 and the extended SVDKL framework, yields a gain in both uncertainty calibration and diagnostic performance over using either of the proposed approximate BCNNs.Whereas the use of DEs and MCD alone already provide simple and fast improvements in uncertainty calibration, additionally deploying SVDKL with proper extensions and tuning can even improve upon those results, albeit at the expense of increased computational complexity.Particularly, for safety-critical medical applications, such as the automated DR screening analyzed in this work, that require the deep learning model to have a good understanding of its limitations, the additional application of SVDKL turns out to be beneficial, as it provides a better near-OOD detection performance through the epistemic uncertainty.These results also highlight the benefit of disentangling epistemic from aleatoric uncertainty in order to detect unknown inputs.Along with the improved reliability measured through 146182 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
the QWK and the observed good alignment of the ID uncertainty with the occurrence of erroneous predictions, we find our EfficientNet-B0-based SVDKL models to have the potential to reduce the risk of severe misclassification and missing pathologies or diseases in a referral-based screening setting that could otherwise have been detected by a specialist, and possibly even improve diagnostic performance, which remains to be analyzed in more detail on real clinical data in future research.Furthermore, improving the uncertainty calibration of SVDKL with more sophisticated and less computationally demanding methods than using MCD and deep ensembling to create fully Bayesian SVDKL models would be of great interest to improve the practical benefits and facilitate translation into clinical application.
of the two: Let D = {X, y} be a dataset with X ∈ R N ×I c ×I h ×I w , y ∈ R N , N samples, and the input image's number of channels I c .Furthermore, defining a CNN as an approximation to the dataset's underlying, unknown function according to y i ≈ f (h (x i , θ FE ) , θ FC ) for i = {1, . . ., N } with the softmax activated final fully connected layer (FC) f (x), the convolutional feature extractor (FE) h(x), and the model weights θ = {θ FE , θ FC }.Then, from a Bayesian perspective, model training yields a maximum a posteriori (MAP) estimate θ = arg max θ∈ p (y | X, θ ) p (θ ) ,

FIGURE 2 .
FIGURE 2. Relative performance [%] of the applied metrics comparing the SVDKL model variants, i.e.SVDKL (0), SN-SVDKL (1), MCD-SVDKL (2), SN-MCD-SVDKL (3), DE-SVDKL (4), DE-SN-SVDKL (5), DE-MCD-SVDKL (6), and DE-SN-MCD-SVDKL(7) for both the sDR grading (top row) and rDR detection (bottom row) task.For better visual clarity, the non-DE (left) and DE-based models (right) are grouped within each subplot.For the ID analysis in (a), we normalize the displayed performance to the basic SVDKL's (model 0) average performance on the EyePACS data.Note that we display the negative ECE (NECE) and apply pairwise identical scales to the AUARC plots of the sDR and rDR task, as well as for the NECE for better comparison and to retain the qualitative differences of the relative performance increases.For the OOD analysis in (b), the displayed OOD performance is normalized to the AUROC performance corresponding to a random referral.

TABLE 1 .
Comparison of the ID test performance for the EyePACS and IDRiD dataset and both the sDR grading and the rDR detection task.The highest mean values per metric are displayed in bold.Statistical significance according to section II-G is indicated with an asterisk.

FIGURE 4 .
FIGURE 4. Distribution of the performance differences between using a CNN or SVDKL base model and any model with or without each main factor, i.e.SVDKL ( ), SN ( ), MCD ( ), and DE ( ) for (a) the ID and (b) OOD performance metrics.The results are displayed as box-violin plots which show the median and IQR accompanied by the violins KDE estimate covering the 1.5-IQR range.More extreme points are considered outliers and are marked by a diamond ( ♦ ).Statistical significance estimated according to section II-G is indicated with an asterisk above the respective violin.A higher delta indicates that adding the base model or extension improved the respective metric.Note that we display the log-likelihood (LL) and negative ECE (NECE) in (a) for better comparison, and that the H1 hypothesis of the reported statistical test corresponds to whether a statistically significant majority of paired differences improves or decreases the individual performance metric with respect to the selected base model or extension.