AIROGS: Artificial Intelligence for RObust Glaucoma Screening Challenge

The early detection of glaucoma is essential in preventing visual impairment. Artificial intelligence (AI) can be used to analyze color fundus photographs (CFPs) in a cost-effective manner, making glaucoma screening more accessible. While AI models for glaucoma screening from CFPs have shown promising results in laboratory settings, their performance decreases significantly in real-world scenarios due to the presence of out-of-distribution and low-quality images. To address this issue, we propose the Artificial Intelligence for Robust Glaucoma Screening (AIROGS) challenge. This challenge includes a large dataset of around 113,000 images from about 60,000 patients and 500 different screening centers, and encourages the development of algorithms that are robust to ungradable and unexpected input data. We evaluated solutions from 14 teams in this paper, and found that the best teams performed similarly to a set of 20 expert ophthalmologists and optometrists. The highest-scoring team achieved an area under the receiver operating characteristic curve of 0.99 (95% CI: 0.98-0.99) for detecting ungradable images on-the-fly. Additionally, many of the algorithms showed robust performance when tested on three other publicly available datasets. These results demonstrate the feasibility of robust AI-enabled glaucoma screening.


Introduction
Glaucoma is one of the main causes of irreversible blindness and impaired vision in the world.It affects the optic nerve, which connects the eye with the brain, and leads to progressive visual field damage.This damage initially passes unnoticed by the patient.Only in later stages will glaucoma patients experience visual loss.According to estimates, by 2040, over 110 million people will have varying degrees of visual impairment caused by glaucoma [1], with 10% becoming blind in both eyes and 25% in one eye [2].Many people experience visual impairment from glaucoma because it is often not detected until later stages [3,4].Current treatments of glaucoma cannot repair the damage, but can only halt or slow the progression of the condition [5].Implementing screening programs to identify patients early on for treatment can alleviate the consequences of the disease.Artificial intelligence (AI) may be the enabling technology for the cost-effective implementation of these programs by automatically detecting perimetric glaucoma (i.e., glaucoma in which there is already visual field damage) in color fundus photographs (CFPs) [6,7,8,9,10].
Existing AI solutions have been shown to drop in performance in real-world screening practice due to comorbidities, poor quality images, different ethnicities, or unexpected out-of-distribution (OOD) samples [9].Ad-hoc quality check modules have been added on AI solutions to overcome this performance drop, but recent research has indicated that these quality checks are not sufficiently accurate when deployed in real-world settings [11].To allow a safe and effective deployment in screening, the reliability and robustness of such solutions need to be assessed.Medical image analysis challenges often exclusively focuses on performance metrics that are potentially unrealistic and overestimated due to the use of test sets that do not represent real-world scenarios.Moreover, metrics to measure reliability and robustness are often neglected due to the difficulty to estimate them in the provided test sets.
With the aim to develop solutions that overcome the aforementioned issues related to robustness in glaucoma screening, we organized the Artificial Intelligence for RObust Glaucoma Screening (AIROGS) challenge.The goal of this challenge was to evaluate the feasibility of the development of a state-of-the-art, reliable AI solution that takes a CFP as input and provides as output the likelihood of referable glaucoma, accompanied with outputs for robustness (i.e., predicting whether the input image can be graded reliably or not).
To encourage the development of solutions that are robust to any kind of ungradable and unexpected input data and are equipped with inherent robustness mechanisms, the training set we provided was a subset of the full AIROGS dataset where only gradable images were included and ungradable images excluded.The test set, however, is unfiltered, containing all images found in screening settings (gradable and ungradable), representing a real-world scenario.
AIROGS was part of the International Symposium on Biomedical Imaging (ISBI) 2022 challenge program.It reopened after presenting the results during ISBI 2022 and submissions can still be made on Grand Challenge1 .
Our challenge, along with the dataset we made publicly available, distinguishes itself from previous glaucoma challenges.First, to the best of our knowledge, our dataset is the largest publicly available CFP dataset with glaucoma labels by a large margin.In total, our dataset contains 112,732 CFPs, exceeding the size of other publicly available datasets containing CFPs with glaucoma, of which the sizes range from 22 to 2,000 CFPs [12,13,14,15,16,17,18,19,20].It is a highly diverse dataset as it originates from 500 screening centers across the United States of America and was acquired with a large variety of cameras.Second, the AIROGS challenge is the first challenge to emphasize robustness in glaucoma screening.Third, AIROGS is one of the first type of challenges on grand-challenge.orgthat requires participants to submit an algorithm (a Type 2 challenge), rather than a file with their predictions on the test set (a Type 1 challenge), as is done in more traditional challenges.This makes human intervention in the generation of test set results impossible, reducing the possibility of cheating.Morever, it greatly improves reproducibility, allowing everyone to  reuse the trained algorithms that were submitted and apply them to new data in a cloud-based environment.Fourth, the reproducibility enabled testing of the participating algorithms on three external datasets: two for evaluating the screening task and one for evaluating robustness.

The Rotterdam EyePACS AIROGS dataset
The Rotterdam EyePACS AIROGS dataset contains 112,732 CFPs from 60,071 subjects and 500 different sites with a heterogeneous ethnicity.The images were originally acquired for a diabetic retinopathy screening program [21].
For grading of the CFPs, all graders were trained and then selected for this task using the European Optic Disc Assessment Trial (EODAT) [22], containing 110 steroscopic optic nerve photographs, in which all glaucomatous eyes had reproducible visual field defects on standard automated perimetry.90 experienced ophthalmologists and optometrists were examined and those who scored at least 85% overall accuracy and 92% specificity were selected to label images for the present study.Eventually, 30 out of 90 candidates passed.
For each eye, three images were taken by the camera operators to reduce the number of ungradable eyes.When labeling the images, graders classified one eye at a time.The labeling tool first presented the first CFP for each eye, upon when graders could choose from the options "Referable glaucoma" (RG), "No referable glaucoma" (NRG) or "Ungradable" (U).If a grader selected U for the first image, the tool showed the consecutive CFP.The third image was presented if the second image was also deemed U.Each eye was scored by two separate graders, who were both unaware about the identity of the other grader.If the two graders agreed on the label of a CFP, this became the final label.If they disagreed, the image was scored by one of the glaucoma specialists who passed the EODAT test with at least 95% accuracy.The final label was then based on his judgement.
The graders were instructed to select RG if they found glaucomatous signs which they expected to be associated with visual field defects on standard automated perimetry.The signs that could be selected were "appearance neuroretinal rim superiorly", "appearance neuroretinal rim inferiorly", "baring of the circumlinear vessel superiorly", "baring of the circumlinear vessel inferiorly", "disc hemorrhage(s)", "retinal nerve fibre layer defect superiorly", "retinal nerve fibre layer defect inferiorly", "nasalisation (nasal displacement) of the vessel trunk", "laminar dots" and "large cup".If the graders did not expect any glaucomatous visual field defects, NRG was to be selected, ignoring comorbidities such as age-related macular degeneration and diabetic retinopathy.If there was not enough information visible in the CFP to decide between RG and NRG, graders were instructed to select U.
The graders were not only evaluated at the start, but they were periodically monitored during the grading process, as well.If their sensitivity or specificity dropped below 80% or 95%, respectively, they were removed from the study and all images they labeled were re-graded by any of the remaining graders.In case a grader wrongly classified a CFP as U, while its final label was NRG or RG, their specificity or sensitivity went down, respectively.In the end, 20 graders remained. Out

External datasets
The participants uploaded their trained algorithms, rather than a file with predictions on our test set, to our challenge platform.This enabled us to reuse the developed models on external data after the challenge ended.To evaluate model generalization and to demonstrate this reusability, we applied all trained algorithms to three external datasets: Retinal Fundus Glaucoma Challenge (REFUGE) [18], Glaucoma grAding from Multi-Modality imAges (GAMMA) [20] and Diabetic Retinopathy Image Database (DRIMDB) [23].The former two are datasets with positive and negative glaucoma CFPs, which we used for externally evaluating the screening performance.We used the latter dataset, which contained different types of ungradable images, to externally evaluate the robustness.
The REFUGE test set contained 400 CFPs, of which 40 CFPs showed glaucoma and 360 CFPs did not.The definition of glaucoma was glaucomatous damage in the optic nerve head area and reproducible glaucomatous visual field defects, which is similar to our definition of glaucoma described earlier [18].
The GAMMA dataset is a multi-modal dataset with optical coherence tomography scans and CFPs for each eye.We used the CFP data from the 100-sample training set as only that subset of the GAMMA dataset had publicly available labels.We defined positive glaucoma as the union of the intermediate and advanced glaucoma stages.These stages were defined using the mean deviation (MD) from the visual field reports as follows: an MD between than -6 dB and -12 dB for the intermediate stage and an MD worse than -12 dB for the advanced stage [20].This resulted in 50 negative and 50 positive glaucoma samples.
DRIMDB is a dataset with 125 "Good" CFPs, 69 "Bad" CFPs and 22 "Outlier" CFPs.As the "Good" class was not necessarily defined as gradable for glaucoma, we manually confirmed that in all "Good" CFPs the optic disc (OD) was well visible.We defined both "Bad" and "Outlier" classes to be ungradable and the "Good" class to be gradable."Bad" images were CFPs with low image quality and "Outlier" images were non-retinal images.

Challenge setup
The AIROGS challenge consisted of four phases (see Fig. 1).The Training Phase opened on the 1 st of December 2021 and closed on the 4 th of March 2022, providing the participants with approximately three months to develop their solutions.At the start of this phase, the training set was released and has since been available for download under the CC BY-NC-ND licence on Zenodo2 .
To ensure fair competition and to encourage the development of inherent robustness mechanisms, teams were not permitted to use additional fundus image training data, including weights pre-trained on fundus image data or in pre-processing steps such as OD segmentation.Manually labeling the challenge data and using the resulting annotations during training was allowed.
To test the algorithms developed by participants, they needed to wrap their trained algorithm in a Docker 3 container and submit it to our challenge platform.This allows the submitted algorithms to be run on data that is not directly accessible by the participating teams.Example code for generating such a containerized submission can be found on GitHub 4 .Preliminary Test Phase 1 opened and closed simultaneously with the Training phase and served as a check for whether the submitted algorithms could be run on the challenge platform and produced the output in the expected format.Algorithms were tested on 10 images from the training set for this check.All algorithms were executed on the challenge platform using an NVIDIA T4 GPU (16 GB VRAM) with 8 CPUs (32 GB RAM).
The test set was and still is closed, meaning the image data and the labels are private and cannot be downloaded.
Preliminary Test Phase 2 opened on the 1 st of February 2022 and we allowed three submissions per team to this phase, as it used 10% of the test set for evaluation.All challenge metrics were also computed and reported back to the participants.The Final Test Phase opened simultaneously with Preliminary Test Phase 2, but algorithms were tested on 100% of the test data and only one submission per team was allowed.The challenge metrics computed for this phase were used for the final team ranking.
The algorithms were expected to produce four outputs, of which two were related to glaucoma screening performance (i.e., image classification of RG and NRG) and the other two to robustness (i.e., the identification of U).The glaucoma screening outputs were a likelihood score for RG (O 1 ) and a binary decision for RG (O 2 , positive if RG and negative if NRG).The ungradability outputs were a binary decision on whether the image is ungradable (O 3 , positive if ungradable and negative if ungradable) and a non-thresholded scalar value that is positively correlated with the likelihood for ungradability (e.g. the entropy of a probability vector produced by a machine learning model or the variance of an ensemble) (O 4 ).Output O 2 was not used in the evaluation pipeline for the challenge leaderboard, but it was requested by the challenge organizers for further analysis.
The evaluation was also based on the two aspects of screening performance and robustness, with two metrics per aspect.Screening performance was evaluated using the standardized partial area under the receiver operating characteristic curve [24] (90-100% specificity) for RG (pAU C S ), and the sensitivity at 95% specificity (SE@95SP S ).These metrics were based on these specificity ranges, as a high specificity is required for cost-effective glaucoma screening due to its relatively low prevalence [25,26].pAU C S and SE@95SP S are both based on output O 1 .For evaluating the robustness, we determined the model's agreement with the human reference on ungradability using Cohen's kappa score (κ U ), calculated using output O 3 .Furthermore, we calculated the area under the receiver operator characteristic curve using the human reference for ungradability as the true labels and output O 4 as the target scores (AU C U ).
To determine the final ranking, we first ranked all participants on the four individual metrics pAU C S , SE@95SP S , κ U , and AU C U resulting in the rankings R pAU C S , R SE@95SP S , R κ U , and R AU C U , respectively.The final score S f inal was then calculated as the mean of those ranking: The final ranking (later also referred to as Mean position), was based on S f inal , where a lower value for S f inal resulted in a higher ranking.
We calculated 95% confidence intervals (CIs) with non-parametric bootstrapping using 1000 iterations [27].The code for evaluating submissions can be found on GitHub 5 .The performance of human graders was calculated by comparing the labels given by the individual graders (excluding the two glaucoma specialists, since the final labels were equal to their decision in case of disagreement) to the final labels as defined in Section 2.1.To compute the performance of all human graders combined, each image was weighted equally in the calculation of the metrics.We also evaluated ensembles of participating algorithms, which were generated by averaging the outputs of these algorithms.

Participating Methods
Fifteen teams submitted a working solution to the Final Test Phase, of which one team did not opt-in to contribute to the current paper.In this section, we present the methods of the fourteen participating teams.More extensive descripions are available on the AIROGS challenge website 6 and a selection of the participating methods were included in the ISBI challenge proceedings [36,37,38].Table 2 and 3 summarize the participating methods in a structured manner.The PUMCH-eye team proposed an approach with five trained models in their workflow.The first model (M disc ) was a segmentation model with ResNet101-UperNet [39] as the backbone that segmented the OD in the input CFP.For the development of this model, they manually labeled the OD in 40 images.In case M disc successfully detected the OD, they computed the center c and the diameter d of the segmentation to crop the input image around c with size 3d.This cropped image was then fed into a vision transformer [40] for the binary classification of RG and NRG.If the OD detection was unsuccessful, they fed the original input image to a different vision transformer for binary classification of RG and NRG.
The team also developed a vessel segmentation model with 40 images in which they manually annotated vessels (M vessel ).They trained a ResNet-18 (R vessel ) which took the output of M vessel as input data, using the first 500 images in the training dataset and 100 manually selected images in the training set with relatively poor image quality.This classfication model served as one of the inputs for ungradability classification.The second input was taken from M disc .The ungradability likelihood output (O 4 ) was then defined as the output likelihood of the binary classification model R vessel (i.e., O vessel or, if the M disc could not detect an OD, as R vessel + 0.75.O 3 was positive if O 4 was at least 0.95 and negative otherwise.

RWTH-CuP [37]
The RWTH-CuP team proposed an approach with two steps consisting of cropping around the OD by employing a detection network, followed by an ensemble of transformers (Swin Transformer-B [41] and DeiT-S [42]) and convolutional neural networks (EfficientNet-B4 [43] and EfficientNetV2-M [44]) that classifies the cropped image.They manually labeled the OD and its environment in 3,221 CFPs to develop this detection network, for which they trained a YOLOv5 [45] object detector network.
For ungradability classification, the team used a hybrid approach.As the probability that an image is ungradable is high if the OD could be found by the object detector network, they employed the confidence score of the YOLOv5 detection model as one of the ungradability measures.To capture other ungradability causes, such as blurred depictions of the OD, they trained an additional classifier on a manually selected subset of the CFPs in the development set.The team considered the 4000 CFPs with the lowest confidence score of the object detector and manually selected 600 images that were assumed by the team to be very close to being classified as ungradable.They used another set of 2,000 high quality images to train an EfficientNet-B4 [43] ungradability classification model.O 4 was then defined as (1 − c) + g, where c is the object detection confidence and g the output the ungradability classification model.The binary O 3 output value was determined using a cut-off manually determined by a medical doctor in 20,000 images from the development set for which O 4 was computed.

Eyelab [46]
The Eyelab team employed a two-stage approach for glaucoma classification.The first step was to detect and crop the OD area and the second step was a vision transformer [47] that classified the cropped image from the first step.For the detection model, they trained a YOLOv5 [45] model using semi-automatically generated labels.Their method for ungradability detection was based on whether the optic disc detection model from the first step found an optic disc to be present.

Tien [48]
Tien used an ensemble of an EfficientNet [43] and DenseNet [49] for the classification of RG and NRG.For the ungradability task, they used an autoencoder network and a blending engine.They used the the reconstruction error as a measure of the likelihood of ungradability.The higher the reconstruction error, the more likely it is that the image is ungradable.The blending engine fused the probability output from the binary classification model as a weight factor to the reconstruction error.The highest weight was 1 (when the probability was 0.5) and the weight was lowest when the probability is certain (either 0 or 1).
The first model was trained on the available training set for the screening task.The second model was tasked with identifying out-of-distribution data, i.e., ungradable images in this case.For this, ungradable images were simulated by applying four image transformations (brightness, gamma, saturation and blur) online to the data, with such a strength that they would destroy image content and turn images useless for diagnostic purposes.The ungradability detection model was trained on a mixture of gradable (sampled directly from the original training set) and ungradable (simulated).
After training, this model was applied on the training set, where all images were expected to be gradable.The threshold that would classify 0.1% of the training set as ungradable was selected for ungradabiltiy detection.

FMS-CETCV [52]
The FMS-CETCV team used a binary classifier with ResNet-50 as backbone for classifying RG and NRG.They used focal loss [53] to account for the class imbalance in the training set.
For the classification of ungradable images, they used a self-supervised learning approach, inspired by the work of Oza et al. [54], where a one-class classification method was presented for unsupervised anomaly detection.The one-class classifier builds a feature space by extracting the features of the training sample which contain only the positive samples (i.e., gradable images).They used an encoder with ResNet-18 as backbone, which is trained on the AIROGS test set that only contains gradable images.The feature space produced by this encoder is then used by a Gaussian anomaly classifier to distinguish gradable and ungradable images.

ICT_HCI [55]
Team ICT_HCI used ResNet-50 for their RG and NRG classifcation model.During inference, they made five random 512x512 crops of the image and then provided all crops separately to the model and get five scores.If the maximum of five scores was greater than 0.9, they let it be the output of the model, otherwise they took the mean of the five scores as the output of the model.
The team used the minimum class probability of the two classes RG and NRG as the likelihood for ungradability.If and only if the ungradability likelihood was greater than 0.1, they set O 3 to be positive.

SK [56]
The SK team employed ResNet-RS [57] for RG and NRG classification.They replaced the final linear layer of ResNet-RS with a single linear layer with two channel outputs.
For ungradability classification, they used an inference-time OOD energy-based method [58] combined with activation rectification [59].The energy-based method uses a scoring function based on energy, instead of softmax, to discriminate in-distribution (ID) and OOD data.In activation rectification, the outsized activation of a few layers can be attenuated by rectifying the activations at an upper limit.After rectification, the output distributions for ID and OOD data become much more well-separated.It is based on the observation that the mean activation for ID data is well-behaved with a near-constant mean and standard deviation, and the mean activation for OOD data has significantly larger variations across units and is biased towards having sharp positive values.
For the robustness task, they used the detection model confidence, an autoencoder, and a variational autoencoder (VAE) [64].They combined these three aspects of their pipeline using this formula to achieve a final ungradability score for O 4 as (1 − c) • s • p autoencoder • p vae , where c refers to the detection network confidence and p autoencoder and p vae refer to the mean squared error between the input and output of the autoencoder and VAE, respectively.

UPRetina-UR [65]
The UPRetina-UR team used ResNet-RS-50 [57] for the classification of RG and NRG.They oversampled cases with RG during training to account for the class imbalance.
They employed a closed-set classification approach for the ungradability task based on the method proposed by Vaze et al. [66].They applied test-time augmentation to obtain five predictions that are averaged to produce O 4 .

OPTIMATeam [38]
OPTIMATeam used the first two blocks from the Inception-V3 [63] network for the classification of RG and NRG.They only used these two blocks to reduce the receptive field size, which was necessary for their ungradability approach.
The ungradability approach was based on the direct modeling of the uncertainty following the evidential deep learning approach [67].They used Deep Dirichlet uncertainty estimation as the ungradability score O 4 .To set a threshold for getting a binary value for O 3 based on O 4 , they made the assumption that diagnosis is only possible if the OD has enough image quality for diagnosis, as glaucoma main structural manifestation occurs in that region.They applied Grad-CAM [68] on the trained model for the screening task and occluded out the region where Grad-CAM was greater than 0.5.This allowed them to produce ID and OOD samples in their validation set, with which they computed the threshold for the binary ungradability decision.In particular, they constructed a receiver operating characteristic (ROC) curve using their values for O 4 with these ID and OOD samples.The ROC threshold where the sensitivity was 0.5 was set to calculate O 3 .
The ungradability output O 4 was the sum of the variances between all models in the ensemble for the positive and negative class probabilities.O 3 was positive if O 4 exceeded 0.2 and negative otherwise.

YC [71]
The YC team used two DenseNet-121 networks to classify RG and NRG in the CFPs.The first network was trained with the full CFP as input and the second network used a version of the CFP that was cropped around the optic disc as input.After the last convolutions of these networks, a fully connected layer with dropout was added.The outputs of these fully connected layers were then concatenated and used as the input to another fully connected layer with dropout, which was followed by the final layer of the network.For cropping the CFPs around the optic disc, they trained a U-Net [72] with a DenseNet-121 [49] backbone.To train this segmentation network, they first roughly annotated the position of the optic disc in 101 CFPs in the training set.Subsequently, they generated reference segmentation maps using a probability density function of the multivariate normal distribution around the annotated optic disc position.
They used Monte-Carlo drop-out [73] with 20 predicted probabilties per image for the robustness task.Then they statistically tested a Wilcoxon one-sample test whether the mean of the predicted probabilities was 0.5.The team defined ungradability for predicting glaucoma as the logarithm of the p-value for the Wilcoxon test.

Mirazzak [74]
Team Mirazzak used an ensemble of ConvNeXts [75] and a vision transformer for the screening performance task.
For ungradability task, they employed the regret function, which was proposed by Bibas et al. [76] as the generalization error of an explicit expression of the predictive normalized maximum likelihood learner.
If the value of regret function was high, the samples were considered OOD and they marked them as ungradable.

Results
This section presents the glaucoma screening performance and robustness of the fourteen participating teams.The final rankings and mean positions of the teams are shown in the first plot of Fig. 2. Four teams shared a rank with another team, since their mean positions were exactly equal, causing there to be two teams for each of the ranks #2 and #11.

Glaucoma screening performance
The glaucoma screening performance of the participating teams is summarized in Fig. 2, showing pAU C S and SE@95SP S in the second and third plot, respectively.The highest scores for pAU C S and SE@95SP S were 0.90 (95% CI: 0.89 -0.91) and 0.85 (95% CI: 0.83 -0.87), respectively.These scores were both achieved by team PUMCH-eye.
Fig. 3a and Fig. 3b show the pAU C S and SE@95SP S for the ensembles when averaging the RG likelihood output O 1 of the best M participants in terms of the relevant metric.An optimal pAU C S of 0.91 (95% CI: 0.90 -0.92) was achieved at M = 3.At M = 2, an optimal value for SE@95SP S was reached, which was 0.87 (95% CI: 0.85 -0.89).
Fig. 4a shows the partial ROC curves between 90% and 100% specificity for all participants.The plot also presents the sensitivity and specificity of the human graders with a 95% CI.These were 0.86 (95% CI: 0.84 -0.87) and 0.94 (95% CI: 0.94 -0.95), respectively.In Fig. 5, we compare the performance on the REFUGE test set of the final AIROGS algorithms, which were trained on the AIROGS train set, to the performance of the algorithms that were submitted to the REFUGE challenge, which were trained on the REFUGE train set.The top three participants of the REFUGE algorithms achieved AUCs of 0.99, 0.98 and 0.96.For the AIROGS algorithms, the best three AUCs were 0.98, 0.97 and 0.97.The mean ± std.dev.AUC of all REFUGE and AIROGS algorithms were 0.94 ± 0.04 and 0.95 ± 0.02, respectively.Fig. 6 presents the relation between the two glaucoma screening performance metrics of all participating AIROGS algorithms on the AIROGS Figure 4: ROC curves for both challenge tasks.The sensitivity and specificity of all human graders on the AIROGS test set combined is indicated with black lines.Respectively, the width and height of the black horizontal and vertical line are 95% CIs.In (a), the partial ROC curve (90%-100% specificity) for screening is shown, with 1,602 positive (RG) and 8,134 negative (NRG) images from the AIROGS test set.In (b), the ROC curve for robustness is shown with 1,554 positive (ungradable) and 9,736 negative (gradable) images from the AIROGS test set.test set and that performance on the REFUGE test set.For both metrics, almost all AIROGS algorithms (except for team PUMCH-eye for SE@95SP S ) scored higher on REFUGE than on AIROGS.Of all AIROGS participants, the best pAU C S and SE@95SP S on REFUGE were 0.94 and 0.88, respectively.In Fig. 7, the relation between the glaucoma performance of all participating AIROGS algorithms on the AIROGS test set and that performance on GAMMA is shown.For both screening metrics, all AIROGS algorithms scored higher on GAMMA than on AIROGS.Of all AIROGS participants, the best pAU C S and SE@95SP S on GAMMA were 1.0 and 1.0, respectively.

Robustness
The robustness metrics of the participating teams are summarized in Fig. 2, showing κ U and AU C U in the fourth and fifth plot, respectively.The highest scores for κ U and AU C U were 0.82 (95% CI: 0.80 -0.84) and 0.99 (95% CI: 0.98 -0.99), respectively.These scores were achieved by team Temirgali and RWTH-CuP, respectively.
Fig. 3c and Fig. 3d show the κ U and AU C U for the ensembles when averaging output O 3 and output O 4 , respectively, of the M best algorithms in terms of these respective metrics.An optimal κ U of 0.85 (95% CI: 0.84 -0.86) was achieved at M = 6.Also at M = 6, an optimal value for AU C U was reached, which was 0.99 (95% CI: 0.99 -0.99).
In Fig. 4b, ROC curves for robustness are shown for all participants.The plot also presents the sensitivity and specificity for separating ungradable from gradable images of the human graders with a 95% CI.These were 0.95 (95% CI: 0.94 -0.96) and 0.97 (95% CI: 0.97 -0.97), respectively.
The results on the external DRIMDB dataset are shown in Fig. 8, indicating the relation between the ungradability metrics κ U and AU C U of all participating AIROGS algorithms on DRIMDB and those metrics on AIROGS.Of all AIROGS participants, the best κ U and AU C U on DRIMDB were 0.94 and 1.0, respectively.

Discussion
AI models have been shown to be effective at detecting glaucoma in CFPs, but most studies lack evidence of robustness to real-world scenarios in which unexpected OOD data can be presented due to various causes.To this end, we relied on the community to develop robust AI solutions for glaucoma screening based on the largest multi-center real-world CFP dataset with glaucoma labels.We organized the AIROGS challenge around this dataset, ensuring the resulting algorithms are reusable in a cloud-based environment.We applied these algorithms to ungradable data, while the participants could only train on gradable data to ensure robustness to any kind of ungradable data, and to other publicly available datasets to assess their generalization.

Overall findings
The team with the highest SE@95SP S scored expert-level screening performance on the AIROGS test set with a sensitivity of 0.85 (95% CI: 0.83 -0.87) at 95% specificity, similar to the sensitivity of 0.86 (95% CI: 0.84 -0.87) at a specificity of 0.94 (95% CI: 0.94 -0.95) of human graders.The highest pAU C S that was achieved by any of the teams was 0.90 (95% CI: 0.89 -0.91).Ensembling the different participating methods improved the screening performance even further, to 0.91 (95% CI: 0.90 -0.92) and 0.87 (95% CI: 0.85 -0.89) for pAU C S and SE@95SP S , respectively.Seven out of fourteen teams exceeded the minimum performance of 80% sensitivity and 95% specificity that was required by human graders who were periodically monitored during the grading process.This shows these models can provide a similar performance to human graders for glaucoma screening, suggesting that AI can potentially play a role in an automated screening process.
We also evaluated the screening performance of the algorithms on two external test sets.Even though the algorithms were trained on AIROGS data, they achieved very high performances on the two external test sets, showing reproducible results in different sets and populations.On average, the participating AIROGS algorithms scored slightly higher on the REFUGE dataset than the REFUGE participants.We found that the participating algorithms scored substantially higher on these external datasets than on the AIROGS test set, indicating the value of a challenging real-world dataset.This strong generalization of the developed solutions also shows the potential of models trained on our dataset to be successfully implemented in screening programs with limited to no loss of performance.
The robustness to ungradable data in the AIROGS test set was evaluated for each team using the metrics κ U and AU C U .The teams that performed the best in terms of these metrics achieved 0.82 (95% CI: 0.80 -0.84) and 0.99 (95% CI: 0.98 -0.99) for κ U and AU C U , respectively.Human experts did reach a higher κ U of 0.85 (95% CI: 0.84-0.86)for this task.Moreover, they achieved a sensitivity of 0.95 (95% CI: 0.94 -0.96) and a specificity of 0.97 (95% CI: 0.97 -0.97) for detecting ungradable cases, while the team with the best AU C U achieved a lower sensitivity at 97% specificity of 0.90 (95% CI: 0.88-0.92).Although the teams achieved relatively high performances, they still achieved lower performance at the robustness task than human experts.This shows this task was especially challenging, possibly because the participating teams could not use ungradable development data or because their robustness approaches focused on specific forms of ungradability.
We also assessed robustness on the external DRIMDB dataset.The best scoring team on this dataset scored very high performances; they achieved 0.94 and 1.0 for κ U and AU C U , respectively.These two metrics were lower on the AIROGS dataset for that team.This also indicates very strong generalization to other datasets for the robustness task.The high ungradability detection performance also indicates robustness to other diseases in the image, as diabetic retinopathy was prevalent in the gradable subset of DRIMDB and the best algorithms did not classify these diseases as ungradable.
A large difference in performance between participating teams can be observed, both for the screening and the robustness task.Therefore, we think it is important to identify which methodological choices were made predominantly by top performing teams.One of the most notable differences between the top three participants and the rest was the use of transformers.Outside of the top three, only the latest placed team used a transformer.Moreover, all best three participants manually labelled ODs for training either a segmentation or detection model to crop around the OD during pre-processing.Even though this was also done by two other teams, this seems like an effective strategy to achieve higher screening performance.A likely reason for the effectiveness of this approach is that most glaucoma-related imaging features can be found on or around the OD.This shows how a priori medical knowledge could still be of value even when large amount of data is available.A less important factor appears to be the number of manually labeled ODs.A possible reason for this could be that the OD detection or segmentation network is not required to be extremely accurate as combining a rough localization of the OD with a large enough padding margin could also suffice to crop the image during pre-processing.
Since the development set only consisted of images that were labeled gradable (either RG or NRG) and the use of external fundus data was prohibited, all teams came up with an uncertainty or OOD detection method based on the gradable data for the robustness task.The ungradability methods of the top three participants in terms of mean position, κ U and AU C U , all revolved around the confidence of a neural network that localized the OD.Of the other participants, only the ninth placed team had such an approach.Apart from these methods based on OD detection, only team UPF+AIML implemented a different robustness technique that was also based on domain-knowledge.This raises the impression that solutions based on domain knowledge are more effective for robustness than more general OOD detection solutions.However, it still needs to be evaluated if such approaches are robust for other general tasks (not glaucoma screening) or other sources of OOD data.
For calculating the κ U metric, the participants were required to output a binary decision on ungradability.A popular approach, especially among the top participants, was to manually identify relatively low quality images in the development set and base a threshold for this binary output on that subset.This technique was employed by the best three, fifth, tenth and twelfth team in terms of κ U .This indicates that this could be a successful approach, although not in general as the accuracy of this binary value is also highly dependent on the quality of the scalar output for ungradability O 4 that is being thresholded.We found the difference between the ranking in terms of κ U and AU C U of one team in particular stood out.Team YC ranked only eleventh for AU C U (which depended on the scalar output O 4 ), but ranked fourth in terms of κ U (which depended on the binary output O 3 ), indicating the approach they used for thresholding their scalar value was highly effective.The difference between their AU C U and κ U was 7, while the next biggest value of this difference was only 3. Team YC indeed came up with a relatively sophisticated method for binarizing O 4 compared to others, based on a Wilcoxon one-sample test to statistically test whether the mean of the predicted probabilties from a Monte-Carlo drop-out approach was 0.5 or not.

Strengths and limitations
The dataset presented in this paper substantially exceeds what was publicly available before in terms of number of images and patients.The dataset is also highly diverse because of the large number of different sites, cameras that were used and ethnicities.The quality of the labels was controlled by the initial and periodical evaluation of human graders, the fact that each image was independently labeled twice by two trained graders and, in case of disagreement, by a highly experienced reader.The participants submitted their solutions as containerized algorithms, allowing reproducibility, facilitating inference on other data, and preventing manual manipulation of the test set.
One of the rules of the AIROGS challenge was the prohibition of the use of external fundus data for development.A limitation of this work is the fact that we cannot be sure if any of the teams used such data in their development process.
A possible approach to prevent this and making the process fairer is to have participants submit a containerized algorithm for training, which would be trained by the challenge organizers with private challenge training data.Nevertheless, with such an approach it would still be challenging and time-consuming for the challenge organizers to verify if the training containers do not contain any weights pre-trained on other data.
The dataset used for the challenge is diverse, but improvements could still be made in that respect.All screening sites were based across the United States of America, raising the question whether a more generalizable model could be obtained with data from across the world.On the other hand, we showed that many algorithms trained on the AIROGS dataset performed at least as well on three external test sets, of which two originated from China and one from Turkey, as on our internal test set.
Not all research groups working in the field of retinal image analysis participated in this challenge and many teams that joined the challenge did not submit a solution to the Final Test Phase.Possible reasons for this include that many teams saw their results did not match to ones already present on the leaderboard, that the barrier for some teams was too high to get a solution wrapped in a Docker container, or that they were not able to finish in time.Therefore we would like to stress the challenge is still open and we are curious to see if the community can make further improvements.After all, especially for the robustness task, there seems to be room for improvement, given the gap with the human grader performance.

Future directions
Based on the solutions that were presented by the teams, we think it would be valuable to combine methodologies from different participants and to work further on their ideas.For example, as we mentioned before, team YC apparently had a highly effective method for thresholding their ungradability scores as their κ U was very high compared to their AU C U .A possible future direction would be to combine methods of high performance in terms of AU C U with the binarization technique from team YC.Moreover, we observed that algorithms which scored high in terms of robustness, used domain-knowledge for this aspect of the challenge.Possible future directions could be to explore other ways to incorporate domain-knowledge into an ungradability method.This observation also leads to the question whether there are more fields in medical image analysis in which domain-knowledge can be leveraged for uncertainty estimation and OOD detection.
Next to a decision on RG and NRG presence, the graders were asked to provide which clinical, glaucomatous features were present in the eyes they classified as RG, as listed in Section 2.1 and further described by [77].This information was not yet included in the dataset release for this challenge, as it fell outside the scope of this challenge.Future solutions and challenges could be developed with this information, possibly resulting in more explainable algorithms.
This challenge only focused on classification based on a single CFP.It may be interesting to explore the effect on screening performance and robustness of including various types of metadata in our dataset, which we have available but have not been published yet.This metadata, although missing for some images, includes the camera type, age and anonymous patient identification (which can be used to link two eyes to a single patient).

Conclusions
We presented the results of community-acquired algorithms tested on real-world data for robust glaucoma screening from CFP.The best algorithms performed similarly in terms of screening to the carefully trained and selected human graders, and were shown to be effective at flagging images that could not be graded.Methodological choices predominantly made by the best teams included, for the screening task, the use of vision transformers and the incorporation of optic disc detection models in pre-processing and, for the robustness task, out-of-distribution detection approaches based on domain-knowledge.We hope the unprecedented size and real-world nature of the dataset we released and the algorithms that were developed using this dataset will help towards implementing robust AI for glaucoma screening.

Figure 1 :
Figure1: Overview of all phases in the AIROGS challenge.A world map is shown for each phase that indicates with red circles from which countries the teams that participated in that phase originated.A circle is shown for each country from which at least one team participated and its size represents the number of teams that joined from that country.The relevant subset of the AIROGS dataset for each phase is shown at the bottom of the figure.*All phases reopened for new submissions after the winning teams were announced.

AUCUFigure 2 :Figure 3 :
Figure2: Final rankings of all participating teams.The teams are sorted by their final ranking and therefore also by their mean position.The mean position is shown in the left plot and the four challenge metrics are shown in the other four plots.The κ U of all human graders is indicated with a red dotted line.The width of the horizontal lines in all plots and the shaded area in the plot for κ U are 95% CIs.We consistently use the same colors to refer to teams in other figures in this manuscript.

Figure 5 :
Figure 5: Comparison of the AIROGS and REFUGE algorithms, tested on the REFUGE test set, visualized as violin and swarm plots.The final algorithms that were developed for the REFUGE challenge itself and for the AIROGS challenge are shown on the left and right, respectively.The AIROGS algorithms were only trained on the AIROGS train set and were not retrained with the REFUGE dataset.

Figure 6 :Figure 7 :Figure 8 :
Figure 6: Performance of the participating AIROGS algorithms on the REFUGE dataset, compared to their performance on the AIROGS dataset.Both screening metrics (a) pAU C S and (b) SE@95SP S are shown.
of the three CFPs that were available for each eye, we only included the RG or NRG photograph in the dataset if it was available.Otherwise, only one of the U photographs was used.We split the data into a training set of 101,442 CFPs and a test set of 11,290 CFPs, ensuring that data from patients in the training set was not in the test set.We randomly sampled patients when making the split, oversampling patients with ungradable and RG CFPs for the test set, such that approximately 1,600 RG and 1,600 U photographs ended up in the test set.Since we were interested in AI solutions that can identify ungradable data without training on ungradable data, we left out all U photographs that ended up in the AIROGS training set.Table1shows statistics about RG, NRG and U prevalence, age, sites and cameras for the full dataset, the training set and the test set.Approval from the Institutional Review Board of the Rotterdam Eye Hospital was obtained to conduct this research.

Table 2 :
Method overview from all participating teams for the screening task.OD = optic disc.

Table 3 :
Method overview from all participating teams for the ungradability task and the deep learning frameworks they used.OD = optic disc.AE = autoencoder.VAE = variational autoencoder.rec.error = reconstruction error.OOD = out-of-distribution.