Exploring Bias in Sclera Segmentation Models: A Group Evaluation Approach

Bias and fairness of biometric algorithms have been key topics of research in recent years, mainly due to the societal, legal and ethical implications of potentially unfair decisions made by automated decision-making models. A considerable amount of work has been done on this topic across different biometric modalities, aiming at better understanding the main sources of algorithmic bias or devising mitigation measures. In this work, we contribute to these efforts and present the first study investigating bias and fairness of sclera segmentation models. Although sclera segmentation techniques represent a key component of sclera-based biometric systems with a considerable impact on the overall recognition performance, the presence of different types of biases in sclera segmentation methods is still underexplored. To address this limitation, we describe the results of a group evaluation effort (involving seven research groups), organized to explore the performance of recent sclera segmentation models within a common experimental framework and study performance differences (and bias), originating from various demographic as well as environmental factors. Using five diverse datasets, we analyze seven independently developed sclera segmentation models in different experimental configurations. The results of our experiments suggest that there are significant differences in the overall segmentation performance across the seven models and that among the considered factors, ethnicity appears to be the biggest cause of bias. Additionally, we observe that training with representative and balanced data does not necessarily lead to less biased results. Finally, we find that in general there appears to be a negative correlation between the amount of bias observed (due to eye color, ethnicity and acquisition device) and the overall segmentation performance, suggesting that advances in the field of semantic segmentation may also help with mitigating bias.


I. INTRODUCTION
O CULAR biometrics represents a branch of biometric recognition technology that exploits various characteristics of the eye for automatic identity inference [1]. Recognition techniques based on ocular traits have been successfully applied for access control applications, userfriendly verification schemes on mobile devices, as well as large scale identity-management programs, e.g., Aadhaar [2]. Research on ocular biometrics has long been focused on iris recognition technology, but more recently also expanded into other (visible) ocular modalities, such as the periocular region [3] and the vasculature of the sclera [4]. The sclera in particular has seen considerable interest, mainly due to its appealing characteristics, i.e.: (i ) unlike iris recognition, sclera recognition performs best in the visible spectrum [5], and, hence, does not require any specialized acquisition hardware; and (ii) the vasculature of the sclera is considered to be highly discriminative and stable over time, (iii) while the presence of contact lenses can (purposely/inadvertently) degrade the performance of recognition techniques based on the iris or the periocular region, it has only a limited effect on sclera recognition models [5], [6].
A typical sclera recognition procedure consists of four main steps: sclera segmentation, vessel enhancement, feature extraction and matching. Each of these steps is critical for the overall accuracy and trustworthiness of the recognition procedure and has to ensure consistent performance across diverse data characteristics, e.g., gender, ethnicity, acquisition device, gaze direction. The recent interest in sclera biometrics has led to considerable advances with all four steps and among others resulted in powerful segmentation models [7], [8], [9], novel recognition techniques [4], [5], [10], but also multi-biometric systems with impressive performance characteristics [11], [12]. However, to the best of our knowledge, the This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ literature fails to address an important issue in this field: the bias and fairness of sclera-oriented biometric algorithms [13].
To address this gap, we describe in this paper a group evaluation effort, organized as a follow-up event to the 2020 edition of the annual Sclera Segmentation Benchmarking Competition (SSBC) [1], which focuses on the assessment of one of the key components of sclera-based recognition systems, i.e., sclera segmentation models. While SSBC 2020 studied the performance of modern sclera-segmentation models in mobile environments, the goal of the group evaluation was to benchmark segmentation performance across more diverse image characteristics, but more importantly to explore in a comprehensive manner the bias and fairness of contemporary sclera-segmentation techniques. This is a vital venue of research as questions of unfair treatment and bias in automated decision-making models have recently been a highly controversial and heavily researched topic in academia, industry, as well as society in general. Bias in machine learning algorithms has also been highlighted as one of the key topics of future research in various national and international strategies and acts [14]. It is, therefore, paramount to understand what kind of performance differentials can be expected from current state-of-the-art segmentation models, as this may impact the bias and fairness of all downstream tasks, including the final decisions made. Motivated by the importance of this topic, the group evaluation aimed at investigating the effect of different demographic and environmental characteristics, as well as the impact of training data on segmentation performance. Multiple segmentation models were developed for the evaluation, including extensions of some models that took part in SSBC 2020, but also novel approaches, developed specifically for the group evaluation. The models were benchmarked under a common experimental framework to provide answers to the following research questions: • Q1: How well do contemporary models perform in the task of sclera segmentation with diverse input images? • Q2: Which subject/data characteristics represent the most critical source of bias for sclera segmentation models? • Q3: What impact do training data characteristics have on the bias exhibited by the segmentation models? • Q4: Can we mitigate the bias exhibited by the segmentation models without losing segmentation accuracy? The combined research efforts of multiple research groups helped provide answers to these questions and led to the following contributions that are presented in this work: • A report on the current state-of-the-art in sclera segmentation, with a rigorous (independent) analysis of the main factors affecting segmentation performance of a representative sample of current segmentation models over five datasets with diverse image characteristics. • A comprehensive evaluation of (algorithmic and representation) bias (and fairness) of sclera segmentation models across two environmental and two demographic factors. This includes novel performance measures for quantifying bias and a novel (public) dataset of ocular images. • The introduction of several new (sclera) segmentation models developed exclusively for the group evaluation.

A. Bias and Fairness in Biometrics
Automated biometric recognition techniques can today be found in a variety of application areas that have an immediate impact on people's lives, including online banking, healthcare, access control, or surveillance and security [13], [15]. Because the (automated) decisions made by biometric systems have potentially critical consequences for individuals, it is paramount that the recognition techniques be free of biases and render fair decisions for all. While there is no (single) established definition of bias and fairness in the literature, we provide here the formulation of Drozdowski et al. [13], who defines an algorithm as being biased if it leads to significant performance differences for different subsets of data, where the subsets can be based on subject-specific (e.g., pose, expression), demographic (e.g., age, gender, ethnicity), or environmental (e.g., illumination, capture device) factors. The concept of fairness, on the other hand, can be viewed as an algorithmic property related specifically to demographic bias and is defined by Mehrabi et al. [15] as "the absence of prejudice or favoritism toward an individual or group based on their innate or acquired characteristics". Studies on bias and fairness in biometrics have been a central research topic in recent years [13], largely due to societal, legal and ethical implications of potentially unfair decisions made by automated machine learning models [16].
A considerable amount of work has been done to investigate (demographic) bias and fairness in face recognition systems, e.g., [17], [18], and [19], and potentially sensitive facerelated tasks, such as age estimation [20], face image quality assessment [21], privacy protection [22], and face-morph detection [23] to name a few examples. Similar studies were also presented for fingerprints [24], [25], finger vein [26], and palm print [27] recognition systems among others. While much of this work aimed at identifying the presence of bias in various (learning-based) biometric systems and algorithms (e.g., [17], [20], [24], and [26]), a small number of works also tried to investigate causes of the observed performance differentials for different data groups, e.g., [19] and [28]. The insight and observations made by these studies provided critical understanding of the bias-related behavior of existing biometric algorithms and contributed towards various bias mitigation measures, e.g., [29], [30], and [31].
There has also been work exploring bias in the context of ocular biometrics. Krishnan et al., for example, investigated the presence of age and gender bias in recognition systems relying on the periocular regions in [32] and [33], respectively. Fang et al. [34] aimed at quantifying demographic bias in presentation attack detection (PAD) aimed at iris recognition systems, and Gorodnichy and Chumakov [35] explored age-induced performance differentials in biometric systems based on the iris. While these works presented empirical studies on the bias and fairness of different algorithms related to ocular biometrics, they have been limited to the iris and the periocular region only. Studies related to emerging ocular modalities, such as the sclera, on the other hand, are still largely missing from the literature. Given this limitation, we present a comprehensive analysis in this paper, focused on the overall performance but most of all bias of sclera segmentation models w.r.t. different demographic and environmental factors. Such segmentation models represent key components of sclera-based recognition systems and are, therefore, expected to have a considerable impact on their recognition performance.

B. Sclera Segmentation
The goal of sclera segmentation is to identify the region-ofinterest (ROI) in the input image as accurately as possible, and, consequently, to ensure that all downstream tasks are applied only to relevant parts of the image that contain (discriminative) vascular patterns needed for identity inference. Several specific challenges make sclera segmentation a difficult task, including: (i ) the low contrast between the foreground (i.e., the sclera) and the background (i.e., the surrounding region), which makes using traditional binarization techniques infeasible, (ii) the wide range of appearance variations caused by subject-specific and demographic factors such as eye color, ethnicity, sex/gender, or health; and (iii) the effects of external factors, e.g., the imaging device or ambient lighting.
The evolution of sclera segmentation models has been documented and largely driven by a series of Sclera Segmentation Benchmarking Competitions (SSBC), held as part of major biometrics-oriented meetings and conferences [1], [48], [49], [50], [51], [52]. These competitions introduced segmentation benchmarks for the community [49], [50], examined segmentation performance under changes in gaze direction [51], in cross-sensor and cross-resolution settings [48], [52], and in mobile environments [1]. Segmentation models for the sclera (as well as for the pupil and iris) were also studied in the scope of Facebook's OpenEDS challenge [47], which aimed to compare existing models with data collected using head-mounted displays. In this paper, we further contribute to these efforts through a group assessment organized during 2021 as a follow-up to the 2020 edition of SSBC. As sclera segmentation models have matured and are now used in realworld applications, the goal of the evaluation was not only to investigate the performance of state-of-the-art models in various settings but also to better understand their behavior in terms of bias and fairness.
III. BENCHMARKING METHODOLOGY A comprehensive experimental framework was designed to facilitate the group evaluation. This included the selection/collection of suitable datasets and the definition of common experimental protocols and appropriate performance measures. In this section we describe this experimental framework and provide details on the benchmarking methodology used throughout the group evaluation.

A. Datasets
Five dedicated datasets were utilized for the group evaluation. The datasets contain ocular images that differ in terms of acquisition device, gaze direction, ambient conditions, image quality, and demographics, and, hence, allow investigating various aspects of the developed segmentation models. Details on the datasets are given below.
• The Multi-Angle Sclera Dataset (MASD) [36] contains 2624 RGB images of 164 eyes from 82 different subjects, captured using a DSLR camera (specifically, NIKON D800 with 28-300 lenses). The images were manually cropped to a resolution of 7500 × 5000 pixels to extract the relevant region of interest (ROI). The images in MASD were acquired under 4 different gaze directions (left, right, straight, and up) and with 4 distinct images per gaze direction for each subject. The dataset contains images of male and female subjects captured in different lighting conditions and at different times of the day. It is annotated with high-quality manually generated sclera masks and is publicly available. 1 • The Sclera Mobile Dataset (SMD) [37] contains 500 RGB images of 50 eyes from 25 subjects (10 images per eye) acquired with an 8MP (3264 × 2448) mobile phone (Micromax Canvas Knight A350) rear camera. The dataset is approximately gender-balanced with 12 male and 13 female subjects and comes with variations in the age and skin color of the subjects. Images in the dataset were captured in different lighting conditions and with image noise to more accurately represent realistic scenarios in which sclera segmentation methods need to operate. SMD ships with manually generated sclera annotations and is also publicly available 1 . • The Sclera Liveness Dataset (SLD) represents a novel dataset, captured specifically for the group evaluation, and consists of 108 genuine RGB images from both eyes of 27 individuals (in other words 54 different eyes). For each eye 2 sample images were captured. The dataset contains blurred images and images with blinking eyes. It includes both male and female subjects, of different ages and different skin tones. The images in SLD were taken at different times of the day to model natural environmentinduced variations. Differences in image quality (blur, lighting condition, etc.) and acquisition conditions were included intentionally in the dataset to facilitate investigations into the performance of the segmentation models in non-ideal scenarios. High-resolution images (3264 × 2448) are included in the dataset. All images were captured using a mobile phone (Lenovo K3 Note) with an 8MP rear camera and are stored in JPEG format. SLD is publicly available 1 . • The Sclera Blood Vessels, Periocular, and Iris (SBVPI) [5], [9] dataset consists of 1858 RGB images of 110 eyes (i.e., 55 subjects) captured with a DSLR camera (specifically, Canon EOS 60D with macro lenses). The images were manually cropped to extract the desired ROI while maintaining their aspect ratio, then rescaled to 3000 × 1700 pixels to maintain a consistent image size across the entire dataset. Images in the dataset were captured at the highest resolution and quality settings available in the camera and in a laboratory environment. Similarly to MASD, the dataset contains images taken under 4 different gaze directions, with a minimum of 4 images per direction for each subject. The appearance variability in SBVPI is due to identity, eye color, gender, and age. Manually generated markups of the sclera and periocular regions are present for all images. SBVPI is publicly available for research purposes. 2 • The Mobile Ocular Biometrics In Unconstrained Settings (MOBIUS) [1] dataset comprises 16717 RGB images of 200 eyes from 100 subjects. The images were manually cropped to obtain the relevant ROI and resized to a resolution of 3000 × 1700 pixels to keep a consistent image size across the dataset. A subset of 3542 images from 35 subjects (70 eyes) is designated for segmentation research and contains high-quality manually generated (and later cleaned with a semi-automatic correction procedure [53]) annotations of the sclera, iris, and pupil regions. The dataset again contains 4 gaze directions for each eye, but exhibits a significantly higher degree of variability than other datasets due to the use of 3 different mobile phone cameras (Sony Xperia Z5 Compact, Apple iPhone 6s, and Xiaomi Pocophone F1) for image capture, and 3 ambient settings (i.e., sunny outside; inside with good illumination; and inside with poor illumination). Additionally, data about the subjects (e.g., identity, gender, eye color, age, eyewear, eye conditions and allergies) is also available to facilitate research into various data characteristics and their impact on segmentation performance 2 . We note that all datasets were collected with consenting subjects. A few illustrative example images from the experimental datasets and the corresponding sclera masks are presented in Fig. 1. A high-level comparison is given in Table I. 2 The SBVPI and MOBIUS datasets are publicly available on request for research purposes. For more information visit sclera.fri.uni-lj.si.

B. Evaluation Setup 1) Experimental Protocols:
The research groups participating in the evaluation were given access to images from all five datasets. For the MASD, SMD, and SBVPI datasets both the raw images and the ground truth segmentation masks were made available, whereas only the raw images were made public for SLD and MOBIUS, while the ground truth remained sequestered. Based on this data, the participants were asked to develop sclera segmentation models under two distinct experimental protocols, i.e.: • The Complete Training Data (CTD) protocol, where the segmentation models were trained on the full MASD, SMD, and SBVPI datasets (for a total of 4982 images from 162 subjects). The results for the group evaluation under this protocol were generated on the SLD and MOBIUS datasets. Since different datasets were used for training and testing in all experiments conducted under this protocol, there was no overlap in subjects between the training and testing data. • The Limited Training Data (LTD) protocol, where it was only allowed to use specific training data to learn the models and results needed to be generated on predefined test datasets. This protocol resulted in multiple models with different train-test data configurations, depending on the bias aspect being explored in a given experiment. Details about the specific training and testing data used under various configurations of this protocol are provided in Section V. The above protocols were designed for the analysis of different aspects of the developed segmentation models, as detailed in the experimental section.
2) Result Generation: Two types of results were requested for the analysis: (i ) binarized (black-and-white) segmentation masks, with white pixels corresponding to the sclera region and black pixels to other image areas, and (ii) probabilistic segmentation maps, with the pixel intensities corresponding to the "probability" that the pixels belongs to the sclera region. Both types of results were submitted for all models trained under the CTD and LTD experimental protocols. A sample submission is shown in Fig. 2. These results were ultimately compared to the (sequestered) ground truth information for scoring purposes. In all experiments, the scoring was done with fixed-size images and ground truth masked, rescaled to 480 × 360 pixels, to ensure a common evaluation setting.  2. Illustration of the results generated for the group evaluation. For each input image (left), a probabilistic (middle) and binary segmentation mask (right) had to be generated and submitted for scoring.

C. Scoring Criteria
The main goal of the group evaluation is to analyze two key aspects of recent (sclera) segmentation models: (i ) the overall segmentation performance, and (ii) the exhibited biases. Two sets of performance indicators are, therefore, used to report results of the group evaluation.
1) Overall Segmentation Performance: In accordance with standard evaluation methodology [1], [48], we use the following indicators to score segmentation performance: • Precision, i.e., the proportion of correctly identified sclera pixels in relation to all pixels determined as belonging to the sclera by a given model: ( T P T P+F P ) [54]. • Recall, i.e., the number of correctly identified sclera pixels in relation to all pixels marked as belonging to the sclera region in the ground truth: ( T P T P+F N ) [54]. • F 1 -score, i.e., the harmonic mean between precision and recall: (2 · precision·recall precision+recall ) [1]. • Intersection over Union (IoU) or Jaccard index, i.e., the quotient between the size of the intersection of the predicted and actual sclera regions, and the size of the union of the two, computed as: ( . Here, T P, F P, and F N stand for the number of true positives, false positives and false negatives generated by the models with respect to the ground truth. Additionally, we report complete precision-recall (PR) curves [55], [56] based on the computed probabilistic predictions and the corresponding Area Under the precision-recall Curve (AUC) [57] as another aggregate performance indicator that provides a more holistic view on the performance of the evaluated models.
2) Bias Evaluation: Bias is commonly quantified through a measure of performance (or error) dispersion across different subgroups of the evaluation data [18], [29]. Following this established practice, we report the standard deviation (STD) and mean absolute deviation (MAD) of the computed performance indicators as two measures of bias in our experiments [30], i.e.: • Standard Deviation (STD), defined as the square root of the average squared deviation between the performance of specific subgroups p g and the mean performance across all groups p: • Mean Absolute Deviation (MAD), defined as the average absolute deviation between the performance on specific subgroups p g and the mean performance across all groups p: In the above equations G refers to the number of different subgroups in the data, p g denotes the group-specific performance (in our case the F 1 score), and p = 1 G G g=1 p g . While the two measures capture similar aspects of the bias, STD gives larger importance to outliers (worst case scenario), whereas MAD is influenced more by the majority of subgroups.
In general, STD and MAD quantify the performance variations across different data subgroups (i.e., bias) but ignore the innate variations of the data that also cause performance differences. Based on this observation and the insights from [58] we, therefore, propose and introduce two disparity measures that weigh the computed group-specific dispersion against the observed dispersion on some reference data: • Control Group Disparity (CGD), which we define as the ratio between the standard deviation of the performance scores between different data subgroups and the corresponding standard deviation computed on control groups: where there are G control groups in total, and each control group c matches the size of one of the original attribute-specific data subgroups, but contains randomly chosen samples. Additionally, p C = 1 G G c=1 p c . • Fisher Disparity (FSD), which we define as the ratio between the standard deviation of the performance scores across different data subgroups and the mean standard deviation within the subgroups, i.e.: where p i is the performance on the i -th data instance (i.e., a single image) and n g is the number of images in the g-th subgroup. Both FSD and CGD consider reference variations when quantifying bias, but do so based on different assumptions ,  TABLE II   HIGH-LEVEL COMPARISON OF THE SEGMENTATION MODELS DEVELOPED FOR THE GROUP EVALUATION. THE MODELS EXHIBIT DIVERSITY ACROSS  THE BASE ARCHITECTURE USED, THE FORMAT OF THE INPUT DATA, THE USE OF AUGMENTATION STRATEGIES, NORMALIZATION PROCEDURE,  PROBLEM FORMULATION, LEARNING OBJECTIVE AND COMPLEXITY and, therefore, provide complementary information w.r.t. the observed performance differences across data subgroups.
IV. SUMMARY OF DEVELOPED MODELS Seven research teams developed segmentation models for the group evaluation, i.e., the Chinese Academy of Sciences (CAS), the Democritus University of Thrace (DUTH), the Warsaw University of Technology (WUT), the Federal University of Parana (UFPR), the Hochschule Darmstadt (HDA), Fraunhofer IGD (IGD), and the Norwegian University of Science and Technology (NTNU). More details on the submitted models are provided in the following section, along with links to the corresponding source code repositories, to ensure reproducibility of the results and to provide additional implementation details on all developed segmentation approaches.

A. Model Descriptions
ScleraSegNet 3 (CAS). The CAS group designed an attention-assisted U-Net-based [59] model for sclera segmentation [60], called ScleraSegNet. The model incorporates modules for channel-and spatial-attention into both the central bottleneck, as well as the skip-connection part of the base U-Net architecture. This helps improve the sensitivity of the model to foreground/background pixels and also alleviates the interference of noise factors. ScleraSegNet is trained with images resized to a width of 600 pixels (regardless of the original size of the images in the training data), while maintaining the original aspect ratio. Heavy data augmentation is performed, including random resizing, blurring, translating, flipping, rotating and cropping (to 321 × 321 pixels) to avoid overfitting. Binary cross-entropy is utilized as the loss function in training. When generating the binary masks for the evaluation, the binarization threshold is set to 0.5.
ScleraU-Net2 4 (DUTH). The DUTH group developed a novel U-Net-inspired model based on the ScleraU-Net architecture designed initially for the SSBC 2020 competition [1]. Compared to the original U-Net, ScleraU-Net2 has a reduced number of convolutional layers and, therefore, exhibits decreased network complexity. This leads to a more lightweight architecture that is better tailored towards the sclera segmentation problem. Specifically, ScleraU-Net2 comprises 8 filter kernels in its first convolutional layer, compared to the 64 kernels of the original U-Net. For the subsequent layers the number of filters is doubled after every pooling operation. Another key improvement in ScleraU-Net2 is the 3 ScleraSegNet is available from github.com/xiamenwcy/ScleraSegNet. 4 ScleraU-Net2 link: github.com/georgezampoukis/ScleraU-Net2_SSBC. use of Group Normalization (GN) after each convolutional layer. Group normalization [61] is used as a replacement for Batch Normalization (BN) and is paramount for models trained with relatively small batch sizes, where BN layers may fail to properly capture the distribution parameters, resulting in poor normalization with adverse effects on generalization. Finally, the activations of all convolutional layers are replaced with GELUs [62], for which improved performance across many vision tasks has been reported in the literature. Training is performed with a fixed learning rate of 10 −4 and a batch size of 6. Data augmentation is applied in an online fashion, by random horizontal flipping and a limited amount of rotation and shear. The probability maps are converted to binary maps by a fixed-value thresholding of 0.5.
MU-Net 5 (WUT). The main idea behind the approach of the WUT group is to utilize a light-weight architecture designed for mobile-computing that allows for efficient learning of segmentation models with limited training data. Along these lines, the WUT group designed a U-Net-like encoder-decoder model, named MU-Net, with a MobileNetV2 [63] encoder pretrained on ImageNet. The model is fine-tuned for sclera segmentation using the provided training data, augmented with horizontal flips, and the standard binary cross-entropy learning objective. Because the encoder model, MobileNetV2, has fewer parameters than the encoder of the original U-Net, the entire model converges quickly, while also ensuring high segmentation accuracy and good generalization. The model produces a probabilistic segmentation prediction for each pixel location. A binarisation threshold is, therefore, chosen by iterating over the validation images and fixing the threshold at the value that achieves the highest F1-score to produce the binary segmentation results required for the evaluation.
FCN8 6 (UFPR). Inspired by the work of Long et al. [64], the UFPR group designed a segmentation model, FCN8, based on a Fully Convolutional Network (FCN). Due to the fully convolutional structure, the model is applicable with input images of arbitrary size and produces corresponding segmentation results. For FCN8, an architecture similar to the one proposed by Teichmann et al. [65] is utilized and relies on a VGG-16 model (without the FC layers) in the encoder and a three-layer decoder for upsampling. A unique aspect of FCN8 is the design of the bottleneck module in the encoder-decoder architecture, which retains the spatial dimension instead of compressing all of the information of the input image into a vectorized latent representation. This design choice allows for the use of a simple decoder that does not need to learn decoding spatial information from the latent representation and can, therefore, be trained efficiently with a limited amount of data. FCN8 is learned with a cross-entropy training objective.
CGANs2020CL 7 (HDA). The HDA group contributed an approach that framed the sclera segmentation task as a patchbased image-translation problem [66] and used a Conditional Generative Adversarial Network (cGAN) as the basis for the segmentation. The goal of the cGAN in this setting is to implement a mapping from the gray-scale real-valued (ocular image) domain to the binary sclera domain. The backbone of the model is a ResNet-101 [67] learned from scratch using only the provided training data. To avoid overfitting and ensure that a well performing model with good generalization capabilities is learned, aggressive data-augmentation is performed using the Imgaug library. 8 Here, various augmentation strategies were considered, including rotations, flipping, cropping, color manipulations and others. The model is trained using a weighted sum of the GAN and L 1 -reconstruction losses.
RGB-SS-Eye-MS 9 (IGD). The IGD team developed a model that extends the multi-scale eye segmentation solutions (Eye-MS) from [68]. The model represents a convolutional neural network (CNN) that refines segmentation progressively using different input resolutions. Each of the refinement modules consists of two convolutional layers, followed by a normalization layer and a LReLU non-linearity. RGB-SS-Eye-MS is trained with the Intersection over Union (IoU) loss using the SGD optimizer with a learning rate of 0.1 and a batch size of 32. Heavy data augmentation in the form of random cropping, horizontal flipping brightness and contrast changes, blurring and noise infusions is employed to improve generalization. The predicted segmentation is rounded to the nearest integer values to generate binary segmentation masks.
ScleraMaskRCNN 10 (NTNU). The solution designed by the NTNU group, called ScleraMaskRCNN, follows the two-stage approach from Mask-RCNN originally proposed in [69]. In the first stage, ScleraMaskRCNN generates region proposals for the sclera in the input image and in the second stage then classifies these proposals into the most likely class (i.e., sclera/other). A pixel-level mask is also computed in the second stage to facilitate the (instance) segmentation procedure. The model uses a ResNet-101 [67] as the backbone for feature extraction and is trained using a joint objective that combines losses for classification/localization and segmentation-mask prediction. No augmentation is used during training.

B. Comparative Analysis
A high-level comparison of the developed segmentation models is presented in Table II. In accordance with recent trends in the segmentation literature [1], [9], [48], [70], all of the contributed models use deep learning to efficiently capture the complex appearance variations present in the ocular images. The majority of solutions rely on encoder-decoder architectures (e.g., U-Net-based, FCN) with an information bottleneck as the basis for segmentation, but custom designs (CRN) and Masked-RCNNs are also represented among the contributed models. Noteworthy, all models learn from color as well as texture information, except for the solution from the HDA group that relies exclusively on texture (i.e., processes gray-scale images). The developed models also differ in terms of problem statement (semantic vs. instance segmentation and segmentation vs. image translation) and corresponding learning objectives. Finally, we observe considerable differences in the number of trainable parameters, ranging from 409K for the most light-weight model to 138M for the heaviest one. Overall, the developed models represent a rich and diverse set of segmentation techniques for the group evaluation.

V. EXPERIMENTS AND RESULTS
In this section, we present the results of the group evaluation that: (i ) analyze the performance of different sclera segmentation models over multiple test datasets, (ii) investigate performance differences of the models across various data subgroups and training configurations, and (iii) study the correlations between bias and overall segmentation performance. We make our evaluation code publicly available to ensure the reproducibility of our results. 11

A. Segmentation Performance
In the first series of experiments, we benchmark the developed models with respect to the overall segmentation performance. We consider the complete-training-data (CTD) protocol for these experiments and use the (unseen) MOBIUS and SLD test images for scoring. We separately analyze results based on: (i ) the submitted binary segmentation masks, where the participating groups performed the binarization procedure on their own, and (ii) the probabilistic masks, processed independently by the organizers of the group evaluation.
1) Results on Binary Masks: In Fig. 3 we show the average F 1 scores obtained by the developed models together with the corresponding standard deviations computed from n = 5 (disjoint) stratified subsets of images sampled from each of the two test datasets. These results provide insight into the performance of the segmentation models, but also the variability of the observed scores. More detailed results across the remaining  performance indicators are given in Table III. Here, only the mean scores are reported to keep the results uncluttered. The models are sorted with respect to the harmonic F 1 mean calculated across the two evaluation datasets.
As can be seen, RGB-SS-Eye-MS and CGANs2020CL perform the best overall in this setting with harmonic F 1 means of 0.778 and 0.765, respectively, followed closely by ScleraU-Net2 with a score of 0.742, mostly due to the more consistent performance across both test datasets. FCN8 and ScleraSegNet exhibit a slightly weaker performance to ScleraU-Net2 in terms of the harmonic F 1 mean with scores of 0.741 and 0.739, respectively, but achieve the best and second best performance on MOBIUS. MU-Net and ScleraMaskCNN yield the sixth and seventh best results and rank behind the best performing models with corresponding harmonic F 1 means of 0.729 and 0.538. It is interesting to note that among the top performers, models using a fixed thresholding procedure for generating the binary masks (ScleraU-Net2 and ScleraSegNet) result in more consistent F 1 scores across the datasets than models using dynamic thresholding (RGB-SS-Eye-MS, CGANs2020CL and FCN8). Nonetheless, finding a good trade-off between precision and recall scores appears to be challenging for all models regardless of the thresholding strategy used, as evidenced by the difference between the two performance scores and their variability across MOBIUS and SLD in Table III. Overall, we observe that 6 of the 7 submitted models are within a performance difference of less than 0.05 in terms of the harmonic F 1 mean. However, larger performance variations are observed with other (individual) performance scores (recall, precision, IoU) over each of the two datasets.
2) Results on Probabilistic Masks: To get better insight into the performance of the segmentation models, we generate precision-recall curves from the probabilistic segmentation masks and show these together with the optimal operating point in terms of F 1 score in Fig. 4. Additionally, we also visualize the operating points that correspond to the binary masks in the same graph. Numerical results computed based on the curves are summarized in the right part of Table III. Several observations can be made based on these results: • Binarization. Using an optimal threshold for generating segmentation masks from the probabilistic predictions in general improves results for all models in terms of the harmonic F 1 mean. Additionally, the binary operating points are often not located on the PR curves due to different strategies used for either producing the probabilistic predictions or determining the binarization threshold. This suggests that even with fixed segmentation models, efficient mechanisms for segmentation mask generation are critical for performance and suitable trade-offs between precision and recall scores. means of 0.863 and 0.856 across the two datasets, respectively. ScleraSegNet ranks third with a score of 0.830, whereas the rest perform weaker. These results suggest that mechanisms that allow for efficient training with limited training data (e.g., heavy augmentation, use of pretrained models) lead to the most competitive segmentation performance.
3) Qualitative Results: In Figs. 5 and 6 we visualize the (binary) segmentation masks generated by the evaluated models for a few example images from the two experimental datasets that produced the best and the worst segmentation results across the evaluated models. We can see that for the well-performing samples in Fig. 5, all evaluated models generate competitive results and generalize well across different gaze directions. For the more challenging samples from Fig. 6, on the other hand, the segmentation models result in very different errors. While some are able to reasonably well identify the sclera region in the presence of partially closed eyes and eyelash occlusions (e.g., see the results for CGANs2020C and FCN8), others struggle to locate parts of the sclera region or introduce visible artifacts.

B. Bias Analysis
As emphasized by Mehrabi et al. in [15], biases come in various shapes and forms and may raise issues related to the fairness of automated decisions made by machine learning algorithms. To better understand the behavior of sclera segmentation models in this regard, we explore two types of biases in this section, i.e.: • Algorithmic Bias: The first type of bias originates from the machine learning algorithms and is typically associated with the design choices made, the optimization objective used, the regularizations considered and similar algorithm-specific characteristics [71]. We study algorithmic bias in the following sections by comparing the developed models on subgroups of the test samples with fixed and predefined training data. • Representation/Sampling Bias: The second type of bias stems from the way the data is sampled from a population during the data collection process [15], [71]. Unrepresentative training data or potential biases in the data are typically inherited by the machine learning models and (may) eventually lead to unfair decisions. We investigate representation bias within the group evaluation by analyzing models learned with different sets of training data. To ensure a comprehensive analysis, experiments are conducted with subgroups defined based on different data characteristics. Specifically, we consider subgroups generated based on demographic (eye color and ethnicity) as well as environmental factors (acquisition device and gaze direction), which represent two of the main groups of data characteristics most critical from a bias perspective according to [13]. The selection of characteristics is also motivated by the annotations available in the datasets utilized for the group evaluation. We note that all experiments are performed with stratified subgroups to mitigate issues related to different sample sizes.
1) Algorithmic Bias: When investigating algorithmic bias, we consider F 1 scores computed from the binary segmentation masks as the basis for the analysis. The binarization threshold is, thus, set automatically during training, similarly to a real-world operational scenario.
Eye-Color Bias. An important ocular characteristic, also often associated with race, is the eye color of the individuals. The vast majority of people of Asian and African origins, for example, have brown eyes, whereas people of Caucasian origin typically exhibit a wider spectrum of eye colors. To explore the impact of eye color on segmentation performance, we conduct an analysis on the test images from the MOBIUS dataset with (stratified) subgroups that correspond to subjects with brown,  green, blue and gray eyes. The models scored in the analysis are trained using the CTD protocol, so all eye colors are well represented in the training data. Fig. 7(a) shows that all models underperform with green eyes, where green-color-specific F eye 1 scores between 3.4% (ScleraU-Net2) and 14.8% (ScleraMaskRCNN) below the average performance across all images F all 1 are observed, and F eye 1 scores between 8.4% (ScleraU-Net2) and 22.0% (ScleraMaskRCNN) below the best performing eye-color are seen. The (differential) results for the remaining eye-colors are closer in general, with blue and gray eyes consistently yielding the highest scores for all tested models and brown eyes resulting in somewhat lower but still above average F eye 1 values. These systematic performance differentials are unexpected, especially given the fact that the blue-and gray-colored eyes are less represented in the provided training data than the gray and brown eyes, suggesting that eye-color represents a critical image characteristics with considerable impact on the segmentation performance and fairness of the evaluated models, regardless of their design. While the presented F 1 scores provide an initial idea about the performance differentials due to eye color, they are based on selected subgroup samples that may contain additional sources of variability that affect performance [19]. Since these sources cannot be easily accounted for (as they are in general unknown), we report the proposed bias disparities, CGD and FSD, in the right part of Fig. 7(a). As can be seen, all models exhibit a CGD score above 1, suggesting that the variability in segmentation performance due to color variations is larger (even though moderately so) than the variability seen in randomly sampled subgroups. The lowest performance differentials are seen with the ScleraU-Net2 model with a CGD score of 1.64 and the largest for the CGANs2020CL approach with a CGD value of 8.23. Interestingly, this model is also the only one trained on gray-scale images. The fact that the only model working with gray-scale images exhibited by far the highest degree of eye-color bias suggests that using color information (for training and at run-time) is beneficial for stable results across different eye colors. A similar ranking can also be observed when normalizing the bias scores using within-group variations in FSD. Here, MU-Net and ScleraU-Net2 exhibit the most stable performance across the eyecolor subgroups, whereas CGANs2020CL again results in the largest performance differences. In Fig. 7(b) we compare the Fig. 8. Differential performance and bias scores with respect to ethnicity on a chimeric test dataset (MOBIUS+SMD+SLD). The best disparity score in (a) is presented in bold, the second best is underlined. In (b), the disparities (CGD and FSD) are normalized to the range of STD and MAD for visualization purposes. The mean value of each bias score is shown as a dashed line and lower values imply better performance/behavior, i.e., ↓. The figure is best viewed electronically and in color. Illustration of MOBIUS images captured with three different acquisition devices in an indoor setting. Note that the devices produce images of different characteristics in terms of color tone, sharpness, and focus. segmentation models with respect to all four bias scores and relative to the average performance across models (dashed line). In this relative comparison, MU-Net and FCN8 are the only two models that yield below-average bias scores across all performance indicators. On the other end of the spectrum are the CGANs2020CL and ScleraSegNet models, which exhibit above-average bias scores with all considered measures, exceeding the bias scores of the best performing models by a factor of more than 2×.
Ethnicity Bias. The datasets used for the group evaluation contain individuals of different ethnicities. MOBIUS predominantly consists of Caucasian (white) subjects, whereas SMD and SLD contain subjects of Indian descent. To explore ethnicity-related bias, we construct a chimeric dataset from the test images in the MOBIUS, SMD and SLD datasets. We note that the three datasets also differ to some extent in image characteristics other than ethnicity (e.g., due to the capturing equipment and lighting), so cross-talk from other attributes may be present in the reported results. This cross-talk needs to be taken into account when interpreting results, but is accounted for (partially) by the disparity measures. To ensure that there is no overlap between the training and testing images, we use the limited-training-data (LTD) protocol for the analysis with MASD and SBVPI (having a total of 4482 images from 137 distinct subjects) serving as the training data.
From the results in Fig. 8(a) 12 we observe that several of the tested models produce considerable performance differences for subjects of different ethnicities. RGB-SS-Eye-MS, ScleraU-Net2, and ScleraSegNet, for example, show a difference of 29.9%, 37.2% and 41.8% in the ethnicity-1 differences between the two ethnicities of below 5%. When looking at the disparity measures, we notice that ethnicity induces significantly larger performance variations than eye color on average, with CGD scores reaching values above 10 for most models and FSD scores above 0.5. While the performance of the fairest (most unbiased) model, CGANs2020CL, is comparable to the best model from Fig. 7, we still observe performance differentials that are larger than in the control group. As illustrated in Fig. 8(b) three (valid) models achieve below-average performance differences across the ethnicities when considering all four bias scores, i.e., CGANs2020CL, FCN8 and ScleraMaskRCNN. It needs to be noted, though, that the overall segmentation performance is quite different for the three models, with F all 1 scores of 0.666, 0.770 and 0.597 for CGANs2020CL, FCN8 and ScleraMaskRCNN, respectively. Acquisition-Hardware Bias. Next, we focus on performance differentials induced by the acquisition devices. For this part of the analysis, we again consider the test images from the MOBIUS dataset, which come from three different mobile phones. We use the CTD protocol to make sure examples from all capture devices are present in equal amounts in the training data. We note that, in general, all acquisition devices generate images of reasonable quality but with differences in color tone, sharpness and focus, as shown in Fig. 9.
The results in Fig. 10(a) show that the performance differences due to the capture device are overall larger than those originating from eye-color, but are below the differentials observed for ethnicities. In general, most models (except RGB-SS-Eye-MS) perform strongest with images from the Xiaomi phone, whereas the other two acquisition devices produce mixed rankings across the models. We see performance differences in the range of 5.1% (RGB-SS-Eye-MS) to 15.4% (CGANs2020CL) between the best and worst device-specific F hdw 1 scores and observe that even when normalized against reference data variations (in CGD and FSD), the acquisition device still has a considerable impact on segmentation performance. When comparing the models in terms of all four bias scores in Fig. 10(b), we notice below-average performance differentials in terms of all four scores for the RGB-SS-Eye-MS and ScleraU-Net2 models and above-average differentials for the CGANs2020CL and ScleraSegNet models. Both of the strongest models in this experiment frame the sclera segmentation task as a semantic segmentation problem and are among Fig. 10. Differential performance and bias scores w.r.t. acquisition hardware on the MOBIUS dataset. The best disparity CGD and FSD scores in (a) are presented in bold, the second best are underlined. In (b), the disparities (CGD and FSD) are normalized to the range of STD and MAD for visualization purposes. The mean value of each bias score is shown as a dashed line and lower values imply better performance/behavior, i.e., ↓. The figure is best viewed electronically and in color. Fig. 11. Differential performance and bias scores with respect to eye gaze evaluated on the MOBIUS dataset. For the disparities in (a) the best score is presented in bold, the second best is underlined. In (b), the disparities (CGD and FSD) are normalized to the range of STD and MAD for visualization purposes. The mean value of each bias score is shown as a dashed line and lower values imply better performance/behavior, i.e., ↓. The figure is best viewed electronically and in color. the lighter models in terms of trainable parameters, which helps to generate stable segmentation results with limited performance variations across different capture devices.
Gaze-Direction Bias. Another potential source of differences in segmentation performance is the gaze direction, in which the eye was imaged. To explore the impact of gaze on segmentation performance, we conduct an analysis on the test images from the MOBIUS dataset with (stratified) subgroups that correspond to images captured with subjects looking up, left, straight, and right. The models scored in the analysis are trained using the CTD protocol, so all gaze directions are well represented in the training data, as both the MASD and SBVPI datasets (which form the majority of the training data in the CTD protocol) contain images with varying gaze directionsrefer to Table I for details on the dataset characteristics.
The results in Fig. 11(a) show that the worst performance is fairly consistently achieved with the straight gaze direction, where straight-gaze-specific F gaze 1 scores between 2.5% (MU-Net) and 19.5% (ScleraU-Net2) below the average performance across all images F all 1 are observed. Note that this result appears despite the fact that the straight gaze direction is slightly overrepresented in the provided training data (since the 500 images in SMD are captured in the straight gaze direction only). The upwards gaze direction consistently results in the best segmentation performance, with upwardgaze-specific F gaze 1 scores between 6.1% (ScleraU-Net2) and 21.4% (ScleraMaskRCNN) above F all 1 . The left and right directions result in roughly equivalent performances across the board, mostly falling between the performances on the upwards-and the straight-gaze direction.
The (poor) results with the straight-gaze direction can be attributed to the fact that under this direction, the sclera commonly appears in the form of two distinct areas of roughly the same size with possibly different brightness values -due to the external illumination conditions. This also explains why the worst individual performances were observed with the sunny and well-lit samples in Fig. 6, even though the images captured in sunny weather and well-lit rooms achieve better segmentation performance than images captured in poorly-lit rooms on average [1]. Conversely, with the upwards-gaze direction the sclera typically takes the form of a single contigous area with potential gradual changes in illumination and contrast, leading to above-average segmentation results. The weaker results with the straight-direction images are somewhat in conflict with the feature extraction stage, where matching with the straight-gaze direction images was shown to lead to better recognition results than matching with other gaze direction images [5]. This implies that in real-world scenarios, where sclera segmentation is still a relatively difficult problem (unlike in the laboratory conditions explored in [5]), a balance in the performance of the methods addressing these two steps has to be achieved for a successful overall recognition pipeline.
As all models exhibit a CGD score significantly above 1, we can conclude that the variability in segmentation performance due to gaze variations is larger than the variability seen in randomly sampled subgroups. The lowest performance differentials are seen with the FCN8 model with a CGD score of 4.42 and the largest for the ScleraU-Net2 approach with a CGD value of 18.9. A similar ranking can also be observed when normalizing the bias scores using within-group variations in FSD. Here, MU-Net and ScleraMaskRCNN exhibit the most stable performance across the gaze-direction subgroups, whereas ScleraU-Net2 again results in the largest performance differences. In Fig. 11(b) we compare the segmentation models with respect to all four bias scores and relative to the average performance across models (dashed line). In this relative comparison, FCN8 and ScleraSegNet are the only two models that yield below-average bias scores across all performance indicators. On the other end of the spectrum is ScleraU-Net2, which exhibits above-average bias scores with all considered measures, exceeding the bias scores of the best performing models by a factor of 4×.
2) Representation Bias: Performance differentials across different data subgroups are often ascribed to biased (or unbalanced) training data [13], [15]. To provide insight into this issue, we study the representation (or sampling) bias in the context of performance differentials induced by ethnicities in the next series of experiments. For the analysis we consider two configurations of the limited-training-data (LTD) experimental protocol: (i ) in the first configuration, the SBVPI data, with exclusively Caucasian subjects, (i.e., 1858 images from 55 subjects) is used as the training data, and (ii) in the second configuration, MASD (having exclusively Indian subjects) as well as SBVPI (for a combined 4482 images from 137 subjects) are utilized for training. The test set consists of images from the MOBIUS, SMD and SLD datasets, which were not seen during training. We note that the MU-Net model did not converge properly using the limited amount of training data available in this experiment and is, therefore, excluded from the following analysis.
Several observations can be made from the results in Table IV: (i ) The segmentation performance increases for both ethnicities in terms of F et n 1 scores when adding more training data for the majority of models, i.e., SBVPI → MASD+SBVPI, suggesting that the added training samples contribute towards better segmentation results. (ii) The performance with Caucasian subjects is consistently higher for all models regardless of the training data used. The only notable exception here is ScleraMaskRCNN, which performs better with Indian subjects when the mixed-ethnicity data is used for training. (iii) While the performance differentials between the two ethnicities range between 7.5% (FCN8) and 32.2% (ScleraU-Net2) in terms of F et n 1 scores when the models are trained on the SBVPI data, the range of performance differentials changes to between 1.65% (CGANs2020CL) and 41.8% (ScleraSegNet), when both MASD and SBVPI are utilized for the training procedure. As also seen from Fig. 12, where differences in the bias scores due to the training data are presented, i.e., Bias = ψ masd+sbv pi − ψ sbv pi ; ψ ∈ {STD, MAD, CGD, FSD}, several models (CGANs2020CL and FCN8) are able to significantly reduce the performance differences with more representative training examples in addition to improving their overall F 1 scores, whereas others (e.g., RGB-SS-Eye-MS, ScleraU-Net2, and ScleraSegNet) improve segmentation performance but also increase the differences values. This observation is consistent with prior work studying representation bias in other problem domains, e.g., [72] -informative training data may help to reduce performance differentials with well-designed and trained models, but this is by no means guaranteed.

C. Bias vs. Segmentation Performance
In the final analysis we investigate the relationship between algorithmic bias across eye color, gaze direction, ethnicity, and capture device and the overall segmentation performance. The analysis for each of the four factors is conducted with the same experimental setup in terms of training and testing data as in the corresponding experiments from Section V-B. Thus, all models are trained on the same data to ensure a fair evaluation.
In Fig. 13 we plot the calculated CGD disparities against the F 1 scores for each experiment and as a function of the model size, i.e., the number of model parameters. The figure, thus, captures the trade-off the developed models offer in terms of bias, segmentation performance and model footprint. In an ideal setting, the models would have low bias (CGD scores on the y-axis), high performance (F 1 scores on the x-axis) and a low parameter count (circle areas), and would as such be located at the lower right in the presented graphs. To capture the relationships between the bias and performance scores, we fit a line to the data points in a least-squares manner. Since certain models performed much worse on certain training configurations, possibly due to insufficient training or errors in training data handling (see for instance Fig. 8(a) and Table IV), we eliminate outliers with an F 1 z-score above 2 (i.e., models with an F 1 score that is more than 2 standard deviations from the mean F 1 score) and fit the line to the remaining data.
As can be seen, there is a weak but consistent negative correlation between the performance differentials and overall segmentation performance for all considered factors. This suggests that better performing models tend to produce smaller performance differences over data subgroups. With improvements in visual segmentation techniques, reductions in the performance differentials may, therefore, also be expected. Active research on reducing bias with existing models is nevertheless a key concern going forward. If we look at the performance-bias trade-off from the perspective of model size, we observe that the largest model, FCN8, is consistently among the best models located at the bottom right of Fig. 13, while the smaller models (MU-Net, CGANs2020CL, and ScleraU-Net2) trend more toward the left (low perfor- Fig. 13. Scatter plots of the CGD disparities relative to the F 1 values achieved by the models. Lines fitted to the points in the graphs are also shown, along with their corresponding R 2 scores. The areas of the circles around the points represent the model sizes in terms of the number of parameters. mance) and top (high bias) of the graphs. This observation may suggest that model scaling can also have a beneficial impact on the segmentation models, similarly to what has been observed recently in other areas, where larger models were found to have a significant edge over their smaller counterparts [73], [74], [75].

VI. DISCUSSION
The following observations were made based on the available sample of models with respect to the research questions laid out in the introductory section of the paper. Q1: How well do contemporary sclera-segmentation models perform with diverse input images?
Significant performance differences were observed across the evaluated models. While many of the best performing models (RGB-SS-Eye-MS, CGANs2020CL, FCN8, ScleraU-Net2 or ScleraSegNet) achieved F 1 scores above 0.7 on the challenging MOBIUS dataset (mobile setting, different devices, gaze directions and environments), some of the weaker models yielded F 1 scores closer to 0.5. Similarly, on the newly collected SLD dataset, F 1 scores varied from above 0.8 for the strongest models to 0.55 for the weakest one. Nonetheless, the results suggest that given the current state of technology, it is possible to train segmentation models that generalize well across data characteristics and produce usable segmentation results even with challenging input images.
Q2: What are the most critical sources of bias? Different characteristics were taken into account when exploring algorithmic bias with the developed segmentation models, including eye color, ethnicity, acquisition hardware and gaze direction. The largest performance differences were observed across ethnicities, where 6 out of 7 tested models exhibited a clear preference for Caucasian subjects, despite the fact that the ethnicity groups were equally represented in the training data (with a slight under-representation for Caucasian subjects). The bias due to eye color was overall the lowest in our experiments. Nonetheless, all 7 models performed worst with green eyes and 6 out of the 7 models performed best with gray eyes, suggesting that eye color represents a systematic (yet limited) source of algorithmic bias in sclera segmentation models. The bias scores observed with different acquisition devices were overall higher than what was observed due to eye color in the experiments, but the ranking w.r.t. devices was not consistent across the segmentation models. While all 7 models performed best with images captured by the Xiaomi phone, the ranking on the other two phones was mixed, implying that, while the acquisition hardware is still a significant source of bias, various segmentation methods respond differently to the image characteristics introduced by the capturing hardware. The bias scores observed for gaze directions were comparable to the scores observed for the acquisition devices, with 6 out of 7 models exhibiting the worst performance with the straightgaze directions, and similarly 6 out of 7 performing best for the upwards-gaze direction, again pointing to the presence of systematic bias with respect to gaze directions.

Q3: What impact do training data characteristics have on the bias exhibited by the segmentation models?
To study the impact of training data characteristics on the segmentation accuracy and ethnicity bias, two different training configurations were explored: (i ) one that contained only Caucasian subjects, and (ii) another one that contained an approximately balanced number of images of Indian and Caucasian subjects. Two main observation were made. The overall segmentation performance of 6 of the 7 models improved with the larger, more representative training dataset. However, only 3 out of the 7 models managed to also reduce the performance differences between the Caucasian and Indian subjects, for 2 models the results were mixed, whereas for the last 2, the bias in fact increased with the balanced dataset. This confirms prior observations [29], [72] that balanced datasets do not automatically lead to unbiased performance, as algorithmic bias is not necessarily related only to unbalanced training data.
Q4: Can we mitigate algorithmic bias without degrading segmentation performance?
There appeared to be a consistent (albeit weak) negative correlation between segmentation accuracy and the CGD bias score across all of our bias experiments. This implies that improving the overall segmentation performance of the models also simultaneously reduces its inherent bias on average. Advances in semantic segmentation can therefore be expected to also address bias and fairness issues to a certain degree.
VII. CONCLUSION In this paper, we presented the results of a group evaluation, organized to benchmark the performance and bias of sclera segmentation models under a common experimental setting. Seven research groups participated in the effort and contributed seven distinct models to the evaluation for scoring.
The results of the group evaluation suggest that contemporary models are able to ensure useful segmentation performance with diverse input images and that more accurate models consistently also achieve lower bias scores with respect to different factors. Increasing the model complexity was also observed to lead to better performance and lower bias. Given such results, recent advances in modern model architectures (such as transformers) may help provide better performance-bias trade-offs in the future. However, note that the improvement in this case may come at the cost of higher memory usage and computational intensity, which could be problematic for applications running on less-capable hardware.
As part of our future work, we plan to explore correlation-based measures for quantifying bias, applicable to groups of (machine learning) models. Such measures are expected to ensure additional insights into the behavior of the models and help identify important trends and model/data characteristics affecting performance and performance differentials across subgroups of data.