Performance Evaluation of Retinal OCT Fluid Segmentation, Detection, and Generalization Over Variations of Data Sources

Retinal Optical Coherence Tomography (OCT) is a non-invasive cross-sectional scan of the eye that provides qualitative 3D visualization of the retinal anatomy. It is used to study the retinal structure and the presence of pathogens. The advent of retinal OCT has transformed ophthalmology and is currently paramount for the diagnosis, monitoring, and treatment of many eye diseases, including macular edema, which impairs vision severely, and glaucoma, which can cause irreversible blindness. However, the quality of OCT images can vary among device manufacturers. Deep learning methods have been successful in the medical image segmentation community, but it is not yet clear if the level of success can be generalized across images collected from different device vendors. In this study, we provide a comprehensive review of current deep learning segmentation methods applied to OCT images. Furthermore, to investigate the problem of variant of data sources from OCT device vendors, we analyse a selection of the most representative methods to address this problem, including those on the top of the RETOUCH competition such as nnUNet and its variant nnUNet_RASPP, SAM and its variant SAMedOCT, IAUNet_SPP_CL, alongside other state-of-the-art algorithms. The algorithms were validated on the RETOUCH challenge dataset, which was acquired from three device vendors across three medical centers from patients suffering from two retinal disease types. Experimental results show that for several tasks of segmentation, detection and generalisation performance from the retinal images, while fine-tuned large foundation models such as SAMedOCT have demonstrated promising performance, the specifically designed and trained models such as nnUNet and nnUNet_RASPP still offer a slight advantage overall. Also, the nnUNet_RASPP obtained the best performance of 82.3% of mean Dice score for fluid segmentation.


I. INTRODUCTION
Edema is an eye condition that occurs when there is a leakage of blood vessels into part of the retinal called Macular (central of the eye at the back where the vision is sharpest) and hence impairing the vision severely.There are many eye diseases that can cause this including, age-related macular The associate editor coordinating the review of this manuscript and approving it for publication was Thomas Canhao Xu .

degeneration (AMD) and diabetic macular edema (DME).
Recent study [42] indicates that there's a rise of retinal diseases in Europe with more than 34 and 4 million people affected with AMD and DME respectively in the continent.AMD, is mostly common among older people (50-years and above).The early stage of AMD is asymptomatic and slows to progress to the late stage which is more severe and less common.DME, is the thickening of the retinal caused by the accumulation of intraretinal fluid in the macular and it's mostly common among diabetic patients.Currently there is no cure for these diseases and anti-vascular endothelial growth factor (Anti-VEGF) therapy is the main treatment.This requires constant administering of injections which are expensive and hence a socio-economic burden to most patients and the healthcare system.Therefore early diagnostic and active monitoring the progress of these diseases is vital because the doctors can give some behavioral advice like change of diet or doing regular exercise which will help slow down the progress or in some cases prevents the diseases from getting into a later stage.As of today this is mostly done manually which is laborious, time intensive and prone to error.Therefore an automatic and reliable tool is very crucial in this process and to further exploit the qualitative features of the retinal OCT modality efficiently.Also, the presence of eye motion artifacts in OCT lower the signal-tonoise ratio (SNR) due to speckle noise.To circumvent this problem device manufacturers have to find a balance between achieving high SNR, image resolution and the scanning time.Hence the quality of the images varies among device vendors and hence the need to develop an automate tool with high performance that can generalise across images from all the device vendors.
To address the above issues, in this work we investigate the most representative methods to address this problem, in particular those that have appeared in the leading positions of the RETOUCH challenge [11], on three problems of segmentation, detection and generalisation over multiple data sources.
Our main contributions are as follows: 1) We provide a comprehensive review of representative models addressing the problem of retina fluid segmentation and detection from OCT images, evaluating their generalization performance across variations in data sources.2) Through the use of a blinded evaluation, we demonstrate that large foundation models, including SAM and its variants SAMed and SAMe-dOCT, show promising performance in the segmentation task following a routine fine-tuning process for this specific problem.3)We illustrate that, at the current stage, specifically designed and trained deep networks such as nnUNet and its variant nnUNet_RASPP maintain a slight advantage in the performance of segmentation, detection, and generalization across all tasks for this particular problem.
The rest of the paper is organized as follows.A brief review of the previous studies is provided in Section II.Section III presents the main methods included in this study, mostly from those that have been submitted to the RETOUCH competition [11] and achieved the leading position on the performance table.The experiment with results and visualisation are presented in Section IV.Finally, the conclusion with our contributions is described in Sections V.

II. BACKGROUND
OCT was first developed in the early 1990s but only became commercially available in 2006 and rapidly became popular due to its high image quality resolution [21].
Before the era of deep learning, research on retinal OCT image analysis have been ongoing for many years and can be grouped into various categories including probabilistic modelling [34], [35], graph-cut [33], [63], [65], [66], Markov Random Fields [64], [75], level set [21], [22].Some of the comprehensive reviews and studies of retinal OCT images conducted in the past include: [14], [38], [69], [74] Recently, deep learning models have provided enhanced solutions to address these problems.UNet [60] is one of such models for medical image segmentation.Its architecture has a U-shape and consists of an encoder, bottleneck and a decoder block.It's an end to end framework in which the encoder is use to extract features from the input images/maps, and the decoder is used for pixels localisation.At the end of the decoder path is a classification layer that classifies each pixel into one of the segmented classes.Also, between the encoder and the decoder paths is a bottleneck to ensure the smooth transition from the encoder to the decoder.The encoder, decoder and bottleneck are made up of a series of convolutional layers arranged in a special order.
The DA-PSPNet, introduced in [77], is designed for the automatic layering of retinal OCT images.It utilizes a dual attention mechanism to segment seven retinal layers in retinal OCT B-scan images.The architecture is comprised of a ResNet [29] backbone with dilated convolution to extract shallow features, a self-attention mechanism for capturing contextual information, and residual blocks.Additionally, the authors integrated a pyramid pooling module at varying scales into the network's architecture to gather global information.Evaluation of the algorithm was conducted on the 2014_BOE_Srinivasan dataset [68], published by Duke University.Experimental results demonstrate that the proposed architecture surpasses SOTA algorithms in the segmentation of the seven OCT layers.
The RFS-Net, presented in [27], is composed of a VGG-16 [67] backbone.It incorporates an atrous spatial pyramid pooling (ASPP) module with four parallel branches featuring varying dilating ratings ranging from 1 to 7, aimed at capturing global information.The model addresses the vanishing gradient problem through the use of residual connections.Inspired by GoogLeNet [70], it employs an inception module for dimensionality reduction and includes an expanding module to recover high-level features from the preceding modules.Evaluation of the model was conducted on three datasets: Retouch [11], Gholami [26], and Kermany [37].Experimental results indicate that the RFS-Net demonstrates performance comparable to SOTA algorithms.
Various deep learning methods for automatic choroidal segmentation in OCT images are outlined in [41].The employed architectures include the Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and the UNet.The CNN and RNN are categorized as patchbased networks, while the UNet [60] is classified as a semantic segmentation network.The algorithm underwent evaluation on a private dataset comprising spectral domain 31720 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
OCT (SD-OCT) images collected from 101 children during four different visits over an 18-month period.Experimental results demonstrate that the algorithms achieved performance comparable to SOTA methods.
Retinal fluids segmentation using volumetric deep neural networks on OCT scans is introduced in [7], employing the 3D UNet [60] architecture with five blocks in both the encoder and decoder paths.Each block comprises of two convolution layers with (3 × 3 × 3) convolutions, followed by a rectified linear unit (ReLU).Batch normalization precedes the ReLU unit, and a Max-pooling of (2 × 2 × 2) is applied.The algorithm underwent validation on the Retouch dataset, achieving a Dice score of 0.79 for intraretinal fluid (IRF), which was the most identified fluid.
An automated quantification of pathological fluids in neovascular Age-Related Macular Degeneration (nAMD) using fully connected neural networks (FCNN) is presented in [48].The architecture encompasses an encoder path for capturing image features, a decoder path for synthesizing information from the feature maps to make pixel predictions, a classification layer for pixel classification, a residual block to address the vanishing gradient problem, and a squeeze-excite block for capturing global contextual features.The network has a depth of five, corresponding to four downsampling steps, and a kernel size of 7 × 7. The evaluation was conducted on a private dataset comprising of two subsets: set 1 includes 107 SD-OCT volumes (49 Bscans, 6×6 mm cube examination) from patients with nAMD, extracted from the Heidelberg Spectralis device; set 2 was acquired from 42 eyes of 40 patients during routine nAMD check-ups.Experimental results indicate that the algorithm achieved an area under the curve of 0.97, 0.95, and 0.99 for the detection of intraretinal fluid (IRF), subretinal fluid (SRF), and pigment epithelial detachment (PED), respectively, along with corresponding Dice scores of 0.73, 0.67, and 0.82.
The biomarker-infused global-to-local network (Bio-Net) is presented in [81] for the automatic segmentation and visualization of the choroid in OCT through knowledge-infused deep learning.Bio-Net comprises a shadow localization and an elimination pipeline.The authors employed the UNet for generating the shadow location mask and the Deshadow-Net for eliminating shadows.The architecture of Deshadow-Net draws inspiration from the two-stage generative adversarial network (GAN) presented in [54].Initially, the region of interest (ROI), retinal pigment epithelium layer, and choroid regions are extracted.The UNet is utilized for ROI segmentation, and the Deshadow-Net is employed to eliminate unwanted pixels or shadows.The algorithm underwent evaluation on a private dataset consisting of 30 OCT volumes from 30 subjects.Experimental results demonstrate that the model surpassed current state-of-the-art algorithms by a significant margin.
An ensemble learning approach named DelNet, utilizing a combination of Convolutional Neural Network (CNN) models based on fully convolutional networks (FCN), is introduced in [8] for the segmentation of retinal layers in OCT images.The architecture of DelNet draws inspiration from ensemble stack generalization [76] in deep learning methods.The authors aggregate outputs from a series of FCNs (base networks) into a single output for image segmentation, aiming to construct a robust model by combining knowledge from weak learners.This approach is employed to address the high variability among retinal layers.The method underwent validation on the Duke Cyst DME dataset [18], and experimental results indicate that the algorithm performs is comparable to SOTA algorithms.
An automated approach for fluid segmentation in retinal OCT images utilizing attention mechanisms is introduced in [45].The method comprises of an encoding path and two decoding paths.The encoder is responsible for feature extraction, the first decoder extracts semantic information, and the other decoder predicts distance maps.Both the encoder and decoder paths consist of three blocks, each containing a convolutional layer with a padding of 1, a batch normalization layer, and a ReLU layer.In the encoding path, a max-pooling layer with strides of 2 is employed for down-sampling.For image reconstruction in the decoder path, skip connections, similar to those used in [44], are employed.Attention mechanisms inspired by [10] and [32] are incorporated into the network's architecture.The method underwent evaluation on the Multivendor dataset [8], comprising 500 scans from four OCT scan devices (Cirrus, Nidek, Spectralis, Topcon), with the respective number of scans being 57, 159, 53, and 231.Experimental results demonstrate that the proposed approach significantly outperformed the baseline methods.
The cascade dual-branch deep neural networks for retinal layer and fluid segmentation in OCT, incorporating a relative positional map, is introduced in [47].The method comprises two UNet architectures stacked in series.The first UNet is dedicated to segmenting the Region of Interest (ROI), defined as the area between the inner limiting membrane (ILM) and the Bruch membrane (BM).The second UNet is employed for segmenting six retinal layers and fluid within the ROI.A Random Forest classifier is used at the classification layer for pixel classification.Both the encoder and decoder paths consist of four convolutional blocks with feature sizes of 64, 128, 256, and 512.The kernel size is set to 3 × 3, and a 2 × 2 max-pooling layer with stride of 2 is utilized.The method underwent validation on a private dataset comprising 58 3D volumes acquired using the Zeiss Cirrus 5000 HD-OCT.Each volume consists of 245 B-scans, resulting in a total of 14,210 B-scans.Experimental results demonstrate that the model's performance is comparable to SOTA algorithms.
The DeepRetina for layer segmentation in OCT images is presented in [43].The method consists of three main components: Xception65 [19] which serves as the backbone to extract and learn the characteristics of retinal layers, the ASPP module, with a dilating rate of 2, 6, 12, 18, to capture global information, and an Encoder-Decoder module [9], [17] that further enhances the segmentation map by eliminating misclassified pixels.The algorithm is evaluated on two datasets.The first dataset is a private dataset from the Shenzhen Eye Hospital affiliated with Jinan University and the Shenzhen University School of Medicine, consisting of 280 volumes, resulting in a total of 11,200 images (40 B-scans per volume).The second dataset is the Duke dataset with 110 OCT B-scan images spanning 10 different patients.Experimental results demonstrate that the method's performance is superior to the baseline models.
The Deep-ResUNet++ is presented in [55] for simultaneous segmentation of layers and fluids in retinal OCT images.The approached incorporated residual connections, ASPP blocks and Squeeze and Exciting blocks into the traditional 2D UNet [60] architecture to simultaneously segment 3 retinal layers, 3 fluids and 2 background classes from 1136 B-Scans from 24 patients suffering from wet AMD.The algorithm is validated on the Annotated Retinal OCT Images (AROI) [51] which is publicly available.
The CoNet (Coherent Network) is presented in [56] for the simultaneous segmentation of 7 layers, 2 backgrounds and 1 fluid in retinal OCT images using the Duke DME dataset [18] obtaining a mean Dice Score of 88%.The CoNet uses standard UNet as the backbone with a reduced depth and incorporates an atrous spatial pyramid pooling (ASPP) block at the input layer to capture global contextual features.The ASSP block uses 4 parallel filters with different frequencies or dilating rates.
A clinical application for diagnosis and referral of retinal diseases is proposed in [25] in which 14,884 OCT B-Scans collected from 7,621 patients are trained on a framework consisting of two main parts: The segmentation model (3D Unet [20]) and the classification model.
A combination of Convolutional Neural Networks (CNN) and Graph Search (GS) method is presented in [24].The framework aims to validate nine layers boundaries from 60 retinal OCT volumes (2915 B-scans, from 20 human eyes) obtained from patients suffering from dry AMD.CNN is used for the extraction of the layer boundaries features while the GS is used for the pixels classification.
In [57] another Deep Learning approach for retinal OCT segmentation combining a FCN for segmentation with Gaussian Processes for post processing is proposed.The method is validated on the University of Miami dataset [73] which consists of 50 volumes from 10 patients suffering from diabetic retinopathy.Their approach is divided into two main steps which are the pixel classification using the FCN and the post processing using Gaussian Processes.
Another CNN-based approach for the simultaneous segmentation of layers and fluid is presented in [61].They presented a 2D UNet like architecture with a reduced depth for the segmentation of 10 classes consisting of 8 layers, 1 background and 1 fluid from 10 patients suffering from Diabetic Macular Edema (DME).The Duke DME dataset [18] is used to validate the algorithm.
The nnUNet, a self-configuring framework is introduced in [31].The framework aims to eliminate the problem of manual parameters setting ''trial and error'' by using the dataset's demographic features to determine and automatically set some of the model's key parameters like the batch size.The framework uses the standard UNet [60] and is evaluated on 11 biomedical image segmentation challenges consisting of 23 datasets for 53 segmentation tasks.
An extended version of the nnUNet [31] is presented by McConnell et al in [49] by integrating residual, dense, and inception blocks into the network for the segmentation of medical imaging on multiple datasets.The algorithm is evaluated on eight datasets consisting of 20 target anatomical structures.
ScSE nnU-Net, another extended version of the nnUNet [31] is presented in [78] for the segmentation of head and neck cancer tumors.It extends the original nnUNet by incorporating spatial channels with squeeze and excitation blocks into the network's architecture.The algorithm uses nnUNet to extract features from the input images/maps and then the squeeze and excitation blocks to further suppress the weaker pixels.The method was validated on the HECKTOR 2020 dataset consisting of 201 cases and a test set of 53 cases.
The exploration of advanced architectural variations of the nnUNet is detailed in [50]  The MA-SAM, an adaptation of SAM for volumetric images, is introduced in [15].Originally, SAM was developed and trained on 2D images.The authors of MA-SAM incorporate critical third-dimensional or temporal knowledge during fine-tuning to effectively adapt SAM and leverage the volumetric structure inherent in medical imaging.This adaptation involves integrating a series of 3D adapters into the 2D transformer blocks within the SAM architecture.These adapters are employed to extract essential volumetric or temporal insights necessary for medical image analysis.The algorithm's performance is evaluated on four medical image segmentation tasks across 10 public datasets, encompassing CT, MRI, and surgical video data.
The nnFormer (notanother transFormer), a combination of the transformer and nnUNet, is introduced in [83].The authors integrated the transformer architecture into the nnUNet pipeline, capitalizing on pre-processing and self-parametrization techniques.The network's architecture comprises of an encoder, decoder, a bottleneck inbetween, and skip attention blocks at every convolutional layer.The algorithm is evaluated on three public medical datasets: BraTS [52], Synapse [1], ACDC [2].The authors drew the following conclusions: 1) nnFormer outperformed other transformer-based architectures significantly, and 2) nnFormer and nnUNet demonstrate a high level of complementarity.
The DconnNet [80], which employs a directional connectivity modeling scheme for segmentation, was assessed on the training (subset) dataset of the RETOUCH Grand Challenge using 3-fold cross-validation, achieving a dice score of 87.7.

A. LIMITATIONS AND BENEFITS
While deep learning methods have demonstrated enhanced performance in medical image segmentation, detection, and classification tasks, their reliance on large datasets poses a significant challenge.The limited availability of public data, driven by privacy concerns, exacerbates this issue.Additionally, the manual segmentation/annotation of datasets is labor-intensive and time-consuming.Furthermore, the Retouch challenge is not publicly available, and each participant team can only submit a maximum of three models for evaluation, thus limiting our ability to assess other SOTA algorithms like [4], [5], [6], and [59].Moreover, a notable challenge in the field is the lack of consistent evaluation metrics.While some authors employ metrics like Dice Similarity (DS) and Intersection over Union (IoU), others use alternative measures such as Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and mean unsigned error.On the positive side, a significant advantage lies in the ability to fine-tune large foundation models with limited computational resources, thereby enhancing the generalization performance of the models.Also, another benefit is the availability of several annotated small datasets from different anatomic regions of the body.

III. METHODS
The RETOUCH grand challenge [11] is a competition focused on the segmentation and detection of three retinal fluids from retinal OCT images.The training dataset comprises 70 raw images with corresponding masks, while the testing dataset includes 40 raw images without their corresponding masks.To ensure fairness in comparison, the organizers employed a blinded evaluation by retaining the masks or ground truth of the testing dataset, and participants can submit their predictions via email for evaluation.In adherence to competition requirements, each submission must be accompanied by a written paper explaining the methods employed.The results of the submission are communicated to the teams via email and are also published on the organizer's website alongside the accompanying papers.The RETOUCH challenge, initially organized in conjunction with MICCAI 2017 in Quebec, Canada, featured the participation of eight teams.Subsequently, the competition transitioned to an online format, and it remains ongoing, continuing to accept submissions [3].
In this section we provide details of the methods that are in the leading positions from the RETOUCH competition, including the nnUNet, nnUnet_RASPP, SAMedOCT, IAUNet_SPP_CL, as well as the standard UNet and other methods.

A. UNet
The UNet is an end to end architecture for medical image segmentation.It consists of 3 main parts: the encoder, the decoder and bottleneck between the encoder and decoder.The encoder captures contextual information (or features extraction) and reduces the size of the feature map by half after every convolutional block as we move down the encoding path by implying strided convolutions.Pixels localisation is done at the decoder.As we move up the decoder path the size of the feature map is doubled after every convolutional block by implying transposed convolutions, and for the reconstruction process features maps are concatenated to the corresponding map in the encoder path using up-sampling operations.The bottleneck serves as a bridge, linking the encoding and decoding paths together.It consists of a convolutional block that ensures a smooth transition from the encoder path to the decoder path.At the encoding path, decoding path and bridge layer each convolutional block consists of a convolutional layer that converts the pixels of the receptive field into a single value before passing it to the next operation followed by an instance normalisation to prevent over-fitting during training.This is followed by a LeakyReLU activation function to diminish vanishing gradient.A high level diagram to illustrate the architectural structure of the standard UNet is shown in FIGURE 1.

B. LOWERCASENNUNLOWERCASEET AND LOWERCASENNUNLOWERCASEET_RASPP
The nnUNet is a self-configuring and automatic pipeline for medical image segmentation with the ability to automatically determine and choose the best model hyper-parameters given the data and the hardware availability, thus alleviating the problem of trial and error of manual parameters setting.Given a training data the framework extracts the ''data-fingerprint'' such as modality, shape, and spacing and base on the hardware (GPU memory) constraints the network topology, image re-sampling methods, and input-image patch sizes are determined.After training is complete, the framework determines if post-processing is needed.The framework uses the standard UNet as the network's architecture.Please refer to the original publication [31] for more information.Inspired by the success of nnUNet we have introduced an enhanced architecture nnUNet_RASPP.1,2by incorporating residual connections and an ASPP block in the network's architecture to solve the problem of data source variation.Residual connections were incorporated in every convolutional layer at both the encoding and decoding paths to reduce the training error rate, further learn complex features, and combat the problem of vanishing gradients.The ASPP was incorporated at the input layer of the encoding path to mitigate the problem of fluid variance.Incorporating these techniques into the standard nnUNet improves the overall performance of the network.The diagram of nnUNet_RASPP is demonstrated in FIGURE 2.A.
Residual connection is a technique used to combat the problem of vanishing gradient developed in [28].The UNet architecture uses the chain rule for back propagation during training.This process can sometimes lead to vanishing gradient and one of the ways to circumvent this, is to introduce residual connection into the network's architecture.As indicated in [28], residual connections reduce the training error rate as we increase the depth of the network.nnUNet automatically determines the depth of the network, and the introduction of residual connections to the network's architecture further reduces the training error rate and allows the network to learn complex features, thus improving the overall performance.The diagram of residual connection is demonstrated in FIGURE 2.B.
ASPP [16] is a technique used to extract or capture global contextual features by applying paralleling filters with different frequencies or dilating rate for a given input filter.ASPP can be used to solve problems with high variability.To the best of our knowledge, this is the first time that ASPP has been integrated into nnUNet to solve this particular problem.The diagram of the ASPP block is demonstrated in FIGURE 2.C.

C. SAMeDOCT
The SAMedOCT [12] is inspired and adpated from the Segment Anything Model (SAM) [39].It is a '' foundation model '' for image segmentation developed by researchers at Meta.SAM gained prominence for its ability to enable zero-shot transfer to various segmentation tasks, having been trained on over 1 billion masks from 11 million diverse images.Due to its extensive training dataset, SAM demonstrates the capability to generalize to new tasks beyond those encountered during training.SAM comprises of three main components: 1) An Image Encoder built from the Vision Transformer (ViT) [23], which preprocesses high-resolution inputs and runs once per image; 2) A Prompt Encoder embedding dense prompts (i.e., masks) using convolutions, which are then summed element-wise with the image embedding; 3) A Mask Decoder that efficiently maps the image embedding, prompt embeddings, and an output token to a mask.
Focal and dice loss are employed during training.SAMed, a variant of SAM adapted for medical segmentation, is introduced in [82].SAMed is derived from SAM by freezing the image encoder and adopting a low-rank-based fine-tuning strategy (LoRA) [30].This strategy approximates the low-rank update of the parameters in the image encoder and fine-tunes the lightweight prompt encoder and the mask decoder of SAM.SAMed was evaluated on the Synapse multi-organ segmentation dataset, achieving remarkable results.Building on the success of SAMed, SAMedOCT was adapted from SAMed to address the challenges posed by the RETOUCH grand challenge.

D. IAUNet_SPP_CL
IAUNet_SPP_CL, a combination of a graph-theoretic method, a fully convolutional neural network (FCN), and curvature regularization loss function is presented in [79].The graph-theoretic method is employed in the preprocessing stage to delineate layers and regions of interest (ROI), while the FCN is utilized for fluid segmentation, employing the standard attention UNet as the backbone.
The authors enhanced the architecture by introducing spatial pyramid pooling (SPP) modules with four pooling maps at different scales in parallel, concatenating the original input after bilinear interpolation to enhance the network's capability to segment multi-scale objects.The curvature regularization loss function is applied to smooth boundaries and eliminate unnecessary holes within the predicted fluid lesions.

E. SFU
The SFU, a 3-part CNN-based and Random Forest (RF) framework is developed by [46].The first part of the framework is used for pre-processing of the images, the second part consists of a 2D UNet architecture for the extraction of features and a RF classifier to classify the pixels at the third part.At the segmentation layer, axial motion between scans was corrected using cross-correlation by applying bounded variation 3D smoothing.This correction aimed to reduce the effect of speckle while preserving and enhancing the boundaries between retinal layers.To prevent overfitting during training, a dropout layer was introduced before the 1 to 1 convolutional layer.Additionally, to address data limitations, data augmentation techniques such as flipping, rotation, and zooming were applied during preprocessing.

F. UMN
The UMN, a combination of CNN and graph-shortest path (GSP) method is presented in [58].The CNN is used for the segmentation of region of interest (ROI) and the GSP is further used for the segmentation of the layers and fluid from the ROI.B-scans were extracted from the 3D volumes for training.At the segmentation layer, the initial step involved segmenting the layers as ROI to efficiently detect the presence of fluids.Extracting the ROI helped reduce training time, as training the network on the entire image would be more time-consuming.The GSP was employed for pixel classification, mapping each pixel in the image to one node in the graph.Only local relationships between pixels were considered, and an 8-regular graph was constructed using the 8 neighbors of each pixel.

G. MABIC
The MABIC, a standard double-UNet architecture, is proposed in [36].The method utilizes two UNet architectures connected in series, where the output of the first UNet serves as an input to the second UNet.The initial part takes raw images as input to extract the ROI.Additionally, in this initial part, dropout and maxout activation are applied at each layer to enhance accuracy and prevent overfitting.The subsequent part takes the extracted ROI and the segmentation mask as input.Importantly, there are no fully connected layers between encoding and decoding layers in the latter part.

H. RMIT
The RMIT, an approach using a combination of deep neural network and adversarial loss function is presented in [72].
The authors adapted the architecture from the standard UNet by incorporating a batch normalization layer in each block of convolutions.They introduced dropout at each skip connection to prevent overfitting and incorporated an adversarial loss function to estimate the loss during training.

I. REtinAI
The RetinAI, introduced in [62], is a standard 2D UNet with residual connections.The network was trained on B-scans.
As part of the preprocessing, all the B-scans were normalized to the same resolutions, and horizontal flip, shear, rotation, shift, and Gaussian noise were applied for data augmentation.Categorical cross-entropy was used as the loss function during training.

J. SVDNA
A noise adaptation approach based on singular value decomposition (SVDNA) [40] is introduced as an unsupervised technique for noise transfer in the domain adaptation of retinal OCT images.The pipeline comprises of two phases.In the first phase, SVDNA is employed to generate masks, which are subsequently used to train a supervised segmentation network in the second phase.The model's performance was evaluated online, achieving a mean DS of 0.71 on the hidden test dataset.The authors didn't publish the AVD scores.

VOLUME 12, 2024
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

IV. EXPERIMENTS AND PERFORMANCE EVALUATION A. DATASET
The methods were validated on the MICCAI 2017 RETOUCH grande challenge dataset [11].The dataset is publicly available and it consists of 112 OCT volumes of patients suffering with early AMD and DME collected from 3 device manufacturers: Cirrus, Spectralis and Topcon from 3 clinical centers: Medical University of Vienna (MUV) in Austria, Erasmus University Medical Center (ERASMUS) and Radboud University Medical Center (RUNMC) in the Netherlands.Examples of the dataset are shown in FIGURE 3. The training set consists of 70 volumes of 24, 24, and 22 acquired with Cirrus, Spectralis, and Topcon, respectively.Both the raw and annotated mask of the training set are made available to the public.The testing set consists of 42 OCT volumes of 14 volumes per device vendor.The raw or input of the testing set is available publicly but their corresponding annotated masks are held by the organizers of the challenge.Submission and evaluation of prediction on the testing dataset is arranged privately with the organizers and the results are sent to the participants.
Manual annotation was done by 6 grader experts from 2 medical centers: MUV (4 graders supervised by an ophthalmology resident), and RUNMC (2 graders supervised by a retinal specialist).The dataset is annotated for 4 classes of 1 background labelled as 0 and 3 fluids which are: Intraretinal Fluid (IRF) labelled as 1, Subretinal Fluid (SRF) labeled as 2 and Pigment Epithelium Detachments (PED) labelled as 3.The RETOUCH dataset is particularly interesting because of its high level of variability.It was collected using multiple device vendors, the sizes and number of B-Scans varies per device vendor, and it was collected and annotated in multiple clinical centers.Also, for fair comparison the annotated testing set is held by the organizers and submission is curbed to a maximum of 3 per participating team.

B. TRAINING AND TESTING
Training was done on the 70 OCT volumes of the training set (both raw and mask volumes).The estimated probabilities and predicted segmentation of the testing set (42 raw volumes) were submitted to the challenge organizers for blinded evaluation on the ground truth or masks.
Also, to further evaluate the robustness and generalisability of the methods, the predicted segmentation of the algorithm was evaluated on OCT volumes from two vendor devices and tested on the third.In this case OCT volumes from the third vendor device weren't seen during training.For this experiment, two sets of weights were generated which are: (1) Training on 46 OCT volumes from both Spectralis (24 OCT volumes) and Topcon (22 OCT volumes) and evaluated on 14 OCT volumes from the Cirrus testing set and ( 2) training on 48 OCT volumes from both Cirrus (24 OCT volumes) and Spectralis (24 OCT volumes) and evaluated on 14 OCT volumes from the Topcon testing set.Again the same environmental settings were used to conduct all the experiments.
In the detection task the estimated probabilities of presence of each fluid type is plotted using the receiver operating characteristics (ROC) curve.The area under the curve (AUC) which measures the ability of a binary classifier to distinguish between classes is used as the evaluation matrice.The AUC gives a score between 0 and 1 with 1 being the perfect score and 0 is the worst.For the segmentation task, two evaluation matrices are used to measure the performance of the algorithms: 1) The Dice Score (DS) [13], [53], [71] which is twice the intersection, divided by the union.It measures the  overlapping of the pixels in the range from 0 to 1 with 1 being the perfect score and 0 being the worst.The Absolute Volume Difference (AVD) [71] which is the absolute difference between the predicted and the ground truth.The value ranges from 0 to 1 with 0 being the best result and 1 being the worst.The equation to calculate the DS is shown on Eqn (1) and that for AVD in Eqn (2).Where X is the raw input or raw image, Y is the ground truth, or mask, ∩ is the intersection and | | is the absolute value.
C. RESULTS In this section we report the performance for the detection task measured by the Area Under the Curve (AUC), and the segmentation task measured by the Dice Score (DS) and Absolute Volume Difference (AVD) for the nnUNet_RASPP, and baseline nnUNet.We also compare our results to the current state-of-the-arts (SOTA) architectures.The segmentation performance grouped by segment classes per algorithm measured in DS is illustrated in Table 2 with the corresponding diagram in FIGURE 4, and that measured in AVD is illustrated in Table 3  Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.IRF (0.019 and 0.021 compared to 0.042) and SRF (0.017 and 0.016 compared to 0.020) AVD scores than SAMedOCT.5) Apart from the IRF class, the nnUNet_RASPP has the best DS in every single class when compare to the other models/teams.6) IAUNet_SPP_CL and nnUNet_RASPP jointly achieve the second-best mean AVD score of 0.036.7) We observed that, overall, the CNN/DNN models exhibit slightly better performance than the foundational model (SAMedOCT [12]).We believe this is because SAMedOCT is constructed with ViT as a backbone, and ViTs are more data-hungry than CNNs due to their ability to model long-range dependencies, as explained in [23].A detail break down of the DS and AVD per vendor device trained on the entire 70 volumes and tested on the holding 42 cases of the testing set is shown in Table 5 with the corresponding diagrams in FIGURE 5. We noticed the following: 1) nnUNet_RASPP outperformed the baseline nnUNet and the state-of-the-arts models in two (Cirrus and Spectralis) of the 3 devices in both DS and AVD.The nnUNet RASPP model came in second place on the third device (Topcon) with a marginal difference from the baseline model, nnUNet.
2) The nnUNet_RASPP and nnUnet were the only two algorithms to maintain constant high level performance and generalisability across all classes and data sources in both DS and AVD.Both models constantly occupied the top 2 spots in performance per segment classes and vendor devices.Table 6 with its corresponding diagrams in FIGURE 6 shows the results when trained on 2 vendor devices from the training set and tested on the third device from the holding testing set measured in DS and AVD.In this case Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.because of the constraint of the evaluation submission (curb to 3 maximum per team) of the predicted segmentation on the testing set, results for nnUnet are unavailable.Here we noticed that 1) nnUNet_RASPP outperformed the current SOTA architecture scoring a mean DS of 0.86 (10% higher than the second best) on the Cirrus device and 0.81 (6% higher than the second best) on the Topcon device; 2) nnUNet_RASPP also obtained the best AVD scores, scoring a mean of 0.0114 and 0.0878 on the Cirrus and Topcon devices respectively; 3) nnUNet_RASPP still maintained its high level of robustness and generalisability with a consistently high level of performance measure in DS and AVD.The detection performance grouped by segment classes per algorithm measured by the AUC is illustrated in Table 4 with the corresponding diagram in FIGURE 7.Here the nnUNet obtained a perfect AUC score of 1 for all three fluid classes and nnUNet_RASPP obtained an AUC score of 0.93, 0.97, and 1.0 for the IRF, SRF, and PED respectively.
The visualizations using orange arrows to highlight the fine details capture by nnUNet_RASPP when trained on

V. CONCLUSION
In this work, we have investigated the problems of detection and segmentation of multiple fluids in retinal OCT volumes acquired from multiple device vendors using SOTA deep learning methods.We have included the most representative methods as appeared in the leading positions of the RETOUCH competition, and evaluated their performance based on the hidden test results where the testing datasets are not available to the competition participants.The RETOUCH datasets are among the largest for the underlying problems.
It demonstrates a high level of variability due to data collected from devices of various vendors.
The key findings of the work include the following: 1) Firstly, it provides a comprehensive review of the representative models to address the problem of retina fluid segmentation, detection from OCT images and generalisation performance over variations of data source.2) Secondly, through a blinded evaluation where the ground truth on the testing datasets is withheld by the competition organisers, it is demonstrated that large foundation models, including SAM and its variants SAMed and SAMedOCT, exhibit promising performance in the segmentation task through a routine fine-tuning process.3) Thirdly, at the current stage, the specifically designed and trained deep networks such as nnUNet and nnUNet_RASPP still offer a slightly advantage on the performance of all tasks of segmentation, detection and generalisation.We believe nnUNet_RASPP's slight outperformance in this particular problem is achieved by incorporating residual blocks to learn complex features and reduce the training error rate and an ASPP to capture global information.Moreover, it has an inherent simple architecture with fewer model parameters than the foundation models and therefore offers better real-time performance.
The methods included in this study provide useful information for further diagnosis and monitoring the progress of retinal diseases such as AMD, DME and Glaucoma.In the future, we aim to leverage the availability of various small public medical image datasets to assess the performance of these methods on a more heterogeneous dataset (combination of datasets originating from different modalities and anatomical regions).
by McConnell et al.They investigated eight advanced variants of the nnUNet, evaluating the methods across eight datasets encompassing 20 anatomical structures.The architectural variations include residual, dense, inception, and attention gates, resulting in six novel nnUNet variations: Residual-nnUNet, Dense-nnUNet, Inception-nnUNet, Spatial-Single-Attention-nnUNet, Spatial-Multi-Attention-nnUNet, and Channel-Spatial-Attention-nnUNet.Their findings indicate that no single architecture universally performs best for all medical image segmentation problems.Therefore, the selection of a network's architecture should be tailored to the specific problem at hand.Additionally, the study concluded that the Standard-nnUNet and Baseline-nnUNet are optimal for problems or datasets featuring a singular anatomical region, while the other architectures excel in scenarios involving spatially imbalanced datasets or problems with multiple anatomical regions.The Segment Anything Model (SAM) is introduced in [39] by Kirillov et al., drawing inspiration from the concept of zero-shot and few-shot generalization in natural language processing (NLP).Developed by researchers at Meta, SAM is positioned as a foundational model for image segmentation.It undergoes training on an extensive dataset consisting of 1 billion masks and 11 million images, utilizing the Vision Transformer (ViT) architecture [23].Researchers and developers have the flexibility to fine-tune the SAM model and leverage the pre-trained model for specific segmentation tasks during training.

FIGURE 1 .
FIGURE 1.An illustration of the standard UNet architecture used in nnUNet.

TABLE 1 .
A summary of previous work including the authors, year, segmentation, dataset and evualation matrics: Dice Score (DS), Mean average error (MAE), Mean Average Difference (MAD) receiver operating characteristic curve (ROC), Root Mean Squared Error (RMSE), and Intersection over Union(IoU).

FIGURE 2 .
FIGURE 2.A high level illustration of nnUNet_RASPP architecture with B, a residual connection block to address the vanishing gradient problem where X is an input and F(X) is a function of X; and C, an ASPP block of multiple parallel filters at different dilating rates or frequencies to capture global information.

FIGURE 3 .
FIGURE 3. B-scan examples of raw (column 1) and their corresponded annotated mask (column 2) of OCT volumes taken from the 3 device vendors (rows): Cirrus, Spectralis and Topcon.The classes are coloured as follows: Black for the background, blue for the Intraretinal Fluid (IRF), yelow for the Subretinal Fluid (SRF) and red for the Pigment Epithelium Detachments (PED).
Scores (DS) by segment classes (columns) and teams (rows) for training on the entire 70 OCT volumes of the training set and tested on the holding 42 OCT volumes from the testing set.

FIGURE 4 .
FIGURE 4. Segmentation performance comparison by DS on the right and AVD on the left of the nnUNet_RASPP and baseline nnUNet, together with the SOTA algorithms grouped by the segment classes when trained on the entire 70 OCT volumes of the training set and tested on the holding 42 OCT volumes from the testing set.

FIGURE 5 .
FIGURE 5. Segmentation performance comparison by DS on the right and AVD on the left of the nnUNet_RASPP and baseline nnUNet, together with the SOTA algorithms grouped by the segment classes when trained on the entire 70 OCT volumes of the training set and tested on the holding 42 OCT volumes from the testing set per device.

FIGURE 6 .
FIGURE 6. Segmentation performance comparison by DS on the right and AVD on the left of the nnUNet_RASPP and baseline nnUNet, together with the SOTA algorithms grouped by the segment classes when trained on the entire 70 OCT volumes of the training set and tested on the holding 42 OCT volumes from the testing set per device.

FIGURE 7 .
FIGURE 7. Detection performance comparison by DS of the nnUNet_RASPP and baseline nnUNet, together with the state-of-the-arts algorithms grouped by the segment classes when trained on the entire 70 OCT volumes of the training set and tested on the holding 42 OCT volumes from the testing set.

FIGURE 8 .
FIGURE 8. Examples of B-scans to illustrate the visualization output/predicted of nnUNet_RASPP, in order of the raw/inputs, mask/annotations and predicted/outputs in columns when trained on the training set of two vendor devices and tested on the training set of the third vendor device (Cirrus and Topcon in row 1 and row 2 respectively).Fine details captured by the model are indicated with orange arrows.

FIGURE 9 .
FIGURE 9.An example of a B-scan to illustrate the visualization output/predicted of nnUNet_RASPP, in order of the raw/inputs, mask/annotations and predicted/outputs when zoom out to highlights the fine details captured by the model using orange arrows.This is captured when trained on the training set of the Spectralis and Topcon devices and tested on the training set of the Cirrus device.

TABLE 2 .
Segmentation table of the Dice

TABLE 3 .
Segmentation table of the Absolute Volume Difference (AVD) by segment classes (columns) and teams (rows) for training on the entire 70 OCT volumes of the training set and tested on the holding 42 OCT volumes from the testing set.

TABLE 4 .
Detection table of the Area Under the Curve (AUC) by segment classes (columns) and teams (rows) for training on the entire 70 OCT volumes of the training set and tested on the holding 42 OCT volumes from the testing set.

TABLE 5 .
Segmentation table of the Dice Score (DS) and Absolute Volume Difference (AVD) by segment classes (columns) and teams (rows) for training on the entire 70 OCT volumes of the training set and tested on the holding 42 OCT volumes from the testing set per device.

TABLE 6 .
Generalisation table of the DS and AVD by segment classes (columns) and teams (rows) trained on 48 OCT volumes from 2 device sources and evaluated on 14 OCT volumes from the testing set on the third device that wasn't seen at training.