Dual-Branch U-Net Architecture for Retinal Lesions Segmentation on Fundus Image

Deep learning has found widespread application in diabetic retinopathy (DR) screening, primarily for lesion detection. However, this approach encounters challenges such as information loss due to convolutional operations, shape uncertainty, and the high similarity between different lesions types. These factors collectively hinder the accurate segmentation of lesions. In this research paper, we introduce a novel dual-branch U-Net architecture, referred to as Dual-Branch (DB)-U-Net, tailored to address the intricacies of small-scale lesion segmentation. Our approach involves two branches: one employs a U-Net to capture the shared characteristics of lesions, while the other utilizes a modified U-Net, known as U2Net, equipped with two decoders that share a common encoder. U2Net is responsible for generating probability maps for lesion segmentation as well as corresponding boundary segmentation. DB U-Net combines the outputs of U2Net and U-Net as a dual branch, concatenating their segmentation maps to produce the final result. To mitigate the challenge of imbalanced data, we employ the Dice loss as a loss function. We evaluate the effectiveness of our approach on publicly available datasets, including DDR, IDRiD, and E-Ophtha. Our results demonstrate that DB U-Net achieves AUPR values of 0.5254 and 0.7297 for Microaneurysms and soft exudates segmentation, respectively, on the IDRiD dataset. These results outperform other models, highlighting the potential clinical utility of our method in identifying retinal lesions from retinal fundus images.


I. INTRODUCTION
Diabetes federations predict that the number of people with diabetes will rise from 463 million to 700 million over the next 25 years if sufficient measures are not taken to The associate editor coordinating the review of this manuscript and approving it for publication was Rajeswari Sundararajan .combat the spread of diabetes [1].As a common chronic complication of diabetes, DR remains one of the top five causes of irreversible blindness in adults [2].In clinical practice, a fundus image is a projection of the fundus captured by a monocular camera onto a 2D plane.There are parameters in the fundus image, such as optic disc (OD), macula, fovea, blood vessels or some lesions related to DR: microaneurysms (MA), hemorrhages (HE), hard exudates (EX) and soft exudates (SE).Microaneurysms (MAs) are the first visible signs of DR, which are small swellings in the tiny blood vessels of the retina that appear as tiny, round, red spots.HEs form due to blood leakage and appear as a small white dot or spot.EXs and SEs are bright object [3], as listed in Figure 1.Fundus images can be acquired noninvasively and economically, making them more suitable for large-scale screening.Retinal lesions can be visualized on fundus images [4].Thus, ocular screening by fundus images is important in the diagnosis of DR.It is therefore essential that treatment can be administered for the prevention of vision loss.
The analysis of the lesions from the retinal fundus image is represented with their shape, their texture and their location of appearance which are the main indicators for evaluating the evolution of the disease [5].Quantitative analysis of fundus images is important, but analysis of the visualization base has played an important role for disease diagnosis in the screening process.But Performing a manual analysis is a laborious task, and the diagnosis of anterior diabetic retinopathy relies on automated lesion segmentation.However, there are certain constraints in the automated lesion detection procedure.
1) The retinal fundus images suffered from various issues such as noise, uneven illumination and variable low contrast, as well as abnormalities in their parameters such as retinal vessels, drusen and optic disc.Another major issue is that the lesion size of the retinal fundus image is smaller than the background and it is challenging to detect the lesion.
2) The shape, texture, and color of the lesion in retinal fundus images make it challenging to detect the lesion, since the shape and color of MAs and HEs are almost identical and appear as red dots in the fundus and other side images, the SEs and EXs appeared as bright spots.
Many researchers [6], [7], [8] have misclassified lesion detection, and it is difficult to establish appropriate classes for lesion detection from retinal fundus images.Deep learning-based methods especially fully convolutional neural networks have achieved great success in medical image segmentation.The fully convolutional networks (FCNs) [9] and U-Net [10] played an important role in segmentation especially segmentation on medical images.After successfully use of U-Net, we implemented a novel approach to improve the performance of retinal lesion segmentation from color retinal fundus images.U-Net contains an encoder and symmetric decoder to perform the segmentation.The encoder is employed to extract features, while the feature extraction processes are connected with the decoder.The stack of convolutional layer, batch normalization, rectified linear unit layer (ReLU), and the following max-pooling layer are the basic architecture for extracting features for segmentation of retinal lesion.With the increasing layers, all information related to the lesion is extracted as high-level semantic information while a part of low-level semantics includes information on color, texture, and shape, and such information is not the main purpose for segmentation else it plays an important role for segmentation tiny lesion.Inspiring by [11], Takikawa et al. propose a new two-stream CNN architecture for semantic segmentation that explicitly shape information as a separate processing branch, that processes information in parallel to the classical stream.Furthermore, Zhou et al. propose a Bilateral branch network (BBN) [12], each branch of which performs its duty separately.The feature maps from the base network are extracted to perform different segmentation simultaneously and this method may work well on other similar types of tasks.
Furthermore, as previously mentioned, John et al.'s proposal [13], [14] underscores the advantages of auxiliary information and the dual-branch architecture in augmenting the acquisition of contextual features.This enhancement contributes to improved lesion segmentation performance.We propose an end-to-end dual-branch U-Net architecture lesion segmentation framework (DB U-Net) that contains two branches which are composed of U-Net and a modified U-Net respectively.The output of U-Net is supervised by red/bright lesions compared with ground truth (GT) to learn common features among red/bright lesions.The modified U-Net namely U2Net consists of two decoders: lesions segmentation biased learning decoder and boundary of lesion biased learning decoder.To preserve low-level information that might be lost during the down-sampling process, U2Net incorporates additional boundary information through an extra decoder.This involves combining two branches with an image patch to create feature maps, which are then inputted into the fusion module as the final step in the DB U-Net process.The research paper's novelty lies in its comprehensive approach to addressing the challenges of retinal lesion segmentation in diabetic retinopathy.It combines deep learning techniques with innovative architectural modifications to improve automatic lesion detection's accuracy and diagnostic capabilities in retinal fundus images.The following are the main contributions of this research work.Our main contributions and novetly are as follows 1) We address the challenges inherent in the automatic lesion detection process within retinal fundus images, including image quality, small lesion sizes, and the similarity in appearance of different lesions.We introduce deep learning techniques to overcome these challenges, specifically focusing on fully convolutional neural networks (FCNs) and U-Net.This proposed network aims to enhance accuracy in retinal lesion segmentation, leveraging the potential of these advanced neural networks.2) We present innovative architectural modifications to improve segmentation performance, including a twostream CNN architecture that explicitly considers shape information and a dual-branch U-Net architecture (DB U-Net) featuring two distinct branches.Additionally, we propose a supervised learning approach where the model's output is guided by specific types of lesions (red/bright), facilitating improved learning of common features among these lesions compared to ground truth (GT).3) We introduce innovative enhancements to segmentation models, including incorporating boundary information through an extra decoder in U2Net to preserve low-level details during the down-sampling process, potentially enhancing segmentation accuracy.Additionally, in our DB U-Net architecture, we implement a fusion module at the end, combining feature maps from two branches and image patches.This integration aims to improve further the model's ability to capture essential features for segmentation.4) Our evaluation encompasses three publicly accessible datasets: IDRiD [15], E-Ophtha [16], and DDR [17].
Through ablation studies, we analyze the impact of various design choices on lesion segmentation performance.Comparative assessments reveal that DB U-Net achieves remarkable and competitive results in comparison to state-of-the-art segmentation models like U-Net, DeepLab v3+ [18], and other segmentation models.This stands as a significant contribution, as our method consistently outperforms alternative approaches across diverse databases.This paper is organized as follows.Section II discuss related works.Section III provides a comprehensive architecture of proposed model.Section IV explains the datasets and configure for perform an experiments.Section V analyses the quantitative results and experimental performance.Finally, we present conclusions in Section VI.

II. RELATED WORK
Table 1 below provides a thorough review of previous research, presenting a comparative analysis of existing work.Additionally, a detailed discussion of these earlier contributions is presented.
Table 1 showcases a comparative analysis that emphasizes the variety of datasets, lesion types, neural network architectures, and distinctive features employed across various studies focused on detecting and segmenting retinal lesions.Each study employs distinct techniques and innovations to enhance model accuracy, illustrating the continuous research and progress in this domain.The subsequent explanations delve into the specifics of each method, outlining their contributions and limitations as explained below.
Haloi et al. [19] present a novel method for early diabetic retinopathy screening, focusing on MA detection in color fundus images.They employ deep neural networks with dropout training and max-out activation functions, eliminating the need for preprocessing and manual feature extraction.While claiming substantial improvements, quantitative evidence is lacking.The method achieves state-of-the-art accuracy on benchmark datasets.Still, it faces limitations, including data diversity, interpretability, computational demands, false positives/negatives, dataset-dependent performance, lack of detailed comparisons, data balance, and robustness to noisehighlighting the need for further research and refinement.
Chudzik et al. [20] present an innovative approach to automated MA detection in fundus images, a crucial component of diabetic retinopathy screening.They employ a patch-based fully convolutional neural network with batch normalization layers and use the Dice loss function, simplifying the process with just three processing stages, contrasting with methods requiring up to five.Notably, the paper demonstrates successful knowledge transfer between datasets within the MA detection domain, potentially enhancing adaptability across different data sources.The method's evaluation on popular datasets, including E-Ophtha, DIARETDB1, and ROC, showcases robust performance, surpassing state-of-theart methods based on the FROC metric.It excels in achieving high sensitivities for low false positive rates, enhancing its promise for diabetic retinopathy screening.However, the study does not explicitly address potential limitations such as further diversity evaluation, model interpretability, handling class imbalance, and comprehensive comparisons with existing methods, factors essential for assessing its realworld relevance.
Kou et al. [21] present the deep recurrent U-Net (DRU-Net), a deep learning approach for MA segmentation in diabetic retinopathy diagnosis.MAs are critical indicators, but manual annotation is cumbersome, prompting the need for automation.The DRU-Net combines U-Net, deep residual models, and recurrent convolutions to enhance feature accumulation, addressing low contrast and small MA challenges.It achieves impressive results on E-Ophtha and IDRiD datasets, notably an accuracy of 0.9999, an AUC of 0.9943 on E-Ophtha, and 0.987 AUC on IDRiD.It outperforms U-Net, FCNN, and ResU-Net, establishing itself as a stateof-the-art MA segmentation method.However, potential limitations include computational demands, generalizability, interpretability, data augmentation, which should be considered for practical use beyond specific datasets.
Sarhan et al. [22] present a two-stage deep learning approach for MA segmentation in diabetic retinopathy detection, underscoring the importance of deep learning in fundus image analysis.MAs are vital markers of diabetic retinopathy progression.Their method leverages multiple input scales, allowing for consideration of features at various resolutions, crucial for accommodating MA size variations.Additionally, selective sampling enhances computational efficiency by focusing the model's attention on key image regions.Embedding triplet loss is introduced to enhance classification model discriminative power, resulting in a significant 30.29%relative improvement over fully convolutional neural networks (FCNs) in MA segmentation.However, the study lacks discussion of potential limitations, including computational complexity, generalizability, interpretability and dataset diversity.These considerations are vital for assessing the method's practical applicability beyond its reported performance.
Theelen et al. [23] present a method aimed at improving CNN training for medical image analysis, with a focus on hemorrhage detection in colored fundus images.They address the challenge of time-consuming CNN training by introducing selective sampling, dynamically choosing misclassified negative samples to prioritize informative data during the learning process.Their results show a substantial reduction in training time, maintaining or improving performance, achieving AUC values of 0.894 and 0.972 on two datasets, and demonstrating potential for model generalization.However, limitations include the method's application specificity, questions about generalizability to diverse data sources, lack of detail on heuristic sampling criteria, and the need for clinical validation in real healthcare settings.These considerations are crucial when assessing the method's practical applicability beyond its promising yet specialized performance.
Zheng et al. [24] present a deep learning approach for detecting retinal exudates, an early indicator of diabetic retinopathy (DR).They tackle challenges in deep convolutional neural network (DCNN) application by introducing an ensemble convolutional neural network (MU-net) to cope with limited labeled data and adopting conditional generative adversarial networks (cGANs) for mitigating severe class imbalance.This strategy enhances model robustness and generalization across diverse datasets and clinical scenarios.The method demonstrates significant performance improvements, reflected in higher F1 scores at the lesion level and increased accuracy at the image level compared to non-cGAN approaches.However, limitations such as the need for computational resource requirements, dataset diversity, and model interpretability must be considered when assessing its practical applicability beyond benchmark datasets.
Yan et al. [25] propose an innovative method for segmenting small lesions in high-resolution retinal images.They acknowledge downsampling and patch-based methods' limitations and introduce mutually local-global U-nets to balance local and global context.While their method shows promise, quantitative comparisons with existing techniques are lacking, and they plan to collect more data for future research.They also suggest the model's potential for broader applications beyond retinal lesion segmentation, although concrete evidence is missing.Further validation and exploration are needed.Guo et al. [26] introduce a significant contribution to DR and diabetic macular edema diagnosis by developing the L-Seg multi-lesion segmentation model.This model addresses challenges related to the diagnosis of these conditions by simultaneously segmenting four types of lesions in fundus images.L-Seg is notable for being the first small object segmentation network capable of concurrently handling soft exudates, hard exudates, microaneurysms, and hemorrhages.The method incorporates a multi-scale feature fusion technique to enhance its performance.It introduces a multi-channel bin loss to address the class imbalance and loss-imbalance issues during training.Extensive evaluations on various datasets showcase L-Seg's superiority over other deep learning models and traditional methods, particularly excelling in small lesion segmentation.The limitations highlight the need for further research to ensure the model's applicability and robustness beyond the evaluated datasets and challenges.
Wang et al. [27] contribute to the field of diabetic retinopathy diagnosis by addressing the challenging task of multiple lesion segmentation.Their work introduces a scale-aware attention (SAA) block designed to effectively handle variations in lesion scales.Through extensive experimentation, they establish the superiority of the SAA block over existing attention mechanisms, achieving stateof-the-art results in the domain.However, the study falls short in terms of clinical validation and comprehensive comparisons with existing methods.Additionally, it does not delve into considerations related to computational resource requirements and scalability.These limitations highlight the necessity for further research and real-world validation to ascertain the practical applicability of their approach.
Liu et al. [28] introduce a dual-branch network designed to segment hard exudates in color fundus images.These exudates vary significantly in size, and class imbalance issues complicate their segmentation.The dual-branch network employs two branches with partially shared weights, allowing it to effectively learn features and classifiers for hard exudates of different sizes.During training, they utilize a novel dualsampling modulated Dice loss, prioritizing the segmentation of large exudates before addressing smaller ones.Their experimental evaluations, conducted on publicly available datasets for hard exudate segmentation, demonstrate the superior performance of the dual-branch network compared to existing methods that use both CBCE (Class Balanced Cross-Entropy) loss and Dice loss.This suggests their novel network architecture and loss function significantly enhance segmentation accuracy.However, it's important to acknowledge certain limitations.The study primarily focuses on showcasing the effectiveness of their dualbranch network but does not thoroughly explore potential limitations related to its clinical applicability.Additionally, the comparison with existing methods is somewhat limited, warranting further research for a more comprehensive evaluation of the network's real-world potential and potential drawbacks.
The comparative analysis of these methods highlights their innovative approaches to detecting and segmenting retinal lesions.Nevertheless, each method comes with its own set of limitations.Although these approaches show promise, they frequently lack thorough evaluations and do not fully account for real-world challenges like diverse datasets, computational requirements, generalization, and interpretability.In light of these observations, we will now introduce our proposed approach, which aims to address the limitations identified in the existing methods.

III. PROPOSED METHOD
The proposed methods contain different tasks and the model of proposed methods is shown in Figure 2.Each part of these methods is elaborated below.

A. PROPOSED ARCHITECTURE
In the architecture section, we explained the DB-U-Net model of our proposed method, as shown in Figure 2. The network model contained the dual branches of the network with the fusion module.The first branch of the proposed network is U-Net and the second branch of the proposed network is U2Net.The U2Net model is based on a multi-task learning network that explores the boundaries of information that can extract lesions and give information about appropriate boundaries.Next, we used the fusion features based on the branch feature maps to produce the image patches for precise segmentation.In the training process, U2Net on the branch is performed by predicting the target lesion and the corresponding boundary while the other branch was supervised by a red and bright lesion label to learn the common characteristic of a similar lesion and their prediction process is explained below and in the next section we elaborate the subnet architecture of our proposed method.

B. U-NET ARCHITECTURE
The main objective is to detect the lesion as much as possible, and the U-Net branch is used to train the common features based on the red and bright lesion to detect the required lesion.The U-Net is composed of the encoder and the decoder with a convolutional layer instead of fully connected layers.This process is used to convert input images into binary image maps.It takes the image patches with X ∈ ℜ H ×W ×3 and it is a process of entering and exiting segmentation of the red and bright lesion and it is denoted by Y 2 .In this proposed method, the residual architecture according to LinkNet [29] can optimize the final performance and we modify the residual block [30] replaced by a convolution block in the typical U-Net.Additionally, we have reduced the number of convolution cores to half of the typical architecture by proposing to increase the trade-off between speed and accuracy. (1) where X , X out are the input and output the residual block in downsampling path.BN (•) denotes the batch normalization layer.ReLU (•) denotes rectified linear unit layer.Conv(•) denotes convolutional layer and Conv 1×1 (•) is an identity mapping function.Pooling(•) is a max-pooling function.

C. U2NET ARCHITECTURE
The proposed U2Net architecture is shown in the Figure 3.
The deep neural network acts as a black box to extract the feature from the input.Downward sampling path depends on increasing layers to extract low-level information from exudates such as color, shape.High-level functionality such as the border is phased out as the [31] pattern is implemented.In this research work, we designed the modified U-Net model named U2Net model.The U2Net model is used to overcome the lack of information, as shown in Figure 3.The decoder in U2Net is to share the characteristics of its encoder to predict the lesion and its boundary respectively as shown in Figure 2. The boundary information in U2Net is introduced as an auxiliary by an additional decoder to avoid the loss of low-level information in order to obtain a good segmentation of the exudates.
The architecture of U2Net is implemented based on U-Net because it is based on a convolutional block replaced by the residual block with a number of convolution cores per convolutional layer reduced by half.The green block in the Figure 2 represents the input X ∈ R H ×W ×3 and the output of U2Net: lesion segmentation map Ŷ1 , and the corresponding boundary segmentation map Ŷb respectively.

D. ARCHITECTURE OF FUSION MODULE
The segmentation map of red/bright lesion and target lesion are obtained from U-Net and U2Net respectively.The fusion module takes the image patches and the segmentation map as the input to get the final segmented lesion.Firstly, the segmentation of red/bright lesion Ŷ2 is concatenated with image patch X along the channel dimension in order to forming a new feature map.In this step, the image X and segmentation of red/bright lesion Ŷ2 are concatenated within the fusion module to stack the channels together.Then we merge the feature map using an Atrous Spatial Pyramid Pooling (ASPP) [32] which is containing multiple atrous convolutions with different sampling rates to capturing the context of the image at multi-scales.It combines segmentation map from U-Net branch with image patch X and the output of ASPP Ŷ3 supervised by the target lesion according to groundtruth.
Finally, we obtain two segmentation Ŷ1 and Ŷ3 .Moreover, we assume the difference between Ŷ1 and Ŷ3 can be viewed as different regions.So, to highlight these regions, we subtract Ŷ3 and Ŷ1 element-wise and take its absolute value, which is contributed to encourage fusion module focus on the difference between the two feature maps.We concatenate the aforementioned segmentation map and denote it as X concat ∈ R H ×W ×6 X concat ∈ R H ×W ×5 .The following ASPP and 1 × 1 convolution layer transform X concat to the final segmentation Y ∈ R H ×W ×1 .The algorithm for the fusion module is presented below (see Algorithm 1).

E. LOSS FUNCTION
During the training process, we simultaneously train the U-Net and U2Net sub-networks, incorporating the fusion

Algorithm 1 The Fusion Module Algorithm
Input: Image patch X , Lesion segmentation map Ŷ1 from U2Net, Segmentation map of red/bright lesion Ŷ2 from U-Net module to supervise both segmentation and boundary map predictions.The total loss function is expressed as follows: where L represents the total loss function.L Fusion Y , Ŷ is the loss function related to the fusion module, which evaluates the discrepancy between the ground truth Y and the predicted Ŷ fusion outcomes.Equation 3 combines these three loss components to create a comprehensive loss function that guides the training of the U-Net and U2Net sub-networks with the fusion module.The objective during training is to minimize this total loss L which helps improve the accuracy of segmentation and boundary map predictions.
The modules employ the Dice loss as their loss function.''The Dice loss, as referenced in [33], serves as a valuable metric for gauging the extent of overlap between the Ground Truth (GT) and the segmented output.It relies on the Dice coefficient for its calculation.This approach obviates the necessity to meticulously fine-tune the balance between foreground and background elements within the data.
Given the inherent imbalance in the distribution of lesion pixels and background pixels, the adoption of the Dice loss as the primary loss function is a strategic choice.Mathematically, the Dice loss is represented as follows: In this context, p l (x) represents the probability assigned to pixel x for belonging to class l and g l (x) corresponds to a vector indicating the ground truth label, where it assumes a value of one for the correct class and zero for all other classes.This formulation of the Dice loss effectively addresses the challenge posed by unbalanced training data.Consequently, there is no need to introduce weighting parameters between various classes, such as the background and the vessel tree, during training.This makes the loss function particularly suitable for binary segmentation tasks.

IV. EXPERIMENT A. DATASET
The Indian Diabetic Retinopathy Picture (IDRiD) [15]  Each image was divided into 24 patches with 320 pixels for training.For the other segmentation of the lesion, the cropped image was reduced by 4 times (t = 4) and was divided into 6 patches with 320 × 320.DDR [17] is the largest dataset proposed in 2019 for DR screening, containing 13,673 images obtained from 147 hospitals, covering 23 provinces in China.These images were captured using 42 types of fundus cameras with a 45 degree field of view and range in resolution from 1088 × 1920 to 3456×5184.For lesion segmentation, DDR provides 757 fundus images with pixel-level annotation.There are 383 images for training, 149 for validation, and 225 for testing.We scaled all fundus images by 4 times (t = 4), while 320×320 patches were uniformly cropped from these images for training.
The E-Ophtha [16] is a publicly available dataset that consists of two parts: E-Ophtha EX and E-Ophtha MA.In our experience, E-Ophtha MA is adopted only, which consists of 148 images with MAs or small HEs and 233 healthy images with resolution ranging from 1440 × 960 to 2544 × 1696 pixels and provides annotations at the pixel level for AM segmentation.In our experiments, we used 100 images for training and the remaining 48 images for testing.For lesion segmentation, we scaled these images to 1360×2048 and then cropped the images to 1280 × 1920.Each cropped image was divided into 24 patches with 320×320.

B. PREPROCESSING
There are serious challenges such as uneven illumination, high variability in contrast, and background noise from data acquisition devices on the original fundus images.To track these issues, we developed an image enhancement method inspired by [5] and [34] to mitigate the influence of the aforementioned challenges.We apply histogram equalization (HE) and contrast-limited adaptive histogram equalization (CLAHE) [35] on the brightness channel of the LAB fundus image.Concretely, a Gaussian filter is used to remove the noise caused by the device and zoom in at first.Next, we transform the color space of the images from RGB to LAB.HE distributes the pixel intensities of the image according to all the information of the image to improve the contrast.But it also amplifies background noise.CLAHE can remove noise and retain detail by limiting contrast.After pre-processing, the original fundus images are split into multiple image patches in uniform resolution and used for data augmentation.

C. DATA AUGMENTATION
Data augmentation is an essential method to improve model robustness and accuracy by artificially augmenting the training data available in deep learning.In this article, there are two methods of data augmentation: geometric and lightweight.The first includes vertical flip, horizontal flip, and rotation, which are processed on both the input image and its corresponding ground truth segmentation.While the latter adjusts the input brightness based on gamma correction and only applies to the input image.Before training, several transformations randomly combine and apply to each input image patch.In this whole process, We employed publicly available databases for validation of our proposed, and each of these databases includes images with ground truth or pixel-level annotations.These resources were utilized to validate and compare the effectiveness of the method we have proposed.

D. EXPERIMENT DETAIL
The experimental environment is single Inter CPU Intel (R) Xeon (R) CPU e5-2620 V3 @2.40GHz and NVIDIA GeForce GTX 1080Ti, with 16G video memory.The model proposed in this paper is implemented in Python 3.5 and PyTorch 1.12.For training, each model has trained 1,000 epochs.We performed the Adam algorithm with a batch size of 8, a max iteration of 1,000, a momentum of 0.9, a weight decay of 5e-4, and an initial learning rate of 1e-3.In the comparative experiments, we employed ResNet-101 as the backbone architectures of DeepLab v3+ and adapted COCO pretrained weights as the initial weights.The learning rate was initialized to 1e-4 to fine tune DeepLab v3+.

E. EVALUATION METRICS
The performance of model for lesions segmentation was evaluated by Area under Precision-Recall curves (AUPR) [36] as AUPR was used for evaluation by ISBI IDRiD challenge [15] in 2018.Precision-Recall curves (PR curve)is used and PR curve is a plot of the PPV (y-axis) and the Sen (x-axis) for different probability thresholds which is set as 33 equally spaces instance from 0 to 1 in probabilities in our experiment.AUPR are recommended for imbalanced binary classification task where Area under ROC curves (AUC) may provide an excessively optimistic view of theperformance [36], [37].PR curves are recommended for tasks with imbalanced binary classification models where ROC curves may provide an excessively optimistic view of the performance [36], [37].Other evaluation metrics are defined in Table 2. [24].

EXPERIMENT RESULT
In this section, we introduce the results of our proposals on three public datasets.Initially, we provide an ablation study to show the effectiveness of each component.Comparing with other extensive models and show qualitative results of our method are also performed.AUPR was used as the main evaluation metric as same as the 2018 ISBI grand challenge.And the Illustration of lesion segmentation results on a fundus image from IDRiD is shown in Figure 4.

A. ABLATION STUDIES
From Table 3 to Table 5, we tested the impact of different modules on the results on IDRiD, DDR, E-Ophtha MA datasets respectively.
1) The external decoder was employed in our ablation studies.We used the conventional U-Net as a baseline model and utilized the U2Net's output as a segmentation result to assess lesion segmentation performance.When compared to the U-Net, the external decoder, specifically designed to learn boundary features, exhibited a significant enhancement in AUPR for lesions, except for SEs.This observation underscores the valuable contribution of auxiliary information in enhancing the network's performance, particularly in the context of small-scaled lesion areas.2) Incorporating the dual-branch architecture into our research framework, we labeled the final segmentation output as DB U-Net to evaluate its performance.
In Table 3, it is evident that DB U-Net yielded the highest AUPR value compared to all the other methods under consideration.The segmentation results of DB U-Net, as illustrated in Table 4, effectively distinguished similar abnormal regions while maintaining a high sensitivity to lesions.This architectural approach, featuring dual branches and a fusion module, improves performance in balancing detection and classification tasks.However, when examining the results in 4, we noticed that U2Net achieved the highest AUPR, whereas DB U-Net outperformed others in the F1 score.This indicates that our model excelled at a specific threshold value for achieving the best results.Nevertheless, it's important to note that DB U-Net exhibited some instability when subjected to various factors, such as changes in illumination, resolution, or lesion size.This instability may arise due to the branch responsible for red/bright lesion segmentation, which introduces information about similar lesions.Furthermore, in cases of low resolution, the fusion module may struggle to differentiate the target lesion from the fused feature map.
3) The highlight of our model output incorporates the Fusion module, which plays a pivotal role in improving the performance of DB U-Net, our proposed method.The Fusion module demonstrates its importance by exhibiting improved network performance over alternative methods.Essentially, this module helps in effectively detecting diseases in input data.The Fusion module is an essential component that merges information from different branches or sources within the network, enabling a holistic understanding of the data.By integrating this module into the architecture, DB U-Net gains the ability to extract and utilize valuable information from multiple sources, leading to more accurate and robust disease detection capabilities.Essentially, the Fusion module serves as a hub in our model, allowing it to connect the collective power of its components and deliver superior performance in disease detection tasks.role in information fusion is instrumental in the overall success of the DB U-Net model.

B. COMPARATIVE ANALYSIS:PERFORMANCE ON THE PUBLIC E_OPHTHA_EX DATASET
Our evaluation on the publicly available e_ophtha_EX dataset reveals that our proposed method outperforms other state-of-the-art techniques in various aspects as shown in Table 6.Specifically, our method shows significant improvements in sensitivity (3.06% to 9.72%), precision (1.06% to 5.61%), and F1-score (2.02% to 8.08%) compared to recent studies.While competitive with a leading method by Zheng et al. regarding specificity and accuracy, there is a slight gap in sensitivity, precision, and F1-score.
Our method demonstrates superior performance in critical metrics, making it a strong contender in medical image analysis applications, albeit with some nuanced differences compared to the top-performing alternative.We conducted a computation time analysis, revealing our proposed method achieved a swift 0.141second computation time, whereas no other method in the study reported their timing.

C. COMPARATIVE ANALYSIS WITH TOP 10 IDRID LESION SEGMENTATION TEAMS
In this section, we employed AUPR (Area under Precision-Recall curve) as the evaluation metric, aligning with the criteria used in the IDRiD challenge.The IDRiD challenge, hosted by the IEEE International Symposium on Biomedical Imaging (ISBI) conference, focuses on analyzing fundus images.To gauge the effectiveness of our method, we conducted a comparative analysis against the top 10 teams participating in the lesion segmentation competition of the IDRiD challenge.As illustrated in Table 7, our proposed approach secured the top position in microaneurysms (MA) segmentation, ranked second in hemorrhage (HE) segmentation, and achieved first place in both hard exudate and soft exudate segmentation.Worth noting is that the top-performing teams in the challenge adopted different network architectures for each specific segmentation task.Additionally, they encountered the complexity of finetuning numerous hyper parameters during the training phase.Consequently, these high-performing teams were obligated to test four distinct models for each corresponding segmentation task during the evaluation phase.In contrast, our study adopted a single unified network architecture, requiring only minor adjustments to the hyperparameter settings.Despite this streamlined approach, our proposed method is able to achieve results that are on par with the performance of the top-performing teams.

D. OVERALL COMPARATIVE ANALYSIS
In Table 8 and Table 9, we compared against published stateof-the-art methods IDRiD, DDR, E-Ophtha (MA) datasets.
In IDRiD, we compare our framework with other published deep learning methods: L-seg [26], Local-Global U-Net [25], Multi-scale Net [22], Deeplab v3+, and their AUPR score L-Seg is an end-to-end multi-lesion segmentation model with a multi-scale feature fusion method and proposes a novel multi-channel bin loss to handle the cases of both class-imbalance and loss-imbalance problems.In Table 8, There are various evaluation metrics with the current state-of-the-art methods (e.g., AUC, AUPR, F1).The IDRiD grand challenge provided a great opportunity to compare our performance standardized metrics.From Table 8, we observe that DB U-Net achieved the best performance of 0.5254 and 0.7297 on MAs and SE segmentation and ranked No.3 and No.2 on HE and EX segmentation respectively.As shown in Table 10 and Table 9, DB U-Net achieved the highest AUPR value on MAs, HEs, and EXs segmentation on DDR and E-ophtha MA.The improvement in fundus lesions segmentation shows the capability of effectively handling both data imbalance problems and lesion segmentation under the complex background.

VI. DISCUSSION
For lesions segmentation, several modified architectures, as well as effective methods, have been employed.In our work, we attempt to explore the dual-branch architecture and auxiliary information improves the performance of our proposal in terms of precision and sensitivity.There are still some defects.
Our experiments show that the aforementioned methods lead to a highly effective architecture that significantly boosts performance on lesion segmentation, especially scatter and smaller objects.However, our proposal fails to distinguish target lesions from the background or the other lesions that belong to the same group (see Figure 4).This indicates that our work is short of capturing the global context.Introducing multi-scale information may optimize this issue.Just as [25], a segmentation framework integrates the decoder parts of a global-level U-net and a patch-level.
As mentioned before, MAs are the earliest clinical signs of DR but the ratio of object pixels to background pixels is approximately 0.10% on IDRiD.In the DDR and E-ophtha datasets, the ratio is 0.02%, 0.01%, respectively.Due to PR curve is deployed to evaluate models instead of ROC curves, the misclassified pixels have an enormous impact on AUPR value while these are of no consequence to DR screening in medical practice.A question that the lack of standardized evaluation metrics naturally arises.

VII. CONCLUSION
As a chronic eye disease, the timely treatment is of great significance and prospect in terms of the patients with diabetic retinopathy.Based on deep learning, computer-aided diagnostic technology plays an important role in disease screening.In this paper, we propose a network with dual- branch architecture to improve the segmentation of scattered and small lesions in fundus images.We introduce edge information and parallel architecture to address the issue of segmenting various size lesions.We evaluated our work on public datasets and obtained competitive performance, which demonstrates that the efficiency of our proposal network.However, the fusion module is constricted by the quality of the segmentation map from branches.We found that optimizing the feature space through auxiliary information helps the model focus on small region.Furthermore, the structure of double branches can compare the prediction of the network on the two branches, and regard the areas with differences as easily confused areas.The purpose of double branches is to mine the complementary information of features and obtain better feature representation to improve the final segmentation performance.
There are still many problems of lesion segmentation being to be solved.More researches are necessary to further explore the practical application of the automatic diabetic retinopathy diagnosis system.For example, medical related tasks are more difficult to obtain labels.Patient privacy, professional labeling and other factors limit the scale of the dataset.Under the constraints, weak-supervised learning, semi-supervised learning, few-shot learning and even zero-shot learning can reduce the dependence of the model on data.How to use less data to get a better and more robust model is an expected research.

FIGURE 1 .
FIGURE 1. Illustration of the retinal image by highlighting normal structures (optic disc) and abnormalities associated with DR in different color: MAs, HEs, SEs, and EXs.

FIGURE 2 .
FIGURE 2. An overview of the proposed segmentation framework by highlighting the three modules in different colors.The two branches of DB U-Net are composed of U-Net and U2Net, respectively.The fusion module takes segmentation maps from two branches as input and outputs the final segmentation result, denoted as Ŷ .

FIGURE 3 .
FIGURE 3.An overview of U2Net.Dual decoders share the same latent representation from the encoder.Two upsampling paths take care of lesion and boundary segmentation.
L U −Net Y b , Ŷb is the loss function specific to the U-Net sub-network, which measures the error between the ground truth segmentation data Y b and the predicted segmentation Ŷb .L U 2Net Y 2 , Ŷ2 is the loss function specific to the U2Net sub-network, which measures the error between the ground truth Y 2 and predicted Ŷ2 values of segmentation.

FIGURE 4 .
FIGURE 4. Figure (a) displays the original color fundus image.Figure (b) provides a visual representation of segmentation maps, presenting the outcomes of lesion segmentation on the IDRiD dataset.In this representation, microaneurysms (MAs) are denoted in blue, hard exudates (HEs) in green, hemorrhages (EXs) in red, and soft exudates (SEs) in yellow.This illustration highlights the results of segmenting four distinct types of retinal lesions using our proposed deep learning model, organized from left to right: MAs, HEs, SEs, and EXs within fundus images.The uppermost row of images represents the ground truth (GT) reference.Similarly, the second, third, and fourth rows correspond to lesion segmentation achieved by the first stage (U-Net), the second stage (U2Net), and the third stage (referred to as Fusion, employing the DB U-Net model), respectively.The grayscale value assigned to each pixel reflects the probability of the presence of a lesion.

are summarized in Table 8 .
We can observe that Local-Global U-Net performed well on HEs, EXs segmentation.The Local-Global U-Net is an efficient network that combines local details and the global context by integrating the decoder parts of a global level U-Net and a patch-level one.It resulted in an AUPR value of 0.711 and 0.889 higher than that of the other model.Similarly, Multi-scale Net also introduces multi-scales information which uses multi-scale input with embedding triplet loss.The triple loss minimizes the distance between the lesion patches while increasing the distance between the lesion patch and the healthy one.Multi-scale Net report an AUPR value of 0.4196.DeepLab v3+[18] extends DeepLabv3[44] by adding an effective decoder module and utiles the depthwise separable convolution to both ASPP and decoder modules.In our experiments, it takes the ResNet-101 model as a backbone network.DeepLab v3+ shows poor performance in HEs and SEs segmentation on IDRiD.From Table 10, we observed that there is no obvious gap compared with other models on DDR.Its performance on different datasets may be limited by the scale of training data.As mentioned before, DDR is the largest dataset and has enough data for training while IDRiD provided less data on HEs and SEs segmentation.Table 10 offers a comparative analysis of various segmentation methods using AUPR (Area Under the Precision-Recall Curve) values within the DDR dataset.Overall, U-Net and ResUNet deliver moderate performance, with ResUNet excelling in HE and SE segmentation.DenseUNet stands out with strong MA, HE, and SE segmentation results, while UNet++ showcases superior performance in EX and SE segmentation.Att-UNet exhibits consistent but not outstanding results across all metrics, while PSPNet falls behind with lower AUPR values.DeepLab v3+ also registers relatively low AUPR values, indicating suboptimal performance within the DDR dataset.L-Seg is referenced but lacks specific AUPR values.In contrast, EfficientNet-B0+SAA displays impressive results, especially in EX and SE segmentation.Dual PSPNet+DSM lacks a specific AUPR value in the Table 10, making its performance unclear.The proposed method emerges as a strong contender, particularly excelling in HE segmentation, albeit with the highest parameter count among all methods.IFLYTEK-MIG and VRT are the teams in IDRiD competition.They resulted in a AUPR value of 0.5017,0.4951and ranked No.1 and 2 on the MAs segmentation task of the competition, respectively.IFLYTEK-MIG proposed a cascaded CNN-based approach with U-Net containing three stages: a coarse segmentation model, a cascade classifier, and a fine segmentation model.VRT modified U-Net, the upsampling layers of which have the same number of feature maps with layers concatenated, and they adjusted the number of downsampling layers according to the type of lesion.

TABLE 1 .
Summary of several researches for lesion detection/segmentation.

TABLE 3 .
Performance of various models on IDRiD dataset.(best AUPR value are shown in bold).

TABLE 4 .
Performance of various models on DDR dataset.

TABLE 5 .
Performance of various models on E-Ophtha(MA) dataset.

TABLE 6 .
Evaluation of Exudate detection on e_ophtha_EX dataset.

TABLE 7 .
Comparative analysis with top 10 IDRiD lesion segmentation teams.L-Seg ranked No. 3 on SE No. 4 on HE segmentation.Table8and Table10, L-Seg is the only end-to-end unified framework that generates multi-lesion segmentation results and shows competitive performance compared with DeepLab v3+, U2Net, and DB U-Net on DDR dataset.

TABLE 8 .
AUPR value of other published methods on IDRiD dataset.The result of iFLYTEK-MIG* and VRT* are borrowed from the Leaderboard of the IDRiD Challenge.

TABLE 9 .
Performance of other published methods on E-Ophtha-MA dataset.

TABLE 10 .
AUPR value of other published methods on DDR dataset.