Retinal Vessel Segmentation Using Deep Learning: A Review

This paper presents a comprehensive review of retinal blood vessel segmentation based on deep learning. The geometric characteristics of retinal vessels reflect the health status of patients and help to diagnose some diseases such as diabetes and hypertension. The accurate diagnosis and timing treatment of these diseases can prevent global blindness of patients. Recently, deep learning algorithms have been rapidly applied to retinal vessel segmentation due to their higher efficiency and accuracy, when compared with manual segmentation and other computer-aided diagnosis techniques. In this work, we reviewed recent publications for retinal vessel segmentation based on deep learning. We surveyed these proposed methods especially the network architectures and figured out the trend of models. We summarized obstacles and key aspects for applying deep learning to retinal vessel segmentation and indicated future research directions. This article will help researchers to construct more advanced and robust models.


I. INTRODUCTION
The fundus retina image is the only deeper microvascular system that can be observed non-invasively. The retinal vessel map contains abundant geometric characteristics, such as vessel diameter, branch angles, and branch lengths. These geometric characteristics reflect clinical and pathological features, which are used to diagnose hypertension, diabetes, and atherosclerosis [1]- [4]. The ophthalmologist uses retinal blood vessels to diagnose vascular and vascular system lesions related diseases, which interprets diabetic retinopathy (DR) and diabetic maculopathy (MD). These are the leading causes of global blindness. Retinal image assessment has been an indispensable step for the identification of retinal pathology.
Retinal fundus image illustrates retina structure, such as retinal blood vessel tree, optic disk (OD), fovea, macula, and abnormal structures, as shown in Figure 1. The retinal blood vessel tree is composed of the central retinal artery, vein, and branches. An abnormality may include microaneurysms (MAs), haemorrhages, exudates and cotton wool spots [5].
The associate editor coordinating the review of this manuscript and approving it for publication was Zhaojie Ju . Precise identification and diagnosis of eye abnormalities and their timely medication are vital in preventing blindness. Initially, trained experts would manually segment the retinal blood vessels, but that was an expensive, tedious and timeconsuming process [7]. Moreover, the complexities of the image cause inconsistency of vessel map segmented by different experts [8], due to the lower contrast between vessels and backgrounds, uneven illumination, various abnormalities and variation in vessel width/shape. These facts inspire the development of automatic retinal vessel segmentation with minimal human interference.
Several supervised and unsupervised methods are developed and used to automate the segmentation of retinal VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ vessels. Earlier, unsupervised methods are the most common approach for automatically segmenting the retinal vessels, which do not rely on any annotation for segmentation [9], [10]. These methods are roughly divided into matching filter [11]- [13], vascular tracing based segmentation [14]- [16] and model-based segmentation methods [17]. Unsupervised methods show some defects in their performance because they cannot benefit from the hand-labelled ground truth. Unlike unsupervised methods, supervised models are trained by using annotations and can benefit from the ground truth. Supervised models conduct retinal vessel segmentation in two stages: feature extraction and pixels classification. Features can be further divided into handcrafted features or automatically learned features. In machine learning, the process of feature extraction from fundus images is manual, and some typical classifiers are adopted, such as k-nearest neighbour classifier (KNN) [18] and support vector machine (SVM) [19]. Manual feature selection can leverage domainknowledge well, but it also lacks generalization ability since it is application-specific and cannot learn new features automatically [20].
Deep learning, especially convolutional neural networks (CNNs), has gained much attention for image analysis [21], [22]. Deep learning methods learn features automatically by using massive data with less human inference. They have better generalization ability and recognition capability because they can learn different level patterns automatically and will not be limited by a specific application. In 2012, Krizhevsky, et al. [23] proposed AlexNet for image recognition. For image segmentation and identification, VGGNet [24] and GoogleNet [25] were introduced. Long, et al. [26] proposed fully convolutional networks (FCN) for image semantic segmentation, which made dense predictions in a sliding window fashion and thus speeded up the segmentation.
Several review articles on retinal blood vessel segmentation have been published [10], [27]- [30]. However, Mookiah, et al. [10], Khan, et al. [28], Badar, et al. [29], Li, et al. [30] did not focus on deep learning methods for vessel segmentation, whereas the techniques discussed in Soomro, et al. [27] are published several years ago. Therefore, in this review article, we discussed publications of recent six years for retinal blood vessel segmentation based on deep learning.
All the papers were retrieved by conducting iterative and exhaustive searches in IEEE Xplore, Springer Link and Sci-enceDirect databases.
This article is organized as follows. In section II, we discuss deep learning and convolutional neural networks. In section III, we introduce the datasets used for retinal vessel segmentation and performance evaluation metrics for proposed models. In section IV, we analyze the existing models for retinal segmentation based on deep learning. In section V, we discuss retinal vessel segmentation according to the analysis of existing models. Section VI concludes the article and points out future research directions.
In this section, we have discussed the most widely used CNNs architectures for image computer vision tasks.  tion layers, such as 13, 16 and 19. Finally, VGG19 with 19 convolutional layers won the ImageNet challenge of 2014. Szegedy, et al. [25] introduced GoogleNet which contains 22 layers and adopts the Inception module [45].

1) CNN ARCHITECTURE COMPONENTS
CNN architectures are composed of hierarchically structured layers with optimized parameters. This section will interpret the main components of a CNN.

a: CONVOLUTIONAL LAYER
Convolutional layer is the main layer in CNNs that extracts features from input data or feature maps. The convolutional layer contains several stacked convolution kernels to conduct convolution operations. In the convolution operation (see Figure 4), a convolution kernel slides from left to right and from up to down, and it multiplies with a specific region of input or feature map elementwise to produce a value, which is known as feature extraction. The specific region is called the receptive field. These special regions share kernels, known as weight sharing which reduces the complexity of the model and makes the training process easier. Mathematically, the feature map z generated by convolution kernel can be expressed as: where x is the input image, W is the convolution kernel, while b is the bias for the convolution layer.

b: BATCH NORMALIZATION
The input or feature maps generated by convolutional layers may vary greatly, so for large or small values sent to the activation function they face a problem of vanishing/exploding gradients, which hamper the training process [38], [46]. To address this problem, batch normalization [47] is proposed to accelerate the training process, which scales the input of activation function and reduces internal covariate shift by applying normalization operation to each mini-batch. Generally, batch normalization is performed before the activation function, but the function can also be used after the activation function based on application.

c: ACTIVATION FUNCTION
An activation function is a type of mathematical function that maps input non-linearly, which is applied to improve the feature representation ability of networks. It often follows convolutional layers and uses feature maps as input in neural networks. Sigmoid function [48] is a prevalent option for activation function, which is defined as: where x is the input and y represents the output. The sigmoid function experiences the vanishing gradient problem for very large or very small input. ReLU [49] is another frequently used activation function. It is expressed as: where x is the input of ReLU function and y represents its output. It preserves the positive part in feature maps and prunes the negative part to 0. ReLU can alleviate the problem of vanishing gradient since its gradient is 1 when the input is positive, no matter how large it is. However, when the input is negative, the output of ReLU and its gradient is always equal to 0. It can reduce overfitting, but it also obstacles CNN architecture to learn in some cases because of zero i.e. disconnection of neurons. LeakyReLU (LReLU) was proposed to address the problem of zero gradients when the input is negative for ReLU function [50]. It preserves the positive part fully, but it also preserves the negative part and scales them in a ratio λ (range 0 to 1). It is expressed as: when its input is negative, both output and gradient are nonzero values. Generally, Softmax function is applied to the final layer as activation function for the multi-class classification task. It is expressed as:  where x is input vector and x i is its component. K-dimension output means K-class classification. y(x) i is the output which means the probability of the input vector is classified into the ith class.

d: POOLING LAYER
The feature map out of the convolutional layer records the position of pixels precisely, so it is very sensitive to the location of features. The high sensitivity means a small movement of the position of features, such as rotation and shift, will lead to a different map, which will decrease the robustness of CNNs. Usually, a pooling layer with pooling operation is applied after the convolutional layer to reduce specific feature positioning reliance and ensure the shift-invariance of CNNs. At the same time, pooling operations can also reduce the resolution of feature maps, and then reduce the burden of computation.
Pooling operations can be categorized as max-pooling [51], average pooling [52], and sum pooling [53], [54]. Figure 5 shows how pooling operations work: a sliding window is placed upon feature maps and max value, average value, or sum of the value in this window is calculated as output. Especially, if the size of the pooling window equals the size of the feature map, it is referred to as global pooling, otherwise, it is local/regional pooling.

e: FULLY CONNECTED LAYER
Fully connected layers (FCs) are flattened layers that generate specific semantic information. Each neuron in the fully connected layer has a connection with all the neurons in the previous layer, then all activations can be computed with matrix multiplication followed by biases.

2) LOSS FUNCTION
The loss function is used to evaluate the difference between the predicted result and desired result. An appropriate loss function can measure the difference between the result and label properly and guide a fast and correct training process. Following are some popular loss functions used in CNN architectures. For multi-class classification tasks, the most used loss function is cross-entropy function loss. It is given as: where M is the number of classes, c i is the practical label of an input belongs to ith class so it is 0 or 1, p i is the probability of the input predicted by networks.
Cross-entropy can also be applied to binary classification since the sigmoid function is a special case of the Softmax function. Here, cross-entropy is known as binary crossentropy loss function, it is expressed as: where y is ground truth and p is predicted value. Dice coefficient (DSC) is a statistical indicator that can be used to evaluate the similarity between two images. It is represented as: where |GT | represents the ground truth magnitude while |SR| represents the segmentation result magnitude, |GT ∩ SR| represents the common elements between GT and SR. Based on the Dice coefficient, Dice loss (DSL) is another loss function widely applied to image segmentation tasks. Dice loss is expressed as:

B. FULLY CONVOLUTIONAL NETWORKS (FCNS)
Long, et al. [26] proposed fully convolutional networks (FCNs) which replaced fully connected layers with upsampling layers. The feature maps were up-sampled to the same size as the input images, and thus dense predictions were made by the network. The proposed FCN architecture is shown in Figure 6. Compared with traditional CNNs, FCNs can predict each pixel in an image or image patch, so it is more suitable and fast for image segmentation tasks.

C. U-NET
Ronneberger, et al. [55] proposed U-net which has symmetrical encoder-decoder structure and skip connections from encoding path to decoding path. Features were extracted in the encoder and images were reconstructed in the decoder. Skip connections sent low-level feature maps generated in the encoder to the decoder directly. Since low-level feature maps contained local information while high-level feature maps contained global information, then the proposed U-net integrated low-level and high-level feature maps and thus made the better prediction. Figure 7 shows the U-net architecture.

III. DATABASE AND EVALUATION METRICS FOR RETINAL VESSEL SEGMENTATION
Retina locates in the inner layer of the eyewall. A digital fundus camera attached with a low-power microscope is used to acquire retinal fundus images. The pupil of the human eye is the entry/exit point for fundus camera illumination and imaging light beams on the retina. The retinal fundus images can also be obtained through EasyScan camera based on Scanning Laser Ophthalmoscopy (SLO) [56]. SLO has the advantage of lower light exposure and has a better contrast between vessels and background due to the confocal design [57]. There are many publicly available databases for retinal vessel segmentation [10]. Here we just introduce several main databases. DRIVE [18], STARE [58], CHASE_DB1 [59] and HRF [6] are the most used publicly available databases. DRiDB [60] and ARIA [61], [62] are also available for retinal vessel segmentation but less used in recent years. Images in these six databases were obtained by the colorful fundus photography technique. In addition, two other databases, IOSTAR [63] and RC-SLO [64], whose samples were obtained by SLO, can also be used for retinal vessel segmentation. Table 1 indicates the brief information of these databases.
Generally, pixels in FOV of fundus images are classified as vessel pixel (positive) or non-vessel pixel (negative). To measure the identification of pixels, ground truth labels are compared with pixel identifications. On this basis, there are four basic pixel measures i.e., TP (true positives), FP (false positives), FN (false negatives), and TN (true negatives). Table 2 shows the measures of these elements through pixels.
Several evaluation metrics are defined to evaluate the performance of segmentation networks. Some of the prevalent metrics are listed in Table 3. In addition, the Receiver operating characteristic curve (ROC curve) is a plot that summarizes the trade-off between TPR and FPR of a model under different thresholds. Therefore, the ROC curve can be utilized to compare different models under the identical threshold or a specific model under different thresholds. Similar to the ROC curve, the Precision-Recall curve (PR curve) illustrates the trade-off between Precision and Recall. The area under the curves, AUC-ROC (the area under the ROC curve), and AUC-PR (the area under the PR curve), are available to evaluate the overall performance of the networks.

IV. EXISTING MODELS FOR RETINAL VESSEL SEGMENTATION
In this section, we category and analyze various methods for retinal vessel segmentation according to their network architecture.

A. CNN FOR RETINAL VESSEL SEGMENTATION
Earlier, some researchers adopted CNNs with only several layers to segment vessels. We review 7 CNNs and summarize their performance evaluations in Table 4.
Fan and Mo [65] applied a 5-layer CNN to vessel segmentation and extracted image patches in the green channel as input. According to the comparison between R, G and B channels, the green channel provides the best vessel-background contrast than red and blue channels. They used L2-norm as the loss function and adopted an optimized threshold to generate the binary vessel map.
Liskowski and Krawiec [66] proposed a CNN with 6 layers for retinal vessel segmentation. They applied global contrast normalization (GCN) and zero-phase component analysis (ZCA whitening) to training images in the pre-processing phase. GCN reduced the uneven illumination in images and ZCA abstracted features from universal characteristics and thus focused on the higher-order correlations.
Khalaf, et al. [67] constructed a CNN with 7 layers. They divided pixels in an image into 3-class: background, large vessel and small vessel to reduce the intra-classes variance. They extracted the green channel of images and applied adaptive histogram equalization (AHE) and top-hat filtering to the green channel in the pre-processing phase. The green channel and AHE increased image contrast and suppressed noise, and top-hat filtering enhanced vessels in training images.
Vengalil, et al. [68] proposed to fine-tune an existing network DEEPLAB-COCO-LARGEFOV using retinal fundus image patches. They replaced the last layer by a convolutional layer and applied a threshold to obtain final vessel maps. They did not adopt any image processing technique because they thought it may lead to undesired outcomes or harm vessel structures.
Tan, et al. [69] constructed a 7-layer CNN to make predictions for multiple objects in fundus images, including optic disc, fovea and retinal vessels. They extracted image patches in different channels with different sizes and resized them. Utilizing multiple channels can provide more information which is helpful for multi-object classification.
Guo, et al. [70] proposed a CNN with 6 layers and introduced a reinforcement sample learning scheme that trained the network on samples with poor performance. The proposed scheme allows researchers to train networks with fewer iterations of epochs and less training time as well as increased network performance.
Uysal and Güraksin [71] proposed a CNN model with several convolutional layers, and they also introduced transposed convolution to up-sample feature maps. Their proposed model made pixel-wise identification and did not perform well.
From Table 4 we can see that most CNNs just produced about 94% segmentation accuracy. We suppose that it is because CNNs have only several convolutional layers and do not have strong feature representation capacity, then they can only segment the basic structure and misclassified most of the vessel boundaries and thin vessels, so they are less used in recent years.

B. FCN FOR RETINAL VESSEL SEGMENTATION
FCNs can make dense and excellent predictions for each pixel in an image patch [26]. In this survey, we review 7 FCNs and list their performance evaluations in Table 4.
Oliveira, et al. [75] proposed an FCN and added skip connections to propagate features from shallow layer to deeper layer. They also explored the multiscale nature of the vascular system by using stationary wavelet transform (SWT) which added extra channels to input. Their result illustrated that the deep learning method can benefit from domain knowledge.
Jiang, et al. [74] used a network based on the fully convolutional version of AlexNet. They applied Gaussian smooth to reduce the discontinuity between FOV and the replaced region. The segmented vessels were thicker than ground truth, so Jiang, et al. [74] applied a 9 × 9 filter to refine the result and reduce noise in the post-processing phase.
Li, et al. [77] constructed an FCN with skip connections and introduced active learning to retinal vessel segmentation. Active learning used fewer manually labelled samples to improve the segmentation accuracy of blood vessels. The performance of the proposed model was increased in the iterative training process.
Since the consecutive down-sampling operations in the encoder lead to loss of information, which is critical to determine vessel boundaries and thin vessels. Luo, et al. [72] proposed a size-invariant fully convolutional  neural network (SIFCN) to reduce its effect. They hold the size of feature maps in each layer by padding and assigning strides and thus reduces loss of information.
Atli and Gedik [78] proposed a fully convolutional network and they were the first to use up-sampling and downsampling to capture thin and thick vessels, respectively. Their proposed model made some over segmentation and did not produce a very good performance, especially on STARE database.
FCNs have more convolutional layers that can extract high-level features, therefore, FCNs have performed better than other architectures as shown in Table 4 and Table 5. However, the segmentation results produced by FCNs are not enough fine, and the edges of segments are too blurry and smooth. FCNs also ignore the spatial consistency of pixels in pixel-wise segmentation. Conditional random field (CRF) [79] can be introduced to improve the segmentation of FCNs [80], [81].

C. U-NET FOR RETINAL VESSEL SEGMENTATION
U-net has a symmetric architecture and skip connection is applied to send feature maps from encoder to decoder directly [55]. Low-level feature maps contain rich detailed information while high-level have better global information, therefore, U-net can capture local and global information to make better decisions. In this survey, we review 32 U-shaped networks and list their performance evaluations in Table 6.
Guo, et al. [79] proposed a U-net and introduced structured dropout to regularize it. The proposed structured dropout is inspired by DropBlock [80] and discards continuous regions of feature maps in a ratio. Sule and Viriri [81] proposed a U-net and applied transpose convolution to the expanding path to recover the lost information.
Zhang and Chung [82] regarded retinal vessel segmentation as a multi-class classification task and introduced an edge-aware mechanism. They divided pixels into 5 classes: background, thick vessels, thin vessels, background near thick vessels and background near thin vessels. The network can pay more attention to the boundary areas of vessels in this way. They leveraged deep supervision to ease optimization.
Mishra, et al. [83] proposed a simple U-net and introduced data-aware deep supervision to improve thin vessel segmentation. They computed the average input retinal vessel width and matched it with the layer-wise effective receptive fields to find layers that extract vessel features preeminently, and then add auxiliary layers there.
Laibacher, et al. [84] proposed a U-shaped network for retinal vessel segmentation which was built on pre-trained components of MobileNetV2. It was the first network to run in real-time on high resolution images. It utilized bottleneck modules and bilinear up-sampling to reduce the number of parameters so that the model could be employed on mobile and embedded systems. The network was trained by using a hybrid loss that combined binary cross-entropy and Jaccard index.
Jin, et al. [85] introduced deformable convolution to retinal vessel segmentation. The deformable convolution block adjusted the receptive fields adaptively by learning offsets and therefore captured the retinal vessels at various shapes and scales. The proposed deformable U-net produced better performance than U-net and deformable convolution network [86] on DRIVE, STARE and CHASE_DB1and two other datasets: WIDE [87] and SYNTHE [88].
Similar to Luo, et al. [72], Wang, et al. [89] also wanted to reduce information loss caused by consecutive downsampling layers. They introduced a feature refinement path to U-net which sent low-level feature maps to high-level layers in encoder and decoder, respectively. The proposed feature refinement path can improve the detailed representation ability of the encoder and the discriminative ability of the decoder. While Yin, et al. [90] proposed to add multi-scale grayscale images to each stage of the encoder and decoder to reduce information loss and help information recovery.
Dharmawan, et al. [91] proposed a new directionally sensitive blood vessel enhancement method that combined CLAHE with a new match filter to detect micro vessels. The new matched filter was based on multi-scale and orientation modified Dolph-Chebyshev type I function. Their method detected more micro vessels than common CLAHE but still produced many mistakes.
Residual learning [92] was also introduced to increase the depth of networks as well as alleviate vanishing/exploding gradients. It was applied to building blocks [93]- [97] or skip connections [97], [98].
Dilated convolution [99] was also introduced to retinal vessel segmentation to enlarge the receptive fields [100]- [103]. Lopes, et al. [100] also tested the effect of different downsampling techniques, that is, max-pooling, convolution with 2×2 kernel and convolution with 3×3 kernel. They obtained better results when using convolution as down-sampling operations, which is consistent with Soomro, et al. [76].
Jiang, et al. [101] arranged dilated rates deliberately to obtain a dense sampling of input and thus avoid the chess-board effect. They also introduced depthwise separable convolution [104] to reduce the computation cost and the number of parameters.
Soomro, et al. [103] introduced morphological transform and fuzzy C-means to the pre-processing of images. In the post-processing phase, they applied morphological reconstruction to remove small objects in segmented results. Mou, et al. [97] introduced probability regularized walk (PRW) algorithm to reconnect fractured vessels. PRW is an extension of the random walk algorithm [105] on probability maps.
There is a black ring around the field of view (FOV) in fundus images. Networks should pay more attention to the FOV since the black ring does not contain any information. Attention mechanism [106] has been applied to locate the region of interest (ROI) and strengthen feature representations in retinal vessel segmentation. Luo, et al. [ [114] designed attention modules to strengthen feature representations, and their attention maps were learned by networks instead of assigned by experts.
Yan, et al. [115] introduced a novel joint loss to alleviate the highly unbalanced pixel ratio between thick and thin vessels in fundus images. They divided vessels into thin vessels and thick vessels to alleviate the unbalance problem. The joint loss includes pixel-wise and segment-level loss which emphasizes more on the thickness consistency of thin vessels.
Nasery, et al. [116] proposed a new data augmentation approach. They leveraged vignetting masks to create more annotated fundus images. Their method just adjusted the illumination condition of images but did not change the geometric and morphologic characteristics.
Galdran, et al. [117] proposed a new metric for retinal vessel segmentation and tested it using U-net. They introduced normalized mutual information to evaluate the segmentation quality. The new metric was applied to raw vessel probability map and can instruct the selection of threshold to binarize the vessel probability map.
Alvarado-Carrillo, et al. [118] focused on the curvilinear structures in vessels, so they proposed Distorted Gaussian Matched Filters (D-GMFs) with adaptive parameters and added them to the beginning and end of a U-net. They did not conduct an ablation study so we cannot know the effect of their proposed D-GMF Adaptive Unit.
Considering the large-scale variants of vessels and semantic variants existing in fundus images, Wu, et al. [119] proposed to adjust the receptive field adaptively to capture multi-scale features, they also adaptively fused features to extract more semantic information. They obtained a good result but still need to pay more attention to thin vessels.
From Table 6 we can see that U-net can produce about 96% segmentation accuracy, which is higher than FCNs'. U-net can reuse low-level information by concatenating feature maps, which also increases the computation burden, so the input is a small image patch cropped from whole images. The network cannot identify pixels well because an image patch contains less information than a whole image, and it is also constrained by a limited receptive field, although dilated convolution can enlarge the receptive field.

D. MULTI-MODEL NETWORK FOR RETINAL VESSEL SEGMENTATION
Lots of researchers had found the limited prediction capability of a single model, so they proposed multi-model networks for stronger prediction ability. Most of them followed the spirit of U-net and FCN and employed encoder-decoder structure to form sub-models. We review 19 multi-model networks and summary their performance evaluations in Table 7.
Some research segmented thin/thick vessels or vessel boundaries/centers separately, then fused the segments to complete a whole segmentation [122]- [125]. These methods can be regarded as coarse-and-fine segmentation because thick/thin vessels or boundary/center vessels were segmented concurrently and separately.
Yan, et al. [122] proposed a three-stage segmentation network for retinal vessels using three sub-networks. The segmentation of the whole vessel tree was divided into three sub-tasks: thick vessel segmentation using FCN, thin vessel segmentation using U-net and fusion of segmentations.
Sathananthavathi and Indumathi [123] also proposed a coarse-and-fine strategy. They constructed 2 parallel FCNs, the first FCN was larger and trained with ground truth to extract thick and moderate vessels, while the second FCN was smaller and trained with skeletonized ground truth to extract thin vessels legibly. Outputs of both FCNs were integrated to generate the overall vessel segmentation.
Yang, et al. [124] proposed an improved U-net, whose encoder was used as a backbone to extract features, they arranged 2 decoders to segment thin and thick vessels, respectively. Finally, they added a fusion network to fuse the output of two decoders.
Wang, et al. [121] constructed a U-net which is composed of one encoder and three decoders. They used one decoder to generate a coarse probability map and divided an image into 'hard' or 'easy' regions according to the probability map. They used two other decoders to segment vessels in 'easy' and 'hard' regions independently. Finally, they fused all feature maps produced by 3 decoders to generate the final vessel map. They also introduced an attention gate to give more weight to vessel feature responses in decoders.
Tian, et al. [125] proposed a multi-model network to learn high-and low-frequency information, respectively. They applied Gaussian high-pass filter and Gaussian lowpass filter to original fundus images to obtain high-frequency or low-frequency information. They sent obtained highand low-frequency information to two sub-networks with encoder-decoder structures, respectively. They fused the output of two sub-networks to get the final vessel map.
More researchers proposed coarse-to-fine segmentation by cascading several sub-networks. The following sub-network can inherit the learning experiences of previous sub-models [126]- [132]. Generally, they added intra-and inter-skip connections to send low-level feature maps and learned knowledge to deeper layers and sub-networks. The followed sub-network segmented vessels coarsely and the following sub-network refined vessel maps. The following sub-model used segmented results of previous sub-models and original images as input. Wu, et al. [128] added an auxiliary layer to the followed network to get an auxiliary loss, so their model was trained by main supervision and auxiliary supervision.
Guo, et al. [133] introduced an incremental learning strategy by cascading five CNNs. They trained the next CNN using the same samples as previous ones and enhanced it by feeding samples that were not performed well in the previous CNN. Finally, the final decision of each pixel was made using a voting scheme on the multiple CNNs results.
Lian, et al. [108] observed existing models always applied global pre-processing operations to images that will lose local information. They applied global and local operations simultaneously to enhance the contrast of images.
Tang, et al. [134] proposed a network with five identical and parallel sub-networks for ensemble learning. The input of each sub-network was grayscale images by extracting R-G channel image data with different proportions. Probability maps produced by five sub-models were averaged to generate the final segmentation result.
Zou, et al. [135] also formulated the task as a multi-class classification task to detect thin vessels with a width less than 2 pixels. They constructed 2 networks, one is for generating labels, one is for retinal vessel segmentation while the last one is for label simplification.
Cherukuri, et al. [136] proposed a domain-enriched network that was composed of two parts: a representation network to geometric features from fundus images and a residual task network to make a pixel-level prediction using the obtained features. Their method obtained a good performance but there are still non-vessel pixels identified as vessel pixels.
To consider the graphical structure of vessel shape, Shin, et al. [137] proposed a vessel graphic network that combined a graph neural network (GNN) [41] with a CNN to jointly utilize both local appearance and global vessel structure. They did not obtain a good result because their model misclassified many non-vessel pixels as vessel pixels.
Tajbakhsh, et al. [138] proposed an error correction mechanism that can learn from segmentation mistakes. The proposed network is divided into three sub-networks: a U-net to produce an initial segmentation, a network to produce diverse but representative error patterns and another U-net to make up the mistake of the initial segmentation map.
From Table 7 we can see that multi-model networks can produce about 96.3% segmentation accuracy, which have slight improvement compared with single networks. However, multi-model networks are also more difficult to train and have a higher computation burden.

E. GENERATIVE ADVERSARIAL NETWORK (GAN) FOR RETINAL VESSEL SEGMENTATION
GAN [40] is a type of deep unsupervised learning model, which is composed of a generator and a discriminator. In this survey, we review 13 GANs and list their performance evaluations in Table 8.
Son, et al. [148] explored several models for the discriminators: pixel-GAN, patch-GAN and image-GAN. The results indicate that patch-GAN performs better than others including the one with a single generator.
Park, et al. [149] chained two U-nets in the generator and used residual convolution blocks as building blocks in both generator and discriminator. They utilized automatic color equalization (ACE) to enhance images in the pre-processing phase with the Lanczos resampling method to smooth the vessel branches and reduce false negatives in the post-processing phase.
GANs based on semi-supervised learning were also explored to address the problem of lacking annotated data. Huo, et al. [150] proposed a semi-supervised framework that combined GAN and self-training scheme, and they adopted particle swarm optimization (PSO) [151] algorithm to choose the hyperparameters in semi-supervised learn-ing since self-training is sensitive to hyperparameters. They obtained 0.9550/0.8419 of AUC_ROC and AUC_PR on the DRIVE database when using 0.1 labelled and 0.9 unlabeled data. Lahiri, et al. [152] also trained a GAN based on semisupervised learning to learn from both labeled and unlabeled data. They only used 3K annotated image patches to make patch-wise predictions and obtained 0.95/0.96 accuracy and 0.96/0.94 AUC on DRIVE and STARE databases, respectively. Their models outperformed simple U-net and used less annotated data, but they also did not segment vessels as accurately as other improved models based on supervised learning.
From Table 8 we can see that GANs produced about 96% segmentation accuracy, which is similar to U-net. Park, et al. [149] obtained the best performance. Compared with CNNs, we need to train generators and discriminators alternatively in GANs, which is more troublesome.

F. OTHER NETWORK FOR RETINAL VESSEL SEGMENTATION
Researchers also proposed networks that cannot be categorized into the forgoing classes due to their unique architectures. The performance of these methods is listed in Table 9. Ngo and Han [153], Guo, et al. [154], Li, et al. [155] adopted multiple input branches to capture multi-scale spatial information. Further, all the feature maps generated by each branch are combined to make predictions. In addition, Guo, et al. [154] applied the Kdimensions tree integrated with the hessian matrix to reconnect the broken segments in the post-processing stage. Some broken vessels were reconnected and the vessel map VOLUME 9, 2021 became cleaner after post-processing. Accuracy and sensitivity were increased while specificity was decreased after postprocessing. Li, et al. [155] introduced sparse variables into the label design and improved the cross-entropy loss function to address the unbalance of samples. Li, et al. [155] got better sensitivities than [153] and [154] because they solved the class imbalanced issue.
Holistically-nested edge detection (HED) network made a significant advancement on edge detection in an image [162]. It is a single-stream network with multiple side outputs and final predictions are made by fusing multi-scale side outputs. Inspired by the HED network, Mo and Zhang [20], Guo, et al. [158] integrated feature maps generated in different stages of networks to generate the final probability map. Lin, et al. [156], Hu, et al. [157] also constructed single-stream networks based on VGG Net and applied fullyconnected CRFs [163] to get the final binary segmentation result. CRFs utilized multiscale feature maps in different stages to make full use of spatial contextual information. CRFs can mitigate noise and edge blurring acting as a global smoothness regularizer.
Feng, et al. [159] proposed a cross-connected network (CcNet) that has two parallel paths. They used two CRM (convolution-ReLU-Max pooling) modules as building blocks and formed cross-connections between these two paths. They sent feature maps produced by each module in the upper path to each module in the lower path. They concatenated all feature maps generated in the lower path to generate final vessel maps. Since these cross-connections, CcNet can learn multi-scale features efficiently.
Zhuo, et al. [160] used three dense blocks to form a straight network and added two bottleneck blocks between them, which aimed to reduce the model complexity and computation cost. Similar to Luo, et al. [72], they also maintained the size of feature maps by cancelling down-sampling layers to reduce information loss of tiny vessels. In addition, considering the existing evaluation metrics should not be equally important since great unbalance exists between vessels and non-vessels, Zhuo, et al. [160] proposed a new evaluation index named fusion score, which converts multiple evaluation metrics into a single target. It is expressed in Equation 9: They got a fusion score of 0.8339, 0.8449 on DRIVE and STARE databases, respectively.
To reduce the high-frequency information loss caused by consecutive down-samplings in the encoder, Noh, et al. [161] introduced a scale-space approximated CNN (SSA-Net). It is a single-stream network with residual connection and skip connections. They inserted up-sampling layers in the feature generation phase to generate size-invariant feature maps and thus reduce spatial scale-space distortions.
CLAHE is widely applied to fundus images to enhance image contrast, which has two parameters: size of the contextual region and clip limit. Most researchers just used default values for CLAHE, but Aurangzeb, et al. [164] introduced PSO to CLAHE to find optimal parameter values. They did not propose a new network but just applied their method to existing models.
From Table 9 it can be observed that Noh, et al. [161] obtained the best performance in these methods, and these methods produced about 95.5% segmentation accuracy. These models did not have a strong learning capacity because most of them have only several convolutional layers, and their architectures may not be very suitable for this task.

V. DISCUSSION
In this survey, we have reviewed 89 deep learning models for retinal vessel segmentation, which indicate deep learning has been widely applied to segment retinal vessels. Earlier, researchers applied simple CNNs for vessel segmentation on DRIVE and STARE databases [69], [165]. Moreover, lots of researchers have proposed various improved models for retinal vessel segmentation. FCNs and U-nets were the most leveraged to make dense predictions because of the excellent performance [73], [79], [81]. Later, different improvement modules such as residual block, dilated convolution and attention mechanism, were introduced with U-net to improve the performance of proposed models [101], [102], [107]. On the other hand, researchers also proposed multi-model networks to get a stronger identification capability [122], [128], and others introduced GANs to vessel segmentation [128]. Multibranch networks and HED-shaped networks were also used for vessel segmentation [153], [166].

A. CHALLENGES IN RETINAL VESSEL SEGMENTATION USING DEEP LEARNING
According to the existing research, the following challenges are encountered while using deep learning for retinal vessel segmentation: 1) Lack of well labelled training samples. Although there is a large number of fundus images, acquiring annotated data is very difficult to obtain since it requires professional doctors and takes a significant amount of time and cost. 2) Low quality of existing image samples. It hinders deep learning models to learn better feature representations. Image noise, uneven illumination, low contrast especially for thin vessels, centerline reflection and various structures (pathological region, fovea, macula, optic disc) decrease the performance of proposed models. 3) Class imbalance problem of training samples. The different number of positive and negative examples available for training degrades the performance of networks. Class imbalance problem not only exists between foreground and background, but also in thick vessels and thin vessels. Deep learning models tend to classify pixels in boundaries as non-vessels pixels because the number of non-vessels pixels are in large quantities as compared to vessel pixels. The network performs worse on thin vessels than thick vessels since the misclassification of pixels in thin vessels has less influence on the total loss.

B. KEY ASPECTS FOR SUCCESSFUL RETINAL VESSEL SEGMENTATION
From the analysis of existing methods, a successful model should be able to detect vessels under uneven illumination, low contrast and various regions in fundus images. At the same time, it should be robust enough for images and have strong generalization ability. We identify some key aspects for successful and robust retinal vessel segmentation, which are as follow: 1) Raw image enhancement. Using image enhancement techniques pre-processing phase, such as the conversion of RGB images to grayscale images, normalization, contrast limited adaptive histogram equalization (CLAHE) [167] and gamma correction increase the image quality [73], [93], [107]. We can also adopt morphological operations [168] to increase the quality of images [76], [123]. 2) Data augmentation. The publicly available databases are too small to train a network, so we can utilize regular data augmentation techniques [169] to enlarge the training dataset, such as rotating, flipping, shifting, mirroring and cropping images into image patches [23], [107], [108], [122], [156]. Also, we can leverage transfer learning for this task, such as VGGNet [20], [156]- [158], ResNet [158] or a fully convolutional version of AlexNet [74]. 3) A Well-designed model. A well-designed model could capture more spatial information, reduce loss of local information and reuse low-level feature maps for accurate segmentation. From the segmentation result, U-net and multi-model networks have a better performance than CNNs and FCNs, that is because they have more convolutional layers then they can extract features better. In addition, skip connections also help the reuse of low-level information, which is very important in identification. Some proposed GANs also obtained high accuracy. In addition, dilated convolution is a good option to enlarge the receptive field and capture more spatial information but still keep the same number of parameters [99]. Residual learning can increase network depth and alleviate network degradation at the same time [92]. A dense connection can make full use of feature maps generated by all previous layers and thus decrease model complexity and mitigate vanishing gradient [170]. 4) Proper loss function. A proper loss function could lead models to pay more attention to vessels, especially thin vessels. Researchers can adopt improved loss functions, such as weighted cross-entropy loss function, to solve the imbalance problem [82], [125], [158]. 5) Vessel map enhancement. The segmentation result contains noise and isolated small vessels, so we can use a matched filter or morphological transform to illuminate them in the post-processing phase [74], [103]. The vessel segments are broken in some cases, and we can reconnect fractured vessels by some techniques, such as PRW and K-dimensions tree [97], [154]. Better visualization of the vessel map helps ophthalmologists diagnose disease easier. 6) Abundant validation: we cannot only verify our models using a single database, but also cross-validate networks to evaluate their generalization ability. In cross-validation, a network is trained using samples from one dataset but tested using another dataset [20], [75], [123], [128]. Even we can conduct mixed validation for further check. We can train a network using mixed samples from several databases and test it using the rest samples from these databases [74].
From the analysis of the reviewed articles, several proposed research in terms of models and strategies to improve the performance of networks, such as incremental learning strategy [133], various improvement modules [93], [101], [121], coarse-to-fine segmentation [127], there is still no model can segment vessels perfectly, including segmentation of vessel boundaries and thin vessels, segmentation of background between two closed vessels, segmentation of vessel under the presence of abnormalities and various structures, segmentation of vessel in cross-connections and robust segmentation between different databases. In addition, the segmented vessels are still fractured and broken in most results, which invites researchers to investigate further to reconnect fractured vessels.
Although deep learning has been widely applied to retinal vessel segmentation, there are still some limitations. Compared with human beings, deep learning has less generalization capacity. Compared with conventional methods, such as matched filtering methods and vessel tracing methods, deep learning is more uninterpretable, and it needs massive data and GPUs in training processes, which are not available and expensive for users in some cases.

VI. CONCLUSION
Geometric characteristics of retinal vessels reflect clinical and pathological features. The ophthalmologist uses vessel maps to diagnose diseases, such as DR and MD. Precise diagnoses of eye abnormalities and their timely treatment are important in preventing global blindness.
Computerized automatic segmentation for retinal blood vessels is inspired since manual segmentation of retinal blood vessels is expensive and time-consuming. In the past, researchers proposed different methods for automatic retinal vessel segmentation. Unsupervised models are limited by their accuracy. Machine learning algorithms require handcrafted features and thus are limited by their generalization ability. Currently, deep learning models have been greatly used to image segmentation including retinal images since they do not need any handcrafted features and outperform existing unsupervised methods.
This article reviews publications of recent six years for retinal vessel segmentation based on deep learning. The main contribution of our works is to analyze recent models and find out new trends for retinal vessel segmentation. It will be helpful for researchers and industrialists to develop a robust model for retinal vessel segmentation.