Attention Guided U-Net With Atrous Convolution for Accurate Retinal Vessels Segmentation

The accuracy of retinal vessels segmentation is of great significance for the diagnosis of cardiovascular diseases such as diabetes and hypertension. Especially, the segmentation accuracy of the end of vessels will be affected by the area outside the retinal in fundus image. In this paper, we propose an attention guided U-Net with atrous convolution(AA-UNet), which guides the model to separate vessel and non-vessel pixels and reuses deep features. Firstly, AA-UNet regresses a boundary box to the retinal region to generate an attention mask, which was used as a weighting function to multiply the differential feature map in the model to make the model pay more attention to the vessels region. Secondly, atrous convolution replaces ordinary convolution in feature layer, which can increase the receptive field and reduce the amount of computation. Then, we add two shortcuts to the atrous convolution in order to reuse the features, so that the details of vessel are more prominent. We test our model with the accuracy are 0.9558/0.9640/0.9608 and AUC are 0.9847/0.9824/0.9865 on DRIVE, STARE and CHASE_DB1 datasets, respectively. The results show that our method has improvement in the accuracy of retinal vessels segmentation, and exceeded other representative retinal vessels segmentation methods.


I. INTRODUCTION
The retinal blood vessels are important parts of the systemic microcirculatory system, and its morphological changes are closely related to the severity of cardiovascular diseases such as diabetes and hypertension. Long-term hyperglycemia, metabolic disorders in the body of diabetics may cause systemic microcirculation disorders. Systemic small blood vessels and microvascular lesions are the two common pathological changes in diabetes. Retinal blood vessels in fundus are the most vulnerable, namely diabetic retinopathy (DR) [1]. If diabetic patients appear retinal blood vessels swelling, it must be noted [2]. The hypertensive retinopathy (HR) is caused by hypertension [3]. Physiological changes caused by arterial blood pressure and it is a major cause of the clinical symptoms of hypertension treatment fundus, hypertensive patients observed due to increased vessel stenosis or arterial blood pressure caused by the curvature of the vessels [4]. With the development of The associate editor coordinating the review of this manuscript and approving it for publication was Sudipta Roy . computer vision in recent years, the use of computer-aided diagnosis has become a new technology for ophthalmologists to directly examine retinal images [5]. That obtains information about the blood vessels shape and cross retinal vessels segmentation not only enhances DR / HR monitoring, but also reduces viewer between the inside and the observer reproducibility [6]. Moreover, blood vessels are also used as markers of retinal images from different sources in the same patient [7]. In all cases, the extraction of blood vessels from retinal images is important for early diagnosis of some serious diseases. This inspired people to propose a more accurate retinal vascular segmentation algorithm to better diagnose the disease early.
The retina is located behind the vitreous of the fundus appeared concave surface. Retinal vascular structure is extremely complex, with high curvature and various forms [8], which made the task of retinal vascular segmentation very challenging. In the past decades, many methods of retinal blood vessel detection have been proposed. These methods can be divided into two types: manual segmentation and computer algorithm segmentation. The former is extremely time-consuming and laborious and highly demanding on staff. However, if we use the algorithms to automatically segment the retinal vessels, the manual segmentation burden can be reduced. Therefore, the researches on retinal vessels segmentation algorithms are pretty essential. However, due to the retinal blood vessels between the target vessels and image background, highly complex vessels structures, combined with the image noise, how to accurately and effectively retinal blood vessels segmentation is a challenging task.
In this paper, we propose a new network based on U-Net that improves the end-to-end and pixel-to-pixel segmentation of retinal vessel images in deep networks. The U-Net designed as a U-shaped structure, including contraction path and expansion path. The upsampling layers have plenty of feature channels. They are symmetrically superimposed with a conventional convolution layer to obtain context information and spread to a higher resolution layer, so as to obtain a more accurate segmentation [9]. The U-Net not only can locate objects with advanced features, but also segment accurately objects with low-level features, that shows excellent performance in medical image tasks. But, many studies show that the U-Net has its weakness, so we propose a new network based on U-Net. In this paper, we considered that the limitations of the U-Net network are as follows: First, due to the retinal blood vessels are located in specific areas of the retina of the fundus, uneven illumination will bring nonvessel noises interference. Second, the U-Net network has many layers, and the feature accumulation reaches the limit, and the tiny blood vessel parts may not be extracted.
Based on these challenges, and inspired by the recent proposed variability convolution network [10]- [12], the main contributions of this paper are as follows: 1) Proposed method add an attention module between the feature layer and the previous layer of output layer to guide the model learning to better separate vessel and non-vessel features.
2) In addition, two convolutions in the feature layer are replaced by atrous convolutions, and we add two shortcuts in the feature layer. This can reduce the amount of calculation, increase the receptive field, and reuse the features of the feature layer to get more details. The results show that AA-UNet is better to other methods for the segmentation of retinal vessels, and it is best computer aided system for early diagnosis and clinical operation of retinal diseases.
The rest of this paper is organized as follows: Section II briefly overviews the recent literature on methods; A detailed description of the proposed method for extraction of retinal vessels feature based on AA-UNet is given in section III. Experimental results are presented and discussed in Section IV. Finally, Section V makes a summary of this paper.

II. RELATED WORK
Retinal vascular segmentation in fundus images is the purpose of the location and identification of retinal vascular structures. with the development of computer vision in recent years, all kinds of intelligent algorithm are applied in the retinal blood vessels segmentation. on the basis of the learning model, the retinal vessels segmentation algorithms can be divided into supervised methods and unsupervised methods. these two aspects are briefly reviewed as follows.

A. UNSUPERVISED METHOD
The unsupervised methods do not have pre-trained samples, and in most cases, direct use of filter response or technology to construct model based on the model. There have been many studies on the segmentation of retinal vessel. Zana and Klein [13] proposed an algorithm, in a noisy environment detecting the pattern of a container with the accuracy of 0.9377. Chanwimaluang et al. [14] proposed a method of free manual intervention for locating and extracting retinal vessel. It consists of four steps, entropy-based threshold, matched filtering, length filtering and vascular intersection detection, while this approach is rather cumbersome. Fraz et al. [15] extracted blood vessels from the retinal image by combining the detected vascular centerline with the morphological plane section. Martinez-Perez et al. [16] proposed a multi-scale feature extraction based on automatic segmentation method of retinal blood vessels. Niemeijer et al. [17] compared multiple vessels segmentation algorithms. The study has shown that the accuracy of these algorithms can reach 0.9416.
Zhang et al. [18] proposed a text dictionary based on unsupervised segmentation algorithm of retinal blood vessels. If appropriate pretreatment was made, better performance can be obtained. Raja and Vasuki [19] proposed a method, through the removal of the retinal image disc area to automatically split retinal blood vessels. Hassan et al. [20] proposed a vessels segmentation method combining k-means clustering and mathematical morphology. However, this method is not suitable for handling containers of various widths, which may lose tiny structures. Oliveira et al. [21] used three filters, including Gabor Wavelet filter, matched filter, Frangi's filter, to enhance the blood vessels. Jouandeau et al. [22] proposed an segmentation method based on adaptive. Waters et al. [23] proposed a method for modeling vessel as trenches. The corrected light is detected trench by a high curvature, and the trench is directed to a specific direction. They then improved region growing method was used to extract the complete vascular structures. In this method, the empirically set average illumination threshold may bring bias. Zardadi et al. [24] proposed a fast unsupervised method to detect vessel in the fundus images automatically. They enhance the vessel from all directions; then display the activation function of the cellular response; next, each pixel is classified by an adaptive threshold algorithm; and finally, post-morphological processing is performed. However, there are still several points that are incorrectly segmented into vessel, affecting the final performance of the algorithm. The performance of these proposed unsupervised methods on DRIVE and STARE datasets is shown in Tab.1. The performance of unsupervised vessels segmentation is not particularly good [25]. Moreover, due to the influence of external factors, the vascular characteristics in the algorithm usually need careful design.

B. SUPERVISED METHOD
Supervised learning establishes the optimal prediction model by manually labeling the data. It uses the built model to map all inputs to the corresponding output. It has been widely used in segmentation. For the purpose of this study, it requires two processors: one is an extractor that extracts the pixel feature vector; another is to classify extracted vector to the classifier of the labels. A variety of feature extractors are presented, such as Gabor filters [26], Gaussian filters [27]. Many classifiers are used to handle different tasks, such as KNN classifiers [28], Support Vector Machine (SVM) [29], Artificial Neural Networks (ANN) [30], AdaBoost [31].
Deep learning is an algorithm set architecture that solves tasks such as images and texts based on back propagation and multi-layer neural networks. Deep learning is one of the contribution of it can be used from the deep layer feature extraction method for automatically learning features instead of handmade [32].
In many areas, such as medical imaging, satellite observations, text understanding, neural networks structures have been widely used in these areas such as convolutional neural network, recurrent neural network, deep network, and show that they can produce the most advanced results.
In recent years, there have been some studies of blood vessels segmentation based on neural networks. Wang et al. [33] proposed an algorithm that combined convolution neural network (CNN) and random forest (RF) to complete segmentation. Fu et al. [34] transformed the retinal vascular segmentation problem into a holistically-nested edge detection (HED) problem by using the deep learning structure. They used a full convolution neural network to generate a probability map of blood vessels. Orlando et al. [35] proposed a new red lesion detection method based on the combination of deep learning and domain knowledge. They used random forest classifiers to identify real candidate lesions. Jiang et al. [36] proposed a vessel segmentation method with pixels as the main features. And they used selected training data to train the neural network classifier. Where, an 8-d vector represents a pixel, and according to the vector classify unmarked pixels. Aliahmad et al. [37] evaluated retinal vascular segmentation results at different ages based on the supervised learning. The experiment results are shown that different age groups have different results for the segmentation task. Fu et al. [38] took boundary detection as the main research object of segmentation. Deep convolutional neural network (CNN) and conditional random field (CRF) layers are combined to complete the segmentation. Tan et al. [39] proposed a neural network model to segment and identify exudate, hemorrhage, and small aneurysms. A single-layer CNN is used to segment vessel features that could not be accurately segmented, demonstrating the ability to extract features for deep learning. Moreover, for optic disc, fovea and blood vessels, Tan et al. [40] proposed a 7-layer CNN to segment automatically and simultaneously. Xu et al. [41] used a model of automatically detecting DR. To improve the performance, they used data augmentation in preprocessing. Budak et al. [42] proposed a new CNN model, which is a dense connected and cascading multi-encoder/decoder (DCCMED) network. DCCMED contains a series of multi encoder-decoder CNNs and connects certain layers to the corresponding input of the subsequent encoder-decoder block in a feed-forward fashion, for retinal vessel extraction from fundus images. The method is better to others methods on Kaggle datasets. The performance of these proposed supervised methods on DRIVE and STARE datasets is shown in Tab.2.
In summary, deep learning method is superior to unsupervised method in retinal vessel image segmentation. Although there are a lot of methods to segment retinal blood vessels, the accuracy of segmentation need to be explored more, which has great significance for the early detection of diseases. Making a more accurate but more efficient structure for the segmentation task is more challenging. Therefore, we propose a model of AA-UNet to segment the retinal vessel efficiently and automatically.

III. METHODOLOGY
The purpose of this study is to establish a deep learning model for retinal vessel segmentation in fundus images. Based on the U-Net, we propose the AA-UNet for retinal vessels segmentation task. The AA-UNet adds an attention module to the original U-Net. In the feature layer, the convolutions are replaced by atrous convolutions, and adding two shortcuts to the feature layer to make the model separate fundus images pixels and identify details of vessel better.
A. U-NET ARCHITECTURE U-Net has a structure similar to FCN [43]. The difference is that U-Net uses the symmetric encoder-decoder structure, including compression path and expansion path. The compression path (the process of reducing the resolution is shown in the left half block of Fig.1) follows a normal CNN model with convolution and pooling layers. The input is a gray image with the size of 572 × 572. The expansion path (the right half block in Fig.1) contains convolution and upconvolution layers, so that the feature channels are reduced to 1 channel, at the same time feature maps are restored to the original image size [44]. The skip connections between the contraction path and the expansion path cause the model to capture global and local information. Finally, the model uses 1 × 1 convolution to map the feature vector to the required number of classes, and connected softmax layer to output the probability value of target and background to complete the segmentation. Such a simple U-Net symmetry structure has achieved excellent performance in a variety of biomedical segmentation applications.

B. ATTENTION MODULE
We find that retinal vessels appear in a circular area on the retina. To build the model, an attention module was added to U-Net to estimate the most likely region of retinal vessels. A circular bounding box is used to estimate the coordinates of the region. It is shown in the lower right corner of Fig.2.
In the feature layer part of the network, we add an attention module connected to behind the full connection layer. The attention module obtains the coordinates of the circular bounding box region. The left vertex of the circular region is (x1, y1), the right vertex is (x2, y2). Therefore, the regression part of the attention module will get an array (x1, y1, x2, y2). Here, we use the loss function mean square error (MSE), equation (1), where n is the number of arrays, Y ti and Y pi represent the groundtruth and prediction of the i th array, respectively.
Then generating a mask map P from the predicted array. We use the mask map to guide retinal vessel segmentation, focusing the retinal vessel region on a circular area. We set the weight in the mask as 1, and the external weight as θ. Finally, θ is multiplied by feature map before the network output layer to fuse, and the fused feature is used for back propagation. This process is shown as equation (2): where M(x, y) expresses feature of the coordinate (x, y), and θ is weight outside mask map, and the range is [0, 1]. We evaluated the segmentation performance in different θ on DRIVE, STARE and CHASE_DB1 datasets are displayed in Fig. 3, and found that setting θ as 0.3 is the best segmentation performance.
The complete model of AA-UNet is shown in Fig. 2. All convolution layers use zero padding, and here binary cross entropy J(x, z) is used as a loss function of segmentation network. As shown in (3): where m represents the number of per image pixels, x and z represent the probability of foreground estimation and the corresponding groundtruth label, respectively.

C. ATROUS CONVOLUTION WITH SHORTCUT
In order to segment the details of the retinal vessel accurately, we add shortcut to the atrous convolution, which can enlarge receptive field and reuse features of feature layer. The orange part of Fig. 2 shows the feature layer of network. The atrous convolution enlarges receptive field exponentially while maintaining same amount of parameters. Thus, the atrous convolution of this structure can increase the receptive field so that each convolution output contains more information. The atrous convolution actually increases the parameter expansion degree d based on the traditional convolution. The equation of the atrous convolution goes as (4): Moreover, we add two shortcuts in the atrous convolution to reuse the feature information, so that the details can be well-preserved than before.

D. STRUCTURE OF AA-UNET
The whole AA-UNet model shows in Fig. 2, the model contains compression path (gray part is shown in Fig. 2) and expansion path (blue part is shown in Fig. 2). The AA-UNet use a 3 channels image with the size of 512 × 512. The contraction path obtains the feature map, and bottom layer of the feature layer (marked in orange in Fig. 2) use two atrous convolution operations to enlarge receptive field and reduce amount of computation. Two shortcuts are used to reuse the features of the feature layer to pinpoint the vessel. Finally, we added an attention module to estimate vessel region as the bounding box of the mask. Extension path to decode feature map, and in the prediction of final step will pay attention to the mask are integrated into the model. The algorithm of proposed method is shown in Tab.3.

IV. EXPERIMENTS
Due to proposed model consists of contracted path and expanded path, we trained the model in two steps.
Firstly, we train weights of the contraction path to compute the array coordinates of the bounding box of the feature layer used to generate the mask map. And we use the MSE loss function to train these weights. The secondly, we keep weights of contraction path constant and train weights in expansion path. Therefore, we compute mask map using the predicted array coordinates (x1, y1, x2, y2) and then multiply the mask map by the last layer of the expansion path. Finally, binary cross entropy is used as a loss function to train.
The experiment is trained and tested in the tensorflow framework with the NVIDIA GeForce GTX 1070Ti GPU. Because of the lack of data, we also use data augmentation for preprocessing.

A. DATASETS AND EVALUATION METRIC 1) DATASETS
We test the performance of AA-UNet on DRIVE, STARE, and CHASE_DB1 datasets, which are representative datasets in the retinal vessels segmentation task. The retinal vessels segmentation original images and corresponding groundtruth images are shown in Fig.4.
The DRIVE (Digital Retinal Images for Vessel Extraction) consists of 40 fundus images [45]. We selected 20 images for training and the other 20 images for testing. These images were taken by a Canon camera in 45 field of view (FOV). The size of each image is 584 × 565.
The STARE (Structured Analysis of the Retina) consists of 20 fundus images [46]. We selected 10 images for training and the other 10 images for testing. These images were taken by a 35V TopCon camera. The size of each image is 605 × 700.
The CHASE_DB1 (CHASE) consists of 28 retinal images taken from the eyes of 14 children [47]. We selected 20 images for training and the other 8 images for testing. These images were taken by a Nidek camera at 30V. The size of each image is 960 × 999.
For ease of training, we resized the original images size of all three datasets to 512 × 512, and all convolution operations use zero padding. Since our datasets have a very small number, we expanded the training set by using a series of random augmentations, including: rotations, horizontal and vertical flips, elastic distortions and changes in brightness, contrast,   saturation and hue. Image augmentations are commonly used in biomedical image analysis tasks, especially when working with small datasets, as they can improve accuracy and generalization [48]. We set the rotations to a range from −30 to +30 degrees. The brightness and contrast are set to the range from 0.7 to 1.3. The saturation and hue are set to the range from 0.95 to 1.05. Elastic transformations are governed by the grid size and the magnitude, for which we selected values of 8 × 8 and 1 respectively [49]. We use the following metrics to evaluate the model: Accuracy (ACC), Precision (P), True Positive Rate (TPR), True Negative Rate (TNR), and Area Under Receiver Operating Characteristic (ROC) Curve (AUC). ACC is a comprehensive metric that measures the ratio between the correctly classified pixels and the total pixels in the data set. P represents the positive predictive value, which means that the true positive sample accounts for the proportion of all positive samples. TPR is also known as sensitivity, measuring the proportion of positives that are correctly identified. TNR is also known as specificity, measuring the proportion of negatives that are correctly identified. These metrics are defined as follows: In addition, we use the harmonic mean F-measure (F1) and Jaccard Similarity (JS) to evaluate performance. Here, the ground truth is abbreviated as GT, and the segmentation result is abbreviated as SR.

B. COMPARISON OF AA-UNET AND U-NET
The AA-UNet and U-Net models are compared on the representative datasets DRIVE, STARE, and CHASE_DB1, respectively. Each dataset is divided into a train set, a validation set, and a test set. Binary cross entropy is used as a loss function and Adam is used as an optimizer. We set the size of 60 per batch, totally trained 100. In order to achieve fast convergence and avoid over-fitting, We set the initial learning rate to 0.001 and use the dynamic learning strategy to adjust value of the learning rate. The learning rate will be reduced VOLUME 8, 2020  We also tested the inference time of the model on DRIVE, STARE, CHASE_DB1 datasets. Both U-Net and AA-UNet are tested on NVIDIA GTX 1070TI. The U-Net has a parameter number of 31.03M, the inference time is 22ms/21ms/25ms on DRIVE, STARE, and CHASE_DB1, respectively. The AA-UNet has a parameter number of 28.25M, the inference time is 6ms/6ms/7ms on DRIVE, STARE, and CHASE_DB1, respectively. The inference time comparison of the three models is shown in Tab.7. In addition, we use the ROC curve to evaluate the model, shown in Fig.5. The solid red line represents a better model than the other solid blue line on accuracy.
We can see that the AA-UNet is a curve on the upper left and the U-Net is another curve on the lower side. And, it's shown that our proposed AA-UNet has the largest area under the AUC from Fig.5.
To further test the segmentation results of the model, we present segmentation maps of the retinal vessel, as shown in Fig.6, Fig.7 and Fig.8. We can see that our proposed AA-UNet model produces more details of the vessels segmentation. The AA-UNet model can detect details of missing vessel in the U-Net model, thus can complete the segmentation more efficiently. In addition, the Fig. 9 shows partial increased view of vessel images. There are several vessel ends that are not shown by U-Net model. Because retinal blood vessel structure is complex, the most algorithms cannot perform accurately segmentation, but our proposed AA-UNet model reuses the detailed features so that the model can capture the intersection details that are difficult to segment the vessel. Therefore, our model can get ideal results on the retinal vessels segmentation.

C. COMPARE WITH EXISTING METHODS
In this paper, the proposed AA-UNet is also compared with several recently published methods, where the traditional segmentation algorithm is represented by TSA and the deep neural network segmentation algorithm is represented by DSA. Tab.8, Tab.9 and Tab.10 show the type of comparison these algorithm, the year of release, and the segmentation performance on DRIVE, STARE and CHASE_DB1 datasets. In general, the deep neural network algorithms are better VOLUME 8, 2020  than the traditional segmentation algorithms on the retinal vessels segmentation. As shown on the tables, our model performs best on DRIVE and CHASE_DB1, the ACC reaching 0.9558 and 0.9608, the AUC are 0.9847 and 0.9865, respectively. The results are shown that the AA-UNet model has advanced performance compared with these published methods.
Tab.8 shows the results of comparison AA-UNet with existing methods on DRIVE datasets. The images of this datasets are clear, and the algorithm of deep network has obvious advantages. Our model is higher 7% than the traditional segmentation algorithm on TPR and higher 2.2% on AUC. And our model is also optimal or suboptimal compared to other algorithms. Tab.10 shows the results of comparison AA-UNet with existing methods on CHASE_DB1 datasets. The images of this datasets are the worst, the image structure is highly complex, and the contrast of the blood vessels and background is weak. Our model is higher 9.7% than the traditional segmentation algorithm on TPR and higher 3.3% on AUC. And our model is almost optimal compared to other algorithms. Finally, we used the trained models of these three data sets to test HRF datasets, and the accuracy reached 0.9574/0.9698/0.9559, respectively. It verified the universality of the model.

V. CONCLUSION
The neural network learning features to complete the highlevel task has been widely used in various fields, especially in medical image. In this paper, we proposed the AA-UNet model to deal with a segmentation task of retinal blood vessels. To test the performance of model, the AA-UNet and U-Net models were trained and compared on DRIVE, STARE and CHASE_DB1 datasets, respectively.
We used atrous convolution instead of convolution of the feature layer, and added two shortcuts accordingly. In this way, the model can increase receptive field in the feature layer, reduce the amount of calculation, and reuse features of the feature layer. We reduced the number of parameters by a small amount, increasing accuracy and decreasing the inference time. The inference time are 6ms/6ms/7ms on DRIVE, STARE and CHASE_DB1, respectively. In addition, our model can better separate vessel and non-vessel regions by adding an attention module to the feature layer. The detailed portion of retinal vessels in the fundus images are segmented, especially the cross of the blood vessels that are not easily detected. Our model has significantly improved on CHASE_DB1 datasets that has a highly complex vascular structure, and the contrast between the blood vessels and the background is weak. The accuracy and AUC are 0.9608 and 0.9865, respectively. The results showed that our model has a significant improvement both subjectively and objectively on DRIVE, STARE and CHASE_DB1.
This network can segment retinal blood vessels and then diagnose certain serious diseases. This method does not require any hand-made characteristics. Automated diagnostic systems are already widely used in clinical applications, such as DR/HR monitoring during the life cycle. The proposed AA-UNet model provides a common framework for the segmentation of retinal vessels. In the future, the new data of retinal vessels will validate the validity of our model. The segmentation of retinal vessels can also be extended to 3D imaging to get more accurate results of vessels segmentation. This would be a very valuable research area.
YAN LV received the B.E. degree from the Xuzhou University of Technology, Xuzhou, China, in 2017. He is currently pursuing the master's degree in control science and engineering with Heilongjiang University. His main research interests are pattern recognition, computer vision, and biomedical image processing.
HUI MA received the Ph.D. degree in pattern recognition and intelligent system from Harbin Engineering University, China, in 2011. Until 2017, she conducted her Postdoctoral Research work in pattern recognition with Heilongjiang University, China, where she is currently an Associate Professor. Her current research interests include image processing, pattern recognition, and machine learning.
JIANIAN LI received the B.E. degree from Harbin Huade University, in 2016. He is currently pursuing the master's degree in control science and engineering with Heilongjiang University. His main research interests include biometric identification, pattern recognition, and deep learning.
SHUANGCAI LIU received the B.E. degree from the Chengdu University of Information Technology, in 2017. He is currently pursuing the master's degree in control science and engineering with Heilongjiang University. His main research interests are pattern recognition, computer vision, and machine learning. VOLUME 8, 2020