VG-DropDNet a Robust Architecture for Blood Vessels Segmentation on Retinal Image

Additional layers to the U-Net architecture leads to additional parameters and network complexity. The Visual Geometry Group (VGG) architecture with 16 backbones can overcome the problem with small convolutions. Dense Connected (DenseNet) can be used to avoid excessive feature learning in VGG by directly connecting each layer using input from the previous feature map. Adding a Dropout layer can protect DenseNet from Overfitting problems. This study proposes a VG-DropDNet architecture that combines VGG, DenseNet, and U-Net with a dropout layer in blood vessels retinal segmentation. VG-DropDNet is applied to Digital Retina Image for Vessel Extraction (DRIVE) and Retina Structured Analysis (STARE) datasets. The results on DRIVE give great accuracy of 95.36%, sensitivity of 79.74% and specificity of 97.61%. The F1-score on DRIVE of 0.8144 indicates that VG-DropDNet has great precision and recall. The IoU result is 68.70. It concludes that the resulting image of VG-DropDNet has a great resemblance to its ground truth. The results on STARE are excellent for accuracy of 98.56%, sensitivity of 91.24%, specificity of 92.99% and IoU of 86.90%. The results of the VGG-DropDNet on STARE show that the proposed method is excellent and robust for blood vessels retinal segmentation. The Cohen’s Kappa coefficient obtained by VG-DropDNet at DRIVe is 0.8386 and at STARE is 0.98, it explains that the VG-DropDNet results are consistent and precise in both datasets. The results on various datasets indicate that VG-DropDnet is effective, robust and stable in retinal image blood vessel segmentation.

of the feature maps from the previous layer are used as 95 input for the next layer. The outputs of the feature maps 96 in the next layer have the same number of filters (depth) 97 as the previous feature maps. It can reduce a total number 98 of parameters on the DenseNet architecture. Reusing fea-99 ture maps on DenseNet can prevent the network from over-100 learning features [12], [17], [20]. DenseNet designs dense 101 blocks using concatenate operations to reuse feature maps 102 from previous layers [20], [21]. The dense block consists 103 of three sequential operations, namely batch normalization, 104 ReLU activation function and convolution [20]. Wang et al. 105 used DenseNet for retinal blood vessel segmentation and 106 obtained high accuracy and specificity values of 96.09% 107 and 99.04%, respectively, but the sensitivity was still low, 108 namely 75.39% [20]. Cheng et al. combined DenseNet and 109 U-Net architectures on retinal blood vessel segmentation and 110 obtained high accuracy and specificity values of 95.59% 111 and 98.34%, but it gave low sensitivity [22]. Unfortunately, 112 these studies did not assess F1-score and IoU performance. 113 VGG and DenseNet architectures are deep neural networks 114 that have many parameters [23]. The complex models and 115 long computational times are the results of an architecture 116 that has many parameters [24]. This can lead to many new 117 problems, including overfitting. Overfitting occurs when the 118 model or method used has a low error in the training data 119 but a large error in the data that has never been used (testing 120 data) [25]. The prediction graph on overfitting means that 121 the model predicts too accurately and memorizes the pattern 122 in the training data, but fails to predict in the test data due 123 to high error, thus failing to capture the overall trend of the 124 connection [25]. To overcome complex models and overfit-125 ting problems, the dropout techniques can be used [25], [26]. 126 Dropout is a technique to avoid overfitting while accelerating 127 the learning process [25]. Dropout is a regularization strategy 128 for neural networks in which certain neurons are randomly 129 selected and not used during training [27]. These neurons 130 can be discarded at any time. This means that the contri-131 butions of deleted neurons will be temporarily suspended, 132 and no new weights will be applied while the neurons are 133 training [25]. Removing a neuron from an existing network 134 means deleting it temporarily. Several studies have shown 135 that adding dropouts to DenseNet can improve performance 136 in image segmentation. Lee and Lee, used DenseNet Fully 137 Convolutional (FC) and dropout on semantic segmentation. 138 The study obtained a high accuracy of 91.5%, although the 139 IoU value remained low at 66.9% [26]. Wang et al. used the 140 FC-DenseNet, dropout, and scSE modules for Pneumothorax 141 (PTX) segmentation and could be able to achieve high accu-142 racy, sensitivity, and specificity, and an F1-score greater than 143 88%, but this study did not discuss the result of IoU [21]. 144 Bui et al. used 3D-skipDenseSeg with dropout on the con-145 volution layer for MRI of the infant brain segmentation and 146 produced a high F1-score above 90% although it did not 147 measure any other performance [28].  where I and O are input and output images, respectively. The 205 original brightness values I and O are mapped into [0, 1]. γ is 206 the gamma value, if the value γ < 1 (encoding gamma) then 207 the output image will look bright while if γ > 1 (decoding 208 gamma) then the output image will look dark [31]. By utilizing a clip limit that acts as a contrast limiter, Contrast 212 Limited Adaptive Histogram Equalization (CLAHE) is used 213 to improve image quality. CLAHE divided the input image 214 into several local areas called tiles. Clip limit is used to limit 215 the maximum value of the histogram to a specified value. 216 The truncated pixels would be redistributed in local regions 217 throughout the histogram [32]. The number of pixels in each 218 gray level would be divided equally, as explained in (2) [33]: 219 where, N avg is the average number of pixels. N gray is the 221 number of gray levels in the tile. N CR−XP is the number of 222 pixels on the X-axis of the tile. N CR−YP is the number of pixels 223 on the Y-axis of the tile. Equation (3) is used to calculate the 224 clip limit [33].
where N CL is the clip limit and N CLIP is the clip limit input 227 value with a range of 0 to 1.

229
The bottom hat transformation combines the operations of 230 closing and image reduction [34]. By connecting adjacent 231 pixel fragments and removing small holes in the image's 232 center, the closing operation refines objects. This operation 233 is carried out in two stages, namely dilation, and erosion.
where, n out is the feature map size. n in is the input height. P is 266 padding. K is the kernel height. S is stride.

302
In output layer is used sigmoid activation function. The 303 sigmoid function is widely used in outputs that has 2 classes 304 (binary) outputs. the sigmoid function when it returns a value 305 in the region of negative infinite (−∞) and positive infinity 306 (+∞) as in a linear function, it will return the value in the 307 range (0, 1).The sigmoid activation function occupies only 308 the range from 0 to 1 and is asymptomatic in both values. it is 309 very useful for binary classification with 0 and 1 as output 310 [39] The final normalization is calculated of input in one mini 330 Batch x i,j using equation (14) [45]:   The purpose of this stage is to produce the best weights that 376 would be used in the testing stage. The data used are 20 retinal 377 images from DRIVE dataset and 20 retinal images from the 378 STARE dataset for training data. The training dta will be 379 divided to training data and validation data for each dataset. 380 The process that occured in the training stage can be seen 381 in Fig.2. Based on Fig.2, the original image is improved by 382 using gamma correction, CLAHE, and bottom hat transfor-383 mation. The circular random patching technique is applied 384 to reproduce the data. The patching technique divides each 385 pre-processed image into 10000 small pieces measuring each 386 64 pixels × 64 pixels, so that a total of 200000 data is obtained 387 for STARE dataset and DRIVE training data. The total data 388 is divided into 64% for the training data, 16% for validation 389 data and 20% for testing data for each dataset.

391
The weight results from the training process will be used 392 in the testing process to see how far the architecture has 393 succeeded for blood vessel segmentation in retinal image with 394 new data input from the STARE dataset and the DRIVE test 395 data. New data is data that has never been used at all in 396 the training process. For each dataset, 20 images are used. 397 As illustration of the process carried out in the proposed 398 method could be seen in Fig.3.

399
Based on Fig.3, the input image is first performed with 400 the ordered patching technique and enhancements as in the 401 training process. After that, the weights from the training 402 results are used for the testing process. The results of the 403 testing process are still in the form of images of small pieces 404 so that reconstruction is carried out. The results of blood 405 vessel segmentation are compared with ground truth in both 406 of datasets to measure the performance of the proposed archi-407 tecture VG-DropDNet.

409
In blood vessel segmentation, each pixel is classified into 410 two categories, namely blood vessels and non-vessels (back-411 ground). In the confusion matrix, the following four results 412 can be found, namely TP, FN, FP and TN. This study eval-413 uates the performance of the proposed architecture with 414 a confusion matrix is included accuracy (Acc), sensitivity 415 The true positive (TP) is a properly classified blood ves- The accuracy is used to calculate the accuracy of an 438 architecture [66]. However, accuracy is not always suitable 439 for measuring model performance, especially on unbalanced 440 data. Class imbalance between blood vessel pixels and back-441 ground pixels in the dataset requires another performance 442 measure [56]. Sensitivity assesses performance in predicting 443 blood vessels pixels [66]. Specificity assesses performance 444 in predicting background pixels [56]. IoU uses to measure 445 the similarity between the predicted segmentation image and 446 the ground truth image [55]. Performance measures that is 447 suitable for used in unbalanced pixel classes are F1-score, 448 G-Mean, and Matthews Correlation Coefficient (MCC) [75]. 449 F1-score is assessed performance on unbalanced datasets and 450 widely used in segmentation [56]. G-Mean is a statistical 451 measure used to measure the balance between blood ves-452 sel pixels with a background [59]. MCC is the correlation 453 between predicted pixels and ground truth [75]. The Cohen's 454 Kappa coefficient measures the intensity of the agreement 455 The data used in this study are BGR images from 471 the DRIVE and STARE datasets which have dimensions 472 of 565 pixels × 584 pixels for DRIVE and 700 pix-473 els × 605 pixels for STARE. The image enhancement process 474 can be seen in Fig.5. In Fig.5 it can be seen that the input data 475 is in the form of a BGR image, therefore it is necessary to 476 convert it to RGB image. Furthermore, to simplify the image 477 processing process, it is necessary to convert the image to 478 grayscale format. After that, the image contrast is improved 479 so that the retinal blood vessel network looked clearer using 480 gamma correction and CLAHE. The final image from the 481 contrast enhancement process still contains noise that can 482 interfere with segmentation. To overcome this, a bottom hat 483 transformation operation is used. The VG-DropDNet architecture is proposed to obtain a 486 new architecture that combined the advantages of VGG, 487 DenseNet, and dropout architectures to overcome the limi-488 tations of U-Net so the proposed architecture is robust and 489 had stronger performances for blood vessels segmentation in 490 VOLUME 10, 2022

503
In Fig.6, it can be seen that the architecture added a con-504 necting line between the encoder line and the decoder line. 505 The steps in the first layer convolution block for the encoder 506     The number of max polls used in this architecture is 5 so that 526 the input image should have size of 64 pixels × 64 pixels. 527 64 pixels × 64 pixels image size is used because at each layer 528 max poll the image size would be reduced as much as 2 n−1 . 529 If the 32 pixels × 32 pixels image input size is used, then in 530 the last layer the feature size would be 0, so the feature could 531 not be used as input for the connecting line.

532
The bridge path is based on the DenseNet architecture 533 using three dense blocks. It hoped more features would be 534 obtained. However, DenseNet's concatenate function would 535 cause model overfitting. To overcome overfitting, it added a 536 dropout layer on each dense block. In the first dense block 537 and the second dense block, the convolution operation and 538 the ReLU activation function are performed twice using 539 a 3 × 3 kernel matrix, the same padding, and 512 filters. After 540 that, the convoluted feature map would enter the dropout layer 541 with a drop rate of 0.5 so that 50% of the feature map would 542 be omitted temporarily by the computer. The value of the 543 omitted feature map would be set to 0 to avoid overfitting 544 during training. For the third dense block, inputs from the 545 first and second dense blocks are used which are combined 546 using concatenate. An example of a simple calculation on the 547 dropout layer could be seen in Fig.8.     In figure 11(a) it can be seen that the line graph of the 595 error (loss) value on the VG-DropDNet architecture for train-596 ing data and validation data on the DRIVE dataset always 597 decreases until the 50th epoch with the loss value closed to 0. 598 The results of the loss graph in Fig.11(a) shows no overfitting 599 in the DRIVE dataset. Loss values of the training data and 600 validation data has a similar distribution. It indicates that the 601 model generated on the DRIVE dataset is able to recognize 602 other data patterns well even though the data has never been 603 trained before. On the STARE dataset from figure 11(b) it can 604 also be seen that the line graph of the error (loss) value on the 605 VG-DropDNet architecture for training data and validation 606 data always decreased until the 50 th epoch. The training loss 607 value on STARE is closed to 0.2. Training on the STARE 608 dataset also did not experience overfitting between training 609 data and validation data. This indicated that the model used 610 in the STARE dataset is able to predict well for other image 611 data. The graph of the results of the test data training process 612 and validation for the accuracy value could be seen in Fig.12. 613 Graphs of accuracy on training data and accuracy on vali-614 dation data for each DRIVE and STARE dataset can be seen 615 in Fig.12(a) and Fig.12(b). From Fig.12 it can be seen that 616 the resulting accuracy increases with the number of epochs 617 up to the 50th epoch. The accuracy graphs on the DRIVE 618 and STARE datasets does not experience overfitting because 619 they have a similar distribution in both accuracy on training 620 data or accuracy on validation data. The model obtained from 621  and ground truth can be seen in Table 1  truth. It can be seen from the proposed that the blood vessel 641 segmentation in the retinal image is successfully compared 642 to the ground truth image, but some retinal thin vessels in the 643 ground truth are not detected in the segmented image as in 644 files Test_07, Im0324, Test_03, Im0003, and Im0004.

645
The performance value could be done by comparing 646 the segmentation results in testing stage with ground truth 647 image using a confusion matrix   Area Under the Curve (AUC) converts the ROC curve to 684 numeric to measure performance in binary classification by 685 taking values between 0 and 1. The closer to one AUC value, 686 the better the model works. Fig.13 shows the ROC curve and 687 AUC values for DRIVE and STARE. AUC is calculated by 688 the ROC. From Fig.13(a) it can be seen that the ROC graph in 689 the DRIVE dataset is closer to the random classifier diagonal 690 line than Fig.13(b) in the STARE dataset. This causes the 691 AUC value in the DRIVE (0.9309) dataset to be smaller than 692 the STARE (0.9696). The AUC value on STARE is better 693 than on DRIVE, meaning that the model works better on the 694 STARE dataset. STARE has a larger image size. the patching 695 process performed on STARE provides a greater opportunity 696 to obtain more and clearer features of blood vessel in retinal 697 images. This can affect the performance of STARE which is 698 better than on DRIVE. Fig.13(a) describes about the ROC 699 and AUC on DRIVE dataset by VG-DropDNet. Fig.13(b) 700 describe about the ROC and AUC on STARE dataset 701 by G-DropDNet.

703
The VG-DropDNet architecture is used to segment retinal 704 blood vessels in this work. Confusion matrix can be used to 705 calculate architectural performance based on accuracy, sen-706 sitivity, specificity, and IoU. The comparison of the segmen-707 tation results of this study with other studies on the DRIVE 708 dataset is shown in Table 3. 709 Table 3 contains some results from the last 5 years 710 of research using the DRIVE dataset for blood vessels 711 segmentation on retinal image. In Table 3 the highest accu-    Table 4 is a comparison of the results of the last 4 years 730 of studies using the STARE dataset for blood vessels seg-731 mentation on the retinal image. As shown in Table 4, the 732 proposed method achieves the highest accuracy. In terms of 733 sensitivity, and specificity, the proposed method also outper-  According to the comparison results in Table 3 and Table 4,  The number of Blood Vessels and non-blood vessels (back-750 ground) pixels in the images is not balanced. The statistic 751 VOLUME 10, 2022  performances for unbalanced data are MCC, G-mean and F1-score. MCC is used to measure the correlation between 753 two classes. Geometric Average (G-mean) is a metric that 754 measures the balance between majority and minority data.

755
The F1-Score is used to measure whether the classifica-  Table 5 and Table 6. 766 In table 5, the highest MCC and F1-score for DRIVE 767 are obtained by Nagdeote and Prabhu [75]. However, the 768 highest G-mean is obtained by the VG-DropDNet architec-769 ture. The MCC on the VG-DropDNet architecture is higher 770 than the results of Zhuo et al. [72] and Orlando et al. [71].

771
Although the MCC is lower than the results of Nagdeote and 772 Prabhu [75], and Guo et al. [73], the VG-DropDNet archi-773 tecture on DRIVE is able to have a good balance for blood Based on the results and discussion it can be concluded 789 that the VG-DropDNet architecture provides excellent perfor-790 mance results for blood vessels segmentation in retinal image, 791 with the accuracy is more than 90%. VG-DropDNet can be 792 the recommended architecture for blood vessel segmentation. 793 The results of the VG-DropDNet architecture performance 794 on the DRIVE dataset are great even though the IoU value 795 obtained is still below 70%, but IoU in other studies are 796 still rarely measured. The comparison of studies results in 797 Table 3 shows that the IoU on DRIVE is the highest result. 798 On the STARE dataset, the results of the VG-DropDNet 799 architecture performance give excellent results. All perfor-800 mance results on the STARE dataset are above 86%. The 801 Cohen's Kappa coefficient obtained by the VG-DropDNet 802 architecture on both the DRIVE and STARE datasets has 803 given great results above 0.8. It explains that the results 804 of blood vessel segmentation on retina images generated 805 by the VG-DropDNet architecture has a strong intensity 806 of agreement with the ground truth images that have been 807 provided by experts in the DRIVE and STARE datasets. 808 The VG-DropDNet architecture is also able to provide a 809 good balance for the segmentation of blood vessels in reti-810 nal images seen from the MCC and G-mean values close 811 to 1. The segmentation image generated by VG-DropDNet 812 is accurate and valid based on the performance results of 813 accuracy, sensitivity, specificity, IoU, MCC, G-mean and 814 F1-score. The segmented images from Vg-DropDNet are 815 binary images containing only retinal blood vessels. These 816 images can be used for the classification of retina disorders 817 detection that requires examination of abnormalities in the 818 retinal blood vessels. Image results from Vg-DropDNet can 819 be used in medical applications to help detect retina disorders 820 automatically. For future works, there are two focuses. The 821 first is on improving and merging the VG-DropDNet archi-822 tecture with other deep learning architectures and different 823 stages of pre-processing data to improve performance results 824 that are not optimal yet. The second is implementing the 825 segmentation image from VG-DropDNet on the classification 826 of retina disorders based on blood vessels retinal disorders 827 so that this research can be applied in real terms in the 828 medical field. Despite the success of VG-DropDNet in sepa-829 rating lesions from blood vessels, there are cases where this 830 architecture does not perform well. This is especially true 831 for retinal images with severe diabetic retinopathy with very 832 large lesions or high vascular tortuosity. To overcome the lim-833 itations of VG-DropDNet, the use of morphological operating 834 parameters can be used instead of predefined transformations 835 to separate different components of the retinal image at the 836 post-processing stage. By studying the morphological oper-837 ations of the image, complex components of retinal images 838 can be captured well.