U-net Based Method for Automatic Hard Exudates Segmentation in Fundus Images Using Inception Module and Residual Connection

Diabetic retinopathy (DR) is an eye abnormality caused by chronic diabetes that affected patients worldwide. Hard exudate is an important and observable sign of DR and can be used for early diagnosis. In this paper, an automatic hard exudates segmentation method is proposed in order to aid ophthalmologists to diagnose DR in the early stage. We utilized the SLIC superpixel algorithm to generate sample patches, thus overcoming the difficulty of the limited and imbalanced dataset. Furthermore, a U-net based network architecture with inception modules and residual connections is proposed to conduct end-to-end hard exudate segmentation, and focal loss is utilized as the loss function. Extensive experiments have been conducted on the IDRiD dataset to evaluate the performance of the proposed method. The reported sensitivity, specificity, and accuracy achieve 96.38%, 97.14%, and 97.95% respectively, which demonstrates the effectiveness and superiority of our method. The achieved segmentation results prove the potential of the method for clinical diagnosis.


I. INTRODUCTION
Diabetic Retinopathy (DR) is a serious ocular abnormality associated with chronic diabetes. Patients suffering from it will possibly lose their sight gradually and even go blind [1]. Although DR is treatable with timely diagnosis and intervention, the symptoms of vision impairment can be easily overlooked in the early stage of the disease. Thus, it is necessary to arrange regular examinations for diabetics to delay or relieve the risk of blindness. However, the limited number of clinicians currently is far from enough for the diagnosis of the large number of patients, as there have been more than 400 million diabetics all over the world [2]. Therefore, an automatic diagnosis technology needs to be developed to aid medical specialists.
Hard exudate (HE) is regarded as one of the most prominent features caused by DR. It is formed by macromolecular The associate editor coordinating the review of this manuscript and approving it for publication was Yudong Zhang . substances leaking from blood vessels into the eyeball after the retinal vessels are ruptured. As shown in Figure 1, HE can be observed as bright spots or clumps with sharp edges in the fundus image. Nevertheless, various factors such as VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ uneven illumination and equipment noise constrain the accuracy and effectiveness of DR diagnosis through raw fundus images [3], [4]. Automatic detection of HE is therefore a useful and necessary auxiliary diagnostic approach for DR. Traditional image processing approaches of HE detection includes four main categories: threshold-based [5]- [8], cluster-based [9]- [11], morphological-based [12]- [17], and region-growth [18]- [22]. Machine learning methods have also been introduced to detect HE [23]- [28], where professional knowledge is demanded to design hand-crafted features. These methods require exhausting work and complicated parameter settings but commonly not capable of possessing satisfying generalization. Without the need for expert knowledge, deep learning methods, especially convolutional neural networks (CNNs) [29]- [32] and U-Net [33]- [35], have been widely explored in recent years. Compared to traditional methods, detection and segmentation based on deep learning methods perform better in generalization ability and robustness. Despite this, approaches based on CNN fail to achieve satisfactory efficiency due to huge time and computational consumption. In addition, though faster and more efficient than CNN, current U-Net based methods have not achieved satisfying results as the state-of-the-art methods.
In order to solve the above-mentioned problems and develop a more efficient and accurate method for HE segmentation, we propose a novel U-Net based network architecture to implement end-to-end segmentation. As for the difficulty of limited and imbalanced training data, a SLIC superpixel algorithm is applied to generate sample patches to enlarge the dataset, with the relationship among adjacent pixels preserved.
The proposed method consists of the following three steps. After image preprocessing, the SLIC superpixel algorithm is applied to segment the images, and then sample patches are generated based on the superpixel. Then, the network architecture is proposed to predict the segmentation results of each patch. Finally, the neighboring prediction results of patches are spliced together to complete images for the final results. Extensive experiments are conducted on the publicly available dataset IDRiD [36] to verify the performance of the proposed method.
The main contribution of our work can be summarized as follows: (1) superpixels are utilized to cluster the images by considering image pixels possessing similar characteristics regionally as natural entities. Hence, patches extracted based on this possess more contextual information and are beneficial for further segmentation; (2) proportion of sampled patch is carefully chosen and focal loss function is applied, both of which help to overcome the difficulty of the imbalance of the dataset; (3) the proposed network architecture with inception modules and residual connections is capable of extracting multi-scale features and combing low-level and high-level features, thus achieving better performance.
The rest of this paper is organized as follows: Section II analyzes the advantages and disadvantages of related researches. Section III presents the proposed method in detail.
Section IV validates the performance of our method with extensive experiments. Finally, the whole process and future work are concluded in Section V.

II. RELATED WORK
Automatic detection and segmentation of hard exudates have been studied before, developed methods include traditional methods and deep learning based methods.
In the thresholding segmentation, local or global grayscale is regarded as the dominant characteristic. García et al. [8] utilized global and adaptive threshold simultaneously to segment candidate regions at first, and then employed a series of features and radial basis functions (RBF) to classify the true regions. However, it is difficult to select an appropriate and accurate threshold because the brightness and contrast of the image are not consistent.
Clustering-based methods recalculate the internal distance of various types after classifying the whole image set according to given rules, so as to update the clustering center until the convergence of each class center. Osareh et al. [10] adopted a fuzzy C-means (FCM) clustering algorithm to divide the exudates from background. These algorithms feature the disadvantages that they are usually sensitive to noise and computationally intensive. Also, the choice of the initial center is important but difficult as the location and characteristics of the center of the classes are unknown.
Morphological approaches identify exudates using extracted brightness and grayscale characteristics. Harangi and Hajdu [16] integrated mathematical morphology and active contours into a novel framework to segment exudates from retinal images. Zhang et al. [17] selected lesion candidates using morphological operators for subsequent detection in which multi-feature classification was applied. Despite the fast and effective effects of these methods in terms of computation, they do not take other characteristics into consideration, thus resulting in high noise sensitivity.
Segmentation based on region growing has been proved feasible especially when combined with the artificial neural network [19]. In these methods, the feature of spatial grayscale contiguity was applied for segmentation [20]. Additionally, edge detection can be employed at the same time to extract hard exudates [21] for optimization. Nevertheless, it is inclined to result in over-segmentation and the algorithm is relatively more time-consuming.
Giancardo et al. [24] extracted feature vectors based on color, wavelets and exudate probability, so that diabetic macular edema can be automatically diagnosed according to exudates. In [25], Saìnchez et al. improved Fisher's linear discriminant analysis using color features for classification. Besides, Harangi et al. [27] selected the most appropriate descriptors out of more than 50 other ones to train a boosted naive Bayes classifier.
These methods usually share a common idea that multidimensional feature vectors are constructed for each pixel or clustering of pixels. However, these feature vectors, which are often built on the basis of color, shape, size, and any other relevant information, require exhausting work and complicated parameter settings. In addition, poor generalization poses another challenge.

B. DEEP LEARNING METHODS
Deep learning methods, especially convolutional neural networks (CNNs) [29]- [32] and U-net [33]- [35], have been applied for exudates detection. They have been widely adopted in recent years because they do not require hand-crafted features.
Prentašicì and Lončaricì [29] built an 11-layer neural network model and used the output to generate an exudate probability map. Then, outputs of anatomical landmark detected were incorporated to optimize the exudate probability map. Gondal et al. [30] realized both image-level and lesion-level detection by improving well-performing o_O CNN architecture [31], by removing the dense layers and added a global average pooling (GAP) layer to the traditional CNN architecture. With o_O architecture employed as well, Quellec et al. [32] introduced a CNN visualization based solution where optimized heatmaps were produced for more accurate CNN predictions.
Although CNN based methods have achieved high accuracy, most of the CNN methods used the sliding-window algorithm to overcome the difficulty of limited and imbalanced training data, thus causing unsatisfying efficiency. U-Net [33] is a commonly used backbone network in the field of medical image, consisting of contraction and expanding units, which can capture not only local but global context information of the images. Zabihollahy et al. [34] used modified U-net to conduct segmentation and detection of the hard exudates and removed optic disc for better results. To make full use of detailed information and context perceptions simultaneously, Yan et al. [35] combined the local and global U-Net decoders so that the two streams could be enhanced mutually. Although methods based on U-Net are fast and efficient, the segmentation results remain much to be improved compared to the traditional state-of-the-art methods.
In this paper, U-Net backbone is adopted to implement endto-end segmentation, thus largely reducing the time consumption and designing for better performance. And superpixel is applied to extract data samples, which overcomes the difficulty of the small dataset and imbalanced data.

A. PREPROCESSING
Since we are only interested in the retinal fundus, the redundant black background around fundus images needs to be cropped. Given that the image width is w, we select the largest pixel value from the w/64 leftmost and the w/16 rightmost parts of the image, add 10 to this value as a threshold. Regions with larger pixel values than the threshold are regarded as the foreground, and then the minimum bounding rectangle (MBR) of foreground area can be obtained. The area inside the MBR is cropped from the original image for further processing. After cropping, the size of the images is approximately 2848 × 3400. While the red channel is more saturated and the blue channel is darker, only the green channel [37] is adopted because it shows the highest contrast between the blood vessels and the background, which helps to reduce the interference of blood vessels in the process of classification [38]. To enhance the contrast between exudates and non-exudates, an image enhancement algorithm named Contrast Limited AHE (CLAHE) [39], which is a variant of adaptive histogram equalization (AHE), is adopted to reallocate lightness values with the clip limit of 8.0 and the grid size of 8 × 8. Subsequently, we apply gamma correction to compensate for the loss of brightness caused by uneven illumination and enhance the contrast. For a given input image I , the formulation is given by where the gray-level coefficient γ is set to 1/2.2 and f (I ) is the output image. Figure 2 shows four different stages during the image preprocessing. It can be seen that after the preprocessing, the image contrast between background and hard exudates is more obvious, which is beneficial for further segmentation.

B. SAMPLING
In order to overcome the difficulty of the small dataset, sampling is performed on the original images to obtain more patches as samples. At the same time, we utilized the SLIC [40] superpixel algorithm to guide the patch generation process, instead of directly cropping rectangle patches on the original images. The term superpixel is proposed by Ren and Malik [41] to aggregate individual pixels possessing similar features in interested low-level space, such as color, brightness, and texture [40]. It regards pixels as natural entities regionally so that the connection relationships within images can be preserved for further patch extraction.
We adopt the Simple Linear Iterative Clustering (SLIC) algorithm proposed by Achanta et al. [44], which has better performance than previous approaches [41]- [43] in adhering to image boundaries, speed, and memory efficiency. Given a customized region size and regularity, the SLIC algorithm can generate uniform and compact superpixels with similar features and clear boundaries. Specifically, a larger region size can result in lager superpixel segmentation, and the compactness coefficient relates to the regularity of superpixels. In our experiment, the number of segments and the compactness coefficient is set to 150 and 0.5 separately, which suits the patch size. As shown in Figure 3, the yellow polygons are superpixel subregions obtained after segmentation through the SLIC algorithm.  (1, N ), we regard the superpixel X i as a hard exudate included one. Otherwise, X i will be treated as non-exudate background.
In this way, original images are represented in the superpixel level, which takes the information of adjacent regions into consideration. Therefore, the region of each extracted patch possesses similar features, which provides a conducive foundation for further hard exudates segmentation due to the clustering effect [45].
Although the superpixel can effectively interpret and represent image information, its irregular shape makes it unable to be directly fed to the network. To overcome this problem, we generate sample patches with the superpixel centroids as the center of the generated patches. Thus, a set of N patches N ) is obtained based on the superpixels X = (X 1 , X 2 , . . . , X i , . . . , X N ) , i (1, N ), where each image patch Y i R h×w is a rectangle R with a size of height h and width w. Both of the h and w are to 256, and therefore the size of each patch is 256 × 256.
As illustrated in Figure 3, the red rectangle is one of the patches obtained based on superpixels in the yellow line. Similarly, the patches are classified into ones with and without hard exudates, which is shown in Figure 4. If a superpixel X i contains hard exudate pixels, the relevant patch Y i will be considered as a hard exudate patch, and vice versa. Finally, a total of 900 patches are generated for training, and in order to avoid the imbalance between positive and negative samples, the proportion of hard exudate patches and background ones is set to 2 : 1.

C. NETWORK ARCHITECTURE
In this subsection, we proposed a network architecture with U-Net [33] as the backbone for the segmentation of the hard exudates, which is shown in Figure 5. It takes the generated patches as input and outputs binary images as segmentation results. The size of both input and output images are of 256 × 256.
The proposed 9-unit network architecture is composed of a contracting path and an expansive path. The former path extracts features of the input patches, and then the latter path carries out the process of up-convolution. Between the contracting path and an expansive path, skip connections are serving as bridges for information propagation. These skip connections are capable of combining low-level details with high-level semantic information. The contracting path consists of five basic units, where a max-pooling layer is inserted between every two units along the path to down-sample the feature map. Correspondingly, there are four basic units in the expansive path with an up-sampling layer and a concatenation VOLUME 8, 2020 layer before each unit, thus making use of feature maps from both the lower level and the relevant contracting path. In the end, we use a 1 × 1 convolution layer with a sigmoid activation to generate the output prediction results.
Each of the abovementioned units in both of the two paths is constructed by the residual unit with inception [46] module, which is shown in Figure 6(b). It consists of an inception module instead of a common convolution module, followed by batch normalization (BN) [47] and LeakyReLU activation [48]. By using the LeakyReLU as an activation function, the network converges at an earlier stage. Besides, borrowed from ResNet [49], we add an identity mapping to the unit.  Figure 6(a) shows a residual unit with plain convolution, while it is effective in most cases, however, due to the severely uneven and imbalanced distribution of hard exudates in the patches, it is necessary to extract features from different scales. Therefore, the convolution operation in U-Net is substituted by an inception module, which is shown in Figure 6(b) and its internal structure is shown in Figure 7. A 1 × 3, a 3 × 1, and two 3 × 3 filter kernels are designed to extract multi-scale features. These features are then combined together and a 1 × 1 convolution kernel is used for dimension reduction. The inception module is not only able to efficiently reduce the number of parameters [50], but also capable of increasing the expression ability of the network by introducing more linear mappings. Moreover, the direct connection of the original U-Net is replaced by the residual connection in the proposed architecture. As illustrated in Figure 6, we denote the input of the residual as X i , the residual function as f (·). Therefore, the output of the residual is f (X i )+X i after the addition.
In addition, dropout [51] is utilized between the two 3 × 3 filter kernels to avoid overfitting, and the dropout rate is set to 0.5.
The imbalance of the data consists of two aspects. First, most of the sampled patches only contain background, and patches containing hard exudates are of a small majority. As described before, this can be solved by choosing the appropriate proportions of samples. The other comes from the imbalanced distribution of hard exudates in each patch containing hard exudates. For most of these patches, hard exudates only consist of a small proportion, most of which are less than 10%, thus making the accurate segmentation even more difficult. In order to solve this problem, Focal Loss (FL) [52] is introduced in our proposed network as the loss function, which is shown below: where p s = p, s = 1 1 − p, s = 0, and p is defined as the estimated probability for the class with label s = 1. w s is derived by with w dealing with the weight of positive and negative samples. The variable s in the above two formulas denotes the type of ground-truth class, for which s = 1 means that the class is an easily classified sample, and otherwise it is difficult to divide. The index number α (α ≥ 0) works as a tunable focusing parameter to relieve the loss dominated by a large number of simple samples. Compared with the cross-entropy (CE), the focal loss tackles the imbalance between not only positive and negative samples but also hard and simple examples, thus more suitable for our scenario.
Finally, we apply the Adam (adaptive moment estimation) Optimizer [53] to train the network, with a learning rate of 1e −4 . The network takes 800 epochs to converge.

A. DATASET
IDRiD (Indian Diabetic Retinopathy Image Dataset) [36] is a public dataset available online. This dataset provides pixel-level typical diabetic retinopathy lesions and normal retinal structures. A total of 81 images are given labeled with pixel-level ground truths, among which 54 images are training set and 27 images are testing set. Images in this dataset feature a resolution of 4288 × 2848 pixels and a 50 • field of view.

B. EVALUATION METRICS
Metrics including sensitivity, specificity, and accuracy are adopted to evaluate the performance of the proposed method. The related formulas are defined as follows: where TP (True Positive), TN (True Negative) represent the correctly detected hard exudates and non-exudate background respectively, while FP (False Positive), FN (False Negative) denote the number of wrongly detected ones as exudates and background.

C. RESULTS AND ANALYSIS
The training and testing process of our experiment is described in Figure 8. After preprocessing and sampling, we have generated 900 sample patches of size 256 × 256 for training through the SLIC superpixel algorithm. Then we classify them into patches including hard exudates and non-exudate backgrounds, which is used to train the network. For testing, in order to obtain the final segmentation results, we splice the neighboring prediction results of patches together to complete images, during which we take the average value of overlapped areas and set the threshold value to 0.5. The first experiment compares the results of whether the patches are randomly cropped out or generated using the SLIC superpixel algorithm. Figure 9(b) is the enlarged image of the blue rectangle in Figure 9(a), the red rectangle is the patch cropped randomly and the pink patch is generated through the above-mentioned method. Random crops fail to consider the contextual information of pixels, and may introduce irrelevant parts in the same patch, thus undermining the segmentation performance. On the contrary, patches extracted based on superpixels possess necessary information    for better segmentation, because the SLIC algorithm has clustered the regions with similar features into superpixels. Table 1 demonstrates the experimental results based on different patch generation methods.
In terms of the size of the patches, we choose the possible range of the patch size firstly, then carry out experiments to determine the best patch size. As depicted in the four subfigures of Figure 10, we mark an exudate included patch and a background one with red and blue rectangles respectively. The corresponding performance is revealed in Table 2, where we can see that the segmentation results become better when the size increase from 128 × 128 to 256 × 256, but begin to fall gradually when the size decrease to 360 × 360 and 480 × 480. The patch size of 256 × 256 performs best due to its appropriate size that is capable of including the essential information for segmentation and is not too large to introduce unnecessary interference.
In addition, the time consumption of the SLIC segmentation algorithm for different patch sizes per image is shown in Figure 11, with Python3.6 and skimage module on a desktop of 2.60Ghz CPU with 16 G RAM. It can be seen that the time consumption varies slightly, which means different patch sizes do not influence the efficiency of the sampling process.
Experiments regarding the proportions of hard exudate patches (HE) to background patches (BG) have been carried out to overcome the imbalance of data between positive and negative samples. It can be seen from Table 3, when the proportion of HE to BG is 2:1, the sensitivity and accuracy reach the highest, thus achieving the best segmentation result. When the proportion is 1:2, the patches that only contain background are relatively excessive, and therefore lead to   lower accuracy and sensitivity. In addition, when the proportion is 1:1, the same number of hard exudates and background patches are sampled. However, this makes training less targeted, and therefore weaken the network's ability to distinguish the hard exudates from the interference, such as optic disc and vessels. Hence, the ratio of 2 : 1 is chosen due to the highest sensitivity and accuracy.
Meanwhile, we have conducted experiments to compare the performance of different network architectures and demonstrate the effectiveness of the proposed network. As can be seen in Table 4, whether U-Net is combined with a residual connection or an inception module, the result will score higher in all the three dimensions sensitivity, specificity, and accuracy, especially when inception module is applied. The improvement of the performance by inception module can be attributed to its better ability to extract features from multi scales, and the increase in the width of the network. The building unit in the architecture of U-Net + residual network is shown in Figure 6(a), which is a plain 3 × 3 convolution unit of U-Net with an identity mapping. This structure, though optimized with batch normalization, fails to achieve a satisfying result mainly due to its same size filter kernels, which cannot extract features effectively in this scenario. In our proposed network, the U-Net backbone is optimized with both residual connections and inception modules, achieving the best performance with a 96.38% sensitivity, 97.14% specificity, and accuracy in 97.95%.
In the end, we compare the pixel-level segmentation results with other researches to prove the effectiveness of the proposed method. The latest researches conducted on the same dataset IDRiD with different methods, such as U-net based, CNN based and HED [56] based, are selected and compared. As can be seen in Table 5, the result of our method outperforms others' in terms of sensitivity, specificity, and accuracy. Furthermore, we have achieved a significant improvement compared with the previous U-net based method [34]. Figure 12 shows an original RGB color retinal image and its corresponding pixel-level segmentation result, where TP is marked in red, FP is marked in blue, and FN is marked in green.

V. CONCLUSION
In this paper, we propose a novel method of hard exudate segmentation based on an optimized U-Net architecture. Firstly, we generate sample patches based on the SLIC superpixel algorithm and distinguish them into hard exudate patches and non-exudate backgrounds. Then, we fed them to our network, where the U-Net backbone is improved by utilizing both residual connections and inception modules. At last, we splice the neighboring prediction results of patches together to complete images for the final segmentation results. The proposed method is evaluated on the public dataset IDRiD with a series of comparative experiments. The experimental result in sensitivity, specificity, and accuracy achieves 96.38%, 97.14%, and 97.95% respectively, which demonstrates superior performance among current methods. Future work could be extended by applying attention gates into the network.