HRED-Net: High-Resolution Encoder-Decoder Network for Fine-Grained Image Segmentation

Accurate segmentation of fine-grained information is an important step in medical image analysis applications. With the development of the encoder-decoder-based networks, various network structures and algorithms have made significant progress in semantic segmentation tasks. This work aims to present a novel high-resolution encoder-decoder network (HRED-Net) for fine-grained image segmentation that is highly accurate for small-scale targets. We design a multiscale context connection module to extract feature information without reducing the resolution, and propose a multiresolution fusion model to fine-tune the final results. In addition, these modules are trained together with a detail-oriented loss function to enhance the model’s perception of fine-grained parts. Through experiments on the DRIVE dataset, we found a balance between these modules, and our comparison results show that in addition to the extraction multiscale features, the fusion of multiresolution prediction information is also beneficial for fine-grained segmentation. Our method yielded significant improvements in the accuracy and sensitivity in retinal vessel and lung segmentation tasks.


I. INTRODUCTION
Focusing on the details of image segmentation is an ongoing challenge, and accurate segmentation of medical images, including shapes, locations, and sizes, provides scientific assistance to doctors for making accurate diagnoses. Convolutional neural networks (CNN) based algorithms have made important contributions to the field of medical imaging, involving various aspects such as retinal blood vessel segmentation [1]- [4], pathological slice segmentation [5]- [7], organ segmentation [8]- [10], and tumor segmentation [11]- [13].
Due to the limitations of the standardization of clinical data collection programs and some manual interventions in the data collection process [14], fine-grained segmentation [15] of medical images is challenging. The first limitation is low tissue contrast: fine-grained targets tend to be similar to background pixel values, causing inconsistencies or disappearance at the extended end. The second limitation is noise interference: due to the similar physical properties at organizational junctions, and flowing tissue fluid, medical images are often accompanied by impurities and uncertainty shadows.
The associate editor coordinating the review of this manuscript and approving it for publication was Jon Atli Benediktsson .
The third limitation is that the ratio between foreground and background is unbalanced: fine-grained targets, such as tumors, blood vessels, and nerves, are more worthy of attention in medical images, and they are insignificant in images. For easy understanding, we select the retinal vessel segmentation task as an example, although similar situations exist in other tasks, such as organ and tumor segmentation. In Fig. 1, the left panel shows the collected retinal vessel images and the right panel shows the corresponding ground truth. We confirm the previous view in three aspects: 1) the selected parts of the box have low contrast and blurred blood vessel contours, and the targets are interrupted at the position indicated by the arrows, 2) irregular shadows are distributed throughout the image, and 3) the foreground occupies the image at a ratio of less 0.1.
To address this problem, many supervised and unsupervised segmentation methods have been proposed [16]- [21], including threshold processing, level-set, maximum entropy partition, and manual marking method. These methods, however, have a large dependence on the pixels in the region, and it is difficult to distinguish some fuzzy regions. Recently, many researchers have made various attempts with deep learning methods, and they have proposed new ideas for image segmentation. These studies can be grouped into three categories: 1) those proposing encoder-decoder feature extraction structures [5], [16] and implicitly using multiresolution features, 2) those proposing multiple resolution [22] combinations and explicit combinations of multiresolution information, and 3) those proposing multimodal feature extraction methods, including expanding the network width [23], increasing the network depth [24], and increasing the receptive fields of convolution [25].
We extract different scales of information at a resolution to extend the existing encoder network; additionally, we add an explicit fusion of multiresolution information to fine-tune the final results. During training, we also use a detail-oriented loss function to improve the sensitivity. We summarize the main contributions in the following five points: 1) Through careful analysis and experimental verification, an encoder module with residuals is used to extract semantic depth information and improve the semantic segmentation capability of the encoder-decoder structure. 2) A multiscale detail enhancement module is proposed to extract deep semantic information without reducing the resolution, and put it in the correct location through careful analysis and experiments to separate fine-grained targets. 3) We provide a shortcut between the low-resolution prediction maps to the final prediction and uses them to fine-tune the final results. 4) We propose a detail-oriented loss function that combines the weighted cross-entropy loss function and the Dice loss function to focus on the fine-grained parts. 5) We compare of the U-Net, SegNet, and context encoder network (CE-Net) with our proposed network under the same DRIVE dataset inputs, and implement extensive comparisons on other retinal vessel and lung segmentation tasks with the state-of-the-art methods.

II. RELATED WORKS
With the development of deep learning, CNNs have facilitated medical image segmentation tasks. To optimize the details of the segmentation targets, two strategies are often used in the literature: 1) improving the feature recognition and semantic reasoning capabilities of the network and inferring the attribution of pixel points by learning local and global information, and 2) improving the prediction ability of the network with multiresolution features and effectively integrating local and global context information.
To improve the logical reasoning and expression ability of the network, researchers have explored the use of patchbased CNNs in end-to-end learning to show the dawn in engineering applications. As a representative contribution, Long and Darrell. [26] made a major breakthrough when fully convolutional networks (FCNs) were introduced to address pixelwise prediction problems. FCNs define a skip layer that concatenates a deep, coarse layer with a global context and a shallow, fine layer with high-frequency details, leading to sharper boundaries between different classifications. Then, Ronneberger et al. [5] proposed U-Net with a 13-layer Visual Geometry Group (VGG13) framework, and Badrinarayanan et al. [27] proposed SegNet, which is topologically identical to the VGG16 framework; they both collected information on different resolutions for pixelwise segmentation, and these methods work well in medical segmentation tasks in small datasets. The skip connection provides a bridge for direct delivery of different resolution information between the encoder and decoder paths.
The deep convolution algorithm also has good performance in the medical field. To take full advantage of the different network levels of prediction results in one network, Guo et al. [28] and Liskowski and Krawiec [1] predicted the same resolution at different stages of the shortconnection network. This short-connection approach passes low-level semantic information to a higher level to refine the high-level prediction and passes structural information to the lower-level to reduce the noise at the lower-level. Gu et al. [29] combined an inception module and dilated convolutions to form a context extraction module that links the encoder and decoder parts and captures more high-level information through the field of a different branch.
To improve the segmentation details, multiresolution fusion is widely used as an effective means of medical semantic segmentation. In reference [30], multiple U-Nets were connected into a chain, in which different resolution prediction maps were reused to improve the final accuracy. Feng et al. [31] proposed a more complicated method for connecting prediction graphs at different stages, in which the prediction in the primary path and the two branch paths are cross-connected, exhibiting strong robustness in image segmentation tasks. VOLUME 8, 2020 Typically, better results can be obtained via hybrid approaches. In reference [9], a high-resolution pathway block was used as a skip-connection to fine-tune the final prediction map, and low-resolution prediction maps were also used to improve the top resolution. Both the dilated convolution kernels and low-resolution information were combined to obtain feature information with high recognition accuracy. To solve the problem of fuzzy boundary detection, Xie [32] proposed holistically-nested edge detection (HED), gradually reducing the resolution of the predicted images by FCNs, and then fusing them with weights. Lin et al. [22] exploited a multiscale pyramidal model and defined one pyramid for each feature map; the top-down convolutional pathway produces strong semantic information, and the bottom-up convolutional pathway yields accurate activation of the local information. These edge detection methods are important references for semantic segmentation.
In fine-grained segmentation tasks, Mavroudi et al. [33] used a temporal conditional random field module for finegrained action segmentation. Zhao et al. [34] extracted fine-grained information with an improved pyramid neural network. To further improve the temporal convolutional encoder-decoder network, Nie and Shen [35] proposed a semantic-guided method to acquire accurate boundary information. In addition, many researchers have enhanced semantic segmentation by means of large receptive fields, Vo and Verma [36] combined two deep convolutions with multiple filter sizes for identifying fine-grained features. Zhou et al. [37] integrated fine-grained information from multiple scales with parallel multiresolution modules. Yang et al. [38] proposed a multiscale recurrent neural network to refine the details of boundary shapes.
The above literature has contributed to the semantic segmentation of medical images, and we can learn from these studies to further improve the segmentation performance. In the next sections, we describe a method to solve the problem and show the effectiveness of the proposed method through experimental comparisons.

III. METHOD
In this section, we describe our high-resolution encoderdecoder network (HRED-Net) in detail, and the entire process is shown in Fig. 2. The proposed network consists of four strategies, each of which is introduced in a separate section. First, an enhanced feature extraction module similar to U-Net is constructed, in which the feature information of different resolutions is fully extracted. Next, and most importantly, a multiscale pathway is used to improve the extraction of image details. Then, we elaborate on the multiresolution fusion module, sufficiently in the refined network for accurate segmentation targets. Finally, we force the network to focus more on the medical image features and design a hybrid loss function for fine-grained parts.

A. ENHANCED FEATURE EXTRACTION MODULE
Our feature extraction module learns from the efficient encoder-decoder structure [5], which provides local and global context information by extracting features at multiple resolutions.
To extract higher-resolution features, we use deeper convolution layers than U-Net. As shown in Fig. 2, each encoder module in our method is a three-layer convolution with a residual structure. The residual structure is easy to optimize [24]. Each convolution kernel uses batch normalization [39] and a rectified linear activation unit before the weight layer [40]. After that, max pooling is performed to achieve translation invariance at a low resolution.

B. MULTISCALE PATHWAY
Our second block is a multiscale pathway that connects the encoder and decoder. This module is designed to obtain highresolution feature information and improve the perception of fine-grained parts.
Although the pooling operation has the characteristic of translation invariance, the resolution of the image is gradually reduced, which makes small targets disappear and hard to identify in the underlying layers. The current solution is the skip connection pathway which combines the coarse layers with the corresponding fine layers. However, regarding finegrained segmentation targets, the deep layers cannot generate semantic information for the disappearing parts, which causes negative effects on the continuity of the segmentation target. To overcome this limitation, we propose a multiscale module. As shown in Fig. 1, we first perform a 1 × 1 convolution, a 3 × 3 convolution, a 7 × 7 convolution and 3 × 3 max pooling in a parallel, and in this way, we extract features of different scales at the same resolution. Then, we apply a 1×1 convolution to increase the depth of the network in every branch. Finally, we superimpose the features of all branches and use a 3 × 3 convolution for dimensionality reduction. Since our goal is to increase the sensitivity of the network to small isolated parts, no larger convolutions such as 7 × 7 convolutions or 9 × 9 convolutions, are used here.
Our module draws on the advantages of the Inception module [23], [41]. This operation has three advantages: 1) it expands the depth of the network and increases the nonlinearity of the network, 2) it increases the width of the network and improves the adaptability of the network to different scales, and 3) it operates at the same resolution with large receptive fields, to obtain a wider range of information without losing details.

C. MULTIRESOLUTION FUSION MODULE
Our third block is a weighted multiresolution fusion module for fine-tuning the final results. The multiresolution fusion module is designed to effectively fuse high-level global features and low-level local features to refine high-resolution output maps. Fig. 4(a) and Fig. 4(b) represent two multiresolution network strategies; the difference between them is whether they explicitly utilize multiresolution prediction maps. By combining these two structures, we propose a multiresolution fusion module, which is shown in Fig. 4(c).  In U-Net, the encoder provides a channel for extracting semantic information from different scales, the decoder implements a refinement process, and the skip connections combine coarse, deep features with fine shallow features [42]. In this manner, U-Net allows the depth semantic features to guide the subsequent fusion but does not explicitly produce multiresolution predictions. Comparatively, HED explicitly produces predictions for each level of features, and then a weighted fusion layer automatically combines outputs from multiple scales. To inherit the advantages of the two networks, we followed the main outlines of the U-Net architecture and then finetuned the baseline with a multiresolution structure. Deep layers are helpful for instance detection. In U-Net, the transfer of deep features to the top is a long process, and the information is constantly adjusted during communication with the upper layers. Furthermore, semantic segmentation is a process of feature aggregation, which is assigned heuristically, and information discarded at other levels may contribute to the final prediction. Therefore, a shortcut is needed for the deep features to yield the final prediction.
Inspired by the structure of HED and the pyramid scene parsing networks [43], we provide a multiresolution fusion model for the deep layers. The difference from HED is the weight distribution for each resolution; we distinguish the top layer from the other layers; the top prediction map is dominant, and the other prediction maps are secondary. As shown in Fig. 2, we add a 1 × 1 convolution before each prediction map, which not only automatically selects the scale, but also changes the number of channels. We convolve the number of channels from low-resolutions to a quarter of the original resolution [44] and superimpose them onto the original resolution prediction map; in this way, a shortcut to the final result is provided for results of different resolutions.

D. DETAIL-ORIENTED LOSS FUNCTION
As an end-to-end segmentation framework, our target is to train the proposed network to predict where each pixel belongs; i.e., this is a pixel-level classification problem. As a two-class task, the pixel values near the boundary are usually very similar, making the model easy to misclassify. Therefore, modeling the task as a regression problem is more accurate than a modeling it as a classification problem, which estimates the probability of each pixel belonging to the target.
The ratio of foreground to background is often unbalanced in medical images. Li et al. [45] showed that not all pixels are equal and that more power is given to the interesting pixels; to balance the pixel frequency between the region of interest and the background, we choose the weighted cross-entropy loss.
Here, is the total number of pixels; p i and y i are the predicted probability of positive samples and the sample value of pixel i, and α denotes the weight of the positive pixels.
However, the Dice coefficient loss makes sense for clinical application, as more focus should be on the overlap between the prediction and fact in medical images. It is defined as follows: Then, we define the joint morphological segmentation as follows: Here, λ is the balanced weight. In our experiments, we set λ = 1.5 to obtain preferable results according to experience.

IV. EXPERIMENTS AND RESULTS
In this section, we evaluate the proposed method through experimental comparisons. First, we describe in detail the processing of the raw data, the implementation details, and the evaluation metrics. Then, we evaluate the modules we have proposed using the DRIVE dataset. Finally, we demonstrate the effectiveness by comparing our method with the state-of-the-art methods.

A. DATASET AND IMAGE PREPROCESSING
There are many publicly available benchmark datasets for image segmentation. We evaluate the proposed algorithm on three datasets: DRIVE [46], STARE [47], and LUNA.
The DRIVE database contains 20 RGB training images and 20 testing images with the size of 565×584 (the retinal vessel occupies a radius of 540 pixels). We cropped the original images to 544 × 544 pixels with a reference mark, and then training and testing are performed for all of the images of this size. The STARE dataset contains 20 retinal fundus images with labels. We automatically generated marks based on the images. The original image size is 700 × 605 pixels, and we cropped the images to 656 × 544 pixels, and trained and tested images of this size. The LUNA dataset contains 534 2D samples with corresponding labels, and all images have a resolution of 512×512. The first two datasets are blood vessel segmentation datasets in which the segmentation targets are fine-grained and scattered, and the last dataset is an organ segmentation task, which is used to verify the generalization ability of our proposed network.
Due to the limitations in the number of training images, we augmented the dataset to enhance the expressive power of the training data. First, we adjusted the orientation of the images, including vertical flipping and horizontal flipping. Next, image preprocessing, which mainly involved random rotation, random shear, width shift and height shift, was implemented. Finally, we randomly processed the training images, including scaling from 0.9 to 1.1 and channel transformation. We also adopted noise reduction strategies on all the training and testing images. All the images underwent normalization, and then contrast-limited adaptive histogram equalization (CLAHE) and gamma correction were performed. The effect of noise reduction on the image is shown in Fig. 5. The segmentation targets become clearer, and the difference in the brightness decreases from the top row to the bottom row.

B. IMPLEMENTATION DETAILS
The dataset we use has the same standard training set and test set. All the pictures are subjected to the same image preprocessing and cropping processes. On this basis, the training data are subjected to additional amplification processing to compensate for the lack of data. We randomly divided the expanded training data into 80% for training and 20% for verification. All training ends with the early stop method. We also adopt the adaptive momentum (Adam) optimizer [48] with a learning rate of 0.0001, and step size hyperparameters β 1 = 0.9 and β 2 = 0.999. All the experiments are run on an NVIDIA Titan XP GPU.

C. EVALUATION METRICS
Subtle differences determine the quality of fine-grained segmentation, and how to select an evaluation indicator is very important to evaluate the segmentation results effectively. In this respect, we draw on the experience of other researchers. Many researchers have provided us with references for fine-grained segmentation, Angelova and Zhu [49] and Zhao et al. [50] used the accuracy as the metric of finegrained segmentation, Zhang et al. [51] chose the precision and recall as metrics. We choose the accurate (Acc) as an indicator to evaluate fine-grained segmentation and use the sensitivity (Sen) and specificity (Spe) instead of the precision and recall for the medical images.
The accuracy is widely used to measure the percentage of correctly predicted pixels and is defined as: Here, TP, TN, FP, and FN represent the numbers of true positives, true negatives, false positives, and false negatives, respectively.
The sensitivity and specificity often appear in pairs and are important metrics for binary segmentation. They are used to evaluate the correct prediction of foreground and background ratios; and are defined as: In terms of model evaluation, we use the area under the curve (AUC) as a metric. The AUC is widely used as an essential indicator to confirm the effectiveness of machine learning algorithms and is obtained by integrating the area under the receiver operating characteristic (ROC) curve. The ROC curve, is the ratio of the true positive rate (TPR) to the false-positive rate (FPR). Here, x is the FPR and y is the TPR. We calculated the AUC by using the implementation provided in the scikit-learn Python library.

D. ABLATION ANALYSIS OF THE PROPOSED MODULE
To evaluate the performance of the proposed module, we designed a series of experiments for training and testing them on the DRIVE dataset. We first evaluate the effectiveness of the proposed modules and the loss function to determine the optimal combination, and then further confirm the performance with some classic networks under the same conditions.

1) EVALUATING THE MULTISCALE PATHWAY MODULE
To compare the effects of the multiscale pathway on different locations of the skip connection, we designed a comparative experiment. The experiment uses U-Net as a reference object, and the multiscale pathway module is placed in the four skip connections of the network, called SED-Net 1, SED-Net 2, SED-Net 3, and SED-Net 4. The difference between them is the number of channels convolved, which is the same as the number of channels in the corresponding encoder module.
We trained and tested all the networks under the same conditions as described in section A, and the test results are shown in Table 1. The table lists the parameters of the network and the corresponding AUC score, accuracy, and sensitivity for each network. From the experimental results, we can conclude that 1) our network outperform U-Net, and 2) the module performs best at the second encoder. There are three reasons for this phenomenon. The first reason is the resolution: in SED-Net 3 and SED-Net 4, the reduction in the image resolution affects the identification of fine-grained parts. The second reason is the receptive fields: operating at the same resolution with large receptive fields, our method obtains a wide range of information without losing details. The third reason is the width and depth: expanding the depth of the network increases the nonlinearity of the network, increases the width of the network and improves the adaptability of the network to different scales.
For a more intuitive understanding of the effects of the proposed module, we compared the prediction maps between our proposed networks and U-Net in detail (see Fig. 6). We show the images predicted by the two algorithms as well as the original image, comparing the segmentation effects on the fine-grained target. In those selected samples, our prediction VOLUME 8, 2020  map performance was as follows: 1) high sensitivity to finegrained parts (see the marked parts). We recognized the finegrained contours and maintained the integrity. 2) Our method is better for low-contrast pasts than U-Net, referring to the parts marked by ellipses, where the contrast is low between the foreground and background, we segmented the blood vessel contours accurately.

2) EVALUATING THE MULTIRESOLUTION FUSION MODULE
To verify the performance of our multiresolution fusion module, we added a fusion part to the SED-Net 2 networks and called this network HRED-Net. We trained and tested the two networks under the same conditions as described in section A, and the results are shown in Table 2. The table lists the parameters of the network and the corresponding AUC score, accuracy, and sensitivity for each network.
From Table 2, we conclude that the fusion module 1) yields an approximate 0.5% improvement in the sensitivity with a 0.014 M parameter increase, and 2) increases both the  AUC score and accuracy by 0.1%. This result proves the effectiveness of the multiresolution fusion module, indicating that there is a slight fine-tuning effect on the final prediction results by directly connecting the multiresolution prediction results.

3) EVALUATION OF THE DETAIL-ORIENTED LOSS FUNCTION
To evaluate the effectiveness of the detail-oriented loss function on the proposed network, we designed and tested a comparative experiment with different loss functions. From the numerical results shown in Table 3, all four metrics (sensitivity, specificity, accuracy, and AUC) showed a bright contrast in the DRIVE dataset, and the Dice coefficient loss performed better than the weighted cross-entropy loss. The detail-oriented loss function achieved the best segmentation scores. The sensitivity improved the most (0.4%), the specificity and accuracy improved by more than 0.2%, and the AUC score improved by 0.1%.

4) COMPARISON WITH THE STATE-OF-THE-ART METHODS
To further validate the effectiveness of our proposed module, we compare the proposed HRED-Net with the state-of-theart algorithms on the same processed images described in section A. We run the code provided by the authors on the preprocessed images, and the results are shown in Table 4.
In Table 4, our network still has better realization than other methods under the same conditions with fewer parameters and shows the best accuracy and sensitivity, reflecting the advantages in fine-grained segmentation, mainly due to the following two points: 1) our loss function focuses more on the segmentation details, 2) our network handles prediction information in a multiscale and multiresolution way, enhancing the segmentation details.
Intuitive effects can be reflected by predictive comparisons, as shown in Fig. 7. All networks can segment the trunk of the blood vessel well and show differences in the small branches. U-Net and SegNet have visible breaks and discontinuities in the small parts, CE-Net's segmentation results are more complete than the results from the other methods but lack the necessary details, and our predictions are closer to the label in the details than the other methods.

E. RESULTS
To further prove the validity of our proposed network, we experimented on different datasets and compared our results with other state-of-the-art approaches. All of the datasets were trained and tested under the same conditions as described in section B.
First, we compared our method with other machine learning methods on the DRIVE dataset and listed the results of the human observer. The results of the human observer come from people trained by an ophthalmologist [44], which can be used to measure of the effectiveness of machine learning methods. From Table 5, we can see that our proposed network increases the sensitivity from 0.8309 to 0.8730 by 4.2% and the accuracy decreases from 0.9576 to 0.9644, and we also achieve a sensitivity of 0.9742 and an AUC of 0.9796.
Then, we evaluated the accuracy and sensitivity of our method with the state-of-the-art algorithms on the STARE dataset. From the comparison results shown in Table 6, we can see that our proposed method achieves a sensitivity  Finally, we evaluated the performance on the LUNA database, which contains 534 2D samples with corresponding labels. All images have a resolution of 512 × 512. Different from the previous two datasets, the segmentation tasks are concentrated in this dataset. We compared the proposed method with U-Net [5], CE-Net [29] and the recurrent residual CNN based on U-Net (R2U-Net) [53]. From the   comparisons shown in Table 7, our HRED-Net increases the sensitivity value from 0.9832 to 0.9917, the accuracy decreases from 0.9918 to 0.9923, and our network also achieves an AUC of 0.9879 and a specificity of 0.9935. Tables 5, 6, and 7 above show the scores achieved by each model, which illustrate the effectiveness of our proposed network. These results further demonstrate that our proposed modules and the detail-oriented loss function are beneficial for all the target segmentation tasks. To obtain a more intuitive understanding of these scores, we show the same example. From the results shown in Fig. 8, we displayed the original image, U-Net segmentation results, our segmentation results, and the ground truth. We can conclude that our proposed algorithm achieves clear results in retinal vessel segmentation and lung organ segmentation.

V. CONCLUSION
Details are essential for medical image segmentation. In this paper, we proposed a multiscale connection encoder-decoder network that focuses on fine-grained parts. Our network draws on the encoder-decoder structure from U-Net, and it consists of an enhanced encoder module, a multiscale finegrained extraction module, a decoder module and a multiresolution fusion module. We also added a detail-oriented loss function to the network. In the multiscale pathway, we extracted the location information of the small targets by increasing the width and the receptive field of the convolution. Through comparative experiments, we found that the multiscale module performed best in the second branch. Moreover, the multiresolution fusion path facilitates direct participation in the final prediction for low-resolution images, retaining complete semantic features. Our comparative experiments show that the proposed method improves fine-grained part segmentation with fewer parameters, including retinal vessel segmentation and lung CT segmentation.