ELU-Net: An Efficient and Lightweight U-Net for Medical Image Segmentation

Recent years have witnessed a growing interest in the use of U-Net and its improvement. It is one of the classic semantic segmentation networks with an encoder-decoder architecture and is widely used in medical image segmentation. In the series versions of U-Net, U-Net++ has been developed as an improved U-Net by designing an architecture with nested and dense skip connections, and U-Net 3+ has been developed as an improved U-Net++ by taking advantage of full-scale skip connections and deep supervision on full-scale aggregated feature maps. Each network architecture has its own advantages in the use of the encoder and decoder. In this paper, we propose an efficient and lightweight U-Net (ELU-Net) with deep skip connections. The deep skip connections include same- and large-scale skip connections from the encoder to fully extract the features of the encoder. In addition, the proposed ELU-Net with different loss functions is discussed to improve the effect of brain tumor learning including WT (whole tumor), TC (tumor core) and ET (enhance tumor) and a new loss function DFK is designed. The effectiveness of the proposed method is demonstrated for a brain tumor dataset used in the BraTs 2018 Challenge and liver dataset used in the ISBI LiTS 2017 Challenge.


I. INTRODUCTION
The rapid development of deep learning has been widely used in the medical, industrial, agriculture, and transportation fields. It is gradually playing a huge role in people's productivity and life due to its efficient performance, and has received increasing attention and applications from scientific researchers. Its application in medical imaging has also become a current research hotspot.
The most widely used medical imaging technique is image semantic segmentation, which is used in automatic segmentation and recognition of organs and lesions. Typical image semantic segmentation algorithms include FCN [1], SegNet [2], U-Net [3], PSPNet [4], series versions of Deeplab [5][6][7], DANet [8], etc. Among them, U-Net is more suitable for medical segmentation tasks due to its unique architecture. A large number of researchers have made many improvements and attempts on this basis, and have achieved a series of gratifying achievements. For example, U-Net and its improved versions [9] are used to separate out bladder cancer cells [10], predict skin lesions [11], and segment gallstones [12], liver [13], liver tumors [14], and brain tumor [15,16], etc.
For the series versions of U-Net, U-Net uses skip connections to combine the high-level semantic feature maps from the decoder and corresponding low-level detailed feature maps from the encoder. U-Net++ introduces the nested and dense skip connections from DenseNet [17] to further strengthen the plain skip connections for reducing the semantic gap between the encoder and decoder [18]. In U-Net 3+, each decoder layer incorporates both smaller-and same-scale feature maps from the encoder and larger-scale feature maps from the decoder to capture fine-grained details and coarse-grained semantics in full scales [19]. Despite achieving better performance, the series versions of U-Net are still incapable of exploring sufficient information from full scales of the encoder to give full play to the real ability of the U-shaped network in semantic segmentation. In addition, they have a large number of parameters. We focus on the real ability of a U-shaped network with fewer parameters.
Two datasets were selected to verify the effectiveness of our proposed method, one of which is the brain tumor dataset from the Brain Tumor Segmentation Benchmark (BraTS) 2018 Challenge with multi-classification tasks (https://aistudio.baidu.com/aistudio/datasetdetail/64660), and the other is the liver dataset from the Liver Tumor Segmentation Benchmark Challenge organized by the 2017 IEEE International Symposium on Biomedical Imaging (ISBI LiTS 2017 Challenge) with two-classification tasks (https://aistudio.baidu.com/aistudio/datasetdetail/79729). Glioma, a type of brain tumor including LGG (low-grade glioma) and HGG (high-grade glioma), has received general concern from researchers [20] due to its difficulty in recognition and the public challenges. In the BraTS 2018 challenge, there are four modalities of brain MRI, Flair, T1, contrast enhanced T1, and T2, available for prediction, and the segmentation task consists of three nested brain tumor sub-regions: WT (whole tumor), TC (tumor core) and ET (enhance tumor), as shown in Figure 1. In the ISBI LiTS 2017 Challenge, the segmentation task was the liver region.

II. RELATED WORK
It is worth noting that the U-Net improved strategy has received extensive attention. On the basis of U-Net, a residual module was introduced to reduce the complexity of the network architecture [21]; a U-Net network with a residual module was connected in series to avoid or minimize the natural information loss that occurs following image shrinkage [22]; Two-Pathway-Residual (TPR) blocks were designed to replace linear blocks in the U-Net network to solve the gradient degradation problem [23]; non-linear multi-level residual blocks were incorporated into skip connections to reduce the semantic gap [24]; a attention gate was introduced to focus onto the target [25]; a weighted attention mechanism was introduced into the U-Net network containing residual modules [26], and on this basis, a tightly connected network was introduced to improve the utilization of model feature information and reduce the complexity of network learning parameters [27]; a small change was designed so that each layer in the encoder as connected with the same-size layer in scaled original image pyramids to capture the large-scale detailed information and small-scale contour information [28]; Based on the U-Net with a ResNet50 convolution block, the feature maps of different scales obtained by the decoder can be used to obtain the segmentation output using the feature pyramid network [29]; a dual-channel encoder was designed to obtain a larger receptive field and retain spatial information, including the context channel by multi-scale convolution and the spatial channel by using a large convolution kernel [30]; a dual encoder was designed by simultaneously extracting both the zero filled k-space data and undersampled image for reconstruction, which provides better representation at the bottleneck region and supplements the decoder with skip connections [31].
Furthermore, Abd-Ellah [32] designed a two-parallel U-Net with asymmetric residual blocks to extract local and global features in parallel paths. Wang [33] proposed a wide residual network and pyramid pool network (WRN-PPNet), in which the wide residual network (WRN) was used to extract features of multimodal brain tumor slices and the pyramid pool network (PPNet) was used to obtain the global prior representation with a different level. Tan [34] replaced the ordinary architecture in U-Net with deep separable convolutional layers to distinguish the spatial correlation and appearance correlation of the mapped convolutional channel, and then introduced a residual skip connection to heighten the propagation capacity of features and increase the convergence speed. Myronenko [35] added Resnet-based skip connections and designed a VAE (Variational AutoEncoder) architecture based on the U-Net, which won 1st place on the BraTs 2018 challenge validation dataset with a Dice score value of 0.91000, 0.86680 and 0.82330 for WT, TC and ET, respectively. Ahmad [36] proposed a multi-scale hierarchical-based U-Net, which introduced a hierarchical block for merging features to extract multi-scale information.
In this paper, we proposed an efficient and lightweight U-Net (ELU-Net) with deep skip connections. Our main contributions are three-fold: (i) devising a novel ELU-Net to make full use of the full-scale features from the encoder by introducing deep skip connections, which incorporate sameand large-scale feature maps of the encoder; (ii) discussing different loss functions and their combination to the effect of feature learning, and designing a new loss function to maximize the performance of the proposed network; (iii) conducting extensive experiments on the brain tumor dataset from the BraTs 2018 Challenge and liver dataset from the ISBI LiTS 2017 Challenge, where ELU-Net with the fewest parameters is not inferior or even better in many typical algorithms.
The remainder of this paper is organized as follows: Section 3 describes the ELU-Net network architecture and the calculated parameters for ELU-Net with Vgg16 and ResNet34. Section 4 discusses the ELU-Net with different loss functions for the brain tumor dataset and the design of a novel loss function by combination. Section 5 conducts extensive experiments on the liver data sets to verify the effectiveness of the proposed network and compares it with other representative state-of-the-art methods. In the final section, some concluding comments are made. Figure 2 gives simplified overviews of UNet, UNet++, UNet 3+ and the proposed ELU-Net. The U-Net, including encoder and decoder, is the most popular convolutional network architecture for biomedical image segmentation to predict the segmentation mask at the pixel-level rather than image-level classification. First, the image is taken into the encoder to extract the higher-level features by down-sampling the output of the previous encoder layer. Second, the output from each encoder layer is taken into the corresponding decoder layer to classify the pixels by concatenating the feature maps from the output of the last encoder layer or previous decoder layer by up-sampling to keep the scale consistent. Finally, the output of the last decoder layer is activated by the softmax to output the segmentation result.

III. METHODS
The superiority and effectiveness of U-Net are well known. Based on U-Net, U-Net++ replaces the plain skip connection with the nested and dense skip connection, and U-Net 3+ replaces the plain skip connection with the full-scale skip connection. Furthermore, the proposed ELU-Net replaces the initial plain skip connection with a deep skip connection.
By analyzing the series versions of the U-Net architecture, it is not difficult to find that the features of the encoder are the key to image segmentation. To fully extract the features of the encoder, the deep skip connection includes a plain skip connection from the corresponding encoder layer and skip connections from all deeper encoder layers except the last encoder layer. Use of the deep skip connections can enable one to fully capture fine-grained details and coarse-grained semantics in the encoder. D e   Formally, the parameters of the ELU-Net architecture are formulated as follows: let N refer to the total number of the encoder layer, and i index the down-sampling layer along the encoder. Each encoder layer En i X extracts the smaller-scale feature maps from the encoder to obtain the higher semantic feature. Each decoder layer De i X incorporates all the largerand same-scale feature maps from the encoder except for the last encoder layer, and larger-scale feature maps from the decoder or the last encoder layer to obtain the segmentation result with pixel classification.
Here, each decoder layer De i X can be computed as follows: realizes the feature aggregation mechanism with two convolution operations followed by a batch normalization and a ReLU activation function, represents an upsampling operation by utilizing a bilinear interpolation operation, and [] denotes a concatenation operation.
Preferably, the total number of encoder layers N takes a value of 5. Take 2 X , 3 En X and 4 En X , and the larger-scale decoder layer 3 De X . This incorporates the same four resolution feature maps from the encoder and decoder to seamlessly merge the shallow exquisite information with deep semantic information.
where n 0 represents the number of channels of the input, n c represents the number of output categories of the segmentation result.
It is worth mentioning that the fifth encoder layer is not necessarily the deepest level. The value of N can be 5, 6, or any number. Of course, it is not clear if a bigger value of N will lead to a better segmentation result. This depends on the type of datasets and computing power. In this paper, 5 layers are competent for this segmentation task, which is relatively friendly considering the requirements of computing power.

A. LOSS FUNCTION
In order to give full play to the effect of the proposed ELU-Net as much as possible, it is necessary to choose a suitable loss function [37], which is used to evaluate the matching degree between the predicted label of the segmentation result and the true label. Cross entropy loss (CE), as a widely used loss function, can be expressed as follows: (6) where y true represents the true label for the segmentation task,   (14) where  represents the sigmoid function,  and  represent the hyperparameter of the Tversky loss,  represents the balance factor of the focal loss, and  represents the exponential factor of the focal loss.

B. DATASET AND EVALUATION METRIC
The brain tumor dataset using in the BraTS 2018 challenge with multi-classification tasks was selected to discuss the effect of the proposed ELU-Net with the 9 loss functions and their combinations. It contains 210 cases for HGG and 75 cases for LGG. According to the size of the brain MRI, all slices were extracted for each case except the slices in whose corresponding segmentation result the number of pixels classified as ET were less than 20, of which 90% slices and 10% slices were used for training and testing for the ELU-Net, respectively.
We utilized the Adam algorithm for optimization. Its learning rate was set to 1e-4, and was reduced to one tenth of the previous rate whenever the evaluation metric of verification set had been not updated for 20 consecutive generations. Its weight decay was set to 0.002.
The input of ELU-Net cascades 4 slices corresponding to the four modalities of brain MRI including Flair, T1, contrast enhanced T1, and T2, and each slice has 3 channels, and thus a n 0 value of 12. The output of ELU-Net is 4 channels (n c =4) corresponding to the background and three categories including enhancing tumor (ET), peritumoral edema (ED), and non-enhancing tumor (NET). They constitute three nested brain tumor sub-regions, WT, TC and ET, as the image segmentation result: A confusion matrix was introduced to evaluate the results and the basic parameters are listed in Table I. In Table I, TP refers to the number for which the true value is positive and the predictive value is positive. FN refers to the number for which the true value is positive and the predictive value is negative. FP refers to the number for which the true value is negative and the predictive value is positive. TN refers to the number for which the true value is negative and the predictive value is negative. The dice coefficient was used as the evaluation metric for each segmentation result, which is expressed as follows:

C. DISCUSSION
First, under the cross entropy loss (CE), we compared the segmentation result obtained for the proposed ELU-Net with Vgg16 and ResNet34. The Dice coefficients of the three nested brain tumor sub-regions on the BraTs 2018 validation dataset are listed in Table II.   TABLE II  THE DICE COEFFICIENTS FOR THE SEGMENTATION RESULT OBTAINED FOR  THE ELU-NET WITH VGG16 AND RESNET34  Method  Mean  WT  TC  In Table II, the ELU-Net with ResNet34 achieves a better performance, which exceeds 0.03721 than Vgg16 on average.
In order to give further play to the advantage of the ELU-Net with ResNet34, another 8 loss functions and their combinations were compared and their corresponding Dice coefficients on the BraTs 2018 validation dataset are listed in Table III. As given in Table III, the combination with dice loss and focal loss achieves a best performance on average, 0.87095, and the focal loss achieves a best performance on average among the alone loss function, 0.86937, followed by the dice loss, 0.86863. As for the sole sub-region, the cross entropy loss, focal loss and the combination with dice loss and focal loss shows the best performance for WT, TC and ET of 0.93619, 0.86115, 0.81850, respectively.
Based on a combination of dice loss and focal loss, a novel loss function with dice loss, focal loss and KL divergence loss, DFK, was designed as follows: The Dice coefficients for the three nested brain tumor subregions on the BraTs 2018 validation dataset were obtained through training the proposed ELU-Net with Vgg16 and ResNet34 by the loss function with dice loss, focal loss and KL divergence loss (DFK), which are compared with other state-of-the-art methods in Table IV. Some segmentation results obtained using the ELU-Net with ResNet34 and DFK on the BraTs 2018 validation dataset are shown in Figure 4, which show good performance.  As given in Table IV, with the help of DFK, the ELU-Net with ResNet34 achieves on average the best performance compared to other state-of-the-art methods. For the sole subregion, the ELU-Net with ResNet34 shows an obvious advantage on WT, 0.93498, and considerable performance on TC and ET, 0.86023 and 0.81779, which are just reduced by 0.00657 and 0.00551, respectively, with fewer parameters compared with the best results obtained by the other state-ofthe-art methods. According to Eq. (5), the numbers of parameters for the ELU-Net with Vgg16 and ResNet34 are only 644, 864 and 1, 678, 144, respectively.
In addition, the ELU-Net with ResNet34 achieves a better performance than with Vgg16, exceeded by 0.03709 on average and 0.02333, 0.03163 and 0.05632 on WT, TC and ET, respectively, which proves the effectiveness and advantage of the residual network. Based on the ResNet34 backbone network, compared with the result of the combination with dice loss and focal loss, the DFK loss performs better on WT and TC, exceeded by 0.00103 and 0.00037, and a little poor on ET, reduced by 0.00071. In summary, the ELU-Net with ResNet34 and DFK loss shows a better overall performance.
Moreover, with the help of the trained weights for the ELU-Net with ResNet34 and DFK, the Dice coefficients of the three nested brain tumor sub-regions on the HGG and LGG validation dataset were obtained by training, respectively, and are listed in Table V. As given in Table V, the ELU-Net with ResNet34 and DFK shows a better performance on the HGG than that on the LGG, which exceeds that on the whole dataset by 0.00812 and 0.02434 on TC and ET, respectively, and is reduced by 0.00456 on WT. However, for the LGG, a better performance on WT is obtained, which exceeds that on the whole dataset by 0.02113. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

IV. COMPARISON WITH A LIVER DATASET
The liver dataset was selected to further validate the effectiveness of the proposed ELU-Net, which is obtained from the ISBI LiTS 2017 Challenge. It contains 131 contrastenhanced 3D abdominal CT scans, of which 120 and 11 volumes were used for training and testing, respectively. The 3 most obvious slices were extracted for each volume according to the size of the liver. The other hyperparameters were consistent with the aforementioned method.
The input of ELU-Net is 1 slice with 3 channels (n 0 =3), and the output is 2 channels (n c =2) including the background and liver region.
The proposed ELU-Net with Vgg16 and ResNet34 was quantitatively compared with the other representative stateof-the-art methods based on the Dice coefficients.
Moreover, it is worth mentioning that all results were directly obtained from a single-model test without relying on any post-processing tools and each network was optimized by the loss function proposed in its own article. The input of our segmentation network is directly obtained from the extracted slices without relying on any filtering and enhancement processing. The comparison result is shown in Table VI.
As given in Table VI, the ELU-Net with ResNet34 and DFK greatly improves the performance of the U-shaped network. It has the fewest parameters (1.68M), but it has an absolute advantage, even for the other state-of-the-art network architectures with ResNet101 and shows some particular improvements. Its Dice value exceeds that of the U-Net 3+ with ResNet101 and Hybrid loss and CGM by 0.615% and exceeds that of the U-Net 3+ with Vgg16 by 1.865%, but the number of its parameters is only 3.86% of that of the U-Net 3+ with ResNet101 (43.55M) and 6.23% of that of the U-Net 3+ with Vgg16 (26.97M), which proves the lightweight and effectiveness of the ELU-Net. Some segmentation results obtained for the ISBI LiTS 2017 validation dataset are shown in Figure 5. CT Ground truth Ours(ResNet34+DFK) In addition, the ELU-Net with Vgg16 and DFK (644, 200 parameters) also shows a considerable performance, which proves the effectiveness and advantage of our network architecture.

IV. CONCLUSION
In this study, we proposed a novel ELU-Net with deep skip connections to make full use of the features from the encoder for realizing an efficient and lightweight segmentation network architecture. The Vgg16 and ResNet34 backbone network with many loss functions and their combinations is found to give full play to the effect of our methods, and a new loss function with dice loss, focal loss and KL divergence loss, DFK, was designed based on the exponential and logarithmic advantages. The experimental results obtained for brain tumor and liver datasets demonstrate the effectiveness and outstanding performance of the proposed ELU-Net architecture with fewer parameters. The ELU-Net with ResNet34 and DFK show Dice coefficients of 93.498%, 86.023% and 81.779% for WT, TC and ET with an average value of 87.100% for the BraTS 2018 validation dataset, and a value of 97.365% for the ISBI LiTS 2017 validation dataset.