Multi-Task Refined Boundary-Supervision U-Net (MRBSU-Net) for Gastrointestinal Stromal Tumor Segmentation in Endoscopic Ultrasound (EUS) Images

The diagnosis of risk level of gastrointestinal stromal tumor (GIST) is of great clinical significance. The morphology of GIST in endoscopic ultrasound (EUS) images has been normally used by radiologists to diagnosis the risk level of GISTs. Hence, accurate segmentation of GISTs in EUS images is a crucial factor to influence the diagnosis. U-net, an elegant network, has been commonly used in medical images. However, due to the plain architecture and complicated up-sampling path of U-net, classical U-net does not perform well in segmenting GISTs in EUS images with diverse size, heavy shadow and ambiguous boundary. Hence, this paper proposes a novel multi-task refined boundary-supervision U-net (MRBSU-net) for GIST segmentation in EUS images. In our network, multi-task refined U-net (RU-net) is set to deal with heavy shadow and diverse size. Boundary cross entropy in loss function of multi-task RU-net boosts the influence of small size tumors and the refinement avoid the noise information in EUS images propagating to the higher resolution layers. Then we design a refined boundary-supervision U-net (RBSU-net) to solve the ambiguous problem. The boundary supervision in RBSU-net leads the network focus on finding boundary in the down-sampling part and segmenting region on the up-sampling path. At last, we put multi-task RU-net in front of the RBSU-net to increase the stability of the network, what is called MRBSU-net. Extensive experiments have been designed to evaluate the performance of the proposed network. The comparison experiments include the results from traditional U-net, generative adversarial network (GAN) and Deep Attentional Features (DAF). The results of our proposed method perform best among all the comparison methods, which proves that the proposed network could be potentially used in clinic.


I. INTRODUCTION
Gastrointestinal stromal tumors (GISTs) are uncommon tumors of the GI tract. They arise in very early forms of special cells called interstitial cell of Cajal [1]. Several imaging modalities are applied into diagnosis and follow-up treatment of GISTs, such as the computerized tomography (CT), the magnetic resonance imaging (MRI) and the endoscopic The associate editor coordinating the review of this manuscript and approving it for publication was Gina Tourassi. ultrasound (EUS). Due to its real-time nature, non-invasive, inexpensive, and non-radiation, endoscopic ultrasound is used to assess patient specific gastrointestinal structure and function [2]. The risk level plays a significant role in the preoperative treatment. Risk level determines a patient whether to undergo the targeted therapy or not. The lower risk level group (LRG) and the higher risk group (HRG) are two level groups of GISTs. In particular, HRG patients should take the targeted therapy before surgery, while LRG patients do not need. Recently, a computer-aided diagnosis model has been proposed to classify the risk level [2]. While this model depends on segmenting GISTs manually the radiologists, consuming a heap of time. What's more, manual segmentation highly depends on the experience of radiologist and deeply varies among observers. Additionally, in order to quantitatively assess the features of GISTs, the tumors must be differentiated from the background. Hence, accurate and robust delineation or segmentation model is critical for reducing human intervention and diagnosis of GISTs.
Due to the nature of US images, in particular, acoustic shadows, speckle noise, simple structure, ambiguous boundary and poor contrast [3], [4], US images are difficult for automatic segmentation [5]. A limited number of automatic segmentation methods usually based on machine learning methods have been proposed up to now. As medical records being increasingly digitalized, the success of machine learning methods on computer vision tasks reaches an opportune time [6]. Deep learning methods have illustrated remarkable accuracy and efficiency in segmenting a scope of objects [7], [8]. It has become a useful research approach in many fields, and several segmentation methods depending on convolutional neural networks (CNNs) have been proposed on medical imaging [9]- [11]. End-to-point network, such as patch-based CNN pixel classification [12] is time consuming. To overcome these disadvantages, plenty of end-to-end networks come into being. One of the most popular networks is the fully convolutional neural networks (FCNs), the basis of end-to-end network in semantic segmentation [13]. O. Ronneberger proposed U-net in 2015 to segment medical images, achieving a profound impact [14]. Recently, U-net always achieves the state-of-the-art performance on medical image segmentation. To decrease the influence of information loss during the up-sampling, U-net propagated the context information to the higher resolution layer by connecting the formal information to the current feature maps. Although this elegant architecture has solved the problem of various locations, it leads to the shadow information transferring to the resolution layer. Moreover, in addition to the traditional US image problems, Gastro EUS (G-EUS) image exists its own problems: 1) GISTs are diverse size, as shown in Figure 1  a traditional U-net. Traditional cross-entropy loss function of region does not work on small size tumors. Obviously, even if those tumors were segmented as background, it has tiny influence on the loss function of region.
In this paper, we propose a novel multi-task refined boundary-supervision U-net (MRBSU-net) and a modified loss function to improve GISTs segmentation in EUS images with small size and ambiguous boundary.
Our main contributions are: 1) A refined U-net (RU-net) is proposed to decrease the influence of the heavy shadow. The encoder part in U-net is lower as well as the decoder part in the U-net has too many feature channels to propagate trashy information to higher resolution layers by the skip connections. For instance, the noise and shadows would be magnified and propagated in the decoder part [15]. RU-net reduce the concatenation between the previous layer and higher resolution layer, which could avoid the shadow information propagating to the higher resolution layers. Meanwhile, the convolutional layers in the up-sampling path are cut down.
2) To intensify the influence of loss function in small size tumor, we design a multi-task RU-net which consists of two RU-nets with the shared down-sampling part, a variant of Deep Contour-Aware Networks (DCAN) [16]. A boundary loss is added in the loss function to increase the impact on misjudgment of small size tumor. Summing the boundary loss and region loss up add a bonus of a multi-task RU-net, getting boundary and region information respectively without any post-processing.
3) We all know the advantages of fusing the tumor and boundary [17]. Nonetheless boundary details require deep supervision [18]. Due to boundary information losing in the down-sampling path, we propose a refined boundarysupervision U-net (RBSU-net) which focus on finding boundary in the down-sampling part and segmenting region on the up-sampling path. And we add a boundary-based loss in the final encoder part to reinforce the boundary restriction for those tumors with ambiguous boundary. 4) At last, we concatenate feature maps from two branches in multi-task RU-net before the classification and feed it to RBSU-net to increase the stability of the network.  This paper is organized as follows: In Section 2, the description of our MRBSU-net method are provided. The materials and experiments are introduced in Section 3. Section 4 contains an explanation for the experimental results. Discussion and conclusions are presented in Section 5 and Section 6, respectively.

A. OVERVIEW
The proposed method was aimed at segment GISTs from background. Due to the limitation of the EUS images, the method should have the capability of differentiate GISTs from shadows and not be sensitive to noise. In this section, we describe the architecture of our proposed MRBSU-net method in details, as shown in Figure 3. We start with introducing the U-net for end-to-end training. Furthermore, we use the multi-task background features with auxiliary supervision to generate a good likelihood maps of GISTs. Then we concatenate two probability maps from the two branches and feed to a block of a refined boundary-supervision U-net (RBSU-net) followed by a softmax in the final classification layer.
B. MULTI-TASK RU-net U-net achieved the state-of-the-art performance on image segmentation, especially medical image segmentation [14]. The network can be trained in an end-to-end way like U-net, which redeems image as input and outputs the probability map directly. The elegant architecture propagates context information to higher resolution layers, where the missing context is extrapolated the input image. This process solves the problem of various locations. The down-sampling path is designed to extracting the deep level invisible information, while the up-sampling path calculating the score pixel by pixel. However, the shallow depth of U-net and relatively small receptive field cause the poor performance. The shallow depth and complicated convolutional layers leads to the misjudgment of shadow and boundary to be matte.
Hence, to solve the limitations of U-net, we design a RU-net which includes an encoder part and a decoder part, shown in Figure 4. The decoder part contains 5 convolutional blocks with max pooling, which are widely used in the CNN for image classification [12]. Each block consists  of 1∼3 convolutional layers similar to that in VGG-16 network [19]. The decoder part extracts features. Different from the original U-net architecture, we simplify the last three up-sampling blocks replaced by the linear interpolation and cancel the convolutional layers in every up-sampling block to slick the boundary and avoid learning more redundant information. Moreover, sum operation instead of the concatenation operation combine the conciseness of the FCN-8s with the information in the mirroring position of the input image.
Afterwards, the classification scores from U-net relies on the intensity information from the given input. However, the network with single receptive input size cannot deal with the large differences in shape properly. Hence, we proposed a multi-task learning network instead of redeeming segmentation as an individual pixel-wise classification problem. In Figure 5. two of the above RU-nets share the encoder part and two different decoding branches are used for two tasks, tumor region classification and tumor boundary classification. At last, RU-net with multi-task features extract from input A can be trained by minimizing the loss Loss 1 , a combination of loss L E1 and L R1 , as shown following: where Wi represents the parameters of neural network, L E is the loss of tumor edge and L R is the loss of tumor region. All these losses are cross entropy losses: logP x l x a n,i ; a n,i , W VOLUME 8, 2020 L x represents the loss function of each task, a n,i is the i-th pixel in the n-th image and P x is the predicted probability of a n,i belonging to class l x .

C. MULTI-TASK RBSU-net (MRBSU-net)
Although multi-task RU-net solves the various size in GISTs, it is still unstable and cannot avoid the boundary information losing during down-sampling. So it is difficult to segment GIST with ambiguous boundary and great noise as before. This is because in the encoder part causing spatial information loss. Thus, we propose a RBSU-net to solve cases with ambiguous boundary. Due to the information loss in such deep network, we add an auxiliary boundary supervision in the first up-sampling block of RU-net to supervise the network deeply, as shown in Figure 6. So the architecture is similar to the RU-net. The encoder part contains 5 convolutional blocks with max pooling and each block includes 1∼3 convolutional layers. The auxiliary boundary supervision is added to reinforce the boundary in the early stage to give accurate contextual features for the higher resolution layer. The loss is: The overview of the proposed network can be seen in Figure 3. To deal with the diverse size of GISTs, we add a multi-task RU-net in the front of the RBSU-net, which offers a good likelihood maps in both region and boundary of GISTs. Then the outputs from two branches are concatenated as the input. The attention of layers E1 and E2 is steered towards the boundary definitions whereas layers R1 and R2 are focused on the definition of the regions belonging to the tumor. Therefore, the total loss in our proposed network is: D. DATA PREPROCESSING Normalization and data augmentation are the two parts of the data preprocessing. All of the data are normalized from 0 to 1. The datasheet has been augmented to ten times by rotation and folding. All the images are resized into 256 * 256.

E. IMPLEMENTATION DETAILS
The MRBSU-net is trained by an optimizer with learning rate = 10 −4 , batch size = 8 and epoch = 200. All the kernels are initialized randomly from 0 to 1.

III. MATERIALS AND EXPERIMENTS A. MATERIALS
In this study, 905 patients went through biopsy and gastroscopy to detect the GIST from 2008 to 2018 in 19 different hospitals [2]. G-EUS images were collected from one of following ultrasound machines, Olympus EU-ME1, Olympus EU-ME2, Fuji SU9000, Fuji SU8000 and Olympus Alpha 10. All patients had provided written informed consent for imaging and clinical data to be donated for the research [2]. Basic information of all patients was recorded in the information center.

B. EVALUATION INDEX
To evaluate the proposed approach, 545 test images are used. The Dice similarity coefficient (DSC), mean absolute deviation (MAD), and Hausdorff distance (HD) are applied to evaluate the results of the proposed method [15].
DSC is a statistic used to gauge the similarity of two samples, measuring the overlapped areas between segmentation result and the ground truth, as: where, n(.) is a total pixels in the area. A is the segmentation results, while B is ground truth. A and B are the segmented area of A and B. The MAD and HD are defined as the average and maximum of the surface distance errors (SDE) among the whole field, respectively, as: where, x and y represent point in the regions X and Y; d(x, Y) is the minimum distance from point x to the region Y; d(y, X) is the minimum distance from point y to the region X and N X and N Y are the size of regions X and Y, respectively.

C. SEGMENTATION EXPERIMENTS ON DIFFERENT IMPROVEMENTS AND METHODS
To illustrate the effects on each improvement and assess the segmentation quality, the following improvement methods and other deep learning based methods are compared: 1. U-net 2. Refined U-net (RU-net) 3. RU-net with multi-task contextual features (multi-task RU-net) 4. RBSU-net 5. generative adversarial network (GAN) [20] 6. deep attentional features (DAF) [21] All experiments are implemented in MATLAB R2015b and Python 3.5 on a 3.06-GHz Intel(R) Xeon(R) CPU and an Nvidia Titan Xp GPU. The U-net, refined U-net, refined U-net with boundary supervision in the first up-sampling block, BSU-net and RBSU-net are trained and tested in Tensorflow.

D. CLINICAL GISTs CLASSIFICATION EXPERIMENTS
To verify the practicability of the proposed method, we classify the GISTs under the segmentation results of the proposed model. Differentiating HRG GISTs from LRG GISTs has great significance. This part illustrates the influence of the segmentation method on clinical diagnosis. 439 high-throughput GISTs features are designed to demonstrate characteristics of GISTs in EUS images [2]. The tumor boundaries are defined and then the features are extracted. Hence, the all these depends on the segmentation result. In this part, we extracted features from the outputs of the proposed method. A least absolute shrinkage and selection operator (LASSO) model is still the selection method to select features. After that, a random forest (RF) classifier is applied to differentiate the HRG cases from the LRG ones. Cross-validation was included in the training stage. Due to the unbalanced datasheet, we chose 30 LRG cases and 24 HRG cases randomly as validation set, while others were set as training set. And the experiment was bootstrapped for 10 times. More details can be referred to our previous work [4].
In this step, from each hospital, about three-fourths patients of each of four risk levels are still selected as the training set [2]. The datasheet is divided into another 3 parts, as shown in  The model starts to be stable with a low loss after 200 epochs in Figure 7. TABLE 3. shows the evaluation indices of the proposed MRBSU-net and other compared methods. U-net, the stateof-the-art network, performs best among other traditional deep learning method.

A. OVERVIEW
In TABLE 3. results from the proposed method achieves a mean DSC of 0.92±0.056, a mean HD of 11.72±10.73 pixels and a mean MAD of 3.14±2.37 pixels. All three metrics are the highest among other comparative method. Results from the multi-task RU-net achieves a mean DSC of 0.88±0.15, a mean HD of 15.89±18.83 pixels and a mean MAD of 5.12±11.24 pixels, while with respect to the RBSU-net a mean DSC of 0.85±0.21, a mean HD of 14.73±24.69 pixels and a mean MAD of 6.56±14.14 pixels are achieved. Also, all of P values reach at a magnitude of 10 −10 . These details in TABLE 3. show the significant improvement of multi-task enhances the DSC which focus on the overall segmentation performance and the improvement of boundary supervision devotes to find the region of interest. Time consuming in TABLE 3. shows that the proposed model cost only 0.09 second per image increasing in testing stage. Hence, we concatenate two branches from multi-task RU-net and feed them to the RBSU-net. TABLE 3. demonstrates the metrics of the proposed model and other improved methods. The results from RU-net show that the DSC increase by 10% as well as HD and MAD decrease by 15.57 and 5.99, respectively. Compared with RU-net, results from multi-task RU-net and RBSU-net have different degrees of improvements in three metrics. DSC from multi-task RU-net rises by 6%, while the one from RSBU-net increases by 3%. HDs from two methods decrease in different degrees. It reduces by 7.11 in RBSU-net and decreases slightly worse in multi-task RU-net, about 5.95. While the improvement of MAD in multi-task RU-net boost more than that in RSBU-net. MAD from multi-task RU-net drops by 2.69 and that from RSBU-net reduces by 1.25. MRBSU-net who combines multi-task RU-net with RSBU-net performs best with a mean DSC of 0.92±0.056, a mean HD of 11.72±10.73 pixels and a mean MAD of 3.14±2.37 pixels. Figure 8. and Figure 9. shows the five examples on small and normal size respectively. Figure 8. and Figure 9. illustrate the segmentation results of the proposed method and other compared improved methods, which shows the proposed method performs best among all the methods. Obviously, shadows of outputs from RU-net are cut down, but misjudgment still exists in RU-net (Figure 9.). Figure 8. shows that those methods are still unstable in small size tumors.

B. EFFECTS OF THE CONCATENATION AND BOUNDARY-SUPERVISION
To illustrate the effects of the concatenation and boundarysupervision, we list the outputs of each up-sampling block in the RU-net, RU-net advanced, RBSU-net and MRBSUnet (Figure 10.). The case that cannot be accurately dealt with RU-net (Figure 10.) shows multi-task RU-net has eliminated the shadow in the second up-sampling block, while RBSU-net could eliminate the shadow in the first up-sampling block. The combination of multi-task RU-net and RBSU-net could increase the stability of the network. What's more, we also show the outputs of layer E1, layer R1 and layer R2 in the proposed MRBSU-net (Figure 11.) to demonstrate the effects of concatenation. We could see the heat of misjudgment part climbs up quickly to a background level, proving that multi-task RU-net also works on deleting shadows. All the outputs are presented in heat map to make the effects clearer.

D. CLINICAL GISTs CLASSIFICATION RESULTS
We distinct 32 HRG cases from 145 LRG cases. This part compares the results segmented by the proposed method and those segmented manually. The verification experiment on the proposed method uses the same features and classifier as manual segmentation. 10 features are remained after feature selection. In the RF classifier, accuracy, the area under the receiver operating characteristic (ROC) curve (AUC), sensitivity, and specificity reach at 0.791, 0.808, 0.750, 0.800, respectively. The classification results are shown in TABLE 5. And the ROC of the two segmentation was shown in Figure 12. Although the metrics are still lower than those using manual segmentation, these results could indicate this method could be useful as a starting point for manual segmentation for reducing effort from radiologists. The classification results show the proposed method is of diagnostic values.

V. DISCUSSION
We propose a novel MRBSU-net method to segment GISTs automatically in EUS images, connecting a multi-task RU-net to a RBSU-net. Concatenated input and boundary supervision play an important role in performance improvement, especially in images with heavy shadows. Moreover, the clinical application also performs well.   Although DSC of U-net is a little lower than that of GAN [20], the other two metrics of U-net, HD and MAD, perform much better than those of GAN. And all metrics in U-net and GAN are better than those in DAF [21]. These results prove U-net does perform well on medical images, providing theoretical basis for improvement of U-net.
The RU-net illustrates an obvious improvement from U-net in Figure 8. and Figure 9. It eliminates the interferences of shadows. Compared with U-net, the performance of RU-net shows that decrease of the convolutional layer could refine boundary, making it smoother. Attributing to its plain network architecture, U-net does not perform well in the scope of the EUS image segmentation. Compared with RU-net, the down-sampling path in U-net is not deep enough to extract features and the up-sampling path in U-net has too many channels to propagate the noise information in EUS images. However, RU-net is not stable in small size tumor (Figure 8). VOLUME 8, 2020 For those small size tumors, RU-net cannot differentiate tumors from background. For the image with heavy shadow, RU-net cannot distinct the tumor from heavy noise. Obviously, multi-task RU-net and RBSU-net have a certain effect on small size tumors due to the boundary supervision in loss function, as shown in Figure 8. Adding boundary cross entropy in loss function of multi-task RU-net boosts the influence of small size tumors, if they are redeemed as background. Compared with the weight in region cross entropy, that in boundary cross region is much higher, which could differentiate the small size tumors from background. Moreover, it could eliminate the small false positives, as shown in Figure 9. The boundary supervision in RBSU-net leads the network focus on finding boundary in the down-sampling part and segmenting region on the upsampling path. Hence, it does work on large false positives, as shown in Figure 9. However, they are still not stable on small size tumors. The fifth example in Figure 8. illustrates a traditional case of over-supervision by RBSU-net, while RU-net redeem the noise nearby contour as the tumor.  The proposed method concatenates two branches from multitask RU-net and feeds them to the RBSU-net, on the basis of empirical evidence in Figure 8., to increase the stability of the segmentation on small size tumors and to solve this oversupervision. Column 5 in Figure 9. shows RBSU-net does not work on tumors with noise inside, especially those noise similar to boundary. These explain the complementary effects on the combination of multi-task RU-net and RBSU-net from another respects. Figure 8. and Figure 9. show the segmentation results of five examples of small and normal size GISTs. The segmentation performance of tumor with clear boundary is almost similar among five methods. However, in the other cases, results are relatively diverse. The MRBSU-net is more sensitive to the intensity diversity adjoining the contour, ascribing to localization information learned by MRBSU-net. Hence, the proposed model has the best segmentation accuracy and robustness among all the methods.

B. EFFECTS OF THE CONCATENATION AND BOUNDARY-SUPERVISION
To illustrate the effects of the concatenation, we calculate the DSC of the RBSU-net and MRBSU-net, listing in TABLE 3. These results indicate that the outputs with the VOLUME 8, 2020  assistance of the multichannel is much closer to the ground truth. Also, number of miss detection, T-test and P-value in TABLE 3. shows the significant improvements of the proposed model. The contrast experiment of testing per image among all methods in TABLE 3. and TABLE 4. illustrates the proposed method achieves a better performance in a low time cost. And this lays the foundation of being applied into practice. The improvement of multichannel can be seen clearly in Figure 10. The outputs of the second up-sampling block in MRBSU-net has already found the location of the tumor, while RBSU-net has not. These improvements are also proved in the classification experiment, as shown in TABLE 5. In the classification experiment, AUC, accuracy, sensitivity and specificity of the MRBSU-net method achieves a higher quantitative value, showing that the proposed method could be used in clinic. These all proves the effects of the concatenation. Although the results from automatic segmentation method are slightly worse than those from manual segmentation, it still achieves clinical application requirements. Compared to the tedious and time-consuming manual segmentation by radiologists, the automatic segmentation method does have the significance influence in practice.
In TABLE 3., we could see the results of RBSU-net are all higher than those of RU-net which shows the effects of the boundary-supervision. Figure 11. shows outputs of three important nodes, two outputs from the branches of the multitask RU-net and final outputs in heat map. These figures show how the MRBSU-net method works and effects of every improvement.

C. CLINICAL GISTs CLASSIFICATION APPLICATIONS
We are glad to see that the classification based on MRBSU-net segmentation performs well on such unbalanced datasheet. Notwithstanding the metrics are still lower than those using manual segmentation [2] because of several miss-detected cases, as shown in TABLE 3. and in TABLE 4. Experiments demonstrate that the proposed method produces good diagnostic performances in the classification of GIST risk levels. It has been proved that the method could be used in practice which means doctors could abandon the tedious manual segmentation part to a valuable reference by the proposed method. And solving these extreme cases will become our next challenge.

VI. CONCLUSION
In this study, we propose an automatic segmentation method in EUS images, combining a multi-task RU-net with a RBSU-net, named as MRBSU-net. Compared with U-net and other improved U-nets, the MRBSU-net has the best performance in all quantitative indices, DSC, MAD and HD. Multi-task RU-net solves the various sizes in GISTs by intensifying the influence of loss function in small size tumor. Concatenation provides an initial boundary and initial tumor region, which is more efficient than single channel input. By minimizing the boundary-supervision energy function, the network focuses on finding the interest region. Finally, the network performs better and more stable in those cases with ambiguous boundary, shadows and variation in size by making use of multi-task RU-net and RBSU-net. Moreover, the segmentation results get a good hand in tumor classification experiment.
However, in some extreme cases, such as tumor with complicated internal texture structure, the proposed method cannot segment well. Our future work will focus on improving the structure of the deep learning network to handle all GIST cases in EUS.    YUANYUAN WANG received the B.Sc., M.Sc., and Ph.D. degrees in electronic engineering from Fudan University, Shanghai, China, in 1990China, in , 1992China, in , and 1994 From 1994 to 1996, he was a Postdoctoral Research Fellow with the School of Electronic Engineering and Computer Science, University of Wales, Bangor, U.K. In May 1996, he went back to the Department of Electronic Engineering, Fudan University, as an Associate Professor. He was then promoted to a Full Professor, in May 1998. He is currently the Director of the Biomedical Engineering Center, Fudan University. He is also the author or coauthor of six books and 500 research articles. His research interests include medical ultrasound techniques and medical image processing.