Contour-Aware Polyp Segmentation in Colonoscopy Images Using Detailed Upsamling Encoder-Decoder Networks

Colorectal cancer has become one of the most common cause of cancer mortality worldwide, with a five-year survival rate of over 50%. Additionally, the potential of some common polyp types to progress to colorectal cancer is considered high. Colonoscopy is the most common method for finding and removing polyps. However, during colonoscopy, a significant number of polyps is missed as a result of human error mistakes. Thus, this study was primarily motivated by the need to obtain an early and accurate diagnosis of polyps detected in colonoscopy images. In this paper, we propose a new polyp segmentation method based on an architecture of multi-model deep encoder-decoder networks called MED-Net. Not only does this architecture obtain multi-level contextual information by extracting discriminative features at different effective fields-of-view and multiple image scales, it also can substantially do upsample more correctly to produce better prediction. It is also able to capture more accurate polyp boundaries by using multi-scale effective decoders. Moreover, we also present a complementary strategy for improving the method’s segmentation performance based on a combination of a boundary-aware data augmentation method and an effective weighted loss function. The purpose of this strategy is to allow our deep learning network to sequentially focus on poorly defined polyp boundaries, which are caused by the non-specular transition zone between the polyp and non-polyp regions. To provide a general view of the proposed method, our network was trained and evaluated on four well-known dataset CVC-ColonDB, CVC-ClinicDB, ASU-Mayo Clinic Colonoscopy Video Database, and ETIS-LaribPolypDB. Our results show that our MED-Net significantly outperforms state-of-the-art methods.


I. INTRODUCTION
According to the report of the American Cancer Society [50], the number of newly diagnosed colorectal cancer cases in the United State was approximately 143460 in 2012 and this number is increasing quite rapidly every year. Besides, in the study provided by the Spanish Association against Cancer, between 28,500 and 33,800 new cases are diagnosed each year and around 20,000 and 14,000 in men and women, respectively, being detected [46]. More detailedly, colorectal cancer is fourth leading cause of cancer deaths worldwide. Thus, the objective of early colorectal cancer detection is The associate editor coordinating the review of this manuscript and approving it for publication was Chun-Wei Tsai .
to increase the survival rate and improve cancer control systems [47].
In the last two decades, colonoscopy technology has made noteworthy contributions to the development of diagnosis systems of colorectal diseases [59]. Colonoscopy is used to screen patients to detect inflamed tissue and abnormal growths such as tiny polyps. Since the potential of some common polyp types to progress to colorectal cancer is considered high, colonoscopy is a useful method in that it enables physicians to excise polyps and analyze them to detect signs of colorectal cancer. Moreover, according to the danger of polyps which is frequently fatal if discovered in the late stages. Besides, colorectal cancer researchers claim that these polyps frequently do not cause symptoms due to their size. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Thus, colonoscopy is highly recommended as one of the best diagnostic and therapeutic tools for detecting and preventing colorectal cancer. However, the sensitivity and specificity are not 100% and physicians can therefore miss 17% to 28% of colorectal polyps during colonoscopy [50]. Although some patients have more than one diagnostic test, the polyps in their colon are not detected. Undetected polyps are usually smaller than 9 mm in diameter, a size that endoscopists can not easily see and detect. An additional reason why some polyps are not detected is that they are located in the danger area of the left colon or even behind a fold. They may also be too flat and blurred in appearance to allow them to be observed visually. Hence, more than 60% of colorectal cancer cases are attributed to missed polyps [37]. These polyps can develop to malignant tumors and therefore missing them can increase the risk of the patient developing cancer. Besides, the field of semantic segmentation has recently made remarkable contributions to the development of scene understanding and object recognition [57]. Thus, the objective of this study was to develop an automatic polyp segmentation tool that can support clinical endoscopists to detect tiny and flat polyps more effectively.
Computer-aided polyp segmentation systems can reduce the endoscopic miss rates of colorectal polyps and help clinical endoscopists localize the most complex colon polyps [23]. Such a system also supports the diagnosis procedure and can reduce the operation cost significantly. Based on this computer-aided system, more reliable classification and segmentation methods have been developed for the assessment of different polyp types [28]. These methods can automatically segment polyps regions into different categories and their performance rate sis high. Therefore, many researchers have developed fast and precise polyp segmentation algorithms for providing early indications of colorectal cancer. However, their results are still far from meeting clinical requirements. The failures are related mainly to the appearance of polyps which have a large variety of sizes and shapes. Moreover, the non-specular transition zone between a polyp and its surrounding area does not show a significant change in texture or color the would enable endoscopists to distinguish it from all other non-polyp regions. Furthermore, the boundaries of wrinkles and folds in the colon are similar to those of polyps citehoerter2020artificial. In order to solve these main problems, we focused on building a deep convolutional neural network to generate discriminative features that are focused on polyp boundary regions and tiny polyps.
Over the last few years, deep fully convolutional neural networks (FCNNs) [12], [36], [42], [57] have led to dramatic developments in semantic segmentation research can be used to recognize and understand the content of an image at the pixel level [65], [66]. (FCNNs) [36] have also led to a dominant research direction for improving polyp segmentation because of their computational efficiency for discriminative feature extraction. They can automatically generate segmentation maps for all polyps of any sizes. Ronneberger et al. [45] improved FCNNs by introducing a new deep learning network, namely's U-Net, which integrates an FCNN into an encoder-decoder architecture. This network achieved noteworthy results for segmenting biomedical images. Chen et al. [13] proposed investigating contour information to simultaneously segment and separate clustered objects. Because the contour features are emphasized, the objects are separated much better, and thus, the segmentation performance is significantly improved [21]. Their results show the important role of a model that learns contours in cell, tumor or organ seg mentation. These approaches inspired us to find a better solution by utilizing the advantages of a model that explicitly learns polyp boundaries to achieve accurate polyp segmentation. However, the polyp segmentation task has its own challenges that are related to a polyp's nature and environment. These challenges motivated us to develop a novel multi-model deep encoder-decoder network, called MED-Net that is focused on tackling two main issues for polyp segmentation.
The first issue of polyp segmentation concerns polyps' wide variety of shapes, sizes, colors, and textures. First, the polyp size has a direct relation with miss rates of colorectal examinations, because doctors usually cannot easily evaluate small adenomas that are smaller than 9 mm in diameter. In addition, the estimation of the physical size of polyps is not easy to estimate, owing to the fact that it depends on the distance between the polyps and the colonoscopy camera [18]. If the doctor accidentally misses a small polyp, as mentioned above it can later develop into dangerous colorectal cancer. Second, colon polyps can appear in different shapes, for example, as radially pretubarant or flat polyps. Flat polyps are frequently attached to the colon wall, which makes them difficult to detect, because their boundaries are not clear in many cases. Finally, and not least impotant, polyps may appear in different shades and colors because of the uncontrolled illumination from the camera moving in the colon. These diffences render computer-aided polyp detection algorithms considerably less effective in real environments. Thus, our proposed MED-Net is designed such that its inputs are multiresolution images. It consists of a cascade architecture of dilated convolutions and includes an effective decoder module. This network architecture was inspired by a previous deep encoder-decoder network [16], the DeepLabv3+ [14] network and our previous work [38] with the advantage of upsampling method by Tian et al. [54] for segmenting objects. First, the cascade architecture of dilated convolutions is used at the end of our network to extract multi-scale context information in local regions without requiring an increased the number of training parameters [67]. This architecture can also effectively recover detailed information related to polyp boundaries that are lost when the data pass through many convolution and pooling layers. Second, because the size and shape of polyp can appear in different images, we apply multi-scale input images to the same deep convolution neural network to search multi-scale objects [56]. These two techniques in combination can enlarge the receptive field size without resulting in a loss of important information.
Finally, the decoder module gradually recovers sharp polyp boundaries.
The second issue of polyp segmentation concerns the poorly defined polyp boundaries caused by the nonspecular transition zone between the polyp and non-polyp regions. Thus, the improved U-Net network proposed by Chen et al. [13] for gland segmentation inspired us to find a better segmentation method for polyps by exploring the complementary features of their boundaries. However, gland contours are very distinct and can be easily predicted and segmented by Chen et al.'s improved U-Net network, whereas polyp boundaries are poorly defined and can render this network considerably less effective in the colon environments. For this reason, instead of masking polyp boundaries by fixed contours to explore the complementary information, as mentioned in [13], we randomly mask polyp boundaries and their neighboring regions, which do not differ widely. These masked regions provide important information for segmenting polyps precisely. Moreover, our proposed augmentation technique allows hundreds of boundary-aware polyp patterns to be generated randomly from each training polyp image. Thus, instead of extracting only one contour for each training image, as presented in [13], our deep learning network can focus on extracting the most important features in many different parts of the polyp boundaries for each training image. In addition, so that our proposed network is focused on extracting useful features for segmenting polyp boundaries, we propose using two boundary-aware loss functions for training our polyp boundary segmentation method. The first loss function is a new adaptive weighted loss function for focusing on learning the important parts of a polyp such as its boundaries. The second loss function is an attentionbased loss function that allows the network to focus more on the most relevant local regions that include polyps and their boundaries. This function is computed based on minimizing the Euclidean distance between the polyp ground truth and its predicted regions. As a result, the predicted polyp boundary can coincide with its ground truth. After the training processes, our encoder-decoder network can achieve a considerably better accuracy rate because of the effectiveness of these loss functions.
When it comes to the third issue of polyp segmentation, recovering the prediction resolution is significantly considered as a big obstacle for researchers to achieve more accurate predictions. Furthermore, by observing the clear and meticulous result, clinical doctors can give wiser and better decision to increase the surviving rate of victims. In oder to provide better prediction mask, instated of giving prediction at each pixel based on a local receptive field and parameters are shared at different spatial locations, Liu et al. [35] proposed the fully-connected fusion which used fc layers to yield different properties. Besides, the former are location sensitive since predictions at different spatial locations. Hence, fc layers have the ability to suit to these locations. Moreover, this predictions which are produced at each location is made with global information.
Fully convolutional networks (FCNNs) [36] have achieved great success in dense pixel prediction applications such for semantic segmentation task. Although the sharing convolutional computation mechanism makes training and inference computationally very efficient, it still has drawbacks due to several stages of strided convolutions [60]. The later might cause to lose fine image structure information and poor prediction. To overcome the shortcoming of simple upsampling method, Chen et al. [14] applies atrous convolutions to achieve large receptive fields while still maintaining a higherresolution feature map. The encoder part of DeepLabv3+ is used to extract rich information features which are meaningful for medical image especially in colorectal cancer segmentation and the decoder fuses low-level features to capture the fine-grained information lost by convolution and pooling operations in CNNs. To produce best performance, Chen et al. [14] reduced the overall strides of its encoder by four times and fused features of downsample ratio = 4. As a result, this method can be a catalyst for lower performance of a combination of features to be aggregated in the decoder. By dint of applying the simple bilinear upsampling, Chen et al. [14] has limited capability in recovering the pixelwise prediction accurately. The former does not take into account the correlation among the prediction of each pixel since it is data independent [54]. Furthermore, bilinear is oversimple and has an inferior upper bound in terms of reconstructing. Besides, because of massive computation in both encoder and decoder path of the common networks usually give a slow inference [4]. Therefore, these flaws inspired us to propose a better method in helping doctor purpose. First, we adopted DeepLabv3+ as encoder which employs multiple atrous convolution [24], [41] to extract features in numerous resolution images and later we combined with a new technique of upxampling, and finally, with two loss functions to produce best performance.
We tested our proposed algorithm and its competitors on a challenging dataset to evaluate the performance of our polyp segmentation method. The results of extensive experiments show that our algorithms significantly outperform state-ofthe-art algorithms. In particular, our paper presents several novel properties of our proposed method: • We propose a novel encoder-decoder architecture which can recover the full-resolution prediction by applying data-dependent upsampling method, namely MED-Net, to extract the most useful visual features from multiscale image inputs.
• We introduce a new boundary-focused data augmentation method for randomly generating a high number of boundary-aware polyp patterns from each training image. This method contributes to the improvement of MED-Net.
• We propose a new adaptive weighted loss function to boost the segmentation performance of MED-Net.
• We present an attention-based loss function that allows the network to focus more on the polyps and their boundaries. The combination of the loss functions leads to a VOLUME 8, 2020 better performance, because the network can focus on learning iteratively polyp boundaries.
The remainder of this paper is organized as follows. In Section II, we briefly present some related state-of-theart algorithms for polyp segmentation, which motivated our research. In Section III, we describe our proposed method of boundary-focused data augmentation and the processing of MED-Net. Besides, in this Section we also present about our two novel loss function. In Section IV, the experimental results obtained on a challenging databases are presented. We conclude this paper, outlining our intentions for future work, in Section V. The discussion also presented in Section VI for our shortcomings.

II. RELATED WORKS
In this section, we briefly review and discuss the state-of-theart algorithms of colorectal cancer segmentation and related fields.
Recently, computer-aided diagnosis (CAD) has been rapidly developed to assist doctors in diagnosing patients faster and more accurately at many hospitals. In general, CAD is used to provide objective results to facilitate medical image diagnosis. One of the major CAD applications is for segmenting precisely cancer tumors, organs, and polyps. The CAD system can also be incorporated into the diagnostic process of polyp detection and segmentation to decrease inter-observer variation, effectively provide biopsy recommendations, and reduce unnecessary false-positive biopsies. There has been a number of research efforts in the field of polyp segmentation using CAD systems [6].
Conventional CAD systems use image processing methods for segmenting polyps automatically. Bernal et al. [8] employed a watershed algorithm to segment and classify polyp candidate regions. Jia [29] used a K-means clustering method to segment and localize polyp contours. For obtaining accurate boundaries and segmentations of polyps in each image, Breier et al. [11] presented two different approaches based on active contours and active rays, which can achieve a high segmentation performance. They also used the Chan-Vese segmentation method to improve their segmentation performance. Ganz et al. [20] proposed utilizing the contour information to segment polyps precisely. They used the Hough transform technique to detect the region of interest (ROI) before using an ultrametric contour map (UCM) for polyp segmentation. To increase the sensitivity of their method, they applied an ellipse fitting algorithm to extract the polyp boundaries in each image. This method can achieve both a high specificity and high sensitivity on some challenging datasets where the polyp is present in the center of the image. An ellipse fitting algorithm was also effectively used by Hwang et al. [26] to handle the false positive problem by setting valid polyp boundaries. Besides, Akbari et al. [2] proposed the polyp segmentation method which contained two main stages. In the first stage, the authors proposed candidate regions of probable polyp with FCNN-8S [2] network.
Then in the second stage, Akbari et al. [2] used Otsu thresholding and select the largest connected component to segment polyp regions among all candidate regions.
Some researchers have developed polyp segmentation methods based on extracting features from image patches. Each patch is then classified into polyp and non-polyp classes. Tajbakhsh et al. [52] developed a new algorithm to extract oriented patches based on edge maps and classify them into polyp and non-polyp classes using a two-stage random forest classifier. To address the wide variety of polyp shapes, Tajbakhsh et al. [52] presented an ensemble of three convolution neural networks to classify input patches based on color, texture, and shape features, respectively.
Over the last few years, FCNNs [36] have been shown to be among the best algorithms for improving semantic segmentation, because of their computational efficiency for dense prediction. Most state-of-the-art methods for semantic image segmentation using FCNNs are based on the idea of adding convolutional layers at the end of networks instead of using any fully-connected layers. Chen et al. [13] combined an FCNN with a fully connected conditional random field to sharpen object boundaries and improve segmentation performance. However, one of the drawbacks of FCNNs is the lowresolution output responses. To improve for higher resolution images predictions, we improved on the FCNN by combining a cascade architecture of dilated convolutions with a deep network architecture for multi-resolution image inputs. Our network can obtain multi-scale information, recover the spatial resolution, and increase the field-of-view. Chen et al. [14] improved their network by using convolution filters and pooling operations at multiple rates and multiple effective fields-of-view to extract better multi-scale contextual information. They then used the second network to refine the segmentation regions, gradually recover the spatial information, and sharpen object boundaries. In the field of polyp segmentation, Zhang et al. [63] combined DCNN with a random forest classifier to segment potential polyp regions. Akbari et al. [2] also used an FCNN to find potential polyp candidates, and then segmented polyp regions by using the Otsu thresholding method which improves segmentation accuracy. In the first stage of post-processing they use Otsu thresholding method to change probability map which was resulted from FCNN-8S into a binary image and then find the largest connected component and consider it as the most probable location of polyp in the colonoscopy image.
More recently, in order to understand a scene, each piece of visual information has to be associated with an entity while considering the spatial information. An image segmentation can be done using Generative Adversarial Networks (GAN) [44]. GAN has shown outperformed results in many generative tasks to replicate the real-world rich content such as image field. It is inspired by theory: two models, a generator and a critic, are competing with each other while making each other stronger at the same time. A discriminator model estimates the probability which is given by sample coming from the real dataset. It works as a critic and is optimized to tell the fake samples from the real ones. Meanwhile, a generator outputs synthetic samples given a noise variable input. It is trained to capture the real data distribution so that its generative samples can be as real as possible, or in other words, can trick the discriminator to offer a high probability. GAN-based segmentation methods have been proposed in the literature. In [40], the authors replace the traditional discriminator with a fully convolutional multiclass classifier. The classifier assigns to each input image pixel one label that corresponds to a semantic class or to fake/real mark. In this way, they use unlabeled images during the training process. Zhang et al. [64] used GAN to build a Task Driven Generative Adversarial Network (TD-GAN) to achieve simultaneous synthesis and parsing for unseen real X-ray images. By suing the strength of GAN their entire model pipeline does not ask for any annotations from the X-ray image domain.

III. PROPOSED METHOD
In this section, we describe our approach for polyp segmentation in detail. The entire training process of MED-Net is illustrated in Figure 1. Our approach includes three basic steps. First, we present novel augmentation methods that are able to increase effectively the amount of training images by generating additional images from those in the original limited training dataset. In this step, we also implement our proposed new boundary-focused data augmentation method for randomly generating a large number of boundary-aware polyp patterns. This method contributes to the improvement of polyp boundary segmentation. In the second step, all the augmented images are used to train in MED-Net to extract the most discriminative deep features of polyps from multiscale image inputs. In the final step, we present an adaptive weighted loss function and an attention-based loss function that can effectively improve the segmentation performance of MED-Net.

A. BOUNDARY-FOCUSED DATA AUGMENTATION
Deep neural nets are powerful machine learning systems that tend to work well when trained on massive amounts of data [43]. Therefore, data augmentation is an effective technique for improving the performance of modern image segmentation. However, according to specific requirements of medical imaging field, the limitation of database is a huge wall for researcher to overcome [22], [55]. More specifically, for an automated polyp segmentation method using deep learning networks, data augmentation is an essential step for improving segmentation performance. Because the endoscopy procedures involving color calibration and controlling the camera's moverment are not consistent, the appearance of endoscopy images differs significantly across laboratories. The data augmentation step brings endoscopy images into an extended space that can cover all their variances. Since access to data is restricted because of privacy concerns, polyp segmentation networks were frequently trained with insufficient training dataset. The cancer classification performance was hindered by this lack of training data. Recent studies have demonstrated the effectiveness of data augmentation by generating additional data using the original limited training dataset for increasing the amount of training data. By augmenting training data, we can also reduce the over-fitting problem of training models. In this study, we used mainly geometric augmentation techniques including reflection, random cropping, translation, rotation, and elastic distortion which is introduced by Wong et al. [61]. Since the color of endoscopy images significantly varies across laboratories as a result of technicians' varying technical skills, we applied an effective color constancy method, namely, gray world, which assumes the scene in an image, on average, is a neutral gray and the source of average reflected color is the color of the light. For this reason, the illuminant color cast can be effectively estimated by computing the average color and comparing it to gray values. In this algorithm, the illuminant colors are computed by using the mean of each color channel of the image.
Since the texture or color of the non-specular transition zone between a polyp and its surrounding area does not differ significantly from that of a polyp, conventional

Algorithm 1 Boundary-Focused Augmentation Algorithm
Input: Input image I ; Polyp region S; Circle region S e ; Circle center C(x C , y C ); Minimum radius R max from C(x C , y C ); Circle radius r i , r i ⊂ Randint (R,R max ) with R max is the shortest distance from circle center C(x C , y C ) to the polyp boundary; (0,128); I * ← I ; return False end segmentation methods are not able to segment the polyp boundaries. This results in a poor segmentation performance. Moreover, as previously mentioned the boundaries of wrinkles and folds in the colon are similar to those of a polyp and a polyp can be partly overlapped and hidden by these wrinkles and folds. To address the boundary appearance problem and improve the learning ability of our encoder-decoder network, we present a new boundary-focused data augmentation approach that can be combined with most existing deep convolutional neural networks to boost classification performance. In the training dataset, every image generates a number of augmented images by randomly selecting a circle region of an arbitrary diameter and an arbitrary circle center for each of its augmented images, provided that this center is located in the polyp region and the circle does not cover the whole polyp region. Each pixel within the circle region is set to a random value. Some typical examples of the results of the boundary-focused data augmentation approach are shown in Figure 2.
The entire procedure of the boundary-focused data augmentation method is shown in Alg. 1. In particular, this method is aimed to randomly select a circle region S e in the polyp region S and set the values of all the pixels in this region randomly which is illustrated in the Figure 3.
In the initial step, we find the contour of the polyp in label images. Once the contour is found by Bradski Et al's method [9], the 2D continuous function f(x,y) which is called raw moment of (p+q) is defined in Equation 1.
However, for reducing the noise purpose, we apply p,q ∈ [0,1,2,3] to adapt to the scalar(greyscale) image with with The central moment of the polyp are defined as the following: The polyp center coordination is showed as (x, y), which is calculated in Equation 4.
where (x m , y m ) is presented for the point with lowest moments started from the centroid inside polyp region S. When it comes to the second stage, a circle center C(x C , y C ) is randomly chosen in the polyp area with the condition, that is showed in Equation 5. Subsequently, the distance R from circle center C(x C , y C ) to the polyp boundary is computed. If distance R is smaller than the allowed maximum distance R max which is determined from C(x C , y C ) to the nearest point, a circle radius r i ranging from R to R m ax is randomly selected. The selected circle region S e is defined by circle center C(x C , y C ) and circle radius r i . In the third step, each pixel in S e of the raw images is assigned a random value ranging from 0 to 128 due to the value of greyscale image. Nevertheless, if the distance R is longer than max distance R max , this process is repeated at the beginning of the first step till find the proper centroid.

B. FEATURE EXTRACTION USING A MULTI-MODEL DEEP ENCODER-DECODER NETWORK
We propose MED-Net, which was inspired by recent research on semantic segmentation [25], [63], DeepLabv3+ [14] and Upsampling method of Lee and Carlberg [54], Tian et al. [32]. The objective of this study was to build an ensemble of deep encoder-decoder networks (DEDNs) to train and can distinguish the most discriminative features for the polyp segmentation task [58]. Moreover, the network is also able to give accurate predictions by aplying newest upsample method presented by Tian et al. [54]. A DEDN ensemble is a learning paradigm in which many DEDNs are jointly employed to solve a specific problem. Our study demonstrated that an ensemble of multiple deep encoder-decoder networks significantly outperforms single simple deep encoder-decoder networks. This is because our proposed DEDN ensemble competitive advantages that are useful for increasing the prediction accuracy rate.
First, we can apply multi-scale input images to our network ensemble in which each scale is passed through at least one DEDN. This ensemble not only expands the receptive field in the original image to cover better global features but also extracts better multi-scale local features.
Second, our ensemble of DEDNs constitutes a reliable technique to increase the segmentation performance. Because each deep training model presents several local minima, multiple training processes of different DEDNs can improve the distribution of errors in each class. Thus, in combination their outputs lead to an improved performance on the overall task.
Third, in this study, we employed the encoder part of the state-of-the-art DeepLab model, namely, the DeepLabv3+ [14]. Chen et al. [14] demonstrated in the PAS-CAL VOC 2012 challenge that the DeepLabv3+ encoderdecoder model yields a state-of-the-art performance [19]. Later we combine it with the decoder part which is known as DUpsample [54], the network is shown in Fig 4. This is because its encoder module can capture rich contextual information from several parallel atrous convolution layers with different rates and its decoder module is able to recover effectively missing boundaries caused by the pooling or convolutions with striding operations in the encoder module [31], [34]. The main advantage of the new upsampling layer lies in that with a relatively lower resolution feature map [54]. Therefore, our DEDN achieve even better segmentation performance by significantly reducing computation complexity. Let G ∈ R h×w×c is the final product of the encoder part and Y ∈ 0, 1, . . . , a h×w be the label map, in which h and w is the high and the weight, respectively. While, c is the number of channels in final output and a denotes for the number of classes. However, in this study, we highly focus on binary segmentation, therefore, Y ∈ 0, 1 h×w . Precisely, G is typically of a factor of 16 or 32 in spatial size of the ground-truth Y [54]. However, semantic segmentation task strongly requires per-pixel prediction, G needs to be upsampled to the spatial size of Y before going to the computing process of the training loss function. In other words, by integrating DUpsampling mehtod we can make the component DEDN avoid overly reducing the overall strides of the encoder and it also significantly reduces the computation time and memory footprint of the semantic segmentation method. Unlike in our previous research studies [38], in which we fully applied the architecture of Chen et al. [14], we created a new DEDN by merging two state-of-the-art approaches in segmentation. In addition, instead of using pre-trained deep learning models to extract discriminative features, in our study we could train every single component DEDN using its pre-trained model and our augmented training dataset. This is because we could successfully generate sufficient augmented images for training these networks and thus avoid the overfitting problem. After these training processes, our models can extract considerably better features than their pretrained models. The training steps are explained in detail in below.
In the data augmentation step, by means of random cropping, rotation, elastic distortion, and translation, hundreds of new augmented training images are generated from each of the original training images. This new dataset is used for training each component network with differently scaled input images. Each DEDN model is trained with one of three types of scaled input images. Moreover, the fully connected layers from each model are disconnected so that each network can accept an input image of an arbitrary size. Each input image is a rescaled image from the augmented dataset.
Each single component DEDN encoder module can summarize the features of a scaled image input in the low-level feature map taken from the output of the last block. This feature map is then passed through four parallel atrous convolution layers with different rates to capture multi-scale context information and generate convolutional feature maps at multiple scales. Notably, not only can the encoder output feature map contain 256 channels and rich semantic information, but it also extracts features at an arbitrary resolution by applying both the atrous convolution and multiple resolution input dataset. This method boosted the performance of the model which is going to be demonstrated detailedly in Session IV. VOLUME 8, 2020 With regard to decoder path, instead of applying the decoder that's belonged to Chen et al. [14]'s work, we used datadependent upsampling [54] which can full fill the weaknesses of the former. Besides, The component DEDN feature map of each branch is rescaled such that its resolution is the same as that of the other feature maps. All DEDN feature maps are fused into a shared feature map. Since each polyp in an image is prominent in different feature maps, we use a maxout layer to obtain the competitive and dominant features from all the DEDN feature maps and integrate these features into the shared feature map. The entire architecture of our feature extraction model is illustrated in Figure 1. Data-dependent Upsampling method is used to recover the pixelwise segmentation prediction from the raw outputs of the convolutional decoder. Interestingly, this method can avoid overly decreasing the overall strides of the encoder, significantly reducing the massive computation and memory footprint. Besides, before merging features this method allows the decoder to downsample them to the lowest resolution of feature maps. By applying data-dependent upsampling, the huge computation was reduced remarkably and it also can extend the design space of feature aggregation to make better feature aggregation. The following step is to ensemble the DUpsampling into the encoder-decoder framework for an end-to-end trainable system. While DUpsampling can be performed with a convolution operation by kernel 1×1, incorporating directly it into the framework encounters difficulties in optimization.
So far, we have shown that DUpsampling method has been adopted to our MED-Net can be used to replace the bilinear upsampling in semantic segmentation which is less effective in making the training to converge. The next step is to incorporate the DUpsampling into the encoder-decoder framework, resulting in an end-to-end trainable system.
Besides, deep learning neural networks are nonlinear methods which offer increased flexibility and can affect in proportion to the amount of training data available. Nevertheless, a drawback of this flexibility is that they learn via a stochastic training algorithm, therefore, they are sensitive to the specifics of the training data. In other words, the neural network may find a different set of weights each time they are trained, which in turn to produce different predictions. Generally, a networks usually have a high variance and it can make a bad effect when trying to develop a final model. Therefore, one of successful approaches to reduce the variance of neural networks is to train multiple models which strongly motivated us to make MED-Net. Moreover, differences in random initialization, random selection of mini-batches, differences in hyperparameters, or different outcomes of non-deterministic implementations of neural networks are often enough to cause different members of the ensemble to make partially independent errors [5]. The weights of multiple networks can be averaged, to hopefully result in a new single model that has better overall performance than any original model. Likewise, ensemble method is promising to average these points in weight space, and use a network with these averaged weights, instead of forming an ensemble by averaging the outputs of networks in model space [15], [27]. The results are in predictions of out experiments show that they are better than any single networks. According to the vary of datasets in our training strategies the variance is significantly diminished.
We train each network with a stochastic gradient algorithm running on a GeForce GTX 1080 Ti, utilizing the Tensor-Flow [1] distributed machine learning system. We fine-tune the networks by using the models utilizing RMSProp with a weight decay of 0.001. We use a learning rate of 0.007, decayed every two epochs at an exponential rate of 0.94. The training process for each component network takes 350.000 steps. Direct training of our encoder-decoder network with a conventional loss function may not result in a good segmentation performance. Multi-level contextual features extracted from our network can be trained by minimizing the overall loss L a , which is a combination of an adaptive weighted loss L apt and an attention-based loss L att .

C. ADAPTIVE WEIGHTED LOSS FUNCTION
We use a new adaptive weighted loss function inspired by [3], [62], to boost the segmentation performance of the MED-Net. First, we compute the soft-max value in the feature channel k at the pixel position z ∈ with ⊂ Z 2 : where K is the number of classes and u k (z) is the activation function. The adaptive weighted loss L apt is computed by where ω : → R denotes the weight map, l : → {1, .., K } is the true label of each pixel, and p gt (z) denotes the true probability of current pixel z. We aim to build a weight map ω to give high weights to only the pixels that belong to the polyp in a training image or belong to the polyp boundary in an augmented training image. As a result, the network can consider these pixels more important than other pixels in the background. Thus, the weight map can be computed as where ω o : → R is the basic weight map to create a balance between the polyp and the background and ω e : → R denotes the boundary-focused weight map, which can be computed as where n e is a regularization parameter, S po denotes the polyp area, and S is the entire image area. In our experiments, we set n e = 0.5

D. ATTENTION-BASED LOSS FUNCTION
To provide better segmentation results, we also aim to measure the similarity between a predicted polyp mask and a ground truth polyp mask. This measurement allows the network to focus more on the polyps and their boundaries. In particular, in these binary masks, pixel values in the polyp area are set to one and those in the background are set to zero. We denote the predicted value of pixel z by u pred (z) and the true value of pixel z by u true (z). The attention-based loss L att is computed as where N is the total number of pixels in the polyp region, z is the present pixel position in the polyp region, and K att is a regularization parameter. By using this loss function, the difference between the predicted and the true polyp boundary can be iteratively reduced. Moreover, the combination of the above loss functions can lead to a better segmentation performance, because the network is focused focusing on learning iteratively polyp boundaries.

E. ABLATION STUDIES
In order to analyze the effect of each contribution to the segmentation performance, we performed an ablation study where different parts of the system were removed -for instance removing one or both of two in three main contributions. To make equal to all ablation experiments we separated the CVC-ColonDB [48] data set in to training and testing part with the ratio 80 % and 20 %, respectively. For instance, in the first experiment, not only did we adopt the DeepLabv3+ [14] as the main network architecture with the backbone is Xception [17], and kept the softmax cross entropy loss was for logits of each scale was also used. While the augmentation method was kept during the process. Strikingly, all experiments were applied the Boundaryfocused data augmentation method always tend to get higher Dice score by comparing to others. The Table 1 has shown that while the single component contribution slightly impact the the performance of the model, the combination of two contributions always showed the better one.
Furthermore, we also performed four experiments to show the process of choosing the number of component networks. To analyze the performance of each network we kept all the remaining contributions and the quantity of dataset for training and test ratio was also remained unchanged as the former experiments. The result has shown in the Table 2, the MED-Net with three component networks significantly outperformed the three of networks.

IV. EXPERIMENTS RESULTS AND ANALYSIS
In this section, we demonstrate the effectiveness of our proposed polyp segmentation methods. We used three databases to evaluate our methods and compared our results with those of state-of-the-art algorithms.

A. DATASETS
In order to demonstrate the performance of our proposed approach, we used well-known datasets from the MIC-CAI 2015 polyp detection challenge [52]. For comparison purposes, we separated the datasets into a training and a testing dataset according to the recommendation of the MICCAI challenge guidelines: CVC-CLINIC for training and ETIS-Larib for testing. Furthermore, we also report the results from a second publicly available dataset (CVC-ColonDB). In addition, to show the effectiveness of our method when applied to additional datasets, we also used many datasets to meet different goals ; for example, we used a combination of CVC-ColonDB and CVC-ClinicDB, which contains 912 images with associated polyps, to compare our method with that of Li et al. [33].
The datasets are briefly described in the following paragraphs.
• CVC-ClinicDB [7] contains 612 images, where all images show at least one polyp. The segmentation labels obtained from 25 colorectal videos by selecting 29 sequences from these videos. In this database, these images are in three channels with format of tiff, dimension of 384 × 288 pixels and their corresponding label maps.
• CVC-ColonDB [48] contains 379 frames from 15 different colonoscopy sequences where each sequence shows at least one polyp each. These frames were selected in order to maximise the visual difference between them. In this dataset, images are appeared in RGB and the resolution is 574 × 500 pixels. Hence, this dataset can cover as many types of polyp appearances as possible.
• ETIS-LaribPolypDB [8] contains 196 images, which are generated from 34 videos. The size of each image is 1225 × 966 pixels. This dataset contains 44 different polyps with various sizes and shapes. This dataset is one of the most challenging dataset in polyp field. Not only VOLUME 8, 2020 does it lack of quantity, it also includes a vast of polyp appearance.
• ASU-Mayo Clinic Colonoscopy Video Database [53] contains 20 short colonoscopy videos, which has 10 videos have a unique polyp inside (positive shots) and the other 10 videos have no polyps (negative shots). These video are selected so as to display maximum variation in colonoscopy procedures with two types of resolutions.

B. CALCULATION METRICS
We used the Jaccard index, also known as the Intersection over Union (IoU), as the main metric to evaluate our approach's performance. Furthermore, in order to provide a general view of the effectiveness of our method, we also employed Dice score to describe our results. We calculated the mean IoU parameter which known as each per-class IoU was computed over a validation/test set by the Equation 11.
where PR represents the binary mask produced by the segmentation model and GT the ground truth mask, while ∩ denotes a set of an intersection and ∪ a union set between PR and GT.
In addition, we also report the results of three common segmentation evaluation metrics: mean pixel precision, mean pixel recall, and mean accuracy.
We apply precision as a measurement refers to a state of strict exactness -how consistently the pixel is strictly exact and represents the proportion of the negative values produced by the segmentation method and the real negatives of GT in the Equation 12.
In addition, we use recall metric which is also known as Sensitivity to show the fraction of pixels that have been retrieved over the total amount pixel of ground truth in the Equation 13.
Accuracy metric is one of the most common evaluated metric which is used to simply report the percent of pixels in the image which were correctly classified was adopted and presented in Equation 14.
Notably, the Dice coefficient is a statistic used for comparing the similarity of prediction images and label images which is the quotient of likeness and ranges between 0 and 1. Which is presented in Equation 15.
If a pixel of a polyp was correctly segmented, it was counted as true positive (TP). Every pixel segmented as belonging to a polyp that fell outside a polyp label was counted as false positive (FP). Finally, every polyp pixel that was not been detected was counted as false negative (FN) and the remaining pixels were described as true negative (TN).

C. COMPARISON WITH OTHER STATE-OF-THE-ART APPROACHES
In every comparison experiment, we upsampled and downsampled the training dataset to feed it into three different training phases. Not only did we apply our proposed augmentation method, but we also trained the network using other effective methods. For instance, a cropping method was used; that is, we cropped from the center to remove the black parts generated by the camera that exist at image corners. The rotation method was applied using random degrees from 0 to 360. We conducted very many experiments, the results of which showed that using grayscale images for both the training and the testing phase is always better than using RGB images. Therefore, we proceeded to use all the datasets. Subsequently, we formatted the datasets in TFrecords format to optimize the model. Furthermore, all training processes were executed on TensorFow with GPU GTX-1080Ti.

1) RESULT ON THE ETIS-LARIB DATASET
First, in terms of the most challenging dataset, which is provided by ETIS laboratory and Lariboisiere Hospital-APHP. In order to challenge the previous accomplishment on ETIS-Larib dataset, Brandao et al. [10]'s methodology follows the same data guidelines and restrictions with which is given in the e 2015 MICCAI sub-challenge. In the training phase, the author used 4664 images and its corresponding labels which contains at least one polyp and later they use ETIS-Larib [8] to evaluate the performance of model. In the beginning, as mentioned above we turned all images in the CVC-ColonDB dataset [48] and extracted frames in ASU-Mayo Clinic Colonoscopy Video Database [53] into grayscale images. Once this process have been done, these images were resized from 384 × 288 pixels to 500 × 400 pixels and 250 × 200 pixels, respectively. Subsequently, all three sets were carried into three component networks for training. Table 3 shows the achievement of our model in terms of polyp segmentation. This table shows that our MED-Net has outperformed 3 FCNN networks and our previous work. More specially, proposed model achieved both the highest precision and the highest recall among the models. The experimental results show again that the fused method, because of its ability to aggregate multi-scale contextual information, outperforms all the other approaches. Moreover, examples of three FCNNs results for polyp segmentation are depicted in Figure 5. The figure 5 shows that our model can detect the tumor boundary optimally, a result that the other models can not achieve. Finally, in the testing phase, we transformed these images from RGB image in ETIS-Larib [8] to grayscale and TABLE 3. Comparison of proposed method and three fully convolutional neural networks in terms of mean pixel precision and recall for the ETIS-Larib dataset [8].
resized it to the same size with images in the training stage so that they would be compatible with the model.

2) RESULT ON THE CVC-ColonDB DATASET
Second, we appraised our proposed approach performance on the CVC-ColonDB dataset. This database contains images all of which show polyps of different shapes and all of them were annotated by expert clinicians. In addition, we employed the methods presented in [2] and [63] to analyze the comparative effectiveness of our method. Whereas the method presented by Akbari et al. [2] applies a post-processing procedure on the probability map produced by the network to improve the results, we did not apply any post-processing during the testing phase. Besides, we also used two well-known encoderdecoder networks, one was U-Net [45] which is known as a standardization in medical field and the second network is one of state-of-the-art network, DeepLabv3+ [14]. In the training step, the training data selection which was used by Akbari et al. [2], we selected 200 images for training purpose arbitrarily and the remain of the dataset was used for testing. After that, again we upsampled these training images from 384 × 288 pixels to 500 × 400 pixels, and then we downsampled them to create a smaller dataset, the size of the images in which was exactly 250 × 200 pixels. Table 4 shows that our approach outperformed the other two methods completely. Our model achieved a Dice score of 0.908, which is 0.012 better than that of our previous approach. This is because our method can capture multi-level and multi-scale features, which is the advantage of using multiple-model deep encoder-decoder networks.

3) RESULT ON THE CVC-ClinicDB DATASET
Third, we evaluated the performance of our model on the CVC-ClinicDB dataset [7]. From this database we selected 430 images (80% in total dataset) randomly for training, and the remaining 182 images were used as the test set: thus, there is no intersection between the training and the test set. This training and testing strategy also were applied to Chen et al. [14]'s, Li et al. [33]'s and Ronneberger et al. [45]'s network. Table 5 shows that the performance of our model is significantly higher than the competitors in all scores, especially in Dice score which is the main. While our method could clearly segment the the tumor in these testing images, the comparison method could not do locate the tumor position perfectly.

V. CONCLUSION
The segmentation of tumour epithelium in histopathology is a essential beginning step for biomarker assessment, tumour quantification, and prognosis determination in colorectal cancer. In this paper, we proposed a novel deep learning approach that uses a multi-model deep encoder-decoder network, called MED-Net, for colorectal polyp segmentation, in the MED-Net we present a new component networks which are applied with DUpsample method which is known as one of the most powerful approach in segmentation task to robust the performance of the model. In addition, a new data augmentation method that can be applied in polyp segmentation approaches was presented. Furthermore, we described the use of a new adaptive weighted loss function and an attention-based loss function to boost the performance of the model. Our method outperformed the state-of-the-art polyp segmentation methods on three datasets. The key advantage of the proposed method over existing methods is that it employs an ensemble of encoder-decoder networks trained to extract visual features from multi-scale images.
Our proposed method consists of two training stages. In the first training stage, strong data augmentation methods are used to improve the segmentation performance of the deep learning networks. To achieve a better segmentation performance, despite the challenging problem that only a limited number of colon cancer databases area accessible, we not only apply a novel augmentation method, but also other effective methods. MED-Net is trained to extract multicontext information from multi-scale training images. It was able to extract both global and local features of colorectal cancer polyps appropriately, and it could thus greatly improve the performance of different types of methods for detecting malignant polyps. The best training model was built by combining three training models with different resolutions, using the majority voting strategy. Moreover, we introduced two loss functions which strongly supported MED-Net in terms of achieving ideal results.
Our experimental results demonstrate the superiority of the proposed method to state-of-the-art polyp segmentation approaches. Our future work will be focused on improving the IoU measure by adopting distinctive deep learning methods, such as batch normalization and residual networks, which may provide a good tool that will further improve the segmentation performance of our method. Furthermore, we also plan to apply our approach to a real-time system.

VI. DISCUSSION
Our proposed approach is still erroneous, but we are attempting to improve on existing research results by means of a variety of methods. In colorectal polyp segmentation, a good classification algorithm and high-quality data complement each other. When segmenting a high resolution image we must upsample the image and then downsample it to recover the original resolution. Thus, we intend to explore the convolutional neural network structure with the objective of devising a network with wider and deeper structure. Especially, we will highly focus on the encoder part to extract most of meaningful features. The training time is also one of our concerns due to the encoder-decoder structure. There is a tradeoff between time and performance, therefore, the training time is also a drawback in this study due to the number of component networks in MED-Net and the discrimination of encoder-decoder structure.