Automatic Polyp Segmentation Using Modified Recurrent Residual Unet Network

Colorectal cancer is a dangerous disease with a high mortality rate. To increase the likelihood of successful treatment, early detection of polyps is a useful solution. The Unet-architecture network model is showing success in medical image segmentation including analysis of polyps from colonoscopy images. Traditional Unet and Unet-based models are often huge, requiring training and deployment with a high-performance system. Designing models with compact size and high performance would be an important goal. In this study, we proposed to modify the Residual Recurrent Unet architecture to improve the size of the model while ensuring the model performance. The proposed model has flexibility in changing the number of filters in convolution units. By taking advantage of the strengths of residual and recurrent structures in terms of reuse of convolutional functions, the new model, therefore, was not only smaller in size but also has superior performance compared to the traditional Unet model and the others. The evaluations were performed on three public Colonoscopy image datasets: CVC-ClinicDB, ETIS-LaribPolypDB, and CVC-ColonDB. The Dice score on CVC-ClinicDB reached 94.59%, ETIS-LaribPolypDB reached 92.73% and 93.31% on CVC-ColonDB dataset. The experimental results obtained from the proposed network on datasets were better than those in recent related studies. The introduced model has a smaller size than the traditional model nevertheless has outstanding performance, therefore, it would be extremely productive for developing applications on low-performance devices.


I. INTRODUCTION
C OLORECTAL cancer has the third-highest rate of new cases of all cancers and the second-highest mortality rate in 2020 [1]. Polyps are the primary cause of colorectal cancer. Early detection of malignant polyps makes treatment more effective. Colonoscopy is one of the most effective solutions for the early detection of polyps. This makes treating and saving the patient's life [2]. The polyps present in colonoscopy images are regularly diverse in size, shape, and color. It is, therefore, tremendously challenging to detect and distinguish the polyps that appear in the colonoscopy images [3]. This task requires plenty of time and effort from doctors. Therefore, automatic processing by the computeraided would be the best solution to this challenge [4].
To improve the effectiveness of colorectal cancer screening and treatment, polyps inside the colon must be detected early and accurately. Accordingly, detecting and monitoring the size of polyps plays a significant role. With the developments in convolution neural networks (CNNs) and computer vision, the polyp segmentation challenge is becoming more efficient and attracting the attention of researchers [5]. Proposals for the detection and segmentation of polyps were introduced quite early using traditional image processing techniques. Karkanit et al. [6] suggested using color wavelet features to detect polyps in colonoscopy videos. Recently, Deeba et al. [7] proposed a three-part system that performs image enhancement, saliency detection, and feature extraction to classify polyps in colonoscopy images.
The rapid development of deep learning has achieved many outstanding results in the field of medical image processing as well as in the polyp segmentation problem [8]. Commonly used CNNs network architectures for medical im-age segmentation, including Fully Convolutional Networks (FCN) [9], Unet [10], Deeplab [11]. . . have achieved superior results compared to traditional methods. Some recently proposed methods have combined multiple deep learning models to enhance the system's accuracy [12]- [16]. Yamada et al. [12] introduced a system that used deep learning to diagnose colonoscopy in real-time. They built up the system that gets the video frames from the endoscopy system then applied the Faster R-CNN [17] method to detect and recognize the polyp in the images. The results will be displayed on the screen and warned by the sound. Guo et al. [13] proposed an ensemble model based on the pre-trained models of three famous models Unet, PSPNet [18], and SegNet [19]. They applied the transfer learning mechanism for Unet and SegNet; the VGG-16 [20] model was used as the backbone for these models. Nguyen et al. [14] proposed a multi-model deep encoderdecoder network that used multi-scale input images for multimodel. It collected more contextual information and captured the contour more precisely. Banik et al. [15] introduced a multi-model fusion for the polyp segmentation challenge. They enhanced the CNN model namely Dual-Tree Wavelet pooled CNN and the Local Gradient Weighting-embedded level set method. Two new models were fusion to achieve the results. Kang and Gwak [16] proposed an ensemble model that is based on Mask R-CNN [21] model. They used the transfer learning method with two different backbone structures on the Mask R-CNN. The results were the combination of two models. Fan et al. [22] proposed a Parallel Reverse Attention network for polyp segmentation. They used the parallel partial decoder to combine the high-level features. The global maps were generated and considered the interesting area for the next step with a reverse attention module that focuses on the boundary. This combination improved segmentation performance.
Unet is a CNN architecture that achieves outstanding results in medical image segmentation problems [23]. Many Unet-based deep learning models have been changed and upgraded to improve efficiency. Ibtehaz and Rahman [24] proposed the Multires-Unet that modified the Unet node structure by the residual structure and changed the skip connection by the forward structure based on the residual connection. Jha et al. [4] proposed ResUnet++ combined with Conditional Random Field and Test-time Augmentation. This network model was based on Unet combined with the residual structure [25]. Safarov and Whangbo [26] proposed A-DenseUnet that based on Unet++ structure [27]. The dilated convolution with different rates was applied. They also used an attention mechanism to enhance the contextual features. Recently, the polyp segmentation task was being interested and implemented with research on deep learning models. Mahmud et al. [28] proposed PolypSegNet by modifying the encoder-decoder architecture, the network reduced the sematic gap in skip connection of conventional Unet structure. Lin et al. [29] introduced the RefineUnet was improved from FCN architecture. The skip connection was generated with the intermediate layers. Yeung et al. [30] applied the Unet architecture combined with short-range skip connections, and deep supervision and introduced a dual attention gated. The performance of the deep learning model for polyp segmentation was compared by Trinh et al. [31], the best accuracy network required the high enough performance devices. An adaptive Markov random field was also proposed by Sasmal et al. [32] for the polyp segmentation task, this is an unsupervised learning method. The advantage of this approach is the smaller consumption time, however, the network efficiency is not better than the supervised method. In our previous research, we also proposed the TMD-Unet [33] that was based on a multiple-layer U n -Net [34], multipleinput feature, and dense skip connection for medical image segmentation. Polyp segmentation was also used for the evaluation and testing of the TMD-Unet model. However, the evaluation is only on a dataset. In this study, we focus on analyzing and evaluating the model on many datasets with many evaluation criteria.
The combination and ensemble methods have enhanced the efficiency of the system. However, the disadvantage of these methods is that they consume a lot of computational time and memory. The Unet-based network models applied to the polyp segmentation problem have also significantly improved the performance. Nevertheless, the size of these models has not improved significantly compared to the previous models. Mingliang and Xiaolin Hu [35] introduced Recurrent Convolutional Neural Network (RCNN) that adopts the method of reusing the convolutional kernels. This improves the ability to extract context information. The main advantage of RCNN is that it can improve the size while increasing the depth of the network. It was inspired by the recurrence residual Unet network model [36] that combined the advantages of RCNN, residual structure, and Unet architecture for medical image segmentation. In this study, we proposed the mRR-Unet network model that changed connectivity in the nodes of the traditional Unet architecture. Moreover, to reduce the network size while keeping the amount of feature information. In the proposed model, we used 1×1 convolution at the output of the convolution node that doubles the number of feature maps and uses it for the subsequent layers and skip connection. In summary, our essential contributions in this study are: -We proposed a new network model for polyp segmentation named mRR-Unet that combined the dominances of the recurrent convolution, residual structure, and Unet architecture. The proposed models show a performance improvement and a significant reduction in network model size compared to previous models.
-We implemented two versions of mRR-Unet that are mRR1-Unet and mRR2-Unet. The digits in the network model's name represent the number of iterations of the convolutional unit in the node of the Unet architecture.
-To authenticate the effectiveness of the proposed models, we also re-built the traditional Unet and recurrence residual Unet models with different versions to make the comparison. The models have been evaluated on three popular data sets of colonoscopy images, including CVC-ClinicDB, ETIS-LaribPolypDB, and CVC-ColonDB.

II. MODIFIED RESIDUAL RECURRENT UNET MODEL (MRR-UNET)
The proposed model is inspired by the Unet model with nodes in the encoder and decoder side built on the residual network and recurrent convolution architecture. In this study, we build up the model consisting of 5 levels that is followed the traditional Unet. In this section, the details of the proposed models are described.

A. MODEL ARCHITECTURE
FIGURE 1 shows the details of mRR-Unet. The model is based on the conventional Unet. The nodes on the Unet topology are designed to consist of a recurrent convolution layer (RCL) along with a double unit (D). In the proposed model, we included a double unit to improve the quality of the feature maps from the encoder nodes. The double unit allows increasing the number of feature maps to ensure the quality of the features. With this proposal, the network size will be significantly reduced while still ensuring the number of features for the model. The structure of the RCLs block is depicted in FIGURE 2. Each RCLs block consists of two 3×3 convolution units C1 and C2. The C2 is reused according to recurrent structure and combined with the residual network. The outputs of the RCLs are expressed as follow: (1) where f n indicates the output feature of the n th convolution units, C 1 (•) and C 2 (•) represent the convolution unit 1 and 2. Letter 'I' presents the input of the RCLs.

B. THE CONNECTIVITY
In this study, we deploy the proposed model with two versions: mRR1-Unet and mRR2-Unet, the digits '1' and '2' indicate the number of iterations of the convolution block C2 in the RCLs. Let F i ∈ R n×n×k is the output feature of the RCL unit, once F i = f n , χ i ∈ R n×n×2k is the output of the node, n×n is the size of the feature map, k is the filter number of the convolution in the node.
The [•] denotes the concatenate function. The C 1×1 (•) is the 1×1 convolution. The number of filters in the convolution unit is the main factor to determine the network size. In this study, the 1×1 convolution was used to reduce the computation time by decreasing the parameter number while keeping the feature maps. The effectiveness of 1×1 convolution is proved in [37], [38]. In this study, by using 1×1 convolution, the output feature remains while the network size was dropped.  In the decoder nodes of mRR-Unet, the input of the nodes is defined as follow: where B is the batch normalization function, χ i E denotes the output feature from the i th encoder node, the U (•) denotes the up-convolution and χ i−1 D is the output feature from the decoder node. In this study, we built up the architecture with five layers including four encoder-decoder layers and one transition layer. The number of filters in the network nodes Details of the number of layers, convolution types, and functions used in the mRRn-Unet model are shown in TABLE 1.

A. DATASETS
The datasets used in this study consist of three popular datasets on Polyp segmentation.
-CVC-ClinicDB [39] is a three-channel color image dataset, which includes 612 images in tiff format extracted from 25 different colonoscopy videos. The dataset was supplied by the 2015 MICCAI sub-challenge on automatic polyp detection. Most of the images are 384×288 in size. Each image contains polyps that are diverse in size and points of view. The annotation labels of the polyps are also included in the dataset.
-CVC-ColonDB [40] provides 300 frames from 15 short colonoscopy videos. The images are in RGB format with a size of 574×500. The annotation of the region of interest was supplied for all the images. The pixels belonging to the polyps are covered, and the ground truth images were presented in binary format. The images in the dataset cover polyps of various sizes and types.
-ETIS-LaribPolypDB [41] provides 196 frames with the size of 1225×966, which are extracted from 34 different videos. There are a total of 44 polyps included in the dataset. The polyps that appeared in the frames are diverse in shapes and sizes.
For training and inference, the original images were scaled to a size of 224×224. This strategy was applied to all the datasets used in this study. The downgraded size images would reduce the resolution of the image and may result in lost information. However, in this study, the networks need to have a fixed size input image and ensure synchronization. Therefore, the training, testing processes, and comparison of network performance were done in the same setting to ensure fairness when comparing and evaluating models. Because of the image number limitation, the data augmentation technique was applied in the training process. The image processing techniques used for creating the new data consist of rotation in the range of 50 degrees, height and width shift in the range of 0.2, horizontal flip, and shearing in the range of 0.5. Thereafter applying the augmentation technique, the network models were trained with a batch size of eight images and each epoch would be trained in one thousand batches. This strategy was applied to all datasets. FIGURE  3 presents some examples of the dataset used to evaluate the proposed network models.
The early-stopping mechanism was applied to save the training time. The hyper-parameters of the training process are set as follows: 150 epochs, 0.2 of dropout rate, and 3 × 10 −4 of initial learning rate. A learning rate schedule was used to update the learning rate every ten epochs, determined by the following expression learning rate = IRL × (0.9) (N oE/10) where IRL is the initial learning rate, NoE shows the epochs number.
The loss function used in this study is a combination of dice-loss and weighted cross-entropy (WCE). The hybrid loss function aims to solve the imbalance between positive and negative classes. The expression of the loss function is as follow L hybrid = L DL + L W CE (7) where L hybrid represents the total loss, L DL and L W CE show the dice loss and the weighted cross-entropy loss, respectively. The y i gt indicates the probability that pixel i belongs to the background while the y i pred denotes the probability that pixel i belongs to the positive class. The P shows the pixel number, the ω denotes the weight of the foreground class that was 0.3 in this study, and the ε is the small value that prevents the infinitive problem that was 10-8.
All experiments were performed on a personal computer with NVIDIA GeForce RTX 3070 Graphics Processing Unit, Intel Core i7 11370H CPU, and 24GB of RAM.

C. EVALUATION METRICS
The metrics used in this study include Dice Score (Dice), F1score, mean of Intersection over Union (mIoU), Precision, and Recall. The metrics are expressed as follow P recision = T P T P + F P Recall = T P T P + F N F 1 − score = 2 × P recision × R e c a l l P recision + Recall where Y true indicates the case of truth values while Y pred denotes the predicted value number. The T P, T N, F P, andF N denote true positives, true negatives, false positives, and false negatives case numbers.

IV. SEGMENTATION RESULTS
To prove the effectiveness of the proposed network model, we employed five different models that include traditional Unet, residual recurrent Unet with filter number of 16 and t = 1 (RR16-1), RR16-12 (t = 2), residual recurrent Unet with filter number of 32 (RR32-1), and RR32-2 (t = 2). All models are trained and evaluated with the same settings. We conducted the training and inference process fifteen times for every model on every dataset to ensure the precision.

A. THE SEGMENTATION RESULT ON THE CVC-CLINICDB DATASET
The segmentation results of the models on CVC-ClinicDB dataset are shown in This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.  94.64%). Compared to Unet, the evaluated values increased by 2.68%, 2.64%, 4.6%, 3.18%, and 1.81% on Dice score, F1-score, mIoU, Recall, and Precision, respectively. The RR32-2 reached the second-best value in the models. The mRR1 has a smaller value than the RR32-2 but still outperforms traditional Unet and other models. The performance of mRR1 was not better than that of RR32, but the model size was significantly smaller. . Both mRR1 and mRR2 have better results than the residual recurrence models on three crucial metrics Dice, F1-score, and mIoU. The residual recurrence models also outperform the conventional Unet. In summary, the mRR models achieved the best results on the ETIS-LaribPolypDB dataset.  confidence interval between mRR with traditional Unet and RRUnet. mRR2 achieved a statistically significant value compared with the others. For mRR1, when compared with RR32-1 and RR32-2 is nonsignificant on the CVC-ClinicDB dataset. The p-value statistics table shows the reliability of the proposed model.

D. THE QUANTITATIVE RESULTS
FIGURE 4 illustrates the quantitative results of the models implemented in this study. For all cases, the Unet model performed the worst. The mRR2 model segmented the best results on both small and large polyps. In some cases, when the polyp size is small, traditional Unet cannot recognize the polyps, while RRUnet and mRR can both segment well on small polyps. As seen in Figure 4, mRR showed better segmentation results than other models. With extensive and differentiated polyps, the segmentation results of the 6 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.   This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3184773 the models over the validation loss value. The mRR model shows stability during training compared to the other models.
The residual recurrence models also show more stability than conventional Unet. The validation loss value also reached the smallest value. FIGURE 5(b) depicts the training curve on the validation dice value. The proposed model achieves a better validation value than Unet and residual recurrence. The residual recurrence models also outperform Unet, however, the proposed model has the best results.

B. FEATURE MAP VISUALIZATION
Convolutional units with many filters will result in a significant increase in the network size. For the datasets with few samples, it can easily lead to overfitting problems. In this study, we have proposed to reduce the number of filters in the convolution layers in the nodes of the Unet architecture. This helps to reduce the network size while still ensuring the features for the model. FIGURE 6 illustrates the images of the feature map over the network nodes in the Unet structure.
Observing FIGURE 6, we can see that the features passing through the network nodes of mRR, still ensure complete information, thereby leading to the quality assurance results compared to Unet.   FIGURE 7 illustrates the comparison between the models' size, performance, and inference time. As shown in FIGURE 7, although the size of the mRR model is smaller than that of the Unet and RR32 models, the performance is superior. The smaller the model size (smaller bubbles), the less inference time, but the performance of mRR1 and mRR2 is better than the rest of the models. The RR32 models are much larger than the rest, but the evaluation results are not superior. On all datasets, the proposed model has shown to be effective, which proves the effectiveness of the proposed model. With the modification in the node output, the model size is reduced, and the feature maps are used more efficiently. 7 compares the performance of the proposed models with current models on the datasets used in the study. Many Unet-based models perform polyps' segmentation. In this study, the proposed models are compared with DoubleUnet [42], ResUnet++ [4], TMD-Unet [33], MED-net [14], FCN-8s [43]. Observed in TABLE 7, the segmentation results of mRR2 are better than previous models on most of the evaluation metrics. With the single-model method, the proposed methods have smaller sizes on all datasets while still achieving better segmentation results.

VI. CONCLUSION
This study proposed a new network model that modified the recurrence residual Unet. The proposed model is based on reusing convolutional units, which reduced the network size while still achieving better results. The number of filters in the convolution node is a problem for feature map optimization. In this paper, we have used double blocks to help reduce the network size while preserving the quality of 8 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and   the feature map. Our proposed model obtains better results with a smaller size. This proves the effectiveness of the proposed model for segmenting polyps. Applying models with a smaller size but still ensuring performance will help research on devices with limited performance.