Loading web-font TeX/Main/Regular
ConvSegNet: Automated Polyp Segmentation From Colonoscopy Using Context Feature Refinement With Multiple Convolutional Kernel Sizes | IEEE Journals & Magazine | IEEE Xplore

ConvSegNet: Automated Polyp Segmentation From Colonoscopy Using Context Feature Refinement With Multiple Convolutional Kernel Sizes


ConvSegNet: Automated Polyp Segmentation from Colonoscopy using Context Feature Refinement with Multiple Convolutional Kernel Sizes.

Abstract:

Colorectal cancer occurs in the rectal of humans, and early detection has been proved to reduce its mortality rate. Colonoscopy is the standard used in detecting the pres...Show More

Abstract:

Colorectal cancer occurs in the rectal of humans, and early detection has been proved to reduce its mortality rate. Colonoscopy is the standard used in detecting the presence of polyps in the rectal, and accurate segmentation of the polyps from colonoscopy images often provides helpful information for early diagnosis and treatment. Although existing deep learning models often achieve high segmentation performance when tested on the same dataset used in model training; still, their performance often degrades when applied to out-of-distribution datasets, leading to low model generalization or overfitting. This challenge is often associated with the quality of the features learnt from the input images. In this work, a novel Context Feature Refinement (CFR) module is proposed to address the challenge of low model generalization and segmentation performance. The CFR module is built to extract contextual information from the incoming feature map by using multiple parallel convolutional layers with progressively increasing kernel sizes. Using multiple parallel convolutions with different kernel sizes helped to extract more efficient multi-scale contextual information and thus enabled the network to effectively identify and segment small and fine details, as well as larger and more complex structures in the input images. Extensive experiments on three public benchmark datasets in CVC-ClinicDB, Kvasir-SEG, and BKAI-NeoPolyp showed that the proposed ConvSegNet model achieved jaccard, dice and F2 scores of 0.8650, 0.9177, and 0.9328 on CVC-ClinicDB, 0.7936, 0.8618, and 0.8855 on Kvasir-SEG, and 0.8045, 0.8747 and 0.8909 on BKAI-NeoPolyp datasets respectively. Also, an improved generalization performance was achieved by the ConvSegNet model, compared to the benchmark polyp segmentation models. Code is available at https://github.com/AOige/ConvSegNet.
ConvSegNet: Automated Polyp Segmentation from Colonoscopy using Context Feature Refinement with Multiple Convolutional Kernel Sizes.
Published in: IEEE Access ( Volume: 11)
Page(s): 16142 - 16155
Date of Publication: 14 February 2023
Electronic ISSN: 2169-3536

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Image segmentation is an essential task in biomedical imaging, and it has seen its application across various clinical areas. Image segmentation is a technique for splitting images into easily analyzable and interpretable Regions of Interest (ROI). In recent times, deep neural networks, especially convolutional neural network (CNN), have improved image segmentation compared to shallow networks [1], [2]. This improvement has seen its applicability in numerous segmentation areas, such as brain tumors [3], skin cancers [4], covid-19 [5], and lung cancer [6], among other areas. Recently, researchers have intensified their focus on polyp segmentation from colonoscopy due to the mortality tendency of colorectal cancer [7], [8]. Colorectal cancer is the second most common cancer type among women and the third most common among men [9]. Generally, polyps indicate the presence of colorectal cancers in the rectum, and early detection and removal are essential to mitigate mortality. Polyps are abnormal tissues generated from the mucus membrane, and they have been discovered to be present in 50% of individuals that undergo colonoscopy screening, and the frequency often increases as age increases [10]. However, detecting polyps from colonoscopy manually is quite laborious, and the miss rate is between 14% to 30% when trying to detect the presence of polyps in the rectal manually, with the type and size being the determining factor [11].

In most cases, polyps may be hidden from the line of vision during manual inspection and, sometimes, might be present in the operator’s range of view, but remain undetectable [12]. These challenges have prompted the development of real-time Artificial Intelligence (AI) algorithm, as seen in [13]. The polyp segmentation technique in this scenario strives to accurately delineate the polyp border from the surrounding mucosa and detect polyps. Also, various forms of noises such as shadow, blurriness, reflection, and others can be present in colonoscopy, which can also affect the detection of the presence of polyps [14]. Recently, several deep learning models have been proposed to effectively extract cogent features to aid the segmentation of polyps from colonoscopy [15], [16], [17], [18]. However, a general limitation of image segmentation models is the quality of features extracted from input images, and the low segmentation performance achieved when tested on out-of-distribution datasets, which leads to low model generalization [19]. Due to the varying size and types of polyps, several models have been proposed to address the issues of low-quality feature extraction and low model generalization in polyp segmentation when tested on new colonoscopy images. However, it remains a challenging area in polyps’ segmentation from colonoscopy. In this work, we propose ConvSegNet, an image segmentation model which uses a novel Context Feature Refinement (CFR) module to address low model generalization and segmentation performance. The novel CFR module is built to extract quality features by applying multiple parallel convolutional layers with different kernel sizes in the decoder block. This unique structure enables the network to effectively capture multi-scale context features. This is important, as it will enable the network to effectively identify and segment small and fine details, as well as larger and more complex structures in the image. This is crucial for achieving high-quality segmentation results.

Specifically, our main contributions are in four folds:

  • The CFR module leverages progressively increasing kernel sizes to extract contextual information from feature maps.

  • The proposed model improved the quality of extracted features, and is efficient in terms of speed and size, as it achieved improved segmentation performance with few parameters and standard Frames Per Second.

  • Comprehensive experiments were done on three datasets to evaluate the ConvSegNet model against other benchmark polyp segmentation methods using six standard performance metrics.

  • Lastly, we have expanded the standard benchmarks for polyp segmentation, which can be used to create clinically useful procedures.

The rest of this paper is organized as follows. Section II discusses the state-of-the-art polyp and biomedical image segmentation models. Section III describes the architecture of the proposed ConvSegNet model, Section IV describes the datasets, performance evaluation metrics, and experimental results, and Section V concludes.

SECTION II.

Related Works

Various traditional methods have been adopted for polyp segmentation. For example, Hwang et al. [20] presented a polyp detection approach based on the elliptical shape of virtually all small polyps. Segmentation was done based on watersheds image segmentation and ellipse fitting method, then matching curvature and contour distance were used to separate the ellipses of the polyp and non-polyp zones. Ameling et al. [21] considered textural cues and local binary patterns for polyp segmentation. In [22], these two methods were combined by considering shape and texture for polyp segmentation. The shape feature was utilized to accurately identify polyps with curving borders, while the texture information was used to discriminate polyps from non-polyp structures. Also, various Machine Learning (ML) models have been proposed, as seen in [23], [24], and [25], among others. As shown in Figure 1, ML techniques involve data pre-processing, handcrafted feature extraction, and feature selection before the classification phase. In contrast, deep learning models ignore most of these phases and achieve better results. The introduction of deep learning models, especially CNN, has been prompted by several limitations of machine learning models, including issues with automated segmentation of biomedical images, considerable changes in form, size, texture, and in some cases, the colour of ROI between patients, and poor contrast between areas [26], [27].

FIGURE 1. - Classifier approach (a) Machine learning (b) Deep learning.
FIGURE 1.

Classifier approach (a) Machine learning (b) Deep learning.

CNN architectures have improved semantic image segmentation, with most of the existing architectures based on U-Net [28], a modified architecture developed for biomedical image segmentation. The U-Net comprises an encoding network that captures image context and a symmetrical decoding network that allows the localization of salient regions. Several other models have been proposed based on the U-Net architecture. UNet++ [29] expands the U-Net by including skip connections to close the semantic gap between the encoder and decoder’s feature maps before fusion. In [30], ResUNet architecture was proposed based on a semantic segmentation neural network. It integrated U-Net with residual neural network strengths. This combination allowed the residual unit to facilitate network training. The skip connections within the residual unit and between low and high levels of the network facilitated information propagation without degradation. This allowed the architecture to be designed with few parameters while still achieving comparable semantic segmentation performance. Based on these architectures, various image segmentation models have been proposed. Polyps come in a variety of shapes and sizes.

A small colorectal polyp may lack distinguishing textural characteristics in its early stages, making it easy to confuse with normal intestinal tissue. Therefore, some biomedical image segmentation models might perform well on other biomedical image datasets but perform below par when applied to polyp segmentation, as seen in [16], [31], [32], [33], and [34]. For this reason, more polyp specific segmentation architectures are being proposed. The following section presents some existing state-of-the-art polyp segmentation from colonoscopy architectures.

A. Polyp Segmentation Architectures

In Jha et al. [35], the ResUNet++ model was developed to integrate residual units with the Atrous Spatial Pyramidal Pooling (ASPP) and squeeze-and-excitation block based on channel attention. Yeung et al. [12] proposed Focus U-Net; a dual-attention gated model that combined spatial and channel-based attention and used the hybrid focal loss to address class imbalance in polyp’s datasets. The model was tested on some polyps benchmarking datasets and evaluated against U-Net and Attention U-Net [36], and the model showed improved performance. However, the model ignored computational and generalization efficiency and focused solely on segmentation performance. In Kim et al. [37], they presented a U-Net-based architecture with extra encoder and decoder modules called UACA-Net. Foreground, background, and uncertain region maps are calculated for each representation using saliency maps computed by a prediction module in the UACA-Net. The next prediction module computes the relationship between each representation and employs it.

In Mahmud et al. [38], PolypSegNet architecture was proposed based on encoder-decoder architecture. The aggregate feature into each unit layer included several successive depths dilated inception blocks. Rather than connecting different levels of encoder and decoder separately, different scales of contextual information from all encoder unit layers were fed through the PolypSegNet’s deep fusion skip module to generate skip interconnection with each decoder layer. This addressed computational efficiency. However, the generalization performance of the model is quite low, due to the plain skip interconnection. This is because the plain skip connections tend to combine semantically diverse low- and high-level convolutional features, resulting in hazy feature maps.

Zunair and Hamza [32], presented Sharp U-Net without plain skip connections. Before merging the encoder and decoder features, a depth wise convolution of the encoder feature map with a sharpening kernel filter was used instead of the simple skip connection. They were able to create a sharpened intermediate feature map of the same size as the encoder map. The model was also able to smooth out artefacts throughout the network layers during the early training phases by applying the sharpening filter layer. Experiments were done on polyp datasets, covid-19 datasets, lung, and three other datasets. Even though the model achieved higher Jaccard and Dice Scores on the five other datasets, results on polyp segmentation from colonoscopy were relatively low, as the Jaccard was 83.98% and Dice was 90.05%. The low performance can be attributed to the varying sizes and shapes of polyps that the sharpened kernel filter might ignore due to the semantic gap between the encoder and decoder features.

In Zhao et al. [39], MSNet architecture was proposed to segment polyps from colonoscopy images. They combined lower-order and higher-order cross-level complementary information with level-specific information to increase multi-scale feature representation by pyramidally concatenating numerous subtraction units. Even though the model achieved high segmentation performance when trained and tested on the same polyp dataset, the model generalization performance was low, and the number of parameters was relatively high. Also, in [40], a Context Extractor Module was proposed, which consists of DAC block and the RMP block. The DAC block utilized three different dilated convolutions with a fixed 3\times 3 kernel size, while the RMP block used a multiple pooling strategy with different pooling windows [2\times 2 , 3\times 3 , 5\times 5 , 6\times 6 ] and then performed upsampling to have equal spatial dimensions for concatenation. However, by using this approach, positional information is lost, which automatically influences the quality of features learnt.

Some methods based on inception module have also been proposed to increase the segmentation of polyps from colonoscopy. For example, Qadir et al. [41], used two ensemble models based on inception, which was benchmarked on the CVC-ColonDB dataset [22], and the model achieved a recall of 72.59%, precision of 80%, Jaccard of 61.24%, and a Dice score of 70.42%. However, the architecture of the nception module uses 1\times 1 , 3\times 3 , 5\times 5 , and some pooling operations, which also caused positional information loss over a broad range of features from the input, thereby affecting the segmentation performance and generalization performance of such models. Tomar et al. [42] proposed FANet, for polyp segmentation from colonoscopy. The FANet which is an attention feedback model that unifies past epoch mask with the feature map of the current training epoch, which then uses the mask of the previous epoch to provide hard attention to the learned feature maps at several convolutional layers. Even though the model achieved high performance on some other datasets, results on polyp datasets were not SOTA, due to the maxpooling which was done on the input mask before scaling. Also, the FANet model is parameter heavy.

In a bid to develop lightweight segmentation models, Valanarasu and Patel [43], proposed UNeXt, a convolutional multilayer perceptron based network for image segmentation. The model was designed with an early convolutional stage and a MLP in the latent stage. Experiments showed that the UNeXt model was able to improve segmentation performance with minimal model parameters. Also, Li-SegPNet was proposed in [44]. The model utilized a unique encoder block with modified triplet attention to harness cross-dimensional interaction in feature maps. To solve the issue of segmenting objects at various sizes, the authors employed spatial pyramid pooling, and used an attention gating-based modified skip connection to overcome the semantic discrepancy between the encoder and decoder. The model was evaluated on CVC-ClinicDB and Kvasir-SEG, and the result showed that the model performed best on medium-sized polyps, with below par performance on smaller polyps. This limitation can be attributed to the pooling operations in their architecture [38].

In this paper, we propose the ConvSegNet model which uses progressively increasing kernel sizes starting from 1\times 1 , 3\times 3 , 7\times 7 and 11\times 11 in the CFR module, without pooling operations. By doing this, a broader range of features can be extracted progressively from the input, which can help to capture more discriminative features. The novelty of the proposed ConvSegNet model lies in the Context Feature Refinement (CFR) module used in the decoder block. The parallel architecture of the CFR module, consisting of four parallel convolutional layers with progressively increasing kernel sizes. This unique structure will enable the network to effectively capture multi-scale context features. A detailed description of the proposed ConvSegNet model is presented in the next section.

SECTION III.

Methodology

To address the challenges of the existing architectures, we propose a novel Context Feature Refinement (CFR) module to extract contextual information from the incoming feature map by applying multiple parallel convolutional layers with different kernel sizes. This section presents the data processing method, model block diagram, and the architecture of the proposed ConvSegNet model.

A. Data Pre-Processing

The images and masks were resized to 256 × 256 pixels, followed by the pixel value normalization. Data augmentation techniques such as random rotation, horizontal flipping, vertical flipping, and coarse dropout were used to improve the robustness of the input data.

B. Context Feature Refinement Module

The Context Feature Refinement (CFR) module in the decoder block is shown in Figure 2. The CFR module is built to extract contextual information from the incoming feature map by applying multiple parallel convolutional layers with progressively increasing kernel sizes, as shown in Equation 1, where feature map O_{x} is given as:\begin{equation*} O_{x}=b_{x}+\sum \limits _{r} {F_{xr}\ast I_{r}} \tag{1}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where F_{xr} is the convolutional kernel, I_{r} is the input, b_{x} is the bias term, and \ast is the convolutional operation. Then we concatenate the output of these layers and pass them through a 1\times 1 convolution to refine these features. The CFR module begins with four parallel convolutional layers with 1\times 1 , 3\times 3 , 7\times 7 and 11\times 11 as their respective kernel sizes, as shown in Figure 2.

FIGURE 2. - Decoder block with context feature refinement module.
FIGURE 2.

Decoder block with context feature refinement module.

Using different kernel sizes helped increase the receptive field during the convolution operation, which helps to better capture the contextual feature from the input feature map. Zero padding was used in our module to ensure that all the feature maps have the same spatial dimensions, allowing easy concatenation into a single feature map. After that, each convolutional layer is followed by batch normalization and a ReLU activation function, which is given in Equation N. \begin{equation*} f\left ({x }\right)=\max \left ({0,I_{x} }\right) \tag{2}\end{equation*}

View SourceRight-click on figure for MathML and additional features. Next, we concatenate the output of the four ReLU layers along the channel axis and pass them through a 1\times 1 convolutional layer which is again followed by batch normalization and ReLU layer.

C. Convsegnet

Our model uses the novel Context Feature Refinement (CFR) module to extract contextual information from the incoming feature map. The proposed architecture is fed with an RGB image passed to the encoder, consisting of a pre-trained ResNet50. The ResNet50 is used to extract different level features from different blocks with varying resolutions. Each feature map is then passed into a 3\times 3 convolutional layer to reduce the number of feature channels to 64. The convolutional layer is further followed by a batch normalization layer and ReLU activation function. Next, the network is followed by four decoder blocks, each taking the previous feature maps as the main input and a skip connection (indicated by a red colour arrow in Figure 2 and Figure 3.

FIGURE 3. - Architecture of the ConvSegNet.
FIGURE 3.

Architecture of the ConvSegNet.

Each decoder block begins with an upsampling layer where the spatial dimensions (height and width) of the incoming feature map are increased by a factor of two using bilinear interpolation method. Next, it is followed by a concatenation of upsampled feature map with the feature map from the skip connection. Using these skip connections helped to provide additional information to the decoder, to generate better semantic features, while providing additional paths for the better flow of gradients during the backpropagation.

The concatenated feature maps are then passed through the novel context feature refinement module, which uses four convolutional layers with varying kernel sizes to extract contextual information from the input feature. Next, we concatenate the contextual information and pass it through a convolutional layer for refinement. The refined feature acts as the output of the decoder block, which is further passed to the next set of decoders. In the last decoder, we used the low-level features from the input image, then passed the input image through a convolutional layer, then used it as the skip connection. By doing this, we were able to take advantage of the low-level features to generate high-quality semantic features. The output from the last decoder block is then passed through a 1\times 1 convolutional layer followed by a sigmoid activation function which generates a binary segmentation mask.

SECTION IV.

Experimental Results

This section presents the dataset description, the details of our implementation, and the standard performance metrics considered in the evaluation of the proposed model against benchmarking architectures. We then presented a detailed comparison of the performance of the proposed model based on quantitative experiments and generalization experiments. Also, ablation studies were presented, and experiments on two other non-polyp datasets, to demonstrate the extensibility of the proposed ConvSegNet model.

A. Datasets

According to the literature, there are six (6) publicly available datasets that have been used for polyp segmentation from colonoscopy model benchmarking. Kvasir-SEG [45], CVC-ClinicDB [46], ETIS-Larib [44], CVC-ColonDB [22], BKAI-NeoPolyp [47] and CVC-300 [48]. Out of these six, only CVC-ClinicDB, Kvasir-SEG, BKAI-NeoPolyp and ETIS-Larib contain manually labelled ground truth masks. Among these four, studies have shown that Kvasir-SEG and CVC-ClinicDB are the most used datasets for fair generalization evaluation since they are both in standard definition. While Etis-Larib is in high definition and has only 196 images, BKAI-NeoPolyp has 1200 images. For this reason, we have chosen the Kvasir-SEG with 1000 images, CVC-ClinicDB with 612 images, and BKAI-NeoPolyp with 1200 images as the benchmark datasets to evaluate the proposed ConvSegNet model.

1) Kvasir-SEG Dataset

This dataset was extracted from the polyp class in the Kvasir dataset [49]. Kvasir-SEG contains 1000 polyp images, their accompanying masks, and bounding box information taken by electromagnetic imaging devices. The segmentation task can be done with the images and their ground truths, whereas the detection task can be done with the bounding box information. The images in this dataset range in resolution from 332\times487 to 1920\times1072 pixels. Samples of the images and the annotated masks from this dataset are shown in Figure 4.

FIGURE 4. - Sample images with ground truth in Kvasir-SEG dataset.
FIGURE 4.

Sample images with ground truth in Kvasir-SEG dataset.

2) CVC-ClinicDB

The CVC-ClinicDB dataset is an open-access dataset consisting of 612 images with a resolution of 384\times288 from colonoscopy sequences. Samples of the images and the annotated masks from this dataset are shown in Figure 5.

FIGURE 5. - Sample images with ground truth mask in CVC-ColonDB dataset.
FIGURE 5.

Sample images with ground truth mask in CVC-ColonDB dataset.

3) BKAI-NeoPolyp

The BioKinesiology Association of Ireland-NeoPolyp dataset consists of 1200 polyp images, with 1000 images for training and 200 for testing. Samples of the dataset and the ground truth mask is shown in Figure 6.

FIGURE 6. - Sample images with ground truth mask in BKAI-NeoPolyp dataset.
FIGURE 6.

Sample images with ground truth mask in BKAI-NeoPolyp dataset.

B. Implementation Details

For a fair comparison, the state-of-the-art benchmark architectures and the proposed ConvSegNet were implemented using the PyTorch framework and trained on RTX 3090. To train the model for polyp segmentation, we selected three polyp datasets: Kavsir-SEG, CVC-ClinicDB, and BKAI-NEOPolyp. First, we split the dataset for proper training. For the Kvasir-SEG, we followed the official split of 880/120. Here 880 images and their masks were used for training the model, while the remaining 120 images and their masks were used for testing. In the case of CVC-ClinicDB, we split the dataset with the ratio of 80-10-10, where 80% of images and masks were used for training, and the rest were used for validation and testing, and in BKAI-NEOPOLYP, 1000 images were used for training and 200 for testing. For a fair comparison, we have used the same set of hyperparameters to train all the models; we have used the Adam optimizer with a learning rate of 1e-4 (0.0001). An epoch of 200 was set, and early stopping mechanism was used to stop the training once the model stops improving. A combination of dice loss and binary cross-entropy were used as the loss function, with a batch size of 16.

C. Performance Metrics

For model evaluation, six (6) standard metrics were used to compare the performance of the proposed ConvSegNet model to the existing state-of-the-art. Jaccard, Dice score, Recall, Precision, Accuracy and F2-Measure were considered.

The Jaccard index (JI, Equation 3), is the ratio of the overlapping area between the predicted and ground truth to the area of union between the predicted and ground truth segmentation, where S denote segmentation.\begin{align*} Jaccard=\frac {S_{Groundtruth}\cap S_{Automated}}{S_{Ground\mathrm { }Truth}\cup S_{Automated}}=\frac {TP}{TP+FP+FN} \\{}\tag{3}\end{align*}

View SourceRight-click on figure for MathML and additional features. The Dice Score (DSC, Equation 4), also called F1 measure, measures the boundary matching between predicted and ground truth segmentation, as shown in equation 4.\begin{equation*} DSC=\frac {2\times TP}{2\times TP+FN} \tag{4}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
As shown in equation 5, Precision is important in biomedical segmentation because it considers the ratio of the correctly predicted disease pixels to the total number of ground truth pixels.\begin{equation*} Precision=\frac {TP}{TP+FP} \tag{5}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
Equation 6 shows the Recall, which considers the ratio of disease pixels in the ground truth that the segmentation model is able to segment correctly.\begin{equation*} Recall=\frac {TP}{TP+FN} \tag{6}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
As shown in equation 7, the model’s accuracy considers the percentage of the image pixels that are correctly classified.\begin{equation*} Accuracy=\frac {TP+TN}{TP+FP+FN+FP} \tag{7}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
The {F2}_{measure} as shown in equation 8 is more focused on the recall than precision, and it is suitable when it is more important to classify correctly as many positive samples as possible.\begin{equation*} {F2}_{measure}=\frac {5\times Precision\times Recall}{4\times Precision+Recall} \tag{8}\end{equation*}
View SourceRight-click on figure for MathML and additional features.

D. Experiments

This section presents the results of the experiments conducted on the proposed ConvSegNet model and the existing methods. For a fair comparison, we considered six standard state-of-the-art deep learning architectures as benchmark models in U-Net [28], ResU-Net [30], U-Net++ [29], HardDNet-MSEG [50], FANet [42] and UNeXt [43]. We trained all models with the same hyperparameters and hardware. We performed quantitative and generalization experiments to evaluate the performance of the proposed ConvSegNet model against the benchmark architectures.

The quantitative experiments focused on training and testing on the same dataset, while the generalization experiments considered training on one dataset and testing on another. Also, recent state-of-the-art polyp segmentation models were used to evaluate the generalization performance of our ConvSegNet model.

1) Quantitative Results

The results from training and testing of CVC-ClinicDB dataset on U-Net, ResU-Net, U-Net++, HardDNet-MSEG, FANet and UNeXt and the proposed ConvSegNet are presented in Table 1. The proposed ConvSegNet model achieved a Jaccard score of 0.8650, which outperformed the U-Net, ResU-Net, U-Net++, HardDNet-MSEG, FANet and UNeXt architectures which recorded 0.8428, 0.7892, 0.8337, and 0.8388 Jaccard scores, respectively. Likewise, a much better Dice score of 0.9177 was achieved by the ConvSegNet model.

TABLE 1 Comparison of Our Model With Benchmark Models on CVC-ClinicDB
Table 1- 
Comparison of Our Model With Benchmark Models on CVC-ClinicDB

Recall of 0.9518 was also achieved by ConvSegNet, which was better than the benchmark architectures, with U-Net++ achieving the closest recall at 0.9129, proving an improvement of 0.0389 recall score over existing architectures. A Precision of 0.9216 was achieved on the CVC-ClinicDB dataset by the HardDNet-MSEG benchmark, as against the proposed ConvSegNet model by a difference of 0.0168. This is because the foreground (positive) regions are similar to the negative regions in the whole colonoscopy images. Hence, the benchmark models were exhibiting asymmetric errors by having more false positives than false negatives. However, a state-of-the-art accuracy and F2-measure of 0.9881 and 0.9328, respectively, were achieved by the proposed ConvSegNet model, which outperformed the U-Net by 0.002 and 0.0347, ResU-Net by 0.0088 and 0.0606, U-Net++ by 0.0022 and 0.0302, HardDNet-MSEG by 0.001 and 0.039 respectively. Visual representation of the input colonoscopy images, and the segmented polyp regions obtained using the benchmark architecture and ConvSegNet on CVC-ClinicDB is presented in Figure 7.

FIGURE 7. - Visual representation of the input colonoscopy images, and the segmented polyp regions obtained using the benchmark architectures and ConvSegNet on CVC-ClinicDB.
FIGURE 7.

Visual representation of the input colonoscopy images, and the segmented polyp regions obtained using the benchmark architectures and ConvSegNet on CVC-ClinicDB.

Results on the Kvasir-SEG dataset is presented in Table 2. Similar to the results on the CVC-ClinicDB dataset, the proposed ConvSegNet model achieved a better Jaccard score of 0.7936 compared to the U-Net, ResU-Net, U-Net++, HardDNet-MSEG, FANet and UNeXt architectures, which recorded 0.7472, 0.6634, 0.7419, 0.7459, 0.6941 and 0.6284 respectively. The dice score, achieved by the ConvSegNet model also outperformed the benchmark architectures by a difference of 0.0354, 0.0976, 0.039, 0.0358, 0.0803 and 0.1300 respectively. Recall showed that ConvSegNet also outperformed the benchmarks with a recall of 0.9124 compared to 0.8504 on U-Net, 0.8025 on ResU-Net, 0.8437 on U-Net++, 0.8485 on HardDNet-MSEG, 0.8452 on FANet and 0.7840 on UNeXt.

TABLE 2 Comparison of Our Model With Benchmark Models on Kvasir-SEG
Table 2- 
Comparison of Our Model With Benchmark Models on Kvasir-SEG

Except for U-Net architecture which had a precision score of 0.8703, due to the foreground positive regions that are similar to the negative regions in the colonoscopy images, the ConvSegNet performed better than ResU-Net, U-Net++ and HardDNet-MSEG. ConvSegNet also outperformed the four benchmark architectures in terms of Accuracy and F2-measure with an improvement of 0.0107 and 0.0502 above U-Net, 0.0276 and 0.1115 above ResU-Net, 0.0126 and 0.056 above U-Net++, 0.0125 and 0.8855 above HardDNet-MSEG, 0.0397 and 0.0853 above FANet and 0.0409 and 0.1348 above UNeXt architectures, respectively. Visual representation of the input colonoscopy images, and the segmented polyp regions obtained using the benchmark architectures and ConvSegNet on Kvasir-SEG is presented in Figure 8.

FIGURE 8. - Visual representation of the input colonoscopy images, and the segmented polyp regions obtained using the benchmark architecture and ConvSegNet on Kvasir-SEG.
FIGURE 8.

Visual representation of the input colonoscopy images, and the segmented polyp regions obtained using the benchmark architecture and ConvSegNet on Kvasir-SEG.

Table 3 shows the result of the experiment on BKAI-NeoPolyp dataset. As shown, the ConvSegNet outperformed the benchmarks by achieving a Jaccard of 0.8045, dice score of 0.8747, recall of 0.9068, accuracy of 0.9922, and F2 of 0.8909. This outperformed the U-Net by 0.0446, 0.0461, 0.0773, 0.0019, and 0.0645. However, the U-Net architecture had the highest precision at 0.8999. The results on ResU-Net had a Jaccard of 0.6589, 0.7433 dice, 0.7447 recall, 0.871 precision, 0.9843 accuracy and 0.7387 F2. U-Net++ had a performance of 0.7563 Jaccard, 0.8275 dice, 0.8388 recall, 0.8942 precision, 0.9895 accuracy and an F2 measure of 0.8308. HardDNet-MSEG was also outperformed by the ConvSegNet at 0.6734 jaccard, 0.8305 dice and 0.7528 F2. Also, FANet had 0.7578 jaccard, 0.8305 dice and 0.7528 F2, while UNeXt had 0.4680 jaccard, 0.5622 dice and 0.5692 F2 scores. Visual results of the segmentation performance are shown in Figure 9.

TABLE 3 Comparison of Our Model With Benchmark Models on Bkai-NeoPolyp
Table 3- 
Comparison of Our Model With Benchmark Models on Bkai-NeoPolyp
FIGURE 9. - Visual representation of the input colonoscopy images, and the segmented polyp regions obtained using the benchmark architecture and ConvSegNet on BKAI-NeoPolyp dataset.
FIGURE 9.

Visual representation of the input colonoscopy images, and the segmented polyp regions obtained using the benchmark architecture and ConvSegNet on BKAI-NeoPolyp dataset.

2) Generalization Results

We conducted two experiments to test the generalization ability of the proposed ConvSegNet and compare it against the benchmarking architectures. In the first experiment, the whole Kvasir-SEG dataset was deployed for training, while the models were tested on the CVC-ColonDB. While in the second generalization experiment, the total CVC-ColonDB dataset was used for training, and Kvasir-SEG was used as the test set. The first experiments on the five models are shown in Table 4, while Table 5 shows the results of the second generalization experiment.

TABLE 4 Generalization Comparison of ConvSegNet and Benchmark Architectures (Train: Kvasir-SEG, Test - CVC-ClinicDB)
Table 4- 
Generalization Comparison of ConvSegNet and Benchmark Architectures (Train: Kvasir-SEG, Test - CVC-ClinicDB)
TABLE 5 Generalization Comparison of ConvSegNet and Benchmark Architectures (Train: CVC-ClinicDB, Test: Kvasir-Seg)
Table 5- 
Generalization Comparison of ConvSegNet and Benchmark Architectures (Train: CVC-ClinicDB, Test: Kvasir-Seg)

Results on the first generalization experiment, where we trained the models with Kvasir-Seg dataset and tested on CVC-clinicDB showed that our model outperformed all the benchmarking architectures. The Jaccard score of 0.7003 showed that ConvSegNet outperformed the benchmarking models with a difference of 0.157, 0.2036, 0.1528, 0.0946, 0.1658, and 0.3102 to U-Net, ResU-Net, U-Net++, HardDNet-MSEG, FANet and UNeXt respectively. Also, the dice score of 0.7764 achieved with the proposed model was better than the dice score of other models. Likewise, recall of 0.8078, Precision of 0.8625, Accuracy of 0.9685, and F2 measure of 0.7891 achieved with the ConvSegNet model outperformed the other benchmark architectures.

Similarly, as shown in Table 5, the second generalization experiment which trained on CVC-ClinicDB and tested on Kvasir-SEG showed that the proposed ConvSegNet model outperformed the benchmarking architectures. The Jaccard score of 0.6139 achieved by the proposed ConvSegNet is almost a 30% improvement over HardDNet-MSEG, which achieved the best performance of 0.4338 among the benchmarking models. Also, the ConvSegNet’s Dice score, Recall, Precision, Accuracy, and F2 measure were relatively better than the benchmarking architectures.

3) Ablation Studies

Ablation studies to investigate the effect of the CFR module were done. Experiments were carried out on a baseline model with the same hyperparameters as the ConvSegNet, but with the exclusion of the novel CFR module which we have introduced. The result of our ablation studies is shown in Table 6 and Table 7.

TABLE 6 Ablation Studies on CVC-ClinicDB
Table 6- 
Ablation Studies on CVC-ClinicDB
TABLE 7 Ablation Studies on Kvasir-SEG
Table 7- 
Ablation Studies on Kvasir-SEG

As shown in Table 6, the baseline model without the CFR module trained and tested on the CVC-ClinicDB dataset achieved a jaccard of 0.7560, dice of 0.8463, 0.9586 recall, 0.7662 precision, 0.9786 accuracy and 0.9074 F2 measure. This result showed that the CFR module introduced allowed more contextual information to be learnt. The results of the ablation studies on Kvasir-SEG dataset is presented in Table 7.

As shown in Table 7, the CFR module in the ConvSegNet model improved the performance of the baseline model. The baseline model however, had the same F2 measure as the ConvSegNet model, and had increased recall when compared to the 0.9124 achieved by the ConvSegNet on Kvasir-SEG dataset.

4) Computational Evaluation

A comparison of the size of the segmentation models, number of Flops and Frames per second was also done. We evaluated the benchmark models and the proposed ConvSegNet model and presented the results in Table 8.

TABLE 8 Computational Comprison of ConSegNet and Benchmark Models
Table 8- 
Computational Comprison of ConSegNet and Benchmark Models

As shown in Table 8, the U-Net architecture had a parameter value of 31.04M, HardDNet-MSEG had a parameter value of 33.34M, while ResU-Net and U-Net++ had 8.22M and 9.16M respectively. Also, FANet and UNeXt had 7.72M and 1.47M model parameters. The 15.58M parameter value achieved by the proposed ConvSegNet is minimal, compared to the segmentation performance achieved using the model, making the ConvSegNet model less bulky than U-Net and HardDNet-MSEG. The benchmark of U-Net, ResU-Net, U-Net++, HardDNet-MSEG, FANet and UNeXt had Flops of 54.75, 45.72, 34.65, 6.02, 94.75 and 569.56 respectively, while the ConvSegNet had 135.98 Flops. Likewise, the frames per second of the U-Net benchmark model was 156.83, ResU-Net had 196.85, U-Net++ had 126.14, HardDNet-MSEG had 42, FANet had 44 and UNeXt had 88.89 FPS, while ConvSegNet recorded 64 FPS. Showing that the ResU-Net model is faster than the other benchmarks and the ConvSegNet model.

E. Discussion

We proposed a novel segmentation model called ConvvSegNet for the segmentation of polyp from colonoscopy. The Qualitative and generalization experiments showed that the proposed ConvSegNet architecture outperformed the benchmark architectures on which we performed experiments. Thereby proving that the proposed ConvSegNet model can extract a broad range of features progressively from input images, which enabled more significant features to be captured, making our network more robust. The generalization experiments also showed that the ConvSegNet model was able to generalize better than the benchmark models, by achieving improved performance scores over the benchmark models.

Ablation studies were also carried out to investigate the effect of the CFR module on the network by excluding the CFR module from the segmentation model. A model which we termed the Baseline, and the results were presented in Table 6 and Table 7. The baseline model was trained and tested on the same benchmark datasets, and the results showed that the inclusion of the CFR module improved the segmentation performance on CVC-ClinicDB dataset, with an improvement of 12.60% on the jaccard, 7.78% on dice index, 15.31% in precision, 0.96% on accuracy, and 2.72% improvement on F2 score. On the Kvasir-SEG dataset, an improvement of 6.57% jaccard was recoreded, 3.17% improvement in dice index, 9.44% in precision, and 0.73% in accuracy. This proved that the model achieved better performance when the CFR module is included in the segmentation network.

The comparison of the computation cost and efficiency of the ConvSegNet model with the benchmark showed that the ConvSegNet has less computational complexity than two of the benchmark models (U-Net and HardDNet-MSEG). However, the complexity of ResU-Net, U-Net++, FANet and UNeXt is far less than the proposed ConvSegNet, but their segmentation performance was outperformed by the ConvSegNet model. Also, the processing speed of the proposed ConvSegNet model was relatively low when compared to four of the benchmark models. However, the 64 FPS achieved by the ConvSegNet model is standard, and outperformed the recent benchmark of HardDNet-MSEG, which recorded 42FPS.

SECTION V.

Conclusion

In this work, ConvSegNet architecture based on context feature refinement with multiple kernel sizes is trained for polyp segmentation from colonoscopy. The novel Context Feature Refinement module is proposed to address low model generalization and segmentation performance. The module is built to extract contextual information from the incoming feature map by applying multiple parallel convolutional layers with different kernel sizes. The outputs of these layers were then concatenated and passed through a 1\times 1 convolution for feature refinement. This way, we were able to take advantage of the low-level features to generate high-quality semantic features. Using different kernel sizes, we increased the receptive field during the convolution operation, which helped to better capture the contextual feature from the input feature map. The application of the proposed ConvSegNet model saw an improvement in polyp segmentation in terms of quantitative and generalization performances compared to the benchmark models in the study. The method proposed in this work can be further improved in terms of speed, segmentation performance and robustness.. Even though the ConvSegNet model was trained for polyp segmentation, the architecture can easily be extended for other biomedical image segmentation tasks. For future work, we plan to explore transformer models to guide the segmentation model in extracting more contextual information and explore ways to increase processing speed of the segmentation model.

References

References is not available for this document.