Colorectal Disease Classification Using Efficiently Scaled Dilation in Convolutional Neural Network

Computer-aided diagnosis systems developed by computer vision researchers have helped doctors to recognize several endoscopic colorectal diseases more rapidly, which allows appropriate treatment and increases the patient’s survival ratio. Herein, we present a robust architecture for endoscopic image classification using an efficient dilation in Convolutional Neural Network (CNNs). It has a high receptive field of view at the deep layers in increasing and decreasing dilation factor to preserve spatial details. We argue that dimensionality reduction in CNN can cause the loss of spatial information, resulting in miss of polyps and confusion in similar-looking images. Additionally, we use a regularization technique called DropBlock to reduce overfitting and deal with noise and artifacts. We compare and evaluate our method using various metrics: accuracy, recall, precision, and F1-score. Our experiments demonstrate that the proposed method provides the F1-score of 0.93 for Colorectal dataset and F1-score of 0.88 for KVASIR dataset. Experiments show higher accuracy of the proposed method over traditional methods when classifying endoscopic colon diseases.


I. INTRODUCTION
Colorectal cancer is one of the most common and deadly cancers worldwide. Colon diseases like adenoma, adenocarcinoma, Crohn's disease, ulcerative colitis, adenocarcinoma, and adenoma-are considered as significant factors in the evolution of cancer [1]. According to the American Cancer Society, nearly 80,000 people died in 2016 because of different types of cancers related to the gastrointestinal tract [2]. An accurate diagnosis of each disease is essential for early detection and effective treatment of colorectal cancer, which improves the patient's survival rate. Advances in technology make it possible for artificial intelligence (AI)-based computer vision approaches to assists doctors in colorectal classification tasks. Over the decades, research on artificial intelligence in medical imaging has been ongoing and shows its effectiveness in that particular domain [3], [4]. Other studies included an automatic classification of breast cancer [5]; skin cancer [6], detection of gastric cancer [7]; hookworms [8], and recognition of brain tumors [9], [10].
The associate editor coordinating the review of this manuscript and approving it for publication was Chun-Wei Tsai .
Traditional machine learning focused on handcrafted feature-based methods, which rely on image color, shape, and texture information. These methods followed the same approach which requires a feature extraction process and the use of several classifiers for the classification process. However, feature extraction was difficult because of a lack of illumination, blurring, variations in viewpoint, and even colon insufflation.
In recent years, inspired by the great success of deep learning (DL) in computer vision [11]- [13], the interest of applying deep learning to endoscopic image analysis is increasing. Despite this, obtaining a large amount of balanced data remains challenging in the medical field. Even though the transfer learning approach can be applied to solve the above problem [14]- [16], it still suffers from some serious issues. One example is the excessive use of downsampling approaches in the deep layers of a pre-trained network which may work well for natural image classification such as ImageNet dataset [17] but not in the medical domain due to its high intra-class variance and low inter-class variance in different classes. We argue that the small feature maps at the deeper layers of the network contain abstract information which is not enough to represent endoscopic features, thus confound in small polyps and similar-looking images. We think that the bigger size of outputs will represent the features more explicitly, which would improve the classification. Another example is the possibility of occurrence of overfitting when feeding a fewer number of samples to pretrained CNNs; it may learn the noise and detail in the data, which can negatively impact the performance on unseen data. On the one hand, gaining a high accuracy in similar-looking classes with the limited dataset is difficult but a key demand. On the other hand, the utilization of a pre-trained network without effective use of regularization methods are very prone to overfitting. This greatly limits the use of traditional CNNs in the medical world. A small marginal classification error in endoscopic images can lead to a bad experience in the medical domain. For example, the two diseases Crohn's and ulcerative colitis, share similar features that are characterized by chronic inflammation of the digestive tract, and a spurious prediction of such disease is not acceptable at any cost in clinical settings. It is therefore desired to conceive more effective endoscopic image classification architectures that can effectively recover the fine details in the medical images.
To address the need for more accurate classification in the medical domain, we propose a novel method that increases the receptive field of view in the deep layers using dilated convolution effectively. The root presumption behind our architecture is that the model can capture more fine details with the use of dilation at the last layers and helps in increasing performance when high-resolution feature maps are passed to the classification layer. The dilation is added in increasing and decreasing order to get rid of gridding artifact problems. We deduce that increasing use of dilation factors cannot aggregate important spatial features of small polyps and similar-looking images and therefore is detrimental to images of such classes. Similarly, benefited from the regularization method, a DropBlock [18] is added after each dilated convolution to deals with noise, artifacts and reduce overfitting. It drops out the adjacent region of a feature map together which forces the network to look elsewhere to fit the data and hence, helps in the regularization of a model. Since the dataset suffers from various artifacts like specular reflection, artificial devices, motion blur, we think that use of the DropBlock method can handle such artifacts.
Our paper begins with an introduction and motivation for the proposed approach. Next, we present related studies on the endoscopic classification of colorectal diseases (in Section II) with a summary of our contributions. Section III presents a comprehensive description of our proposed method for the classification of several colorectal diseases. In Section IV, we show a collection of data and performance metrics and network training. Section V display our experimental results and performance analysis of our proposed system. Section VI interprets and describes the significance of our findings. Finally, Section VII concludes the paper with a summary of our contributions.

II. RELATED WORK
In this section, we describe various feature extraction and classification methods, including handcrafted feature-based methods and deep learning-based algorithms that have been proposed to classify endoscopic images of colorectal diseases.

A. COLONIC POLYPS
Colonic polyps are considered as a major precursor of colon cancer. In an early study, Häfner et al. [19] introduced texture analysis methods based on local fractal dimension (LFD) for the classification of colonic polyps. They proposed three LDF-based approaches that additionally extracted shape and gradient information of the image to improve classification; these methods were tested on different datasets. Next, a filter bank-based texture analysis method was proposed for the classification of colonic polyps [20]. Different types of polyps were differentiated using the filter masks of the filter bank. M. Hafner et al. proposed a novel color texture operator that was based on a noise-robust local binary pattern variant for an automatic classification of endoscopic images [21]. They quantified the similarity of neighboring pixels by constructing a color vector field from an image and used k-nearest neighbors classifier for classification. Wimmer et al. [22] tested several waveletbased approaches for 11 endoscopic polyp databases, proposed three wavelet-based feature extraction approaches, and found them acceptable for an automatic classification of colonic polyps. Tamaki et al. [23] proposed a local feature-based recognition system: a bag-of-visual words representation of local features followed by the support vector machine (SVM) classifier. In [24], they integrated a Gabor filter and monogenic local binary pattern to generate a new feature that represented shape and edge information at multiresolution while preserving color information. Consequently, linear discriminant analysis was used to reduce the feature dimensions, and SVM was used as a classifier. Stehle et al. [25] proposed a classification algorithm for colonic polyps: they implemented two segmentation algorithms, and the obtained features were used to classify the polyps.
Recently, CNNs have been used instead of handcrafted features for automatic feature extraction and classification [37]. Pogorelov et al. [15] combined deep neural networks, information retrieval, and analysis of global and local image features for multiclass classification, detection, and localization of various gastrointestinal diseases. Zhang et al. [16] proposed a transfer learning approach by using the features learned from non-medical datasets using deep CNN; subsequently, they used low-level features to detect and localize colorectal polyps. Shin and Balasingham [36] have shown that CNN outperformed handcrafted feature-based methods after comparing them on three public polyp databases. Nadeem et al. [32] integrated texture and deep learning features for the classification of gastrointestinal diseases. Urban et al. [33] designed a deep CNN to detect polyps and evaluated the results with an expert colonoscopist. All polyps identified in the expert review were also detected by their proposed method.
Wimmer et al. [31] applied three pre-trained CNN architectures to endoscopic image databases, and SVM was subsequently used to classify colonic polyps and celiac disease. They concatenated and combined the features from several layers and experimented with classification. Their approach outperformed other CNN-based approaches.

B. CROHN's DISEASE AND ULCERATIVE COLITIS
Mahapatra et al. [38] proposed a supervised learning approach for automatic identification and localization of the regions affected by Crohn's disease in abdominal magnetic resonance volumes. They used intensity statistics, texture anisotropy, and shape asymmetry of the 3D regions as features to distinguish between normal and affected regions. In [26], D. Mahapatra et al. performed similar tasks with the use of low-level features such as intensity and texture. Wei et al. [27] used a visual codebook to accurately detect colitis on contrast-enhanced computed tomography scans. Ahmed et al. [28] defined a neuro-fuzzy-based approach that combined a backpropagation neural network-fuzzy classifier with a neuro-fuzzy model to diagnose Crohn's disease. They used factor analysis as a dimensionality reduction technique and performed experiments on different levels of the fuzzy partition.
Mossotto et al. [29] proposed three unsupervised ML models that used endoscopic data only, histological data only, and combined endoscopic/histological data, achieving an accuracy of 71, 76.9, and 82.7 percent, respectively. Han et al. [39] developed a novel pathway-based approach that used genes to calculate individualized pathway scores for the classification of ulcerative colitis and Crohn's disease.
Pogorelov et al. [40] presented a dataset named ''KVASIR,'' where different diseases were classified using global features, deep CNNs, and deep transfer learning. Alammari et al. [30] proposed an approach that used endoscopic domain knowledge and a deep CNN to classify the severity of ulcerative colitis. Stidham et al. [34] found that the accuracy of a deep CNN was comparable to experienced human reviewers for the classification of endoscopic severity of ulcerative colitis. Ozawa et al. [35] showed the robustness of a GoogLeNet CNN architecture based on a computeraided diagnosis (CAD) system for identifying endoscopic inflammation severity in ulcerative colitis. Maeda et al. [41] developed a CAD system for predicting persistent histologic inflammation associated with ulcerative colitis. Table 1 summarizes the problems in existing classification approaches. Previous methods have at least one of the following weaknesses:

C. LIMITATIONS OF RELATED WORK
• Dependence on a fixed set of handcrafted features which requires a deep knowledge about the image characteristics [19], [21]- [27]. They relies on texture analysis where a limited set of local descriptors computed from an image is fed into a classifiers like SVM,Random Forests etc. Despite a good level of accuracy in some works, these techniques have limitations on generalization and transfer capabilities in inter-dataset variability.
• Reliance on endoscopic and histological data which limits the practical utility of these algorithms [29], [41] since histological data might not be available in all scenarios. VOLUME 8, 2020 Overview of original ResNet50 architecture [42]. At stage 1, the feature map size is downsampled by a convolutional layer with strides = 2, which is followed by Batch normalization and Relu layer. Within each stage, the number of filters used by the layers is the same. Each stage has convolutional(Conv) block and Identity(Id) block. The identity block contains three sets of a convolutional layer, and the Convolutional block has one extra layer to match the input and output dimension.
• Incognizant about the features the network learned during the training process [15], [30], [31], [33]- [36], [40]. Contrary to previous works, our approach does not rely on handcrafted features and histological data but uses a deep learning-based neural network using efficient dilation with an effective regularization method approach for endoscopic image classification. Further, The use of a diverse and large number of classes and images on the proposed method makes the model more reliable. In this paper, we present the following major contributions: 1) We increase the receptive field of view in the deep layers of the network with the efficient use of dilated convolution to preserve the spatial information. We utilize the dilation factor in increasing and decreasing order to aggregate the spatial details of tiny features like polyps. 2) We further validate the use of a regularization technique called DropBlock to avoid overfitting and handles noises and several artifacts like specular reflection, artificial devices, motion blur. 3) Finally, we evaluate our proposed deep neural network on our colorectal dataset that includes five classes, and we additionally evaluate it on another endoscopic KVASIR dataset [40]. We show that our approach is promising for endoscopic image classification.

III. METHODOLOGY
Our CNN model for selecting the features from endoscopic images is based on the transfer learning approach: rather than training a new CNN architecture, we reuse a pre-trained network. It is widely known that the features extracted from the activation of a CNN trained in a fully supervised manner for a large-scale object recognition task can be repurposed for a novel generic task. Moreover, our training set contains only a few hundreds of images which is insufficient for state-ofthe-art CNN architectures that require millions of parameters to train. Tajbakhsh et al. [14] demonstrated that the use of a pre-trained CNN with adequate fine-tuning outperformed or, in the worst case, performed similarly as a CNN trained from scratch. Therefore, we employed the last layer fine-tuning on endoscopic data set and initialized the ImageNet pretrained weights for each model, and the last fully connected layer is updated continuously. We use the ResNet50 architecture as a baseline model. From extensive experiments, we found that ResNet50 achieved better performance than  other existing CNN architectures [12], [43]- [46] in colorectal dataset. Fig. 2 shows a basic residual network block with the inclusion of the identity connection. The residual block will learn the following function.
where, F(x) is represented by the stacked non-linear layers and x is an identity function. The ResNet50 architecture consists of five stages of blocks where small chunks of networks connected through skip or shortcut connections to form an extensive network. Two main types of blocks are used, depending mainly on whether the input/output dimensions are the same or different. When the input activation has the same size as the output activation, it is formulated as: where, x and y indicates the input and output vectors of the layers considered. The function F(x, W i ) + x represents the residual mapping to be learned. Fig. 1 shows an example of an identity block, where the upper path is the shortcut path and the lower path is the main path. Similarly, when the input and output dimensions do not match, we add a convolutional layer to the shortcut path by using the following formula Usually, each identity block contains three sets of convolutional layers followed by batch normalization and the ReLU activation function. Similarly, the convolutional block includes the same number of layers with one extra convolutional layer added.
From our assumption, We need to keep the network from down-sampling approach and preserve complex spatial information to the last layers. We achieve this by providing dilation and removing down-sampling. We adopt a network architecture [47] which was designed for ImageNet Classification and make several modifications to fit the network for our purposes.

A. APPROACH
In this section, we describe how our proposed model learns and represents the endoscopic features of colon diseases effectively. For this, we use dilated convolutions with different atrous rate at the end of the layers. The proposed VOLUME 8, 2020 architecture consists of five groups of layers with convolution and identity blocks described in Section III. Let G i be a group of layers, where i = 1, . . . ..5. We denote the j th layer in group i as G i j . Let f i j represent the filter associated with layer G i j . The output of G i j in the original model is We use dilated convolutions in the final two groups of convolutional layers. For Stage 4, we replace the convolution operators by dilated convolutions with an atrous rate of 2 for all layers of the block.
for all j > 0. In the first layer of the block in Stage 5, which is G 5 1 , we perform the same transformation.
By analogy, we use a dilated factor of 4 in the remaining blocks of G 5 j : for all 1 < j < 4. Similarly, for a dilation factor of 2 in the fourth block of stage 5 layer, for 3 < j < 5. Then, a non-residual block having normal convolution is added at the end which is followed by the global average pooling layer (as in the original architecture), which decreases the output feature maps to a vector, and 1×1 convolution maps this vector to a vector that contains the prediction scores for all classes. The overall proposed and modified architecture is illustrated in Fig. 3 and pseudocode is shown in Algorithm 1.
The layer-wise details of the architecture are explained in Fig. 4, which exhibits information about each layer of both original and proposed architecture in sequential order. Our proposed model consists of fifty-seven layers: fifty-six convolutional layers followed by batch normalization that normalizes the feature map and an activation function called the rectified linear unit (ReLU). Because only the convolutions in the later layers of the networks are dilated, the shape and structure of earlier layers are the same. For both models, the first convolutional layer generates a feature map of size 112 × 112×64 after applying 64 different filters of size 7 × 7×3 over the input image of size 224 × 224. Then, a maxpooling layer is used, which processes the input feature map by applying a filter of size 3 × 3 pixels to generate the feature map of 56 × 56×64. For the original model, downsampling is accomplished by the first 1×1 convolution layer with a stride of 2 in the layers of Stage 3, Stage 4, and Stage 5. However, in our proposed model, we set stride to 1 and replace the 3×3 convolution with 3×3 dilated convolution after Stage 3. we gradually increase and decrease the dilation rate in the convolutional layers of Stage 4 and Stage 5 and remove the residual connection at the final two layers. Then, the optimal feature vector of size 1 × 1×2048 is generated after applying the global average pooling layer.  The two non-residual blocks with decreasing dilation are added in the proposed method to overcome the problem of gridding artifacts. Gridding artifacts occur when a feature map has a higher-frequency content than the dilated convolution sampling rate. By doing this, the model does not allow propagation of gridding artifacts from the earlier layers. The converted network will generate the output of 28 × 28 after G 5 layers, which helps global average pooling layers to take more values. It helps the classifier to recognize the features that cover a tiny part in the given image.
Moreover, adding more non-residual blocks at the end layer increases the network size, which can cause overfitting or get stuck in poor local minima. Our limited dataset might not be helpful in this case. Also, the presence of high background noises and artifacts in the endoscopic images is one of the challenges encountered during classification. To overcome this problem, we utilize the effective use of DropBlock [18] method, which is beneficial to regularize convolutional networks. It drops an adjacent region of a feature map together, unlike Dropout [48], which drops out features randomly. We applied DropBlock in all blocks of stage 4 and 5 after each convolutional layers. The pseudocode is explained in algorithm 2. It has two main parameters which are block_size u and γ . The block_size u is the length of contiguous region to be dropped, while γ controls how many units to drop. We use the fixed size of u of 7*7 after the convolutional layer. Similarly, we compute γ by following formula: where keep_prob is the probability of keeping an every activation unit in dropout. We sample the initial binary mask with the Bernoulli distribution with a mean of 1 − keep_prob. v is the size of a feature map, and (v − u + 1) 2 indicates the size of the valid seed region. In our experiments, we use keep_prob = 0.9 in all layers and compute the value of γ .

IV. EXPERIMENTAL PROTOCOL
A. DATA COLLECTION 1) COLORECTAL DATA The dataset was provided by Gill Hospital, South Korea, and it contains five classes of 3,515 endoscopic colorectal disease images: 634 with adenocarcinoma, 775 with adenoma, 563 with Crohn's disease, 773 with ulcerative colitis, and 770 normal images. The original image sizes range from 400×400 to 2000×2000 pixels. Therefore, the images are resized according to the requirements of the CNNs architecture. The image data is normalized with the default properties required for each architecture. We perform data augmentation to increase the number of images before training the networks due to the small amount of available data. Originally, the class is imbalanced, and augmentation is done in such a way that the minority class is augmented more to make a balanced dataset. This is a standard solution to reduce overfitting during the training. Several augmentation techniques such as flipping, scaling, rotating, zooming, contrast normalizing, and shearing was used. Each image was first rotated at a different angle, and each rotated image was flipped each time (horizontally and vertically) and zoomed. Before augmentation, we split the total dataset and separated 80 images for validation and testing purposes, and the remaining images belonged to training. The details of a colorectal dataset are presented in Table 2.

2) KVASIR DATASET
KVASIR Dataset includes 4000 endoscopic gastrointestinal diseases and comprises eight different classes, each containing 500 images. The dataset consists of several sets of images in each category, including anatomical landmarks (such as Z-line, pylorus, or cecum) and pathological findings (such as esophagitis, polyps, or ulcerative colitis). Some sets are related to the removal of lesions, including dyed and lifted polyps and dyed resection margins. Images with different resolutions from 720 × 576 to 1920 × 1072 pixels are included in the dataset. Pogorelov et al. [40] performed a baseline evaluation of these datasets with three different approaches: classification using global features, deep convolutional neural networks, and deep transfer learning. We split the dataset into a 50:50 ratio to make a fair comparison with the original paper. We will compare the results of our proposed method with these existing approaches.

B. PERFORMANCE METRICS
We use performance evaluation metrics such as accuracy (ACC), recall, precision, and F1-score to evaluate classifiers, computed as follows: where ACC and F1 are accuracy and F1-score, respectively; P T and N T are the number of true positives and true negatives, respectively; P F and N F are the number of false positives and false negatives, respectively. Specifically, ACC is the proportion of correctly classified samples. Precision is the proportion of true negatives that are correctly classified. A recall is the proportion of true positives that are correctly classified. The F1-score is the harmonic average of precision and recall.

C. NETWORK TRAINING
The implementation is based on Keras, and the backend is TensorFlow.The training set is used to train the model and learn the parameters. The validation set is used to optimize the model and test it during the training: to automatically adjust the learning rate and decide whether to stop early according to the test performance of a given training step. The test set is used to evaluate the recognition and generalization ability of the proposed model. We initialize the pre-trained weights of ResNet50 and use stochastic gradient descent with a batch size of 16. The learning rate starts from 0.001 and is divided by 10 when the patience level exceeds 8. We use a weight decay of 0.0001 and a momentum of 0.9 without an accelerated gradient.

V. EXPERIMENTAL RESULTS AND EVALUATION A. PERFORMANCE EVALUATION OF COLORECTAL DATASET
In this section, we will compare the performance of the proposed method with the existing related methods which were used in endoscopic image classification. Due to limited works on colorectal diseases using deep learning, we compare the proposed work with the methods used for other similar tasks.   Table 3 presents the result when different fine-tuned CNNs were trained on the colorectal dataset for classification. The experiments are performed with the same parameters and the same number of augmented and validation sets. All the architectures achieved similar results, but the best results are obtained by ResNet50. From Table 4, we can observe the performance of each method evaluated by the F1-score.
Our model significantly outperforms the existing methods by a vast margin achieving 0.93 F1-score. The Zhang et al. [16] achieves good accuracy in normal class which shows its discriminative capability between normal and disease class but failed to achieve a similar result on similar-looking disease classes. Shin and Balansingham [36] with three layers achieves 0.836 F1-score indicating that it is not deep enough to learn the complex patterns of the images resulting in poor performance. Meanwhile, Stidham et al. [34] with 159 layers achieves 0.89 F1-score that shows its powerful discriminative capability but achieves similar result with the other existing CNNs method.

B. PERFORMANCE EVALUATION OF KVASIR DATASET
The results obtained are presented in Table 5 with the metrics like precision, recall, F1, Accuracy measures estimated from the TP, FP, and TN and FN cases. We compare our proposed TABLE 6. Ablation studies for the dilated rate at stage 4 and stage 5. We evaluated the proposed method on the given values. It can be noticed that the proposed methodology achieved 95.7 % accuracy on these datasets with F-score value of 0.88, which is slightly better than the 2 Layer CNN and 3% better than the Inception model using TFL.

C. ABLATION STUDY
We investigate the effectiveness of our contributions by comparing our full model with the baselines based on the same experimental setting. For each ablation experiment, we omit different dilation rates at the convolutional layers of stage 4 and 5 block. The results obtained are displayed in Table 6. We report the F1, Precision, and Recall score for each experimented values.
From the results reported in Table 6. we can draw the following conclusions: 1). Simply adding dilated convolution at the end layers does not improve the classification performance; instead, it will only worsen the performance. Similarly, the addition of a convolutional block with the dilation rate of 2 at stage 4 does not make any changes.
Further, the F1-score drops from 0.91 to 0.89 and 0.90 when significant changes were made in the dilation rate (4th and 6th row) at stage 4 and stage 5, showing that the network suffers from the gridding artifact problem. 2). The F1-score of the network at the complementary branch (5th and 7th row) is higher than the corresponding network in the 4th and 6th row, demonstrating the effectiveness of using DropBlock as a regularization method. 3). The network with increasing and decreasing dilation rates at the end layers improve the F1-score of the baseline network by 0.91 to 0.92. Further, utilizing of DropBlock regularization method enhances the network performance by 0.92 to 0.93. It shows that our approach has a better discriminative capability of identifying a small polyp and recognizing similar-looking images. Table 4 shows that our proposed novel deep model has achieved the best classification performance on the provided dataset. This study's results showed that removing downsampling and preserving features at the last blocks of the CNN increases performance with a 92.8% recall rate and a 93.2% precision rate. VOLUME 8, 2020 The result from Table 3 indicated that the applying of transfer learning on medical datasets is not always beneficial in the medical domain. As it uses a progressive down-sampling approach in the CNN, it is not helpful for those datasets with high interclass similarity and intraclass variation. Fig. 5 demonstrates the type of endoscopic images of different classes where the best-performing model ResNet50 is failed. It is observable that when the size of the polyps is very small in the endoscopic colon images, the network could not classify it because of the downsampling approach used in the existing CNNs. Fig. 5a shows that adenoma is misclassified as normal because of loss of spatial information due to the continuous reduction in resolution of images, which represents the tiny feature maps of size 7 × 7 in the end. Similarly, as the learned features are more class-specific at deep layers, the common occurrence of similar features of images between different classes might make the classification process more difficult. For example, some adenoma's which are a polyp tends to progress towards adenocarcinoma, and they might share a similar shape with continuous inflammation. Fig. 5b shows that the model confuses adenocarcinoma with polyps. In the last two sets of images in Fig. 5, the model detects only some patterns of each class, and the network uses such features during the classification process. But our method preserves information until to end layers. It confirms that deep models with effective use of dilated convolutional layers at the end have an advantage in classifying the endoscopic images, over the fine-tuning of a state-of-the-art CNN architecture and several other methodologies.

VI. DISCUSSION
One benefit of the proposed method is to tackle with the noises and artifacts present in the image. Fig. 6 shows the significance of using the DropBlock regularization method with the dilated convolution. The probability score is increased from 56% to 88%, indicating that the proposed method focuses and covers more specific and essential regions and is less sensitive towards noise and artifacts.
Our method showed a powerful ability to extract useful features from the endoscopic images. We observed the features learned by the CNNs at the last layers using the class activation map approach [49] when the images are hard to distinguish, the other methods produced a large fluctuation in accuracy rates. With the proposed method, it achieved better results for similarly looking images. Additionally, Table 5 shows that the proposed method achieved a high F1-score of 0.88 with 92% recall rates in the KVASIR dataset, which indicates the high capacity of recognizing disease class. Our proposed convolutional neural network is more accurate and stable than other popular traditional and deep models for endoscopic image classification.

VII. CONCLUSION
In this paper, we investigated the use of deep learning for colorectal endoscopic image classification. We showed that the features represented by the layers before the global average pooling are insufficient because of the use of excessive downsampling, which causes loss of spatial information. We applied an efficient technique to preserve the spatial information at the end of layers: specifically, using dilated convolutional layers in increasing and decreasing order. Besides, further use of the DropBlock regularization method at the deeper stages attained specific regions with less sensitivity towards noise and artifacts. We observed an improvement in classification, which proved that the proposed model captured more detailed and tiny differences between similarlooking images. Finally, with extensive experiments and comparisons on the KVASIR dataset, we demonstrated that our proposed deep convolutional neural network has a superior performance in endoscopic image classification. In our future work, we will further employ our novel neural network architecture to handle other endoscopic image classification problems. We also plan to extend our innovative approach by using earlier feature layers and deep features with dilated convolution to tackle image classification problems in other domains.