Processing math: 100%
GIFCOS-DT: One Stage Detection of Gastrointestinal Tract Lesions From Endoscopic Images With Distance Transform | IEEE Journals & Magazine | IEEE Xplore

GIFCOS-DT: One Stage Detection of Gastrointestinal Tract Lesions From Endoscopic Images With Distance Transform


Our proposed framework for lesion detection comprises two main stages: i) the model development (training and evaluation) stage and ii) the deployment stage. In the first...

Abstract:

This study aims at developing a computer-aided diagnostic system based on deep learning techniques for detecting various typical lesions during endoscopic examinations in...Show More

Abstract:

This study aims at developing a computer-aided diagnostic system based on deep learning techniques for detecting various typical lesions during endoscopic examinations in the human gastrointestinal tract. We propose a lesion detection model, namely called GIFCOS-DT, that is built upon a one-stage backbone for object detection (Fully Convolutional One-Stage Object Detection - FCOS). For the proposed model, to deal with the diverse shapes and appearance of the lesions, we introduce a new loss function based on Distance Transform, that better describes the elongated or curved shapes of lesions than the common loss functions like Intersection of Union or centroid loss. We then deploy the detection model on an embedded device that connects to the endoscopic machine to assist endoscopists during examinations. A multithread technique is employed to accelerate the processing times of all steps of the system. Extensive experiments have been conducted on two challenging datasets, the benchmark dataset (Kvasir-SEG) and our newly collected dataset (IGH_GIEndoLesion-SEG), which include various typical lesions of the gastrointestinal (GI) tract (reflux esophagitis, esophageal cancer, helicobacter pylori negative gastritis, helicobacter pylori positive gastritis, gastric cancer, duodenal ulcer, and colorectal polyps). Experimental results show that our proposed methods outperform the original FCOS by 4.2% and 7.2% on Kvasir-SEG and our collected dataset respectively in terms of the average AP_{50} score. On the Kvasir-SEG dataset, the GIFCOS-DT outperforms state-of-the-art detectors such as Faster R-CNN, DETR, YOLOv3, and YOLOv4. Our developed supporting system for lesion detection can run at 14.85 FPS on an embedded Jetson AGX Xavier or 31.92 FPS on an RTX 3090. The detection results of various types of lesions are promising, mostly on malignant lesions such as gastric cancers. The proposed system can be deployed as an assistant tool in endoscopy to reduce missed detection of...
Our proposed framework for lesion detection comprises two main stages: i) the model development (training and evaluation) stage and ii) the deployment stage. In the first...
Published in: IEEE Access ( Volume: 12)
Page(s): 163698 - 163714
Date of Publication: 05 November 2024
Electronic ISSN: 2169-3536

Funding Agency:

References is not available for this document.

SECTION I.

Introduction

Automated detection of lesions in gastrointestinal endoscopy images is crucial in assisting doctors in diagnosis of gastrointestinal (GI) tract-related diseases. Automating lesion detection can help improving the accuracy and efficiency of the diagnostic processes. However, multiple factors could contribute to the complexity of identifying and categorizing lesions. These include the difficulty in distinguishing lesions with similar characteristics, issues with image contrast, clarity, and artifacts (e.g. bubbles, fluids, blood, lens defects).

Recent object detection methods have been widely built upon convolutional neural networks [1], [2], [3], [4], [5], [6]. Most of them rely on loss functions such as Intersection over Union (IoU), Cross-Entropy (CE), and Centerness loss for optimizing the models’ weights. However, these methods are primarily designed for objects with well-defined shapes within rectangular bounding boxes. Digestive tract lesions, in contrast, present a unique challenge due to their diverse shapes and irregularities. Unlike conventional objects, these lesions can vary significantly in shape and appearance, exhibiting traits such as elongation, concavity, or irregular edges. As a result, the application of standard loss functions proves inadequate in accurately detecting and characterizing such lesions.

To address these issues, we propose a method that builds upon the FCOS (Fully Convolutional One-Stage Object Detection) architecture, known for its anchor-free approach to object detection [7]. FCOS has demonstrated superior performance compared to many other detectors, including both two-stage detectors (e.g., Faster R-CNN [1]) and one-stage detectors (such as variants of YOLO [2], SSD [3], DSSD [4], RetinaNet [5], and CornerNet [6]), particularly on the COCO large-scale detection benchmark. However, the original FCOS employs a loss function comprised of focal loss, IoU loss, and centerness loss, which are optimized for detecting compact objects, a characteristic not commonly observed in gastrointestinal lesions. To overcome this limitation, we introduce a novel loss function utilizing Distance Transform, tailored specifically to detect gastrointestinal lesions.

Our key contributions can be summarized as follows:

  • We propose a new detection model, namely GIFCOS-DT (GastroIntestinal Fully Convolutional One-Stage Object Detection using Distance Transform) that extends the existing FCOS model with a novel loss function, the transform distance loss;

  • We introduce a new benchmark dataset of six various lesions of the upper gastrointestinal tract for the evaluation of lesion detection from endoscopic images, named IGH_GIEndoLesion-SEG;

  • We develop an end-to-end assisting system that directly connects the endoscopic machine and the detection module. To enhance practicality, we deploy buffering techniques to accelerate computational processing time on edge devices.

  • We validate our proposed detection model on two benchmark datasets with various lesion categories. Additionally, we deploy our supporting system on various platforms for comparisons and providing recommendations for real-world implementation.

The remaining sections of this paper are organized as follows: Section II briefly reviews related works on existing methods for lesion detection from endoscopic images. The proposed framework for gastrointestinal tract disease detection using distance transform is described in Section III. Section IV reports experimental results on one public dataset for polyp detection Kvasir-SEG and one self-collected dataset IGH_GIEndoLesion-SEG of six lesions. Finally, further discussions and conclusions are presented in Section V.

SECTION II.

Related Works

Lesion detection from endoscopic images has become a highly attractive topic in recent years, with significant efforts directed toward developing effective algorithms and computer-aided systems. Initially, research efforts concentrated on simple identifiable characteristics of lesions such as color and structure, employing models to learn features based on hand-crafted features. However, in recent years, methods based on Convolutional Neural Networks (CNNs) have garnered significant research interest due to their accuracy and versatility. In this section, we survey existing techniques for lesions of the gastrointestinal tract using deep-learning models. We then summarize some Computer Aided Design (CAD) systems for lesion detection using artificial intelligence.

A. Lesion Detection from Endoscopy Images

In the realm of early gastric cancer detection, several pioneering endeavors have significantly contributed to the advancement of lesion detection methods in gastrointestinal endoscopy. Hirasawa et al. devised a method employing a Single Shot MultiBox Detector (SSD), to automate the detection of early gastric cancer lesions and delineate the extent of invasion [8]. The SSD has been extensively applied in gastric cancer detection [8], [9], as well as in identifying erosions and ulcerations [10]. Additionally, Vladimir Khryashchev et al. conducted a comparative study on SSD and RetinaNet for analyzing pathologies in endoscopic images of the stomach [11].

Similarly, Sakai et al. proposed an approach leveraging CNNs to discern gastric cancer regions from normal areas by analyzing finely cut patches of endoscopic images [12].

Shibata et al. introduced a method harnessing the capabilities of Mask R-CNN, designed for both object detection and segmentation, to detect the presence of early gastric cancer and extract invasive regions [13]. Furthermore, Teramoto et al. proposed a sophisticated U-Net R-CNN model, employing two CNNs for segmentation and classification tasks [14]. Initially, the U-Net model was employed to delineate regions indicative of early gastric cancer, followed by classification utilizing a separate CNN model.

Ghatwary et al. introduced a novel 3D Sequential DenseConvLstm network for extracting spatiotemporal features from input videos [15]. Their model combined 3D Convolutional Neural Network (3D CNN) and Convolutional LSTM (Long Short-Term Memory) to effectively capture both short and long-term spatiotemporal patterns [15]. The resulting feature map is employed by a region proposal network and ROI (Region of Interest) pooling layer to generate bounding boxes identifying abnormal regions in each frame of the video. Additionally, they investigated a post-processing technique called Frame Search Conditional Random Field (FS-CRF) to enhance model performance by recovering missing regions in neighboring frames within the same video clip.

Gao et al. employed YOLOv5 for the detection of colorectal lesions [16]. Teramoto et al. proposed a cascade model comprising two stages for gastric cancer detection and characterization [17]. The initial stage employs a diverse set of image classification deep models, such as VGG-16, InceptionV3, ResNet, and DenseNet, followed by a segmentation model (U-Net). Ahmad et al. introduced an automated approach that enhances the YOLO-v7 object detection algorithm through the incorporation of an attention block for gastric lesion detection [18]. Among the three attention mechanisms tested (Squeeze and Excitation (SE), Convolutional Block Attention Module (CBAM), Global Local Attention Mechanism (GLAM)), YOLOv7 enhanced by SE achieved the highest accuracy across four categories: gastric cancer, ulcers, adenomas, and healthy tissues. In another study [19], Xiao et al. proposed a Deep Convolutional Generative Adverserial Network (DCGAN) architecture to augment the dataset obtained from wireless capsule endoscopy. They subsequently employed three deep models, namely SSD, YOLOv4, and YOLOv5, to detect four categories including Ulcer, Polyp, Blood—Fresh, and Erosion.

Overall, these pioneering works demonstrated the significant impact of automated methodologies in the field of early gastric cancer detection, presenting promising approaches for enhancing diagnostic capabilities. The majority of recent methods for detecting lesions from endoscopic images relied on conventional CNN-based object detectors. These methods primarily focused on identifying lesions such as gastric cancer, ulcers, adenomas, and polyps with convex shapes. Detecting other lesions, such as esophagitis and duodenal ulcers, poses a greater challenge and is often overlooked. Nevertheless, early detection holds promise in facilitating treatment and mitigating the severity of these diseases. This paper considers dealing with various types of lesions and aims to improve detection accuracy and efficiency in practical deployment scenarios.

B. Computer Aided Design Systems for Lesion Detection

With the increasing advance of artificial intelligence (AI), some AI-based systems are now commercially available for lesion detection of the gastrointestinal tract. In the field of AI-assisted colonoscopy, several systems have been introduced in the past five years. Notable examples include EndoBRAIN by Cybernet Systems Corporation (Tokyo, Japan) in 2018 [20] and GI Genius by Medtronic (Dublin, Ireland) in 2019 [21]. In 2020, many other AI systems appeared, such as EndoBRAIN-EYE, DISCOVERY by Pentax Medical Company (Tokyo, Japan), ENDO-AID by Olympus Corporation (Tokyo, Japan), CAD EYE by Fujifilm (Tokyo, Japan), and Wise Vision by NEC Corporation (Tokyo, Japan) [22]. Wise Vision has an image analysis terminal to display the results of polyp detection. It runs on an NVIDIA Quadro RTX 5000 with a Blackmagic Design DeckLink Mini Recorder. The cable connecting the endoscopic device and the image terminal is an HD-SDI or 3G-SDI cable. In 2021, EndoScreener was introduced by Wision A.I. (Shanghai, China) [23]. The core module in EndoScreener is SegNet, which is integrated with Olympus 190-series high-definition white-light colonoscopes. The detection results are displayed on either a dual-monitor setup or a single-monitor setup. The latency is approximately 46.56~\pm ~2.79 ms, but the specific configuration of the computer on which it is running is not revealed. In [24], a real-time AI system is developed for the detection of cancer in Barrett’s esophagus. The images are captured from the real-time camera livestream and provide a global prediction (classification), as well as a dense prediction (segmentation). The core modules are ResNet for classification and DeepLab v3+ for segmentation integrated on a desktop with two NVidia TitanX graphics processing units. He et al. presented a real-time use of a deep model for early gastric cancer [25].

In summary, integrating AI models into real commercial products is essential. Real-time experiments have demonstrated that AI can significantly aid in preventing oversights by endoscopists. However, most systems currently target a single type of lesion, and evaluations across different devices are often overlooked. This paper introduces a model that addresses multiple lesion types and proves its feasibility for deployment on various devices, thereby setting the stage for broader future implementation.

SECTION III.

Unified Framework for Lesion Detection

A. General Framework

Our proposed framework for GI lesion detection is illustrated in Figure 1. It comprises two main stages: i) the model development (training and evaluation) stage and ii) the deployment stage. In the first stage, the proposed model is constructed with the new loss function and then trained and validated. In the second stage, a computer-aided system is deployed on a dedicated computing device in a clinical scenario. The system is connected directly to an endoscopic machine to capture endoscopic images, detect lesions in those images, and display the detection results to endoscopy doctors through a Graphical User Interface (GUI). In the following sections, we will present in detail our detection model and the deployment of the model on edge devices in a practical application.

FIGURE 1. - Our proposed framework for lesion detection.
FIGURE 1.

Our proposed framework for lesion detection.

B. Image Representation Based on Distance Transform

It is noticed that in the training and evaluation stage, each image sample in the dataset contains a pair of an original RGB image I and a ground-truth G which is a binary image representing the lesions and background. The binary image is manually annotated by expert doctors. It is widely provided in many datasets for detection and segmentation tasks. As mentioned earlier, the lesions naturally exhibit complex appearance, often non-convex shapes. When using rectangular bounding boxes as ground-truth to enclose these lesions, a significant portion of the background pixels may also be included within the bounding boxes. This can be regarded as a potential side effect for the learning algorithms that may result in incorrect detection. The center of the bounding box may be very far from the real center of the region of the lesion.

To tackle this issue, we first transform the binary image using distance transform [26]. The distance transform is an operator representing the distance to the closest boundary from each point. In this way, it produces an intensity image, the biggest value presents the furthest distance of a pixel to its nearest boundary. As mentionned previously, G is the binary mask so that G(x,y) = 1 for lesion pixels and G(x,y) = 0 for background pixels. The distance transform map D(x, y) computed from G as follows:\begin{equation*} D(x,y) = \underset {(x', y' \in \mathcal {B})}{min}\sqrt {(x-x')^{2} + (y-y')^{2}} \tag {1}\end{equation*}

View SourceRight-click on figure for MathML and additional features.where \mathcal {B} is the lesion boundary and (x', y') are the coordinates of the nearest boundary pixel of the pixel (x, y) . The highest value D(x,y) , the farthest the point to the boundary. Figure 2 illustrates six examples, showing the original image I (left), the ground truth mask G (middle) of the lesion provided by manual annotation of expert doctors and the corresponding distance transform map D (right), arranged row by row. Points in the central area of the map D are the brightest because they are farthest from boundaries. This figure also illustrates variations in the shapes of lesions, with many regions exhibiting crescent moon-like shapes (as in the 4th and 5th images). The final example (6th image) includes two adjacent lesions, resulting in overlapping rectangular bounding boxes. This overlap can pose challenges for detection algorithms.

FIGURE 2. - Illustration of original images, annotated masks (ground truth) and their corresponding distance transform. We notice the variation in the shapes of the lesions, which are commonly non-convex.
FIGURE 2.

Illustration of original images, annotated masks (ground truth) and their corresponding distance transform. We notice the variation in the shapes of the lesions, which are commonly non-convex.

C. Architecture of the Proposed Detection Model

Fully Convolutional One-Stage Object Detection (FCOS) is a well-known free-anchor box detection model in the literature [7]. The biggest advantage of FCOS is that it can predict the box for each object candidate using all ground truth pixels with a center loss function. However, FCOS has a limitation, which is it primarily focuses on the center of the bounding box. To address this, we introduce a new loss function that pays attention to all points in the central area of the lesion, ensuring a fair focus on all important points. We refer to our new architecture as GIFCOS-DT, where GI and DT stand for Gastrointestinal and Distance Transform, respectively.

The architecture of GIFCOS-DT is illustrated in Figure 3. It is composed of four main components.

  • Firstly, the RGB image I \in \mathbb {R}^{W\times H \times N} , after being annotated manually by doctors to generate a binary mask G~\in \mathbb {R}^{W\times H \times 1} , is converted to a gray-scale image D~\in \mathbb {R}^{W\times H \times 1} by the Distance Transform operator explained in section III-B (as shown in the yellow block in Figure 3). W, H, N represent the width, height and channels of the original image I, respectively.

  • Secondly, the RGB image I will go through an encoder to generate three feature maps F_{1}, F_{2}, F_{3} . In this paper, we utilized ResNet-50 [27] because it keeps a good trade-off between the accuracy and the computation time. The input image goes through layers of the backbone (the black block in Figure 3). The output features maps have resolution reduced through layers.

  • Thirdly, as we detect lesions of different sizes, we generate different levels of feature maps. As in [7], we use five levels feature maps P_{1}, P_{2}, P_{3}, P_{4}, P_{5} where P_{1}, P_{2}, P_{3} are produced by F_{1}, F_{2}, F_{3} followed by 1\times 1 convolutional layer with top-down connections. P_{4} and P_{5} are generated by applying a convolution layer with stride 2 on P_{3} and P_{4} respectively. Features P_{i} are organized in a pyramid that continuously input to the heads (as shown in green block in Figure 3).

  • Lastly, pooled features P_{i} in the Feature Pyramid Network (FPN) [28] go through five Heads, each responsible for predicting the class of the object at certain size, important pixels within the bounding box of the object, and regression of the four values of the bounding box. Each output of the FPN will have a head to predict the three outputs about the lesions in the image.

FIGURE 3. - Architecture of our proposed GIFCOS-DT.
FIGURE 3.

Architecture of our proposed GIFCOS-DT.

The output of the backbone block usually has a small size, posing a challenge for detecting small objects. Therefore, right after the backbone block, the model employs the Feature Pyramid Network (FPN) to address this issue. The FPN model combines information from the backbone extracted at various layers in a bottom-up manner, combined with top-down processing to detect objects. In the bottom-up approach, the image size is decreased, but the semantic information is increased. The top-down approach involves increasing layer sizes to facilitate the detection of small objects. These layers are connected to the corresponding layers on the bottom-up side through lateral connections to preserve the semantic information extracted. FPN outperforms other block-based architectures because it maintains strong features at different scales.

The internal architecture of a Head can be seen in detail in Figure 3. Each shared Head across levels will have three outputs:

  • Classification output {\mathcal {O}}_{c} \in \mathbb {R}^{H \times W \times C} with H \times W being the size of the input image and C being the number of classes. This output represents the prediction of whether there is an object of a given class or not in a region of interest.

  • Distance-transform output {\mathcal {O}}_{d} \in \mathbb {R}^{H \times W \times 1} contains the value of each pixel, where pixels closer to the boundary will have lower values and pixels closer to the center will have higher values.

  • Regression output {\mathcal {O}}_{b} \in \mathbb {R}^{H \times W \times 4} with the same resolution as the input image, each has four values l^{*}, t^{*}, r^{*}, b^{*} corresponding to the distances from that position to the left, top, right, and bottom edges of the bounding box.

D. Loss Function

Regarding the loss function, instead of using the original losses {\mathcal {L}}_{\text {cls}} and {\mathcal {L}}_{\text {reg}} as in the FCOS model, we introduce a new distance loss function {\mathcal {L}}_{\text {DT}} . The overall new loss function is constructed from three loss functions {\mathcal {L}}_{\text {cls}} , {\mathcal {L}}_{\text {DT}} , and {\mathcal {L}}_{\text {reg}} corresponding to the three branches: Classification, Distance Transform, and Regression in Figure 3.

1) Classification Loss

In the object classification branch, the Focal loss [5] function is utilized for training. Focal loss is constructed based on the Cross-entropy loss function, but the difference is that it reduces the emphasis on samples that the network has already learned well and pays more attention to hard-to-learn samples. We calculate the Focal Loss using the formula (2):\begin{equation*} {\mathcal {L}}_{\text {cls}}(p_{x,y}, c_{x, y}) = -\alpha _{t}(1 - c_{x,y})^{\gamma }\log (c_{x,y}) \tag {2}\end{equation*}

View SourceRight-click on figure for MathML and additional features.where \alpha _{t} is a coefficient to balance the proportion of bounding boxes between classes, c_{x,y} = {\mathcal {O}}_{c}(x, y) is the probability predicted for the true class p_{x,y} = 1 and c_{x,y} = 1- {\mathcal {O}}_{c}(x, y) when p_{x,y} = 0 , \gamma is a coefficient that focuses on hard-to-distinguish samples. As \gamma increases, the value of the error in the hard-to-distinguish region (1 - c_{x,y})^{\gamma } \log (c_{x,y}) increases, contributing more to learning in the difficult samples.

2) Distance Transform Loss

In section III-B, we have presented that the orginal image I has corresponding mask G and the groundtruth distance transform map D. In each of the five heads, we added a single-layer branch in parallel with the classification branch to predict the distance transform map {\mathcal {O}}_{d} of the original image I. Values of both D and {\mathcal {O}}_{d} are normalized to [0, 1]. Distance Transform map layer is trained with Distance transform loss {\mathcal {L}}_{DT} which is the Cross-Entropy loss [29] of the groundtruth map D with the predicted map {\mathcal {O}}_{d} , as (3). This loss is incorporated into the overall loss function in eq. (5).\begin{equation*} {\mathcal {L}}_{\text {DT}}(D({x,y}),{\mathcal {O}}_{d}{(x,y)}) = - D({x,y}) \cdot \log ({\mathcal {O}}_{d}({x,y})) \tag {3}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

3) Regression Loss

In the third branch, the model predicts the bounding box regression using the IoU loss function [30]. The output of this branch is {\mathcal {O}}_{b} \in \mathbb {R}^{H \times W \times 4} with the same resolution as the input, and at each position, the network regresses a 4D vector \hat {\mathbf {bb}}_{x,y} = (l^{*}, t^{*}, r^{*}, b^{*}) , where these four values correspond to the distances from that position to the left, top, right, and bottom edges of the predicted bounding box \hat {\mathbf {bb}}_{x,y} . With the input information, we transform these four parameters into an appropriate form, resulting in the correct bounding box \mathbf {bb}_{x,y} . The IoU loss function formula is as (4):\begin{align*} {\mathcal {L}}_{\text {reg}}(\mathbf {bb}_{x,y}, \hat {\mathbf {bb}}_{x,y}) = -\ln \left ({{ \frac {\text {Intersection}(\mathbf {bb}_{x,y}, \hat {\mathbf {bb}}_{x,y})}{\text {Union}(\mathbf {bb}_{x,y}, \hat {\mathbf {bb}}_{x,y})} }}\right ) \tag {4}\end{align*}

View SourceRight-click on figure for MathML and additional features.

4) Loss Function of GIFCOS-DT

Finally, the loss function of the proposed model is the combination of the three loss functions {\mathcal {L}}_{\text {cls}} , {\mathcal {L}}_{\text {DT}} , and {\mathcal {L}}_{\text {reg}} as (5):\begin{align*} & \hspace {-1pc}{\mathcal {L}}_{GIFCOS-DT}(\{\textbf {p}_{x,y}\}, \{D{(x,y)}\}, \{\mathbf {bb}_{x,y}\}) \\ & = \frac {\alpha }{N_{\text {pos}}} (\sum _{x,y} {\mathcal {L}}_{\text {cls}}(\textbf {p}_{x,y}, c_{x,y}) \\ & \quad + \sum _{x,y} \mathbb {1}_{\{c_{x,y} \gt 0\}} {\mathcal {L}}_{\text {reg}}(\mathbf {bb}_{x,y}, \hat {\mathbf {bb}}_{x,y})) \\ & \quad + \frac {(1-\alpha )}{N_{\text {pos}}} \sum _{x,y} {\mathcal {L}}_{\text {DT}}(D(x,y), {\mathcal {O}}_{d}(x,y)) \tag {5}\end{align*}

View SourceRight-click on figure for MathML and additional features.where \alpha is a balancing coefficient among the loss values, N_{\text {pos}} is the total number of pixels in the image, N_{\text {pos}} = H \times W . For each position in the image, we compute three loss values {\mathcal {L}}_{\text {cls}} , {\mathcal {L}}_{\text {DT}} , {\mathcal {L}}_{\text {reg}} . \mathbb {1}_{\{c_{x, y} \gt 0\}} is an indicator function, being 1 if c_{x, y} \gt 0 and 0 otherwise. In (5),

In the original paper [7], {\mathcal {L}}_{cls} and {\mathcal {L}}_{reg} were used as two main components of the loss function. The authors then introduced a center-ness loss {\mathcal {L}}_{cnt} which is Cross-Entropy loss designed for account the distance between the target centers and the true object centers. Thus the FCOS was trained with a total loss function {\mathcal {L}}_{FCOS-CNT} = {\mathcal {L}}_{cls} + {\mathcal {L}}_{cnt} + {\mathcal {L}}_{reg} . In our work, as aforementionned, we replace the target center regressor by distance transform regressor and introduce the new distance transform loss function. In the experimental section, we will compare the performance our proposed model, GIFCOS-DT trained with {\mathcal {L}}_{GIFCOS-DT} , against that of FCOS trained with {\mathcal {L}}_{FCOS-CNT} .

SECTION IV.

Experiments

A. Datasets

Our proposed method will be evaluated on two challenging datasets, the Kvasir-SEG dataset and the IHG_GIEndoLesion-SEG dataset. The Kvasir-SEG dataset is commonly used to evaluate the detection and segmentation of polyps in endoscopic images [31]. In addition, we collect a new dataset with six various typical lesion categories of the upper gastrointestinal tract. In the following, we quickly describe the Kvasir-SEG and detail our process to collect the new dataset, which we name the IHG_GIEndoLesion-SEG dataset.

Both datasets present many challenges in lesion detection, such as the high number of lesions, as well as lesion regions that are complex and difficult to distinguish due to diverse structures, colors, and sizes. Additionally, many lesion areas contain bubbles, blood, and bright glare, further complicating detection.

1) Kvasir-SEG Dataset

The main purpose of the Kvasir-SEG dataset [31] is to facilitate research and development of advanced methods for polyp segmentation, detection, localization, and classification. The dataset contains 1000 polyp images and their corresponding ground truth from Kvasir Dataset v2. The resolution of the images varies from 332 \times 487 to 1920 \times 1072 pixels. The images and their corresponding masks are stored in two separate folders with the same filename. Image files are encoded using JPEG compression. The images in this dataset is normalized to the resolution of 1280 \times 995 in our experiments.

2) IGH_Giendolesion-SEG Dataset

To test the detection of the model for various diseases of the gastrointestinal tract, we collected a new dataset that contains endoscopic images of gastrointestinal tracts at several hospitals in Hanoi City, Vietnam from 2022 to 2024. All images were captured using Fujifilm high-resolution endoscopy systems, including the EPX-3500HD, LASEREO, and 7000 systems. These systems provide four different light modes: white light imaging (WLI), Flexible Spectral Imaging Color Enhancement (FICE), black-Light Imaging (BLI), and Linked-Color Imaging (LCI), which enhance lesion detection and characterization. To ensure quality, the collected images must clearly depict the lesions, be free from excessive foam or mucus, lack blurring, and have good contrast. For images of esophageal and gastric cancers, the endoscopists must review the histopathological results to confirm the diagnosis.

To enhance diversity, the collected lesions must include multiple subtypes that vary in characteristics and severity. These subtypes align with international classifications commonly used in endoscopy assessments. For example, reflux esophagitis has five subtypes based on the Los Angeles classification [32]. Gastritis has six subtypes according to the Sydney classification [33] and the Kyoto classification [34], which include raised erosions, flat erosions, blood streaks, redness streaks, atrophy, and nodularity. Duodenal ulcers are classified into six subtypes based on the Sakita and Fukutomi classification [35]. Both esophageal cancer and gastric cancer have six subtypes based on the classifications by the Japanese Gastric Cancer Association [36], [37]. Additionally, images were collected using various light modes (WLI, FICE, BLI, LCI) to capture different aspects of the lesions. To ensure quality, endoscopists reviewed the collected images, confirming their diversity and discarding any that were of poor quality or overly similar. To protect patient privacy, identifying information (e.g. name, age) was removed before labeling and using the images for training, either by cropping or masking the images. Our data collection protocol received approval from the Scientific Committee of the Vietnam’s Ministry of Science and Technology.

Our medical and technical experts closely cooperated to create an online platform for manual labeling and delineation of lesions. Only endoscopy images without patient information were uploaded to this platform. Figure 4 demonstrates our labelling process - the doctor not only labeled the lesions on the image but also the light mode and specific location. Additionally, doctors also delineated and labeled the annotations with subtypes of lesions according to international classification on each image. The labeling and delineations underwent validation by experts with over 5 years of experience in the field. The final dataset comprises 5211 image pairs of original images and their ground truth binary mask from 2543 patients annotated by experts. The resolution of images is 1280\times 995 .

FIGURE 4. - Illustration of Graphical User Interface for labeling the endoscopic images.
FIGURE 4.

Illustration of Graphical User Interface for labeling the endoscopic images.

Table 1 describes the variation of lesions and light modes in the IGH_GIEndoLesion-SEG dataset. Table 2 summarizes the number of images for each category of lesion in both experimented datasets (Kvasir-SEG and our collected dataset IGH_GIEndoLesion-SEG). Figure 5 illustrates some original images and bounding boxes of some lesion categories in our IGH_GIEndoLesion-SEG dataset and Kvasir-SEG dataset respectively.

TABLE 1 The number of images for each type of lesion and light mode in the IGH_GIEndoLesion-SEG dataset
Table 1- The number of images for each type of lesion and light mode in the IGH_GIEndoLesion-SEG dataset
TABLE 2 Summary of the number of samples for each type of lesion in our dataset IGH_GIEndoLesion-SEG and Kvasir-SEG dataset
Table 2- Summary of the number of samples for each type of lesion in our dataset IGH_GIEndoLesion-SEG and Kvasir-SEG dataset
FIGURE 5. - Illustration of the original images with overlaid annotated lesion bounding boxes in two datasets. It is noted that the lesions in our IGH_GIEndoLesion-SEG dataset (two rows above) are more challenging than polyps in the Kvasir-SEG dataset (last row).
FIGURE 5.

Illustration of the original images with overlaid annotated lesion bounding boxes in two datasets. It is noted that the lesions in our IGH_GIEndoLesion-SEG dataset (two rows above) are more challenging than polyps in the Kvasir-SEG dataset (last row).

3) Data Splitting and Evaluation Metrics

We conducted three experiments to evaluate the performance of the proposed model. The first two experiments aim to evaluate the performance of the proposed method on two separate datasets (Kvasir-SEG dataset and IGH_GIEndoLesion-SEG dataset) while the last experiment evaluates on the new dataset mixed of the two aforementioned datasets. We split samples in each lesion category in experimented datasets into three separated parts, comprising the train, validation, and test sets, respectively, with a ratio of 7:1:2. Since we are addressing the issue of object detection rather than segmentation, we need to redefine the bounding boxes for each segmented region to serve as ground truth for training and testing our detection models. Figure 5 illustrates the bounding boxes determined as rectangles. To assess the performance of our proposed method, we employ standard metrics commonly used in object detection, including Area Under the Curve (AUC) and Average Precision (AP). For the object detection problem, a true positive is confirmed if the Intersection over Union (IoU) between the predicted bounding box and the ground truth bounding box is higher than a specific threshold. We compute AP_{25}, AP_{50}, AP_{75} which represent the Average Precision at 25%, 50% and 75% Intersection over Union (IoU) thresholds.

B. Implementation Details

We configured the network with an input image size of 1280\times 995 and used ResNet50-FPN as the backbone. The model was designed to identify the following classes (diseases): Reflux Esophagitis, Esophageal Cancer, H.P-Negative Gastritis, H.P-Positive Gastritis, Gastric Cancer, Duodenal Ulcer, and Polyp. The model was trained from scratch using SGD optimizer, with a batch size of 8, learning rate of 0.01 for 150 epochs. In our experiments, the coefficient \alpha _{t} in eq.(2) is set to 0.25 and \gamma is set to 2.0. In eq.(5), the coefficient \alpha is set to 0.5 by default. However, we conduct an ablation study with different values of this parameter in section IV-C.4.

C. Experimental Results

1) Results on Kvasir-SEG Dataset

Table 3 shows the performance of our proposed method GIFCOS-DT compared with state-of-the-art detectors such as Faster R-CNN, YOLOv3+spp, YOLOv4, DETR [38] and the FCOS. The results obtained by state-of-the-art (SOTA) detectors are reported in [39] while we re-implemented and tested DETR ourselves. In terms of accuracy metrics, GIFCOS-DT consistently outperforms FCOS, with the highest improvement seen in AP_{75} , which is more than 10.5% higher, and the average AP is also higher by 5.9%. Furthermore, GIFCOS-DT achieves the highest accuracy in terms of AP_{25} , with AP_{50} and average AP ranking second, slightly behind the top position by 0.2% and 3%, respectively.

TABLE 3 Comparison of our proposed method, GIFCOS-DT, with the existing state-of-the-art detection models. The best scores are highlighted in bold while the second-best scores are underscored
Table 3- Comparison of our proposed method, GIFCOS-DT, with the existing state-of-the-art detection models. The best scores are highlighted in bold while the second-best scores are underscored

The GIFCOS-DT model with a ResNet50 backbone focuses its detections on the central regions of lesions; additionally, larger lesions may be split into smaller regions. Moreover, the DarkNet53 backbone provides better performance than ResNet50 at higher IoU levels, such as 75. Therefore, at AP_{75} , the results of GIFCOS-DT with a ResNet50 backbone are lower compared to the YOLOv3+spp and YOLOv4 models with DarkNet53.

Figure 6 illustrates two detected results generated by FCOS and GIFCOS-DT for a single polyp case in the Kvasir-SEG dataset. The first row shows the original image and the binary mask of polyp that is manually determined by doctors. In shape, the polyp appears as a large round mass, with some smaller round ones attached to its edges. Based on this segmentation, the bounding box is determined as a white rectangle overlaid on the original image in the lowest row. We display the images obtained by distance transform in GIFCOS-DT and by center-ness transform in FCOS respectively in a middle row. We observe that FCOS with center-ness loss considers only the center of the entire region while GIFCOS-DT takes the centers of each component region into account. As a result, GIFCOS-DT generates three candidate regions (one corresponds to the biggest polyp, and two others correspond to small attached polyps). These results appear to be more reasonable than those generated by FCOS where two detected regions overlap significantly. It is also noted in this example that FCOS detected two candidates (two green bounding boxes) where the smaller one is completely inside the bigger one. So according to the evaluation metric computation, FCOS generated one false positive. Our model generated three candidates (green bounding boxes), according to the evaluation metric computation, GIFCOS-DT produced two false positives (two smaller bounding boxes). However, two smaller bounding boxes correspond to the two smaller components of the polyp. So if the doctor annotated them separately, the detected bounding boxes become true positives. As a result, with our method the dection is more practical and the obtained accuracy is improved.

FIGURE 6. - Comparison of polyp detection by the original FCOS and our proposed GIFCOS-DT on the Kvasir-SEG dataset. In which, the white bounding boxes represent the ground truth labeled by doctors, yellow boxes are detection results by FCOS, and the green bounding boxes represent the detection results by our model.
FIGURE 6.

Comparison of polyp detection by the original FCOS and our proposed GIFCOS-DT on the Kvasir-SEG dataset. In which, the white bounding boxes represent the ground truth labeled by doctors, yellow boxes are detection results by FCOS, and the green bounding boxes represent the detection results by our model.

2) Results on IGH_GIEndoLesion-SEG Dataset

Table 4 presents the comparative results between the original FCOS model and our proposed GIFCOS-DT model. Regarding the AP_{25} metric, all classes show improvements, with an average increase of over 10%. Specifically, the ES, NG, and PG classes exhibit slight improvements, while the GC class demonstrates a notable 7.2% increase. The most significant improvements are observed in the DU and EC classes, with increases of 13.4% and 24.5%, respectively. Similarly, for the AP_{50} metric, improvements are observed in all classes, with an average increase of 7.2%. The EC class shows the highest improvement, with a 25% increase. As for the AP_{75} metric, improvements are seen in most classes, except for a slight decrease of 0.2% in the DU class and no change in the PG class. The highest improvement is still in the EC class, with an 8% increase, while the average result shows a 1.8% increase. Finally, in terms of the AUC metric, the GIFCOS-DT model outperforms the original FCOS model in all classes, with an average increase of 3.7%. The highest improvement is observed in the EC class, with a 9.5% increase. Overall, the results of GIFCOS-DT compared to FCOS on the IGH_GIEndoLesion-SEG dataset show a significant improvement across all classes, with the EC class standing out prominently.

TABLE 4 Performance comparison of GIFCOS-DT (our) with the original FCOS on the IGH_GIEndoLesion-SEG dataset. The best scores are highlighted in bold
Table 4- Performance comparison of GIFCOS-DT (our) with the original FCOS on the IGH_GIEndoLesion-SEG dataset. The best scores are highlighted in bold

3) Results on the Mixed Dataset

Two datasets (Kvasir-SEG and IGH_GIEndoLesion-SEG) are mixed to create a new challenging dataset for evaluating the proposed model. The polyps in Kvasir-SEG is considered as the 7^{th} lesion category in addition to the 6 ones in the IGH_EndoLesion-SEG dataset. The objective of this experiment is to assess the robustness of the proposed method across different datasets and with an expanded range of lesion categories. Table 5 presents the results obtained by the FCOS and our proposed GIFCOS-DT. We observe that GIFCOS-DT produces higher precision for all lesion categories, including polyp. It means that when extending the lesion categories, the performance of GIFCOS-DT increases consistently. In terms of AUC, GIFCOS-DT exhibits a slightly lower value than FCOS only in the gastric cancer category.

TABLE 5 Performance comparison of GIFCOS-DT (Our) with the original FCOS on the mixed dataset. The best scores are highlighted in bold
Table 5- Performance comparison of GIFCOS-DT (Our) with the original FCOS on the mixed dataset. The best scores are highlighted in bold

Figure 7 illustrates the ROC curves for GIFCOS-DT and the original FCOS. The ROCs generated by GIFCOS-DT exhibit higher values compared to those produced by FCOS. Furthermore, both models perform well in terms of ROC for Kvasir polyp and gastritis cancer datasets. The integration of distance transform makes the network better suited for gastrointestinal injury data. Additionally, this ROC curve allows us to select a suitable threshold that balances the trade-off between the true alarm rate and the false alarm rate. By choosing a lower IoU threshold (around IoU =0.2), we can effectively identify six classes (ES, PG, EC, GC, DU and Polyp) with a very low false alarm rate (<0.1). On the other hand, the NG class requires a higher threshold (around IoU =0.3) to achieve a balanced trade-off with a false alarm rate of approximately 0.2.

FIGURE 7. - ROC of each lesion in the mixed dataset.
FIGURE 7.

ROC of each lesion in the mixed dataset.

In Table 5, the results of bounding box regression remain relatively low for certain classes. Using the AP_{50} measurement, ESO achieved only 28.0%, NG scored 13.3%, PG reached 20.6%, and DU 23.8%. These lesions often pose challenges for visual identification due to their indistinctive appearance and lack of specific shapes. Conversely, as the severity of the disease increases, such as in cases of Esophageal Cancer, Gastric Cancer, and Polyp, the results improve significantly. EC achieved a AP_{50} of 64.2%, GC scored 85.8%, and Polyp reached 86.6%. These severe diseases are more easily recognizable to the naked eye due to their distinctive structures and appearances. In terms of characteristics and shapes, Polyp is notably easier to identify compared to Esophagitis, HP-negative gastritis, HP-positive gastritis, and Duodenal Ulcers. Polyps typically exhibit distinctive features, such as raised, round-shaped protrusions on the digestive tract surface, making them easier to distinguish. With the AP_{50} evaluation metric, GIFCOS-DT yields higher results in all classes compared to FCOS. Additionally, in terms of AP_{25} , GIFCOS-DT only shows a 0.6% lower result in the GC class compared to FCOS. For the AP_{75} metric, GIFCOS-DT achieves results approximately 2-5% higher in classes EC, GC, and Poly, as well as the average score, while for the remaining classes, it is approximately equal to FCOS, with negligible differences.

Figure 8 shows that FCOS predicts bounding boxes that tend to be larger or smaller than the lesion area itself, while GIFCOS-DT predicts results closer to the lesion area. With the FCOS feature, the model tries to learn from the center of the box determined by the doctor, so when regressing to the FCOS bounding box, it is very difficult to balance the lesion area and the background area, so the predicted FCOS box tends to include much background region. Meanwhile, GIFCOS-DT tries to stick to the center of the lesion delineated by the doctor, so it can be divided into many areas, into small boxes. But in reality, these areas are close and could be connected together, so doctors have to mark them as a large box. Therefore, although the current AP result is already higher than the FCOS, it could produce much better boxes of lesions than the manual annotation.

FIGURE 8. - Detection results by FCOS and GIFCOS-DT, where the white bounding boxes represent the ground truth labeled by doctors, yellow boxes represent detections by FCOS, and green bounding boxes represent detections by our model.
FIGURE 8.

Detection results by FCOS and GIFCOS-DT, where the white bounding boxes represent the ground truth labeled by doctors, yellow boxes represent detections by FCOS, and green bounding boxes represent detections by our model.

4) Ablation Sduty

a: Role of Distance Transform Loss

We conducted an ablation study on the IGH_GIEndoLesion-SEG dataset because this dataset contains different types of lesions with elongated shapes. We vary the value of \alpha in the eq. (5) with values of 0.2, 0.4, 0.5, 0.6, 0.8, and 1. As mentioned earlier, in the above experimental results, we conducted experiments with GIFCOS-DT using \alpha = 0.5 for 150 epochs. In this ablation study, we vary the values of \alpha and train GIFCOS-DT for 100 epochs only to compare the role of {\mathcal {L}}_{DT} in the overall loss function.

Table 6 shows the AP_{50} results for six types of lesions with different values of \alpha . The higher value of \alpha , the less impact of distance transform loss function in the total loss. When \alpha = 1 , {\mathcal {L}}_{GIFCOS-DT} consists only of {\mathcal {L}}_{cls} and {\mathcal {L}}_{reg} without {\mathcal {L}}_{DT} . The results show the average AP_{50} is highest when \alpha = 0.5 , indicating that the contribution of {\mathcal {L}}_{cls}, {\mathcal {L}}_{reg} and {\mathcal {L}}_{DT} are balanced. Without {\mathcal {L}}_{DT} (\alpha = 1 ), the AP_{50} drops to 31.4% highlighting the significant impact of equivalent distance transform loss on the overall performance of the model.

TABLE 6 Comparison of AP_{50} of GIFCOS-DT on the IGH_GIEndoLesion-SEG dataset with different values of \alpha . The best scores are highlighted in bold while the second-best scores are underscored
Table 6- Comparison of 
$AP_{50}$
 of GIFCOS-DT on the IGH_GIEndoLesion-SEG dataset with different values of 
$\alpha $
. The best scores are highlighted in bold while the second-best scores are underscored

b: Comparision with Transformer Based Model

We compared our model GIFCOS-DT with the original FCOS and DETR model [38]. Notably, while both FCOS and GIFCOS-DT rely on convolutional operators, DETR incorporates both convolutional and transformer operators. Figure 9 shows the AP_{50} of three models on three datasets. While DETR yields higher performance than FCOS, both of them get lower performance than our proposed model GIFCOS-DT.

FIGURE 9. - Comparision of 
$AP_{50}$
 of FCOS, DETR and GIFCOS-DT on three datasets.
FIGURE 9.

Comparision of AP_{50} of FCOS, DETR and GIFCOS-DT on three datasets.

SECTION V.

Building Tool for Detection of Upper GI Diseases

In this section, we present our deployment of an aided tool for detection of upper gastrointestinal tract diseases on an edge device. First, we present our hardware setting. Then we explain how to connect our system to the endoscopy machine to capture the endoscopic images to be processed by the GIFCOS-DT model on the edge device and its computational time. To ensure effective deployment of our model, we’ve set up the environment on edge devices such as the Jetson AGX Xavier and the NVIDIA GeForce RTX 3090. This process involves configuring Python 3.7 along with essential libraries like OpenCV, PyTorch, and others. Our model, configured with a ResNet50-FPN backbone, recognizes classes including Reflux Esophagitis, Esophageal Cancer, H.P-negative gastritis, H.P-positive gastritis, Gastric Cancer, Duodenal Ulcer, and Polyp, from images sized 1280\times 995 pixels. This meticulous setup ensures smooth deployment and efficient inference on edge devices for real-world usage.

A. Hardware Equipment: Connection and Interface

To deploy the lesion detection model in a practical application, we establish connections between equipment and create interfaces among them. In our implementation, we connect the Fujifilm Eluxeo 7000 endoscopy system with a Jetson AGX Xavier edge device. To capture the images taken from the endoscopy, we utilize the AverMedia CL311-M2 grabber, which is connected to the HDMI output of the Fujifilm Eluxeo 7000. The grabber is then connected to the Jetson Xavier via an HDMI to PCIex4 standard. The maximum input resolution that the AverMedia CL311-M2 can handle is 1920 \times 1080 , with a frame rate of up to 60 fps. Figure 10 illustrates the main components of our entire system, and Figure 11 shows a real picture of our deployment at the Institute of Gastroenterology and Hepatology (IGH) in Hanoi, Vietnam.

FIGURE 10. - The schematic diagram of our AI-assisted endoscopy diagnostic system.
FIGURE 10.

The schematic diagram of our AI-assisted endoscopy diagnostic system.

FIGURE 11. - The real-world connection setup between the edge device and the endoscopic machine at the Institute of gastroenterology and Hepatology (IGH), Hanoi, Vietnam.
FIGURE 11.

The real-world connection setup between the edge device and the endoscopic machine at the Institute of gastroenterology and Hepatology (IGH), Hanoi, Vietnam.

B. Deployment of Deep Model for Lesion Detection on Edge Device

The prediction process involves detecting lesions in each image extracted from the streaming video. Depending on the capture speed, a certain number of images per second are generated. For example, if the capture speed is 30 fps, then 30 images will be produced in one second. Subsequently, the model predicts the results for these images, which are then displayed on the screen alongside the images. Typically, this prediction process can be divided into three main tasks:

  • Capture and load images from the video stream (T1): The initial images are read and pre-processed. At this stage, multiple images are grouped into a batch so that the model can predict the entire batch if GPU capacity allows.

  • Perform detection process (T2): This process takes the batch of input images and does the detection by GIFCOS-DT to output the bounding boxes of lesion candidates. This process typically takes the most time due to the dense and complex calculations on the matrices.

  • Display image stream and detection results (T3): After transferring the detection results, bounding boxes are superimposed onto the image, and both are then displayed on a monitor, aiding in doctors’ investigations.

In traditional setups, tasks progress sequentially, moving from one to the next once completed. While being suitable for lower-end systems, this method compromises speed and image quality due to slower processing and reduced FPS. In environments with ample resources, parallel processing can enhance prediction times. Multithreading subdivides tasks into manageable units, each overseen by a separate thread, thereby boosting efficiency through concurrent execution, as illustrated in Figure 12. Nevertheless, within a single core, workload imbalances may arise, constraining potential benefits. Pipelining, as depicted in Figure 13, is applied as a highly effective technique to facilitate concurrent execution of prediction processes. Unlike conventional approaches, pipelining eradicates idle periods between stages, thereby augmenting the number of predictions completed within a specified timeframe. While it doesn’t decrease the time required to predict a single image, pipelining significantly curtails the overall processing time for multiple images. Through the image detection is subdivided into sequential tasks and the allocation of equal time slots for each process, congestion is mitigated, thus enhancing efficiency without compromising detection quality.

FIGURE 12. - Diagram of tasks when predicting consecutive images taken from endoscopy machine.
FIGURE 12.

Diagram of tasks when predicting consecutive images taken from endoscopy machine.

FIGURE 13. - Detection task is subdivided into three sub-tasks performed in a pipeline setup.
FIGURE 13.

Detection task is subdivided into three sub-tasks performed in a pipeline setup.

C. Runtime System Evaluation

Figure 14 illustrated some detection results obtained when deploying the developed system. The data captured from two patients with gastric cancer and esophageal cancer respectively. It shows that the system works and is able to detect correctly the frames with diseases confirmed by doctors.

FIGURE 14. - Frames showing detection of gastric cancer and esophageal cancer lesions in real-time streams captured by endoscopic machines.
FIGURE 14.

Frames showing detection of gastric cancer and esophageal cancer lesions in real-time streams captured by endoscopic machines.

1) Influence of Resolution on Recognition Accuracy

When deploying such system, the computational time is a critical factor because the capturing speed of the endoscopic machine is very high. As a result, doctors require high speed of automatic detection. This section presents the results obtained when deploying the system in real-world scenarios. We integrated the GIFCOS-DT model simultaneously on two devices: Jetson Xavier and a desktop computer equipped with a GeForce RTX 3090 graphics card, within the complete system from data acquisition to prediction and result display. To optimize the image quality and FPS rate, we configured the Capture Card to capture frames at 1920\times 1080 resolution and 35 FPS. After removing patient information, the system received images at 1280\times 995 resolution. While this is the original resolution of the endoscopic image stream, inputting it directly into the detection module would be computationally expensive. Hence, we downscaled the resolution of the input image to 640\times 512 for evaluation, albeit at the cost of reducing model accuracy. For instance, at 1280\times 995 resolution, the mAP50 for detecting Polyp lesions is 86.6%, dropping to 71.4% at 640\times 512 , a reduction of 15.2%.

Figure 15 compares the detection capabilities of the model with high-resolution input images (1280\times 995 ) versus low-resolution input images (640\times 512 ) when detecting directly on edge devices through several examples. It can be observed that while the model still detects lesions at lower resolutions on edge devices, the accuracy decreases. Specifically, there are more false positives with small bounding boxes due to the low-quality input, which leads to confusion with bright spots, blood stains, or saliva. These noise artifacts also cause the model to miss some lesion areas. Nevertheless, the model still effectively detects larger lesions.

FIGURE 15. - Comparison of detecting six types of lesions using the GIFCOS-DT model with High-resolution input images (
$1280 \times 995$
) versus a model detecting directly on edge devices with Low-resolution input images (
$640 \times 512$
).
FIGURE 15.

Comparison of detecting six types of lesions using the GIFCOS-DT model with High-resolution input images (1280 \times 995 ) versus a model detecting directly on edge devices with Low-resolution input images (640 \times 512 ).

2) Framerate of Integrated Models

In the performance evaluation of experiments on edge devices, we assessed GIFCOS-DT with input image resolutions of 1280 \times 995 and 640\times 512 . Therefore, we constructed model streams using two typical techniques: sequential processing and pipeline processing. The models were evaluated on two different hardware devices: GeForce RTX 3090 and Jetson AGX Xavier. Thus, we have eight performance evaluation results as shown in Table 7. Notably, GIFCOS-DT on the GeForce RTX 3090 demonstrates significantly faster latency and frame rates compared to the Jetson AGX Xavier. On the RTX 3090, the highest frame rate is 31.92 FPS, while on the Jetson, it reaches only 14.85 FPS. Latency on the RTX 3090 (24 GB, 35.38 TFLOPS FP32) is at least 0.087 seconds, whereas on the Jetson AGX Xavier (16 GB, 1,410 GFLOPS FP32), it’s 0.199 seconds, reflecting the hardware differences between the devices.

TABLE 7 Performance evaluation of the GIFCOS-DT model on two hardwares: RTX 3090 and AGX Xavier. The best scores are highlighted in bold while the second-best scores are underscored
Table 7- Performance evaluation of the GIFCOS-DT model on two hardwares: RTX 3090 and AGX Xavier. The best scores are highlighted in bold while the second-best scores are underscored

It is worth seeing that implementing the pipeline technique enhances frame rates by approximately 2-3 times compared to sequential processing. For instance, the GIFCOS-DT model with a (1280 \times 995 ) input resolution on the RTX 3090 achieves 28.04 FPS using the pipeline technique, compared to 10.10 FPS with sequential processing. However, pipeline latency is about 20 ms higher due to the sum of all three stages, whereas in sequential processing, it equals three times the largest time slot.

Maintaining equal time slots in the pipeline technique prevents buffer overflow. When slots are balanced, the pipeline is effective. For instance, with the same GIFCOS-DT model and input image size on the Jetson AGX Xavier, a large prediction time slot causes imbalance, resulting in a five-fold increase in latency and minimal frame rate improvement. For low-latency, high frame rate, and resource-rich scenarios, using GIFCOS-DT with a (1280\times 995 ) input resolution, pipeline technique, and RTX 3090 achieves a latency of 0.107 seconds and a frame rate of 28.04 FPS. Slightly reducing accuracy with a (640\times 512 ) input size decreases latency to 0.094 seconds and increases the frame rate to 31.92 FPS. Conversely, for lightweight edge devices, GIFCOS-DT with a (640\times 512 ) input size and sequential processing on the Jetson AGX Xavier balances latency at 0.199 seconds, achieving a frame rate of 5.03 FPS.

3) Limitation of the Proposed Model

In this study, the GIFCOS-DT model has been integrated and deployed on a low computational resource device, such as an embedded Jetson Xavier device, for being neat and affordable reasons. However, to make the integrated edge device suitable for daily clinical applications, the current performance of 14.58 fps still requires further improvement. To address this issue, the model’s streaming procedure can be optimized to achieve real-time processing speeds. Additionally, the quality of the endoscopic images should be carefully evaluated. In practical endoscopic examinations, the abnormality detection model may struggle with contaminated objects in the image, such as water bubbles or food particles. Furthermore, due to the movement of the endoscopy device, images may become blurred or show lesions too closely. These real-world challenges suggest future directions, such as implementing a pre-screening procedure to evaluate image quality or designing end-to-end models to address both image quality and the detection of abnormal regions.

SECTION VI.

Conclusion

This paper introduced the novel GIFCOS-DT model with the Distance Transform loss function for detecting lesions of specific shapes and sizes in endoscopic images of gastrointestinal tract, not limited to polyp shapes as in recent studies. The GIFCOS-DT model outperformed the original FCOS model across two challenging datasets. Specifically, it achieves better results with an average AP_{50} score, surpassing FCOS by 4.2% on the Kvasir-SEG dataset, by 7.2% on the IGH_GIEndoLesion-SEG dataset, and by 3.9% on a mixture of both datasets. Furthermore, the results of GIFCOS-DT exceed those of other networks such as YOLOv3, YOLOv4+spp, DETR, and Faster R-CNN. Additionally, we have developed a supporting tool for data acquisition from endoscopy machines for integration into the AI module via a Capture Card. This module is built using two methods: sequential acquisition and pipeline acquisition. The pipelining technique has shown a processing time reduction of approximately one-third compared to the sequential method. Regarding integration for practical applications, we have optimized the processing time to meet strict requirements of computation time and display speed demanded by physicians. In the future, we will explore techniques to improve image quality before detection and incorporate a tracking algorithm to enhance the detection rate in video streams.

Select All
1.
R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 1440–1448.
2.
J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” 2018, arXiv:1804.02767.
3.
C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, “DSSD: Deconvolutional single shot detector,” 2017, arXiv:1701.06659.
4.
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “SSD: Single shot MultiBox detector,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Amsterdam, The Netherlands, Oct. 2016, pp. 21–37.
5.
T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2999–3007.
6.
H. Law and J. Deng, “CornerNet: Detecting objects as paired keypoints,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 734–750.
7.
Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: A simple and strong anchor-free object detector,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 4, pp. 1922–1933, Apr. 2022.
8.
T. Hirasawa, K. Aoyama, T. Tanimoto, S. Ishihara, S. Shichijo, T. Ozawa, T. Ohnishi, M. Fujishiro, K. Matsuo, J. Fujisaki, and T. Tada, “Application of artificial intelligence using a convolutional neural network for detecting gastric cancer in endoscopic images,” Gastric Cancer, vol. 21, no. 4, pp. 653–660, Jul. 2018.
9.
V. V. Khryashchev, O. A. Stepanova, A. A. Lebedev, S. V. Kashin, and R. O. Kuvaev, “Deep learning for gastric pathology detection in endoscopic images,” in Proc. 3rd Int. Conf. Graph. Signal Process., Jun. 2019, pp. 90–94.
10.
T. Aoki, A. Yamada, K. Aoyama, H. Saito, A. Tsuboi, A. Nakada, R. Niikura, M. Fujishiro, S. Oka, S. Ishihara, T. Matsuda, S. Tanaka, K. Koike, and T. Tada, “Automatic detection of erosions and ulcerations in wireless capsule endoscopy images based on a deep convolutional neural network,” Gastrointestinal Endoscopy, vol. 89, no. 2, pp. 357–363, Feb. 2019.
11.
V. Khryashchev, A. Lebedev, O. Stepanova, and A. Srednyakova, “Analysis of pathologies on endoscopic images of the stomach using SSD and RetinaNet neural network architecture,” in Proc. IEEE East-West Design Test Symp. (EWDTS), Sep. 2021, pp. 1–5.
12.
Y. Sakai, S. Takemoto, K. Hori, M. Nishimura, H. Ikematsu, T. Yano, and H. Yokota, “Automatic detection of early gastric cancer in endoscopic images using a transferring convolutional neural network,” in Proc. 40th Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. (EMBC), Jul. 2018, pp. 4138–4141.
13.
T. Shibata, A. Teramoto, H. Yamada, N. Ohmiya, K. Saito, and H. Fujita, “Automated detection and segmentation of early gastric cancer from endoscopic images using mask R-CNN,” Appl. Sci., vol. 10, no. 11, p. 3842, May 2020.
14.
A. Teramoto, T. Shibata, H. Yamada, Y. Hirooka, K. Saito, and H. Fujita, “Automated detection of gastric cancer by retrospective endoscopic image dataset using U-Net R-CNN,” Appl. Sci., vol. 11, no. 23, p. 11275, Nov. 2021.
15.
N. Ghatwary, M. Zolgharni, F. Janan, and X. Ye, “Learning spatiotemporal features for esophageal abnormality detection from endoscopic videos,” IEEE J. Biomed. Health Informat., vol. 25, no. 1, pp. 131–142, Jan. 2021.
16.
J. Gao, Q. Xiong, C. Yu, and G. Qu, “White-light endoscopic colorectal lesion detection based on improved YOLOv5,” Comput. Math. Methods Med., vol. 2022, pp. 1–11, Jan. 2022.
17.
A. Teramoto, T. Shibata, H. Yamada, Y. Hirooka, K. Saito, and H. Fujita, “Detection and characterization of gastric cancer using cascade deep learning model in endoscopic images,” Diagnostics, vol. 12, no. 8, p. 1996, Aug. 2022.
18.
S. Ahmad, J.-S. Kim, D. K. Park, and T. Whangbo, “Automated detection of gastric lesions in endoscopic images by leveraging attention-based YOLOv7,” IEEE Access, vol. 11, pp. 87166–87177, 2023.
19.
Z. Xiao, J. Lu, X. Wang, N. Li, Y. Wang, and N. Zhao, “WCE-DCGAN: A data augmentation method based on wireless capsule endoscopy images for gastrointestinal disease detection,” IET Image Process., vol. 17, no. 4, pp. 1170–1180, Mar. 2023.
20.
Y. Mori, S. Kudo, M. Misawa, H. Ito, M. Oda, and K. Mori, “Clinical application of endobrain, a medical device adopting artificial intelligence: Challenges to obtain regulatory approval and insurance reimbursement,” Med. Imag. Technol. (Web), vol. 38, no. 5, pp. 213–216, 2020.
21.
A. Repici, M. Spadaccini, G. Antonelli, L. Correale, R. Maselli, P. A. Galtieri, G. Pellegatta, A. Capogreco, S. M. Milluzzo, and G. Lollo, “Artificial intelligence and colonoscopy experience: Lessons from two randomised trials,” Gut, vol. 71, no. 4, pp. 757–765, Apr. 2022.
22.
E. Rondonotti, D. Di Paolo, E. R. Rizzotto, C. Alvisi, E. Buscarini, M. Spadaccini, G. Tamanini, S. Paggi, A. Amato, and G. Scardino, “Efficacy of a computer-aided detection system in a fecal immunochemical test-based organized colorectal cancer screening program: A randomized controlled trial (AIFIT study),” Endoscopy, vol. 54, no. 12, pp. 1171–1179, Dec. 2022.
23.
P. Wang, X.-G. Liu, M. Kang, X. Peng, M.-L. Shu, G.-Y. Zhou, P.-X. Liu, F. Xiong, M.-M. Deng, H.-F. Xia, J.-J. Li, X.-Q. Long, Y. Song, and L.-P. Li, “Artificial intelligence empowers the second-observer strategy for colonoscopy: A randomized clinical trial,” Gastroenterol. Rep., vol. 11, Dec. 2022, Art. no. goac081.
24.
A. Ebigbo, R. Mendel, A. Probst, J. Manzeneder, F. Prinz, L. A. de Souza, J. Papa, C. Palm, and H. Messmann, “Real-time use of artificial intelligence in the evaluation of cancer in Barrett's oesophagus,” Gut, vol. 69, no. 4, pp. 615–616, Apr. 2020.
25.
X. He, L. Wu, Z. Dong, D. Gong, X. Jiang, H. Zhang, Y. Ai, Q. Tong, P. Lv, B. Lu, Q. Wu, J. Yuan, M. Xu, and H. Yu, “Real-time use of artificial intelligence for diagnosing early gastric cancer by magnifying image-enhanced endoscopy: A multicenter diagnostic study (with videos),” Gastrointestinal Endoscopy, vol. 95, no. 4, pp. 671–678, Apr. 2022.
26.
R. Fabbri, L. D. F. Costa, J. C. Torelli, and O. M. Bruno, “2D Euclidean distance transform algorithms: A comparative survey,” ACM Comput. Surveys, vol. 40, no. 1, pp. 1–44, Feb. 2008.
27.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 770–778.
28.
T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 936–944.
29.
Z. Zhang and M. Sabuncu, “Generalized cross entropy loss for training deep neural networks with noisy labels,” in Proc. Adv. Neural Inf. Process. Syst., vol. 31, 2018, pp. 1–11.
30.
D. Zhou, J. Fang, X. Song, C. Guan, J. Yin, Y. Dai, and R. Yang, “IoU loss for 2D/3D object detection,” in Proc. Int. Conf. 3D Vis. (3DV), Sep. 2019, pp. 85–94.

References

References is not available for this document.