Introduction
Automated detection of lesions in gastrointestinal endoscopy images is crucial in assisting doctors in diagnosis of gastrointestinal (GI) tract-related diseases. Automating lesion detection can help improving the accuracy and efficiency of the diagnostic processes. However, multiple factors could contribute to the complexity of identifying and categorizing lesions. These include the difficulty in distinguishing lesions with similar characteristics, issues with image contrast, clarity, and artifacts (e.g. bubbles, fluids, blood, lens defects).
Recent object detection methods have been widely built upon convolutional neural networks [1], [2], [3], [4], [5], [6]. Most of them rely on loss functions such as Intersection over Union (IoU), Cross-Entropy (CE), and Centerness loss for optimizing the models’ weights. However, these methods are primarily designed for objects with well-defined shapes within rectangular bounding boxes. Digestive tract lesions, in contrast, present a unique challenge due to their diverse shapes and irregularities. Unlike conventional objects, these lesions can vary significantly in shape and appearance, exhibiting traits such as elongation, concavity, or irregular edges. As a result, the application of standard loss functions proves inadequate in accurately detecting and characterizing such lesions.
To address these issues, we propose a method that builds upon the FCOS (Fully Convolutional One-Stage Object Detection) architecture, known for its anchor-free approach to object detection [7]. FCOS has demonstrated superior performance compared to many other detectors, including both two-stage detectors (e.g., Faster R-CNN [1]) and one-stage detectors (such as variants of YOLO [2], SSD [3], DSSD [4], RetinaNet [5], and CornerNet [6]), particularly on the COCO large-scale detection benchmark. However, the original FCOS employs a loss function comprised of focal loss, IoU loss, and centerness loss, which are optimized for detecting compact objects, a characteristic not commonly observed in gastrointestinal lesions. To overcome this limitation, we introduce a novel loss function utilizing Distance Transform, tailored specifically to detect gastrointestinal lesions.
Our key contributions can be summarized as follows:
We propose a new detection model, namely GIFCOS-DT (GastroIntestinal Fully Convolutional One-Stage Object Detection using Distance Transform) that extends the existing FCOS model with a novel loss function, the transform distance loss;
We introduce a new benchmark dataset of six various lesions of the upper gastrointestinal tract for the evaluation of lesion detection from endoscopic images, named IGH_GIEndoLesion-SEG;
We develop an end-to-end assisting system that directly connects the endoscopic machine and the detection module. To enhance practicality, we deploy buffering techniques to accelerate computational processing time on edge devices.
We validate our proposed detection model on two benchmark datasets with various lesion categories. Additionally, we deploy our supporting system on various platforms for comparisons and providing recommendations for real-world implementation.
The remaining sections of this paper are organized as follows: Section II briefly reviews related works on existing methods for lesion detection from endoscopic images. The proposed framework for gastrointestinal tract disease detection using distance transform is described in Section III. Section IV reports experimental results on one public dataset for polyp detection Kvasir-SEG and one self-collected dataset IGH_GIEndoLesion-SEG of six lesions. Finally, further discussions and conclusions are presented in Section V.
Related Works
Lesion detection from endoscopic images has become a highly attractive topic in recent years, with significant efforts directed toward developing effective algorithms and computer-aided systems. Initially, research efforts concentrated on simple identifiable characteristics of lesions such as color and structure, employing models to learn features based on hand-crafted features. However, in recent years, methods based on Convolutional Neural Networks (CNNs) have garnered significant research interest due to their accuracy and versatility. In this section, we survey existing techniques for lesions of the gastrointestinal tract using deep-learning models. We then summarize some Computer Aided Design (CAD) systems for lesion detection using artificial intelligence.
A. Lesion Detection from Endoscopy Images
In the realm of early gastric cancer detection, several pioneering endeavors have significantly contributed to the advancement of lesion detection methods in gastrointestinal endoscopy. Hirasawa et al. devised a method employing a Single Shot MultiBox Detector (SSD), to automate the detection of early gastric cancer lesions and delineate the extent of invasion [8]. The SSD has been extensively applied in gastric cancer detection [8], [9], as well as in identifying erosions and ulcerations [10]. Additionally, Vladimir Khryashchev et al. conducted a comparative study on SSD and RetinaNet for analyzing pathologies in endoscopic images of the stomach [11].
Similarly, Sakai et al. proposed an approach leveraging CNNs to discern gastric cancer regions from normal areas by analyzing finely cut patches of endoscopic images [12].
Shibata et al. introduced a method harnessing the capabilities of Mask R-CNN, designed for both object detection and segmentation, to detect the presence of early gastric cancer and extract invasive regions [13]. Furthermore, Teramoto et al. proposed a sophisticated U-Net R-CNN model, employing two CNNs for segmentation and classification tasks [14]. Initially, the U-Net model was employed to delineate regions indicative of early gastric cancer, followed by classification utilizing a separate CNN model.
Ghatwary et al. introduced a novel 3D Sequential DenseConvLstm network for extracting spatiotemporal features from input videos [15]. Their model combined 3D Convolutional Neural Network (3D CNN) and Convolutional LSTM (Long Short-Term Memory) to effectively capture both short and long-term spatiotemporal patterns [15]. The resulting feature map is employed by a region proposal network and ROI (Region of Interest) pooling layer to generate bounding boxes identifying abnormal regions in each frame of the video. Additionally, they investigated a post-processing technique called Frame Search Conditional Random Field (FS-CRF) to enhance model performance by recovering missing regions in neighboring frames within the same video clip.
Gao et al. employed YOLOv5 for the detection of colorectal lesions [16]. Teramoto et al. proposed a cascade model comprising two stages for gastric cancer detection and characterization [17]. The initial stage employs a diverse set of image classification deep models, such as VGG-16, InceptionV3, ResNet, and DenseNet, followed by a segmentation model (U-Net). Ahmad et al. introduced an automated approach that enhances the YOLO-v7 object detection algorithm through the incorporation of an attention block for gastric lesion detection [18]. Among the three attention mechanisms tested (Squeeze and Excitation (SE), Convolutional Block Attention Module (CBAM), Global Local Attention Mechanism (GLAM)), YOLOv7 enhanced by SE achieved the highest accuracy across four categories: gastric cancer, ulcers, adenomas, and healthy tissues. In another study [19], Xiao et al. proposed a Deep Convolutional Generative Adverserial Network (DCGAN) architecture to augment the dataset obtained from wireless capsule endoscopy. They subsequently employed three deep models, namely SSD, YOLOv4, and YOLOv5, to detect four categories including Ulcer, Polyp, Blood—Fresh, and Erosion.
Overall, these pioneering works demonstrated the significant impact of automated methodologies in the field of early gastric cancer detection, presenting promising approaches for enhancing diagnostic capabilities. The majority of recent methods for detecting lesions from endoscopic images relied on conventional CNN-based object detectors. These methods primarily focused on identifying lesions such as gastric cancer, ulcers, adenomas, and polyps with convex shapes. Detecting other lesions, such as esophagitis and duodenal ulcers, poses a greater challenge and is often overlooked. Nevertheless, early detection holds promise in facilitating treatment and mitigating the severity of these diseases. This paper considers dealing with various types of lesions and aims to improve detection accuracy and efficiency in practical deployment scenarios.
B. Computer Aided Design Systems for Lesion Detection
With the increasing advance of artificial intelligence (AI), some AI-based systems are now commercially available for lesion detection of the gastrointestinal tract. In the field of AI-assisted colonoscopy, several systems have been introduced in the past five years. Notable examples include EndoBRAIN by Cybernet Systems Corporation (Tokyo, Japan) in 2018 [20] and GI Genius by Medtronic (Dublin, Ireland) in 2019 [21]. In 2020, many other AI systems appeared, such as EndoBRAIN-EYE, DISCOVERY by Pentax Medical Company (Tokyo, Japan), ENDO-AID by Olympus Corporation (Tokyo, Japan), CAD EYE by Fujifilm (Tokyo, Japan), and Wise Vision by NEC Corporation (Tokyo, Japan) [22]. Wise Vision has an image analysis terminal to display the results of polyp detection. It runs on an NVIDIA Quadro RTX 5000 with a Blackmagic Design DeckLink Mini Recorder. The cable connecting the endoscopic device and the image terminal is an HD-SDI or 3G-SDI cable. In 2021, EndoScreener was introduced by Wision A.I. (Shanghai, China) [23]. The core module in EndoScreener is SegNet, which is integrated with Olympus 190-series high-definition white-light colonoscopes. The detection results are displayed on either a dual-monitor setup or a single-monitor setup. The latency is approximately
In summary, integrating AI models into real commercial products is essential. Real-time experiments have demonstrated that AI can significantly aid in preventing oversights by endoscopists. However, most systems currently target a single type of lesion, and evaluations across different devices are often overlooked. This paper introduces a model that addresses multiple lesion types and proves its feasibility for deployment on various devices, thereby setting the stage for broader future implementation.
Unified Framework for Lesion Detection
A. General Framework
Our proposed framework for GI lesion detection is illustrated in Figure 1. It comprises two main stages: i) the model development (training and evaluation) stage and ii) the deployment stage. In the first stage, the proposed model is constructed with the new loss function and then trained and validated. In the second stage, a computer-aided system is deployed on a dedicated computing device in a clinical scenario. The system is connected directly to an endoscopic machine to capture endoscopic images, detect lesions in those images, and display the detection results to endoscopy doctors through a Graphical User Interface (GUI). In the following sections, we will present in detail our detection model and the deployment of the model on edge devices in a practical application.
B. Image Representation Based on Distance Transform
It is noticed that in the training and evaluation stage, each image sample in the dataset contains a pair of an original RGB image I and a ground-truth G which is a binary image representing the lesions and background. The binary image is manually annotated by expert doctors. It is widely provided in many datasets for detection and segmentation tasks. As mentioned earlier, the lesions naturally exhibit complex appearance, often non-convex shapes. When using rectangular bounding boxes as ground-truth to enclose these lesions, a significant portion of the background pixels may also be included within the bounding boxes. This can be regarded as a potential side effect for the learning algorithms that may result in incorrect detection. The center of the bounding box may be very far from the real center of the region of the lesion.
To tackle this issue, we first transform the binary image using distance transform [26]. The distance transform is an operator representing the distance to the closest boundary from each point. In this way, it produces an intensity image, the biggest value presents the furthest distance of a pixel to its nearest boundary. As mentionned previously, G is the binary mask so that \begin{equation*} D(x,y) = \underset {(x', y' \in \mathcal {B})}{min}\sqrt {(x-x')^{2} + (y-y')^{2}} \tag {1}\end{equation*}
Illustration of original images, annotated masks (ground truth) and their corresponding distance transform. We notice the variation in the shapes of the lesions, which are commonly non-convex.
C. Architecture of the Proposed Detection Model
Fully Convolutional One-Stage Object Detection (FCOS) is a well-known free-anchor box detection model in the literature [7]. The biggest advantage of FCOS is that it can predict the box for each object candidate using all ground truth pixels with a center loss function. However, FCOS has a limitation, which is it primarily focuses on the center of the bounding box. To address this, we introduce a new loss function that pays attention to all points in the central area of the lesion, ensuring a fair focus on all important points. We refer to our new architecture as GIFCOS-DT, where GI and DT stand for Gastrointestinal and Distance Transform, respectively.
The architecture of GIFCOS-DT is illustrated in Figure 3. It is composed of four main components.
Firstly, the RGB image
, after being annotated manually by doctors to generate a binary maskI \in \mathbb {R}^{W\times H \times N} , is converted to a gray-scale imageG~\in \mathbb {R}^{W\times H \times 1} by the Distance Transform operator explained in section III-B (as shown in the yellow block in Figure 3).D~\in \mathbb {R}^{W\times H \times 1} represent the width, height and channels of the original image I, respectively.W, H, N Secondly, the RGB image I will go through an encoder to generate three feature maps
. In this paper, we utilized ResNet-50 [27] because it keeps a good trade-off between the accuracy and the computation time. The input image goes through layers of the backbone (the black block in Figure 3). The output features maps have resolution reduced through layers.F_{1}, F_{2}, F_{3} Thirdly, as we detect lesions of different sizes, we generate different levels of feature maps. As in [7], we use five levels feature maps
whereP_{1}, P_{2}, P_{3}, P_{4}, P_{5} are produced byP_{1}, P_{2}, P_{3} followed byF_{1}, F_{2}, F_{3} convolutional layer with top-down connections.1\times 1 andP_{4} are generated by applying a convolution layer with stride 2 onP_{5} andP_{3} respectively. FeaturesP_{4} are organized in a pyramid that continuously input to the heads (as shown in green block in Figure 3).P_{i} Lastly, pooled features
in the Feature Pyramid Network (FPN) [28] go through five Heads, each responsible for predicting the class of the object at certain size, important pixels within the bounding box of the object, and regression of the four values of the bounding box. Each output of the FPN will have a head to predict the three outputs about the lesions in the image.P_{i}
The output of the backbone block usually has a small size, posing a challenge for detecting small objects. Therefore, right after the backbone block, the model employs the Feature Pyramid Network (FPN) to address this issue. The FPN model combines information from the backbone extracted at various layers in a bottom-up manner, combined with top-down processing to detect objects. In the bottom-up approach, the image size is decreased, but the semantic information is increased. The top-down approach involves increasing layer sizes to facilitate the detection of small objects. These layers are connected to the corresponding layers on the bottom-up side through lateral connections to preserve the semantic information extracted. FPN outperforms other block-based architectures because it maintains strong features at different scales.
The internal architecture of a Head can be seen in detail in Figure 3. Each shared Head across levels will have three outputs:
Classification output
with{\mathcal {O}}_{c} \in \mathbb {R}^{H \times W \times C} being the size of the input image and C being the number of classes. This output represents the prediction of whether there is an object of a given class or not in a region of interest.H \times W Distance-transform output
contains the value of each pixel, where pixels closer to the boundary will have lower values and pixels closer to the center will have higher values.{\mathcal {O}}_{d} \in \mathbb {R}^{H \times W \times 1} Regression output
with the same resolution as the input image, each has four values{\mathcal {O}}_{b} \in \mathbb {R}^{H \times W \times 4} corresponding to the distances from that position to the left, top, right, and bottom edges of the bounding box.l^{*}, t^{*}, r^{*}, b^{*}
D. Loss Function
Regarding the loss function, instead of using the original losses
1) Classification Loss
In the object classification branch, the Focal loss [5] function is utilized for training. Focal loss is constructed based on the Cross-entropy loss function, but the difference is that it reduces the emphasis on samples that the network has already learned well and pays more attention to hard-to-learn samples. We calculate the Focal Loss using the formula (2):\begin{equation*} {\mathcal {L}}_{\text {cls}}(p_{x,y}, c_{x, y}) = -\alpha _{t}(1 - c_{x,y})^{\gamma }\log (c_{x,y}) \tag {2}\end{equation*}
2) Distance Transform Loss
In section III-B, we have presented that the orginal image I has corresponding mask G and the groundtruth distance transform map D. In each of the five heads, we added a single-layer branch in parallel with the classification branch to predict the distance transform map \begin{equation*} {\mathcal {L}}_{\text {DT}}(D({x,y}),{\mathcal {O}}_{d}{(x,y)}) = - D({x,y}) \cdot \log ({\mathcal {O}}_{d}({x,y})) \tag {3}\end{equation*}
3) Regression Loss
In the third branch, the model predicts the bounding box regression using the IoU loss function [30]. The output of this branch is \begin{align*} {\mathcal {L}}_{\text {reg}}(\mathbf {bb}_{x,y}, \hat {\mathbf {bb}}_{x,y}) = -\ln \left ({{ \frac {\text {Intersection}(\mathbf {bb}_{x,y}, \hat {\mathbf {bb}}_{x,y})}{\text {Union}(\mathbf {bb}_{x,y}, \hat {\mathbf {bb}}_{x,y})} }}\right ) \tag {4}\end{align*}
4) Loss Function of GIFCOS-DT
Finally, the loss function of the proposed model is the combination of the three loss functions \begin{align*} & \hspace {-1pc}{\mathcal {L}}_{GIFCOS-DT}(\{\textbf {p}_{x,y}\}, \{D{(x,y)}\}, \{\mathbf {bb}_{x,y}\}) \\ & = \frac {\alpha }{N_{\text {pos}}} (\sum _{x,y} {\mathcal {L}}_{\text {cls}}(\textbf {p}_{x,y}, c_{x,y}) \\ & \quad + \sum _{x,y} \mathbb {1}_{\{c_{x,y} \gt 0\}} {\mathcal {L}}_{\text {reg}}(\mathbf {bb}_{x,y}, \hat {\mathbf {bb}}_{x,y})) \\ & \quad + \frac {(1-\alpha )}{N_{\text {pos}}} \sum _{x,y} {\mathcal {L}}_{\text {DT}}(D(x,y), {\mathcal {O}}_{d}(x,y)) \tag {5}\end{align*}
In the original paper [7],
Experiments
A. Datasets
Our proposed method will be evaluated on two challenging datasets, the Kvasir-SEG dataset and the IHG_GIEndoLesion-SEG dataset. The Kvasir-SEG dataset is commonly used to evaluate the detection and segmentation of polyps in endoscopic images [31]. In addition, we collect a new dataset with six various typical lesion categories of the upper gastrointestinal tract. In the following, we quickly describe the Kvasir-SEG and detail our process to collect the new dataset, which we name the IHG_GIEndoLesion-SEG dataset.
Both datasets present many challenges in lesion detection, such as the high number of lesions, as well as lesion regions that are complex and difficult to distinguish due to diverse structures, colors, and sizes. Additionally, many lesion areas contain bubbles, blood, and bright glare, further complicating detection.
1) Kvasir-SEG Dataset
The main purpose of the Kvasir-SEG dataset [31] is to facilitate research and development of advanced methods for polyp segmentation, detection, localization, and classification. The dataset contains 1000 polyp images and their corresponding ground truth from Kvasir Dataset v2. The resolution of the images varies from
2) IGH_Giendolesion-SEG Dataset
To test the detection of the model for various diseases of the gastrointestinal tract, we collected a new dataset that contains endoscopic images of gastrointestinal tracts at several hospitals in Hanoi City, Vietnam from 2022 to 2024. All images were captured using Fujifilm high-resolution endoscopy systems, including the EPX-3500HD, LASEREO, and 7000 systems. These systems provide four different light modes: white light imaging (WLI), Flexible Spectral Imaging Color Enhancement (FICE), black-Light Imaging (BLI), and Linked-Color Imaging (LCI), which enhance lesion detection and characterization. To ensure quality, the collected images must clearly depict the lesions, be free from excessive foam or mucus, lack blurring, and have good contrast. For images of esophageal and gastric cancers, the endoscopists must review the histopathological results to confirm the diagnosis.
To enhance diversity, the collected lesions must include multiple subtypes that vary in characteristics and severity. These subtypes align with international classifications commonly used in endoscopy assessments. For example, reflux esophagitis has five subtypes based on the Los Angeles classification [32]. Gastritis has six subtypes according to the Sydney classification [33] and the Kyoto classification [34], which include raised erosions, flat erosions, blood streaks, redness streaks, atrophy, and nodularity. Duodenal ulcers are classified into six subtypes based on the Sakita and Fukutomi classification [35]. Both esophageal cancer and gastric cancer have six subtypes based on the classifications by the Japanese Gastric Cancer Association [36], [37]. Additionally, images were collected using various light modes (WLI, FICE, BLI, LCI) to capture different aspects of the lesions. To ensure quality, endoscopists reviewed the collected images, confirming their diversity and discarding any that were of poor quality or overly similar. To protect patient privacy, identifying information (e.g. name, age) was removed before labeling and using the images for training, either by cropping or masking the images. Our data collection protocol received approval from the Scientific Committee of the Vietnam’s Ministry of Science and Technology.
Our medical and technical experts closely cooperated to create an online platform for manual labeling and delineation of lesions. Only endoscopy images without patient information were uploaded to this platform. Figure 4 demonstrates our labelling process - the doctor not only labeled the lesions on the image but also the light mode and specific location. Additionally, doctors also delineated and labeled the annotations with subtypes of lesions according to international classification on each image. The labeling and delineations underwent validation by experts with over 5 years of experience in the field. The final dataset comprises 5211 image pairs of original images and their ground truth binary mask from 2543 patients annotated by experts. The resolution of images is
Table 1 describes the variation of lesions and light modes in the IGH_GIEndoLesion-SEG dataset. Table 2 summarizes the number of images for each category of lesion in both experimented datasets (Kvasir-SEG and our collected dataset IGH_GIEndoLesion-SEG). Figure 5 illustrates some original images and bounding boxes of some lesion categories in our IGH_GIEndoLesion-SEG dataset and Kvasir-SEG dataset respectively.
Illustration of the original images with overlaid annotated lesion bounding boxes in two datasets. It is noted that the lesions in our IGH_GIEndoLesion-SEG dataset (two rows above) are more challenging than polyps in the Kvasir-SEG dataset (last row).
3) Data Splitting and Evaluation Metrics
We conducted three experiments to evaluate the performance of the proposed model. The first two experiments aim to evaluate the performance of the proposed method on two separate datasets (Kvasir-SEG dataset and IGH_GIEndoLesion-SEG dataset) while the last experiment evaluates on the new dataset mixed of the two aforementioned datasets. We split samples in each lesion category in experimented datasets into three separated parts, comprising the train, validation, and test sets, respectively, with a ratio of 7:1:2. Since we are addressing the issue of object detection rather than segmentation, we need to redefine the bounding boxes for each segmented region to serve as ground truth for training and testing our detection models. Figure 5 illustrates the bounding boxes determined as rectangles. To assess the performance of our proposed method, we employ standard metrics commonly used in object detection, including Area Under the Curve (AUC) and Average Precision (AP). For the object detection problem, a true positive is confirmed if the Intersection over Union (IoU) between the predicted bounding box and the ground truth bounding box is higher than a specific threshold. We compute
B. Implementation Details
We configured the network with an input image size of
C. Experimental Results
1) Results on Kvasir-SEG Dataset
Table 3 shows the performance of our proposed method GIFCOS-DT compared with state-of-the-art detectors such as Faster R-CNN, YOLOv3+spp, YOLOv4, DETR [38] and the FCOS. The results obtained by state-of-the-art (SOTA) detectors are reported in [39] while we re-implemented and tested DETR ourselves. In terms of accuracy metrics, GIFCOS-DT consistently outperforms FCOS, with the highest improvement seen in
The GIFCOS-DT model with a ResNet50 backbone focuses its detections on the central regions of lesions; additionally, larger lesions may be split into smaller regions. Moreover, the DarkNet53 backbone provides better performance than ResNet50 at higher IoU levels, such as 75. Therefore, at
Figure 6 illustrates two detected results generated by FCOS and GIFCOS-DT for a single polyp case in the Kvasir-SEG dataset. The first row shows the original image and the binary mask of polyp that is manually determined by doctors. In shape, the polyp appears as a large round mass, with some smaller round ones attached to its edges. Based on this segmentation, the bounding box is determined as a white rectangle overlaid on the original image in the lowest row. We display the images obtained by distance transform in GIFCOS-DT and by center-ness transform in FCOS respectively in a middle row. We observe that FCOS with center-ness loss considers only the center of the entire region while GIFCOS-DT takes the centers of each component region into account. As a result, GIFCOS-DT generates three candidate regions (one corresponds to the biggest polyp, and two others correspond to small attached polyps). These results appear to be more reasonable than those generated by FCOS where two detected regions overlap significantly. It is also noted in this example that FCOS detected two candidates (two green bounding boxes) where the smaller one is completely inside the bigger one. So according to the evaluation metric computation, FCOS generated one false positive. Our model generated three candidates (green bounding boxes), according to the evaluation metric computation, GIFCOS-DT produced two false positives (two smaller bounding boxes). However, two smaller bounding boxes correspond to the two smaller components of the polyp. So if the doctor annotated them separately, the detected bounding boxes become true positives. As a result, with our method the dection is more practical and the obtained accuracy is improved.
Comparison of polyp detection by the original FCOS and our proposed GIFCOS-DT on the Kvasir-SEG dataset. In which, the white bounding boxes represent the ground truth labeled by doctors, yellow boxes are detection results by FCOS, and the green bounding boxes represent the detection results by our model.
2) Results on IGH_GIEndoLesion-SEG Dataset
Table 4 presents the comparative results between the original FCOS model and our proposed GIFCOS-DT model. Regarding the
3) Results on the Mixed Dataset
Two datasets (Kvasir-SEG and IGH_GIEndoLesion-SEG) are mixed to create a new challenging dataset for evaluating the proposed model. The polyps in Kvasir-SEG is considered as the
Figure 7 illustrates the ROC curves for GIFCOS-DT and the original FCOS. The ROCs generated by GIFCOS-DT exhibit higher values compared to those produced by FCOS. Furthermore, both models perform well in terms of ROC for Kvasir polyp and gastritis cancer datasets. The integration of distance transform makes the network better suited for gastrointestinal injury data. Additionally, this ROC curve allows us to select a suitable threshold that balances the trade-off between the true alarm rate and the false alarm rate. By choosing a lower IoU threshold (around IoU =0.2), we can effectively identify six classes (ES, PG, EC, GC, DU and Polyp) with a very low false alarm rate (<0.1). On the other hand, the NG class requires a higher threshold (around IoU =0.3) to achieve a balanced trade-off with a false alarm rate of approximately 0.2.
In Table 5, the results of bounding box regression remain relatively low for certain classes. Using the
Figure 8 shows that FCOS predicts bounding boxes that tend to be larger or smaller than the lesion area itself, while GIFCOS-DT predicts results closer to the lesion area. With the FCOS feature, the model tries to learn from the center of the box determined by the doctor, so when regressing to the FCOS bounding box, it is very difficult to balance the lesion area and the background area, so the predicted FCOS box tends to include much background region. Meanwhile, GIFCOS-DT tries to stick to the center of the lesion delineated by the doctor, so it can be divided into many areas, into small boxes. But in reality, these areas are close and could be connected together, so doctors have to mark them as a large box. Therefore, although the current AP result is already higher than the FCOS, it could produce much better boxes of lesions than the manual annotation.
Detection results by FCOS and GIFCOS-DT, where the white bounding boxes represent the ground truth labeled by doctors, yellow boxes represent detections by FCOS, and green bounding boxes represent detections by our model.
4) Ablation Sduty
a: Role of Distance Transform Loss
We conducted an ablation study on the IGH_GIEndoLesion-SEG dataset because this dataset contains different types of lesions with elongated shapes. We vary the value of
Table 6 shows the
b: Comparision with Transformer Based Model
We compared our model GIFCOS-DT with the original FCOS and DETR model [38]. Notably, while both FCOS and GIFCOS-DT rely on convolutional operators, DETR incorporates both convolutional and transformer operators. Figure 9 shows the
Building Tool for Detection of Upper GI Diseases
In this section, we present our deployment of an aided tool for detection of upper gastrointestinal tract diseases on an edge device. First, we present our hardware setting. Then we explain how to connect our system to the endoscopy machine to capture the endoscopic images to be processed by the GIFCOS-DT model on the edge device and its computational time. To ensure effective deployment of our model, we’ve set up the environment on edge devices such as the Jetson AGX Xavier and the NVIDIA GeForce RTX 3090. This process involves configuring Python 3.7 along with essential libraries like OpenCV, PyTorch, and others. Our model, configured with a ResNet50-FPN backbone, recognizes classes including Reflux Esophagitis, Esophageal Cancer, H.P-negative gastritis, H.P-positive gastritis, Gastric Cancer, Duodenal Ulcer, and Polyp, from images sized
A. Hardware Equipment: Connection and Interface
To deploy the lesion detection model in a practical application, we establish connections between equipment and create interfaces among them. In our implementation, we connect the Fujifilm Eluxeo 7000 endoscopy system with a Jetson AGX Xavier edge device. To capture the images taken from the endoscopy, we utilize the AverMedia CL311-M2 grabber, which is connected to the HDMI output of the Fujifilm Eluxeo 7000. The grabber is then connected to the Jetson Xavier via an HDMI to PCIex4 standard. The maximum input resolution that the AverMedia CL311-M2 can handle is
The real-world connection setup between the edge device and the endoscopic machine at the Institute of gastroenterology and Hepatology (IGH), Hanoi, Vietnam.
B. Deployment of Deep Model for Lesion Detection on Edge Device
The prediction process involves detecting lesions in each image extracted from the streaming video. Depending on the capture speed, a certain number of images per second are generated. For example, if the capture speed is 30 fps, then 30 images will be produced in one second. Subsequently, the model predicts the results for these images, which are then displayed on the screen alongside the images. Typically, this prediction process can be divided into three main tasks:
Capture and load images from the video stream (T1): The initial images are read and pre-processed. At this stage, multiple images are grouped into a batch so that the model can predict the entire batch if GPU capacity allows.
Perform detection process (T2): This process takes the batch of input images and does the detection by GIFCOS-DT to output the bounding boxes of lesion candidates. This process typically takes the most time due to the dense and complex calculations on the matrices.
Display image stream and detection results (T3): After transferring the detection results, bounding boxes are superimposed onto the image, and both are then displayed on a monitor, aiding in doctors’ investigations.
In traditional setups, tasks progress sequentially, moving from one to the next once completed. While being suitable for lower-end systems, this method compromises speed and image quality due to slower processing and reduced FPS. In environments with ample resources, parallel processing can enhance prediction times. Multithreading subdivides tasks into manageable units, each overseen by a separate thread, thereby boosting efficiency through concurrent execution, as illustrated in Figure 12. Nevertheless, within a single core, workload imbalances may arise, constraining potential benefits. Pipelining, as depicted in Figure 13, is applied as a highly effective technique to facilitate concurrent execution of prediction processes. Unlike conventional approaches, pipelining eradicates idle periods between stages, thereby augmenting the number of predictions completed within a specified timeframe. While it doesn’t decrease the time required to predict a single image, pipelining significantly curtails the overall processing time for multiple images. Through the image detection is subdivided into sequential tasks and the allocation of equal time slots for each process, congestion is mitigated, thus enhancing efficiency without compromising detection quality.
Diagram of tasks when predicting consecutive images taken from endoscopy machine.
C. Runtime System Evaluation
Figure 14 illustrated some detection results obtained when deploying the developed system. The data captured from two patients with gastric cancer and esophageal cancer respectively. It shows that the system works and is able to detect correctly the frames with diseases confirmed by doctors.
Frames showing detection of gastric cancer and esophageal cancer lesions in real-time streams captured by endoscopic machines.
1) Influence of Resolution on Recognition Accuracy
When deploying such system, the computational time is a critical factor because the capturing speed of the endoscopic machine is very high. As a result, doctors require high speed of automatic detection. This section presents the results obtained when deploying the system in real-world scenarios. We integrated the GIFCOS-DT model simultaneously on two devices: Jetson Xavier and a desktop computer equipped with a GeForce RTX 3090 graphics card, within the complete system from data acquisition to prediction and result display. To optimize the image quality and FPS rate, we configured the Capture Card to capture frames at
Figure 15 compares the detection capabilities of the model with high-resolution input images (
Comparison of detecting six types of lesions using the GIFCOS-DT model with High-resolution input images (
2) Framerate of Integrated Models
In the performance evaluation of experiments on edge devices, we assessed GIFCOS-DT with input image resolutions of
It is worth seeing that implementing the pipeline technique enhances frame rates by approximately 2-3 times compared to sequential processing. For instance, the GIFCOS-DT model with a (
Maintaining equal time slots in the pipeline technique prevents buffer overflow. When slots are balanced, the pipeline is effective. For instance, with the same GIFCOS-DT model and input image size on the Jetson AGX Xavier, a large prediction time slot causes imbalance, resulting in a five-fold increase in latency and minimal frame rate improvement. For low-latency, high frame rate, and resource-rich scenarios, using GIFCOS-DT with a (
3) Limitation of the Proposed Model
In this study, the GIFCOS-DT model has been integrated and deployed on a low computational resource device, such as an embedded Jetson Xavier device, for being neat and affordable reasons. However, to make the integrated edge device suitable for daily clinical applications, the current performance of 14.58 fps still requires further improvement. To address this issue, the model’s streaming procedure can be optimized to achieve real-time processing speeds. Additionally, the quality of the endoscopic images should be carefully evaluated. In practical endoscopic examinations, the abnormality detection model may struggle with contaminated objects in the image, such as water bubbles or food particles. Furthermore, due to the movement of the endoscopy device, images may become blurred or show lesions too closely. These real-world challenges suggest future directions, such as implementing a pre-screening procedure to evaluate image quality or designing end-to-end models to address both image quality and the detection of abnormal regions.
Conclusion
This paper introduced the novel GIFCOS-DT model with the Distance Transform loss function for detecting lesions of specific shapes and sizes in endoscopic images of gastrointestinal tract, not limited to polyp shapes as in recent studies. The GIFCOS-DT model outperformed the original FCOS model across two challenging datasets. Specifically, it achieves better results with an average