A Novel Approach for Increased Convolutional Neural Network Performance in Gastric-Cancer Classification Using Endoscopic Images

Gastric cancer is the third-most-common cause of cancer-related deaths in the world. Fortunately, it can be detected using endoscopy equipment. Computer-aided diagnosis (CADx) systems can help clinicians identify cancer from gastric diseases more accurately. In this paper, we present a CADx system that distinguishes and classifies gastric cancer from pre-cancerous conditions, such as gastric polyps, gastric ulcers, gastritis, and bleeding. The system uses a deep-learning model, Xception, which involves depth-wise separable convolutions, to classify cancer and non-cancers. The proposed method consists of two steps: Google’s AutoAugment for augmentation and the simple linear iterative clustering (SLIC) superpixel and fast and robust fuzzy C-means (FRFCM) algorithm for image segmentation during preprocessing. These approaches produce a feasible method of distinguishing and classifying cancers from other gastric diseases. Based on biopsy-supported ground truth, the performance metrics of the area under the receiver operating characteristic curve (i.e. Az) are measured on the test sets. Based on the classification results, the Az of the proposed classification model is 0.96, which is 0.06 up from 0.90 which is the Az of the original data. Our methods are fully automated without the manual specification of region-of-interests for the test and with a random selection of images for model training. This methodology may play a crucial role in selecting effective treatment options without the need for a surgical biopsy.


I. INTRODUCTION
According to the worldwide gastric-cancer incidence statistics released in 2018 by the International Agency for Research on Cancer, the world's leading cancer research institute, gastric cancer is the fifth most frequently diagnosed cancer and the third leading cause of cancer death. Fig. 1 shows the world gastric cancer incidence rate for 2018 [1]. As can be seen, eastern Asia and Eastern Europe have the highest incidences.
Ailments, such as gastritis, gastric ulcers, and gastric bleeding have been identified as precancerous lesions that The associate editor coordinating the review of this manuscript and approving it for publication was Pasquale De Meo. lead to gastric cancer. Because these gastric lesions are asymptomatic until they progress to cancer, early diagnosis and treatment are the best ways to reduce the incidence, and regular endoscopy is the preferred method for an accurate diagnosis. Then, the lesions can be treated and/or removed [2], [3].
Because the number of images taken by endoscopy equipment is ever-increasing, and the quality of the images is constantly improving, any doctor who relies on looking directly at images to make diagnosis experiences mounting fatigue. Furthermore, different doctors use different mental heuristics based on experience. Thus, owing to subjective biases and occasional lassitude, misdiagnoses sometimes occur. Computer-aided diagnosis (CADx) systems provide doctors with objective information, which tends to increase the accuracy of diagnosis [4], [5]. Currently, various CADx systems are used for the study of gastroscopy. Kanesaka et al. [6] used a support vector machine to build a CADx system to facilitate early gastric cancer detection using narrow-band endoscopy images. The study achieves an accuracy of 96.3% for cancer and non-cancer classifications. Khryashchev et al. [7] used an endoscopy image-analysis algorithm based on the application of a convolutional neural network single-shot multibox detector [8]. They classified gastroscopy images into three classes: normal mucosa, non-cancerous pathology, and cancer. The mAP value for detection was 0.875. Li et al. [9] developed a model that distinguished early gastric cancers and non-cancerous lesion images based on the inception-v3 network. They compared the performance of two experts with two non-experts and a CNN, showing that the CNN accuracy performed excellently, but the specificity and positive predictive value (PPV) were lower than that of experts. Cho et al. [10] collected endoscopic white-light images of pathologically confirmed gastric lesions and classified them into five categories: advanced gastric cancer, early gastric cancer, high-grade dysplasia, low-grade dysplasia, and nonneoplasm. They pre-trained three CNN models using a training dataset. These studies in [6], [7], [9] used the narrow-band image (NBI) instead of the white light endoscopy (WLE) image. The WLE image is a common filming mode for endoscopy and more useful.
CADx systems that use deep learning have developed significantly. However, they typically rely on large labeled datasets and significant expert knowledge and incur a high computational cost. Data collection is time-consuming and costly because patients require privacy and researchers require institutional review board (IRB) approval [11].
A data augmentation method is a representative solution to the data collection problem. Augmentation improves performance while solving the problem of deep-learning overfitting [11], [12]. To augment data, such methods transform existing information into colors and shapes.
Kim et al. [13] studied CADx systems and applied two segmentation methods to gastroscopy data. They trained a deeplearning model using the inception-V3 network and classified it as normal and abnormal. They also proposed an augmented CADx system that applied 90 • , 180 • , and 270 • rotations of the same data [14]. Chowdhary et al. [15] propose a segmentation model for the detection of cancer in an image. They segment the image using an intuitionistic possibilistic fuzzy c-mean (IPFCM), which combines the intuitive fuzzy c-mean (IFCM) and the possible fuzzy c-mean (PFCM) algorithm.
Ergashev and Cho [16] randomly applied a data augmentation method in the range of 0.9 to 1.1 for brightness and color contrast of the images. Zhu et al. [17] performed data augmentation, in which each image was rotated and flipped to expand the amount of data by eightfold. Asperti and Mastronardo [18] used methods such as rotation, width and height shifting, shear, and zoom in their data augmentation applied randomly within a set parameter range. Frid-Adar et al. [19] obtained augmented samples by randomly stretching images in the dataset horizontally or vertically, which became the input of an adapted deep CNN for image recognition. Zhao et al. [20] presented an automated data augmentation method for synthesizing labeled medical images. They are trained independent spatial and appearance transform models to generalize a dataset of labeled images based on VoxelMorph, which learns to output a smooth displacement vector field that registers one image to another by jointly optimizing the image similarity loss and the displacement field smoothness term [21]. Their methods essentially generate a homeomorphic mapping domain with CNNs to change the probability distribution of datasets.
The works in [14], [16]- [20] show that the results of applying data augmentation is better. However, the method of augmentation was different for each work. In [14], [17], the augmentation method was chosen by them, and the parameters were also manually adjusted. The studies in [16], [18], [19] set the augmentation method themselves, and the parameter values were randomly applied. These methods can augment numerous images, but results can be biased or lack objectivity.
The purpose of this study is to investigate fully automated methods for classifying abnormal gastric endoscopic images into cancer and non-cancerous lesions using a deep-CNN scheme (i.e., Xception). Two approaches, a SLIC superpixel and FRFCM algorithm for the segmentation and Google's AutoAugment for data augmentation, are developed and applied. The AutoAugment method derives parameters for optimal augmentation policies via reinforcement learning using typical datasets, such as CIFAR-10, which Google has recently established as a data augmentation method [22]. The major contributions of this work are highlighted as follows: • A new method for gastric lesion recognition based on a deep learning network that can successfully classify the cancer from pre-cancerous endoscopic images is presented.
• We also propose a novel method that combines image segmentation and augmentation algorithms.
• Our method is fully automated without manual specification of the region of interest for testing and random selection of images for model training.
• Experimental results demonstrate that the efficiency and effectiveness of the proposed task are superior to the basic deep learning model. The remainder of this article is organized as follows: Section II provides a detailed description of the gastroscopic imaging data set used in the experiment and a proposed methodology. The experimental results are reported in Section III. Section IV describes the discussion of these findings. Finally, there are conclusions in Section V.

A. DATABASE
A dataset was collected with IRB approval from the files of patients who had undergone gastric endoscopic imaging at the Department of Internal Medicine at the Gyeongsang National University Hospital, South Korea. All endoscopic image data were selected by internists and verified via medical examination and biopsy. A total of 470 endoscopic image data were collected from 69 patients. There are 250 images used for training and 220 for testing. The training set included 34 cancer and 216 non-cancer images, and the test set included 29 cancer and 191 non-cancer images. The types of lesions listed in Table 1 include ulcers, gastric cancers, submucosal tumor (SMT), and polyps. The ''others'' in Table 1 include blood clots, hematin, gastric xanthoma, etc. Example endoscopic images are shown in Fig. 2.

B. DATA AUGMENTATION
It is a typical problem that there are not enough training data to overcome the overfitting of numerous neural network parameters in deep learning-based applications (viz medical image analysis). A typical approach to handling this problem is artificially augmenting datasets using labelpreserved transformations, such as random image translation, horizontal/vertical flipping, shearing, rotation, and cropping [18]. A random combination of these transformations can be used to gain a manifold increase in the number of training data, which is useful for minimizing the parameter-overfitting problem. Recently, the Google Brain team released the AutoAugment tool, which is used to search for improved data augmentation policies [22]. The algorithmic process is shown in Fig. 3. A controller recurrent neural network (RNN) samples a policy S, to which the images are to be appended. Using a child network, various augment policies are applied to the dataset. It is an algorithm that obtains performance accuracy R and updates it in the controller RNN to find the best augment policy. By applying the optimized data augmentation policies, high accuracy uses of public data (e.g., CIFAR-10, CIFAR-100, SVHN, and ImageNet) have been achieved. In our experiments, a variant of the policy learned from CIFAR-10 data is used. 25 sub-policies are thus used to expand the training dataset by 25-fold.
A total of 25 augment sub-policies are presented with one sub-policy to progress two ordered operations. The operational techniques applied to the augment policy are defined as Shear X/Y, Translate X/Y, Rotate, AutoContrast, Invert, Equalize, Solarize, Posterize, Contrast, Color, Rightness, Sharpness, Cutout, and Sample Pairing. Two parameters provide the probability values that indicate the likelihood that the operating technique will be applied to the policy so that the size of the operation techniques will be applied to regulate the operation. Table 2 shows the CIFAR-10 augmentation policy we used. The first sub-policy specifies a sequential application of Invert followed by Contrast. The probability of applying Invert is 0.1. The Invert operation does not use magnitude information. Then, we apply a Contrast of 0.2.  When applied, it has a magnitude of 6 out of 10. These operations are applied in a specified order. There are 2.9 ×10 32 augmentation sub-policies [22], which are randomly selected and applied to learn the training data. We repeat the learning and classification to find the best policy having improved performance.

C. SEGMENTATION METHOD: SLIC SUPERPIXEL
The SLIC superpixel is a considerably fast clustering method [24]. The reason for its fast calculation is that clustering is performed using only cluster information within a certain area without requiring the entire image's cluster information. Thus, the pixels in the image are clustered into k superpixels using the same procedure as that of Fig. 4.
When the user inputs k clusters, the SLIC superpixel uniformly arranges the positions of the cluster center coordinates in the image having interval S, as shown in Fig. 4(a). Here, S is calculated using (1), where N is the number of pixels in the image.
The advantage of a superpixel algorithm is that it maintains cluster edge information. If the location of the cluster center coordinates is set at regular intervals, the center coordinates overlap with the edge, which can cause the loss of edge information. To solve this problem, we convert the color  When the central coordinate arrangement is complete, we calculate the d c and d s of all pixels ( (2) and (3)).
Here, l j , a j , b j represents the CIELAB color information of the j th cluster center coordinate. l i , a i , b i represents the CIELAB color information of the i th image pixel. x j , y j is center coordinate of the j th cluster .x i , y i is ith pixel coordinate. d c is the CIELAB color distance between cluster c j and pixel i within the 2S range, as shown in Fig. 4(b). d s is the distance between cluster c j and pixel i within the 2S range. If d c and d s are obtained, the total distance, D, can be calculated, as with (4).
Here, m is the weight parameter. Pixel i is assigned to the cluster having the smallest calculated D value among clusters in the range. After all cluster assignments are completed for all pixels, i, we update the l j , a j , b j , x j , and y i information of all clusters to optimize clustering. The l j , a j , b j of the j th cluster is updated to the mean value of l i , a i , b i of the pixel, i, assigned to the j-cluster, and x j , y j is changed to the center of gravity of the pixel, i, assigned to the j th cluster. Repeat until the x j and y j of all clusters are not changed [24].

D. SEGMENTATION METHOD: FRFCM
The fuzzy c-mean (FCM) algorithm with spatial constraint (FCM_S) is an effective algorithm for image segmentation.
However, this algorithm cannot be robust, especially if it does not have enough knowledge of noise. The fast and robust fuzzy C-means (FRFCM) algorithm is more advanced than other FCM algorithms. Lei et al. [25] proposed a new, fast and robust FCM frame walk for image segmentation by integrating local spatial and gray information. The objective function of FRFCM is defined as (5).
The parameter u kl represents the fuzzy membership gray value l with respect to cluster k, and ξ l is the gray levels. Utilizing the Lagrange multiplier technique, the optimization problem can be converted to an unconstrained optimization problem as (6).
The parameter λ is a Lagrange multiplier. Therefore, the minimization problem is converted to finding the saddle point and taking the derivation of the LagrangeJ m with respect to the parameter, i.e.., u kl and v k . By minimizing (5), we obtained the corresponding solution as follows.
We can get a membership partition matrix U = [u kl ] c×q using (7). To obtain a stable U , repeat (7) and (8) until max U (t) − U (t+1) < ε is satisfied, where ε is an exceedingly small error threshold. A new membership partition matrix, U = [u kl ] c×N , that corresponds to the original image can be obtained because u (t) kl is a fuzzy membership of gray value l with respect to cluster k. N is the total number of pixels in the image.
To speed up the algorithm's convergence and obtain a better membership partition matrix, they use the median filter as (9).

E. DEEP LEARNING MODEL: XCEPTION NETWORK
Xception is a CNN model that was released in 2016. The network uses the Inception module to reduce connectivity between nodes, further separating finding relationships between each channel from searching local information. Fig. 5 shows an extreme version of the inception module that separately operates 1 × 1 and 3 × 3 convolution operations of all channels of the resultant feature map. Thus, this module calculates the feature map per channel. Xception uses the depth-wise separable convolution created by modifying this operation. The depth-wise separable convolution is used to perform a convolution operation on each channel and to apply the 1 × 1 convolution operation to the result. As shown  in Fig. 6(a), if the conventional convolutions create a feature map that takes all channels and local information, the depthwise separable convolution, shown in Fig. 6(b), creates one feature map for each channel and performs a 1 × 1 convolution operation to adjust the number of feature maps that are output [26]. The 1 × 1 convolution operation is a pointwise convolution. The difference from the extreme version of the inception module is the sequence of operations and the presence or absence of an intermediate nonlinear activation function.
In a previous study [27], we found that the results of Xception in the gastric medical image classification were the best among four different deep learning models, Xception, Inception-V3, Resnet-101, and Inception-Resnet-V2. It also found out that the CIFAR-10 augmentation policy is the best for the classification among ImageNet, SVHN and CIFAR-10. Based on the result [27], we selected Xception network for this study.

F. EVALUATION METHODS
As a pre-processing step, images are augmented and segmented using AutoAugment and the SLIC superpixel, respectively. First, each image is augmented into 25 new images using the CIFAR-10 AutoAugment policy. Second, each of these 25 images is segmented into nine areas using the SLIC superpixel and FRFCM algorithm. Ground-truth information of the segmented area is provided by the internist. Fig. 7 shows the process of image augmentation and segmentation.   During the test step, we use the score to represent the results of classification. If the classification model classifies more than 1/3 of the segmented areas as cancer, then it will classify the entire image as cancer. The size of gastric lesions can vary significantly and can include all the segmented regions or only half of them. Others can contain just one region. Therefore, classification is conducted using (10), and the classification threshold value is set through experimentation:

Number of cancerous segmented areas
Total number of segmented areas ≥ 1 3 . Fig. 8 presents a flowchart of the proposed system. First, the training data are augmented and segmented to generate the classification model. Then, the test data are applied to the segmentation to determine whether the image is cancer or non-cancer.

III. RESULTS
In the field of medical image analysis, a small amount of training data is always a restrictive concern. Lesions of various sizes can be found in endoscopic images. Thus, a model lacking augmentation and segmentation is insufficient for distinguishing and classifying gastric cancer from other gastric diseases. Thus, we proposed using the AutoAugment and two segmentation algorithms to create a CADx system for gastriclesion diagnosis, as shown in Fig. 8. In models with augmentation, a total of 25 augmented sub-policies are applied, which progresses two tasks in order. In models with segmentation, we used SLIC superpixel and FRFCM to segment images and training was performed based on the internist' ground truth for segmented areas. The number of segmentation areas per image was set to nine. After the segmentation, nine images are generated from one image, and each image has one segmented region which takes the pixel value of the original image and the other region has a pixel value of zero. Fig. 9 shows the segmentation results according to two different segmentation methods. Fig. 9 (a) is the original endoscopy image, and the blue line indicates the area of the lesion as the ground truth as defined by an internist. Fig. 9 (b) and (c) are images with SLIC superpixel and FRFCM segmentation applied, respectively. Clearly, the results of the two segmentation methods are different. The SLIC superpixel has restrictions on clustering space at the segmentation, but FRFCMs have no area limit; therefore, the clustering patches can have very long and wide shapes and may be separated altogether and far away.
The performance comparisons of the application of both algorithms are shown in Fig. 10 and 11. It shows the total results of gastric-lesion classification using original data with augmentation only, segmentation only, and both augmentation and segmentation. The area under the receiver operating characteristic (ROC) curve was 0.90 for the original data, 0.92 with augmentation only, 0.94 with SLIC superpixel segmentation only, 0.92 with FRFCM segmentation only, 0.96 with augmentation and SLIC superpixel segmentation, and 0.94 with augmentation and FRFCM segmentation. As shown by the ROC curves, the proposed algorithm performed the best.
The proposed methods were executed on a 14-core 2.20 GHZ Xeon CPU and 128 GB memory using the CentOS 7.7 Linux operating system. Two NVIDIA Tesla T4 GPUs with 16 GB device memory were used. The network architecture was implemented on MATLAB 2019a using the deeplearning toolbox library. For the original data, the program took ∼1 h to train 100 epochs using the training dataset. For augmentation, the program took ∼1 d to train 100 epochs using the training dataset. For segmentation, the program took ∼3 h to train. For augmentation and segmentation, the program took ∼10 d to train 47,000 iterations. The mini-batch size was 60, and the initial learning rate was 0.001.

IV. DISCUSSIONS
The results show that the models trained using augmentation and/or segmentation performed better than the one trained without them. This provides a solution for one of the biggest problems of deep learning in medical image analysis, which relies heavily on large amounts of data. In Fig. 10 and 11,  the Az values of the two segmentation algorithms were higher or equal to those of augmentation. One possible reason is that it was more effective to transfer information from the lesion location to the network via segmentation. Fig. 10 and 11 show that the results of the SLIC superpixel are higher than those of the FRFCM. In this study, SLIC superpixel is clearly the best overall performer. It is the most memory-saving and the fastest method, segmenting a 471 × 407 image in 0.29 s, while FRFCM takes 11.32 s. It shows outstanding boundary adherence, while the segmented shape of the FRFCM method spreads widely in the image. The segmented shape of the SLIC superpixel is more focused on lesions than that of FRFCM. As shown in Fig. 9, SLIC superpixel with restrictions on the clustering area can select about four patches to segregate images with lesions, but at least five patches must be selected in FRFCM. In the gastroscopy image, the SLIC superpixel method was more suitable for detecting cancers because they spread widely from one place rather than from many places. This could be one of the reasons why SLIC superpixel turned out better than FRFCM. SLIC superpixel is easy to use, its only parameter being the number of necessary superpixels.
The classification results obtained after applying segmentation methods are better than the use of only the original image data, demonstrating that learning through area labeling is effective. It also shows the potential for later detection studies that can provide lesion region information using the probabilities of each patch.
We have proposed a performance improvement in computer-aided diagnosis (CADx) systems for white light gastric endoscopic images instead of narrow-band imaging (NBI) magnification. NBI is an optical technology newly developed and applied in the area of gastrointestinal endoscopy. This technique highlighted to blood vessels using the wavelength, which can be easily absorbed by hemoglobin for maximum contrast. While adding NBI images may improve the performance of the proposed system, they are still used as auxiliary tools in the hospital. We plan to study comparative models using NBI images with different deep CNN networks.
Google's AutoAugment provided the best data augmentation policy for a given dataset using reinforcement learning. A subset of policies achieved by the AutoAugment, which were excellent for the CIFAR-10 dataset, was applied to improve the performance of our gastric cancer classification task. However, several powerful computational resources were required to produce optimal data augmentation on this training dataset. Thus, a fast and flexible data augmentation solution based on Bayesian optimization instead of reinforcement learning should be researched in the future. In addition to methods of finding and applying optimal augmentation policies, a general adversarial network, which is a model architecture for training a generative model, is under investigation. We also plan to study comparative models using other policies of AutoAugment with different deep CNN networks.

V. CONCLUSION
Deep CNNs are powerful and useful algorithms that commonly work well when trained using a large amount of data. The shortage of labeled training data in gastric cancer research is a major problem that limits the application of CNNs in the field of medical image analysis. In this paper, we proposed the CADx system, which can distinguish and classify gastric cancers from other gastric diseases. The system used the Xception network, which involves depthwise separable convolutions. The proposed method used two algorithms: Google's AutoAugment for augmentation and the SLIC superpixel and FRFCM algorithms for image segmentation. As a result of analyzing the classification performance of the system, we found that augmentation and segmentation enabled the classification model to perform better because the applied data augmentation policy prevented overfitting. Additionally, the segmented images expertly identified as cancer or non-cancer by the internist facilitated our focus on the features of the lesion during training. In this study, we achieved a ROC curve of 0.96, which is 0.06 higher than that achieved when the two methods were not applied, indicating that it is meaningful to train by dividing the area of an augmented image. This methodology may play a vital role in selecting effective treatment options and survival predictions without the need for a surgical biopsy.
Future work includes verifying the results of this study by applying the CADx system to a larger and independent data set, expanding a deep-learning model, comparing other segmentation and augmentation methods. The major question of the impact of the number of segmentations and the classification threshold value in segmented areas as cancer will also need to be addressed in future studies.