Semi-supervised Bladder Tissue Classification in Multi-Domain Endoscopic Images

Objective: Accurate visual classification of bladder tissue during Trans-Urethral Resection of Bladder Tumor (TURBT) procedures is essential to improve early cancer diagnosis and treatment. During TURBT interventions, White Light Imaging (WLI) and Narrow Band Imaging (NBI) techniques are used for lesion detection. Each imaging technique provides diverse visual information that allows clinicians to identify and classify cancerous lesions. Computer vision methods that use both imaging techniques could improve endoscopic diagnosis. We address the challenge of tissue classification when annotations are available only in one domain, in our case WLI, and the endoscopic images correspond to an unpaired dataset, i.e. there is no exact equivalent for every image in both NBI and WLI domains. Method: We propose a semi-surprised Generative Adversarial Network (GAN)-based method composed of three main components: a teacher network trained on the labeled WLI data; a cycle-consistency GAN to perform unpaired image-to-image translation, and a multi-input student network. To ensure the quality of the synthetic images generated by the proposed GAN we perform a detailed quantitative, and qualitative analysis with the help of specialists. Conclusion: The overall average classification accuracy, precision, and recall obtained with the proposed method for tissue classification are 0.90, 0.88, and 0.89 respectively, while the same metrics obtained in the unlabeled domain (NBI) are 0.92, 0.64, and 0.94 respectively. The quality of the generated images is reliable enough to deceive specialists. Significance: This study shows the potential of using semi-supervised GAN-based bladder tissue classification when annotations are limited in multi-domain data. The dataset is available at https://zenodo.org/record/7741476#.ZBQUK7TMJ6k


I. INTRODUCTION
Urinary tract cancer comprises different types of lesions ranging from benign tumors to aggressive neoplasms with high mortality.This disease had 164,000 patients reported in 2021 and it is among the top 10 most common cancers worldwide [1].Muscle Invasive Bladder Cancer originates on the inner surface of the bladder and is more likely to metastasize than Non-Muscle Invasive Bladder Cancer (NMIBC) [2].The gold standard for Bladder Cancer (BC) diagnosis is cystoscopy.In case of finding abnormal tissue, patients should undergo Trans-Urethral Resection of the Bladder Tumor (TURBT) [3].This procedure consists of the insertion of an endoscope in the urinary tract and the removal of visible tumor lesions.
The World Health Organization WHO has defined a stratification of urothelial carcinoma accordingly to their propensity of invasion and it can be generalized into two main classes: High-Grade Carcinoma (HGC) and Low-Grade Carcinoma (LGC) [4].Visual classification of BC is a challenging task.The shapes of lesions either highgrade or low-grade tumors are quite similar in some cases, and the visual difference between healthy tissue and non-tumor lesions is not trivial [2].In fact, definitive diagnosis, staging, and grading of cancer are only possible after histological analysis of the resected tissue [5].
The use of different imaging techniques other than White Light Imaging (WLI), such as Narrow Band Imaging (NBI) can improve the differentiation of tumorous lesions from normal tissue [6], [7].Samples of different bladder tissue in both image domains are depicted in Fig. 1.In NBI, a white light source is filtered in two narrow bands of 415 nm and 540 nm.At these wavelengths, the hemoglobin reflection spectra present a global and a local maximum respectively [8].This increases the contrast between the surface mucosa, the capillaries, and the blood vessels in the submucosa, therefore improving bladder cancer diagnosis by highlighting visual structures that are hard to notice when using only WLI [9].Typically during TURBT procedures an initial inspection using WLI is carried out.Subsequently, in a second inspection, the anatomical structures deemed suspicious are examined using NBI to confirm.In some cases, the use of NBI by itself could be more efficient than WLI in the detection of NMIBC [9].
Despite the current advances in optical methods and their implementation in new devices, missing rates are reported to be between 10 and 20% [10].The clinical interest in endoscopic tissue classification is related to the actions to be performed during surgery, as well as the follow-up treatment.The development of computer-aided diagnosis (CAD) systems for BC classification could help clinicians reduce current miss-classification rates which are related to incomplete excision of tumorous tissue, and cancer recurrence reported to have values of 75% [11].For example, identifying a high-grade tumor in real-time could lead to the resection of a wider and deeper section of the tissue to avoid future recurrences.
In recent years, Deep-Learning (DL)-based methods have shown promising results in the analysis of endoscopic images.Most of the currently available datasets for endoscopic image analysis focus on colonoscopy [12], [13] and consist mainly of WLI data.Recently, few studies which include NBI data too have stressed on the advantage of using multi-domain data in the colonoscopy scenario [14]- [16].
In the case of the urinary system, only a few studies have been carried out in the task of tissue classification from endoscopic images [17]- [20].Except for the study presented in [20] where BL imaging is used, the rest of the studies use only WLI.Multi-domain image classification implies several challenges, especially when the data and annotations are not evenly distributed across the different domains and some of the classes are under-represented [21].
In the specific case of TURBT some of these challenges include the fact that visually it is difficult to differentiate between lesions and the diagnosis is inconclusive [22].Furthermore, due to the fact that multi-imaging endoscopes can collect only one imaging type at the time, it is not possible to have equivalent pairs of WLI and NBI images.Usually, an initial examination of the bladder is carried out using WLI and the lesions and anatomical structures deemed to be potentially cancerous tissue are examined again with NBI, in case this modality is available which is not always the case.An additional challenge is related to the imbalance of data in terms of the different classes and types of tissue.Non-Suspicious Tissue (NST) usually receives less attention during interventions, therefore fewer amount of image data is collected from it than from lesions, either in WLI or NBI.Furthermore, non-cancerous lesions such as cystitis or other types of bladder inflammations are less common to appear in the initial inspection during TURBT.All this contributes to the fact that most of the datasets (including ours) are imbalanced not only in terms of different image domains but also in terms of tissue classes.
In this work, we focus on the task of bladder tissue classification in multi-domain images from TURBT procedures, with special emphasis in the fact that annotations only exist in one of these image domains.Considering that most state-of-the-art computer vision methods are sensitive to changes in domain [23], and the specific challenges existing in endoscopic image classification, we propose a GAN-based semi-supervised approach which comprises three main components: 1) A teacher network trained on the labeled WLI images.2) A cycle consistency GAN to perform the unpaired image-to-image translation and 3) A multi-input multi-domain image classifier trained in a semi-supervised way.We show that with our method it is possible to obtain satisfactory classification results even when annotations from one domain are not available.
To ensure that the images produced with the proposed translation network are consistent with the structural and pathological features of the source domain, we perform a detailed quantitative and qualitative analysis of the generative models.Additionally, we validate its quality with help of specialists familiar with the TURBT procedure.In order to allow future research in the task of bladder tissue classification, and ease benchmarking of future methods, we will release the dataset upon publication.

A. Tissue Classification in Endoscopy
The analysis of endoscopic images has been rapidly developing in recent years thanks to the recent availability of new public datasets [24], [25].In the specific task of tissue classification different models and techniques have been proposed with a special focus on the gastrointestinal (GI) tract.The existing methods range from the proposal of feature extraction models [26], [27], to the use of transfer learning and pre-trained CNNs [28], [29] and to more complex methods that focus on targeting the specific challenges present when working with GI endoscopic images [30]- [33].
In the case of the bladder, Ikeda et al. [19] proposed the use of 2-step transfer learning by first fine-tuning their models on 8728 gastroscopic images, and then re-training the models on 2102 cystoscopy WLI images, using the GoogLeNet model for the task of binary classification of images with and without NMIBC.Yang et al. [18] compared the use of 3 different Convolutional Neural Networks (CNNs) as well as the platform EasyDL.The models used were LeNet, AlexNet and GoogLeNet.Their dataset includes 1200 cystoscopy images with cancer and 1150 without.Shkolyar et al. [17] proposed CystoNet, a CNN for bladder cancer detection and binary classification.In their study, they used 2335 WLI frames of normal benign bladder mucosa and 417 histologically confirmed papillary urothelial carcinoma to train the network.In [34] the use of a Generative Adversarial Network (GAN) is proposed to perform data augmentation, then AlexNet and VGG16 are trained with the real and augmented data.In total 202 images from a Confocal Laser Endomicroscope were used in their experiments.In [20] Ali et al. proposed the use of pre-trained models for the task of cancer malignancy, grading, and invasiveness classification on BL photodynamic cystoscopy images.The dataset was composed of 261 BL images and the pre-trained models used were VGG16, ResNet-50, MobileNetV2, and InceptionV3.On top of the pre-trained models, a shallow network was added to perform the classification.

B. Image to Image Translation
Since its introduction, GANs have become an outstanding method for different tasks in DL applications.GANs have been used for different purposes on endoscopic images such as the generation of synthetic images to improve polyp detection, or the construction of SLAM models to predict depth maps in colonoscopy [35], [36].
One of the applications of GANs is image-to-image translation.This task can be resumed as the mapping of an image in domain A to another domain B. In our case, these domains correspond to NBI and WLI.These types of models have been applied in diverse biomedical and endoscopic image tasks such as the translation between optical colonoscopy images and virtual colonoscopy images [37], the mapping between cadaveric and live images [38], the adaptation between phantom images real endoscopic videos among others [39], [40].
Using image-to-image translation with a focus on classification has been previously explored in other fields such as emotion classification, melanoma classification, and breast mass classification, among others.In this regard, Yoo et al. [41] proposed a joint learning approach using a mini-batch strategy and adaptive fade learning to use the generated images in the classifier with application in visually similar data.Likewise, Zhang et al. [42] and Mabu et al. [43] proposed the use of cycle consistency for classification in retinal pathologies identification and opacity classification in CT scans respectively.

C. Semi-Supervised Image Classification
A common characteristic of medical image datasets is the lack of large annotated sets [44].During the last few years semi-supervised learning methods have progressed as a good alternative to leverage this large amount of unlabeled data.One of the most common paradigms of semi-supervised learning is the use of Teacher-Student Networks (TSN) [45].In this type of model, a teacher network is trained on the labeled data, and a student network is trained on the unlabeled data using the predictions given by the teacher.Training in semi-supervised mode allows the student model to learn features from unlabeled datasets [46].
In the endoscopic scenario, few studies have been carried out using semi-supervised learning.Du et al. [47] implemented a semisupervised contrastive learning method for Esophageal Disease Classification in a small dataset.Golhar et al. [48] proposed the use an unsupervised jigsaw learning method for GI lesion classification obtaining an improvement in accuracy of 9.8% with respect to supervised methods.Guo et al. [49] proposed the use of a combination of a discriminative angular loss and Jensen-Shannon divergence loss for semi-supervised learning for wireless-capsule endoscopic image classification.Shi et al. [50] implemented a TSN network for the 3D reconstruction of stereo endoscopic images.
Recently, semi-supervised GAN-based models have been proposed for image classification in different fields such as natural images and hyper-spectral image classification [51]- [54].However, in the field of endoscopic images it remains an unexplored topic.
Unlike the studies presented in [55]- [59] where cycle-consistency translation has been implemented as a way of augmenting their datasets, we use image-translation inside a semi-supervised training loop to improve the classification performance of the unlabeled domain.Furthermore, the methods in which GAN-based semisupervised methods have been proposed are mainly focused on the classification of images of the same domain.
In this work, we propose a synergic semi-supervised GAN-based method that enables not only the exploitation of unlabeled data but also performs image translation to alleviate the dataset's domain imbalance.This allows the proposed network achieves a better generalization even in an image domain where labels are not available.

III. METHODS
Our overall goal is to improve tissue classification of endoscopic bladder images when labels are limited to only one domain, and there is no identical equivalent for every image on each domain.In our case, the endoscopic images are available on WLI and NBI domains, and the labels correspond only to the ones on WLI.

A. Problem Statement
The proposed method consists of three main components; 1) A cycle-consistency translation network to translate every image in the dataset and have equivalent paired images in both domains (NBI and WLI); 2) A teacher network trained on the labeled WLI data; and 3) A multi-input multi-domain classifier trained as student network in a TSN semi-supervised way.A schematic of the proposed model is depicted in Fig.

B. Cycle-consistency Translation Network
The unpaired image-to-image translation network is a generative adversarial network based on the CycleGAN architecture [60].Two generators GAB and GBA are trained to learn the mappings between the domains A = WLI and B = NBI, such that GAB : A → B and GBA : B → A. DA and DB are the two discriminators trained two distinguish between the real and fake images of each domain.The proposed model uses three main losses, the adversarial loss L adv , the cycle consistency loss Lcyc and a similarity loss Lsim.
The cycle loss Lcyc is defined as where the indexes p, q represent the domain of the image and the domain to which is translated.The adversarial loss for each generator Gpq and discriminator Dp is defined as To preserve the fine-grain details, such as the capillaries and inner blood vessels, that are related to the intrinsic pathology of each image domain and which are an essential visual cue for diagnosis assessment, we propose the addition to the cycle-consistency network a similarity loss Lsim.This is defined as: where xA ∈ A and xB ∈ B correspond to the images form the A and B domains and the ith refers index over the a set of images of N elements.xA and xB correspond to the translated images by the generators.F (x, x) is the structural similarity (SSIM) between images x and x proposed in [61] as: Where σ x,x is the covariance between x and x : m is the number of pixels; xj and xj are the jth pixel of x and x respectively; µx, µ x and σx and σ x are the mean intensities and standard deviations of x and x, and c1 and c2 are stabilizing constants to avoid singularities when µ 2 x +µ 2 x ≈ 0 and σ 2 x +σ 2 x ≈ 0 respectively.The overall objective function of the generative network is then defined as L(GAB, GBA, DA, DB) = L adv (GAB, DA) where λi are the hyper-parameters that balance the impact of the losses.The generators are trained to minimize the overall function

C. Semi supervised classification
Initially, the teacher model CT is trained on WLI images in a fully supervised way.This could be seen as disconnecting the branch that goes from the input image x to the Cycle-Consistency Translation Network in Fig. 2, and training the network to optimize eq.7 substituting the ŷT i pseudo-labels with the labels yAi from set XA. Afterward, the student model CS is trained using the labeled and unlabeled data using the predictions ŷT obtained from the teacher.The student network corresponds to a multi-input classifier that takes 3 images as input CS(x, x, x) as depicted in Fig. 2-(C).The first one x is the original image from either WLI (xA) or NBI (xB) domains, the other two images correspond to the ones generated by the generators GAB and GBA respectively.In the case of the branch that takes as input x, random data augmentation operations are applied which include random crop, random rotation, and flipping.Backbone networks b1, b2, and b3, are used to extract the features of each of the 3 input images.In our case, we used as backbone ResNet-101 trained on ImageNet.The extracted features from each of the backbones are processed separately using a shallow network composed of 3 Fully Connected (FC) layers.The outputs from these layers are concatenated together, from which finally the class prediction is performed in the final layer.The classifier was trained to optimize the categorical cross-entropy loss defined as: where ŷi is the predicted output from the student model, ŷT i is the corresponding pseudo-label provided by the teacher network, and i refers to the index over the classes.

D. Dataset
For this study, endoscopic videos from 23 patients undergoing TURBT were collected, as well as the respective histopathological analysis from the resected lesions.The matching between the visual data and the histological results was done with the aid of an expert surgeon.The matching was performed by analyzing frame-by-frame the videos.The sections of the bladder from which lesions were resected during the surgical intervention were then identified.To avoid ambiguities of having multiple lesions of multiple types, only the frames in which individual lesions appeared were used in the dataset.This procedure was performed on all the WLI video clips as well as 3 patients with NBI video data.In total 4 classes were defined.Taking into consideration the general classification of BC as defined in [2] by the WHO and the International Society of Urological Pathology (ISUP), two categories were considered for cancerous tissue: Low-Grade Cancer (LGC) and High-Grade Cancer (HGC).Additionally, 2 extra categories were considered for No Tumor Lesion (NTL) which comprehends cystitis, caused by infections or other inflammatory agents, and Non-Suspicious Tissue (NST).The detailed statistics of the dataset are shown in Table I.
The videos were acquired at the European Institute of Oncology (IEO) at Milan, Italy.Each patient signed an informed consent document approved by the IEO and in accordance with the Helsinki Declaration.No personal data was recorded.To determine if the use of more data helps to achieve better generalization when training the GAN networks, we used additional data from the datasets presented in [14], [28] which contains endoscopic images from colonoscopy in NBI and WLI domains, and [62] which contains unlabeled data from TURBT as well in NBI and WLI domains.

E. Model Implementation
The model was trained in three steps.First, the cycle consistency GAN was trained for 150 epochs with an initial learning rate of 2e−4 and batch size of 1.The λ hyperparameters were set to λ1=λ2=2.0, and λ3=λ4=1.0The second step consisted of training the teacher classifier using the labeled dataset XA.Once the GAN model and the teacher networks were trained, the multi-input classifier was trained setting the initial learning rate at 1e−5 using a batch size of 32.The models were implemented using Tensorflow 2.5 in Python 3.6 and deployed on an Nvidia GeForce GTX 1080 GPU.The training of the classifiers was repeated 10 times for each of the different experiments carried out in this study.
For performance benchmarking of the classifiers, a hold-out strategy was used, 4 patient cases randomly chosen were held as test dataset.The rest of the dataset was divided randomly in a 75/25 ratio for training/validation.In the case of the GAN models, only the train dataset used for supervised classification was used during the training of the different combinations described in Table II.For the semi-supervised training apart from using the labeled WLI images and unlabeled NBI, all NBI cystoscopy images described in [62] were added to the training dataset.The test dataset for the semi-supervised task remained the same as the one used to test the performance of the teacher model.

F. Evaluation protocol
Each of the different modules that comprise the proposed method was evaluated separately, and the best components of each one were chosen.
In contrast with other DL models that are trained to minimize a loss function, GAN models are trained to converge to an equilibrium between the generator and the discriminator networks.For this reason, there is no objective loss function to train this type of model, DA: our dataset.DB : dataset described on [62].DC : dataset described on [28].DD: dataset described on [14] and compare their performance objectively [51].However, there are some quantitative techniques that have been proposed to assess the performance of GAN models [63].
1) Quantitative Evaluation of the Generators: Generator models are usually evaluated based on the quality of the images they generate.However, this type of evaluation might not fully show the performance of the models and might be subjective due to biases of the reviewer [63].In this regard, some authors have proposed the use of different metrics such as the Inception score, to quantitatively evaluate the quality of the generated images [51].In our specific case, we have the limitation that the dataset does not correspond to natural images, such as the ones on ImageNet dataset, and therefore we can not apply the Inception score directly.We use instead the Fréchet Inception Distance (FID) proposed in [64], to quantify the performance of each generator trained and defined as: were m, C are the mean and covariance obtained from the last pooling layer of an Inception model using sample images produced by the generative model respectively, and mω, Cω are the corresponding ones using images from the original dataset.
We also analyze how the amount of data affects the quality of the images and the classifiers' performance.For this purpose, we use 3 different combinations of datasets coming from 4 different sources.The datasets composition is shown in Table II.
To measure the sensitivity of the models depending on the amount of data used, we analyze the sensitivity to noise for each of the generative models trained on the different datasets as proposed in [65].We added zero-mean Gaussian noise N (0, σ) in a range of σ = [0.025,0.05, 0.075, 0.1, 0.2] to the translation result before reconstruction.We compute the Mean Square pixel Error (MSE) of the reconstructed image with respect to the original image xi and calculate the sensitivity (SN) using the equation: We compared the sensitivity for each of the generators in the proposed Cycle Similarity network (CSi-GAN) and the baseline CycleGAN.
2) Evaluation by Medical Specialists: Once the different GAN models were trained, the one with the best FID score was selected as the one to be used for human evaluation.With this analysis, we intended to confirm that the quality of the generated images is good enough to deceive experts, as well as to have a baseline to compare the classification performance of the models with respect to the ones from specialists.
To qualitatively evaluate the utility of the images an online survey was set up where medical experts were asked to complete two tasks.
In the first task, 20 pairs of randomly selected images were shown to the participants.Each image pair corresponded to two images from the same domain; one of them was an original image taken with the endoscope while the other corresponded to a translated image by the GAN.The participants were asked to identify which one was the original one, and which one was the generated one.For this task, NBI and WLI image pairs were evenly distributed with 10 samples for each case.In the second task, 40 pairs of images were shown to the participants.The clinicians were asked to classify the images according to the 4 classes explained in section III-D.Each image pair corresponded to one of the following options distributed in a 50/50 ratio: 1) A pair of images that showed the same anatomical region at different times.In this case, the pair of images could correspond to two images of the same region and the same domain or two images of the same region but with a different domain, i.e. (NBI, NBI), (WLI, WLI) and (NBI, WLI).Each of the possible cases was evenly distributed.2) In the second option, again two images were shown that correspond to the same anatomical region at different times.However, in this case one of the images was domain translated.The images used in this task were randomly chosen, taking into consideration having an even distribution of the 4 different tissue classes.Image pairs from options 1) and 2) were randomly ordered across the survey.
3) Evaluation of the Classifiers: Once the GAN models were trained, we incorporate them into the general workflow using them as the base backbone to produce the multi-domain input images to feed the student classifier.The training was performed first in a fully supervised manner and then in a semi-supervised way using the previously trained teacher.To select the teacher model, diverse pre-trained models previously used in the literature were trained and the one with the best performance metrics was chosen as the teacher.We also performed ablation studies as well to demonstrate the utility of each of the elements of the proposed method.In the final stage, we train the multi-input classifier in a fully supervised way, using each of the previously trained generative models to determine whether there is a correlation between the classification performance and the quality of the generated images.

G. Evaluation Metrics for Classification
To evaluate the classification performance of the proposed method we used the metrics: accuracy (Acc), precision (P rec), recall (Rec), and F1-score.Additionally, as proposed in [49] we also evaluated the model using Matthews correlation coefficient (MCC) and Cohen's kappa (CK) statistic which has shown to be effective to benchmark diagnosis reliability of classifiers [66].Mann Whitney U-test was used to determine the statistical significance.In the case of the user's experiments, the same metrics were used to evaluate their performance.Additionally, for the users' task of identifying the real images from the fake ones, the Area Under the Curve of the Receiver Operating Characteristic curve AU C was used.

IV. RESULTS AND DISCUSSION
This section is divided into two main subsections.First, we evaluate the performance of the image-translation network quantitative and qualitatively.Then we proceed to analyze the results of the classification network and the influence that the quality of the generated images has on the overall system, as well as the different components of the system.

A. Evaluation of the GAN models
The first set of results corresponds to the qualitative assessment of the synthetically generated images.Samples of randomly chosen  generated images by the different GAN models trained are shown in Fig. 3.A visual comparison shows that the amount and diversity of training data improve the quality of the images.We can observe that the addition of data helps the network learn the existence of other objects which do not correspond to the anatomical structures in the body, such as tools or bubbles.This shortcoming where the networks tend to disappear external structures by coloring them with the same hue as the rest of the background is more perceptible when models are trained with small datasets (D1).Furthermore, in these cases, the network also presents some noticeable flaws since sometimes the generated images present black dots scattered at diverse points.Nevertheless, the use of only external data (D2) also alters the hue of the translation.This could be linked to the fact that the external data comes mainly from GI images which present different tints and anatomical formations than the ones present in the bladder.In general, for both cases cycleGAN and CSi-GAN the use of the more general dataset (D3) which comprises data from the same anatomical target and external data produce the best quality images.However, still some image artifacts such as specularities, reflections, interlacing, etc. appear in the generated images without being present in the original one.The most significant improvement comes from using the Lsim loss to train the GANs.The fine-grain details, such as small vessels, are better preserved and highlighted after the translations, and it also helps to reduce the amount of noise in the image.Similar behaviors can be observed in the video material attached to this manuscript.

1) Quantitative Evaluation of the GAN:
To evaluate the quality of the images generated by the GAN models the FID score and the AUC of the sensitivity curve were used.The results obtained for both metrics are shown in Table III.The model that obtains the best metrics for both cases, i.e. lower values, is the proposed CSi-GAN when trained on D3.In the case of FID score there is a clear difference between CycleGAN and CSi-GAN regardless of the dataset used for training, with CSi-GAN obtaining in general better results.In the case of the AUC of the Sensitivity curve, the difference between the two models is not that obvious.This could be associated with the fact that neither of the networks is designed from the origin to be noise-resistant.However, there is a clear tendency that the addition of data makes CSiGAN more resistant to the addition of noise than its counterpart CycleGAN.This might be related to the fact that even if the addition of more data helps CycleGAN to generalize better in domain translation the lack of a structural loss inhibits it to discern properly between the correct information to produce a satisfactory translation, and the information that seems useful but is just noise.This could also explain the reason why CycleGAN obtains better metrics when trained on dataset D2 than on D3 since the quality of the images of D2 is higher and less noisy.
2) Evaluation by Medical Specialists: In order to perform a more exhaustive analysis, a protocol was implemented to acquire feedback from expert clinicians in the field of endoscopy as described in sec.III-F2.A total of 20 physicians from 10 different institutions familiar with TURBT participated in the study.Of this, 15 corresponded to Expert Surgeons (ES) and 5 to Residents (RE).For this analysis we choose the generative model which obtained the best FID score and AUC values, i.e.CSi-GAN trained on dataset D3, to generate the synthetic images.
The results regarding the ability of surgeons to discern between real and synthetic images are shown in table IV.The results are split in 3 categories to evaluate separately each translation (WLI → NBI and NBI → WLI) and therefore each generator independently, as well as the overall performance of the GAN (ALL).For both groups of participants (ES and RE), the results show slightly better results in the translation WLI → NBI for all metrics.This might be related to the fact that there are more sample images in the WLI training dataset than in the NBI and therefore the generator GAB is able to generalize better and produce better quality images than its counterpart GBA.The overall AU C for ES is 0.59 and 0.52 for RE, meaning that their performance is marginally better than what a random binary classifier could achieve, confirming that the quality of the generated images is good enough to trick experts in the area.
Concerning the tissue classification task, results are shown in Fig. 4. In the case of Acc there was an average improvement of 8% when using a pair of a real image and a synthetic one than when only 2 real images were shown.In the case of P rec the improvement was 19%, while no improvement or decrease was observed in the case of Rec.For the F-1 score and M CC the improvements were 16% and 17% respectively.However, no statistical significance was found.This goes in accordance with the results obtained in the previous analysis, meaning that the generated images do not affect the specialist's performance on tissue classification.Table V: Comparison of using different pre-trained models in the proposed GAN-based multi-input classifier.The average ± standard deviation for each metric is presented in terms of the type of data in the test dataset (WLI and NBI) and the combination of both (ALL), for each of the models.The numbers in bold indicate the cases that obtained the best metrics.

B. Tissue Classification Evaluation
Results regarding tissue classification are divided into three parts.First, we show that the use of our proposed GAN method for image translation improves in general the performance of tissue classification using different backbones previously used in the literature as simple fine-tuned classification networks.Next, we show that the use of semi-supervised learning, in general, improves further the classification performance.Finally, we perform an ablation analysis of the proposed model.

1) GAN-based Tissue Classification:
To test the generalization of our method, we compare the use of different networks (VGG16, VGG19, Inception V3, Desenet, ResNet-50, and ResNet-101) trained in a fine-tuning fashion against the implementation of these same networks in our GAN-based classification method.CSi-GAN trained on D3 was chosen as the as the translation network.Results in terms of ACC, M CC and F-1 score are shown in Table V. Overall the use of the proposed GAN-based method obtains better metrics than the baseline networks.In the majority of the cases, there is little improvement or no improvement when the input image is in the WLI domain.This uneven behavior in terms of the classification improvement might be related to the fact that WLI images are more similar to the natural images dataset in which the models were originally pre-trained (ImageNet).However, there is a noticeable improvement when it comes to the classification of NBI images where most of the base-line shows poor performances.
2) Semi-supervised Classification: We compared the use of GANbased classification trained in a fully supervised way against the use of semi-supervised classification.In both cases, only the Multi-Input classifier weights were trained while the ones of the Cycle-Consistency Network remained constant.For these experiments, CSi-GAN pre-trained on each of the D k datasets were used.The results of these experiments are shown in Fig. 5 in terms of ACC, F-1 score, and M CC.On average the improvement, in terms of ACC, F-1 score, and M CC, of using CSiGAN trained in a fully supervised way against the training in a semi-supervised fashion was of 8%, 6%, and 9% respectively.This shows the potential of using GANbased semi-supervised learning for bladder tissue classification.The confusion matrices of the best model obtained are shown in Fig. 6.
3) Ablation Results: In this case, we made a comparison between the base model, the proposed CSiGAN model trained in a fully supervised way, and in a semi-supervised way (Se-CSiGAN).We also analyzed the influence that each of the inputs of the multi-domain classifier model has.For this purpose, we trained the network with each of the individual branches (b1, b2, b3) separately.The statistical significance was calculated with respect to the base-model ResNet-101.Classification results obtained by medical experts, stratified between specialists and residents are shown as a reference point.The results of the ablation experiments are shown in the tables VI and VII.From these results, we can see that in general, all the models obtain better results than the specialists, and the major improvement comes from the use of a semi-supervised approach.However, the improvement obtained in the domain for which there are no labels Table VI: Ablation results.The average± standard deviation for each metric is presented in terms of the type of data in the test dataset (WLI and NBI) and the combination of both (ALL), for each of the models.To have a reference point, the results obtained from physicians are shown too divided by specialists and residents.The table shows in which cases Domain Translation (DT) and Unlabeled Data (UD) were used during the training.The experiments to examine the impact of each of the branches (b1m, b2, b3) in the multi-input classifier were performed in a fully supervised (FS) way in order to analyze the effects only of the translations performed by the GAN.The ablation results corresponding to branch b1 is equivalent to the baseline (ResNet-101) result since the inputs from CSi-GAN are not used.The Cohen's Kappa (CK) statistic is reported as an overall benchmark of the classifier.when using domain translation is also noticeable.As expected, the integration of both results in the best performance, and improves considerably the detection of classes that are underrepresented.This behavior is more clearly noticeable in the case of the NTL class which in our dataset has the smallest number of samples and in contrast to NST could be easily misclassified as a tumorous lesion.
An additional analysis was performed in order to determine if the quality of the GAN-translated images influence the classifier performance.The metrics Acc, F-1 score, and M CC, obtained by training the multi-input classifier in a fully supervised using both CycleGAN and CSi-GAN, are compared against the FID score for each of the translation networks.The results of this comparison are shown in Fig. 7.Even though it is easy to notice the gap in terms of the FID score between the generators from CycleGAN and CSi-GAN, and the best classification metrics are obtained when using CSi-GAN with more data (D3), this improvement is minimum.Furthermore, CycleGAN trained on D2 obtains similar metrics.The comparison against the classification metrics does not show a conclusive result and further research is needed to determine the correlations that could lead to best practices and parameter choices when training GAN models.

V. CONCLUSION
In this paper, we propose a novel semi-supervised learning GANbased method to address the problem of endoscopic image classification in NBI and WLI imaging domains.The proposed method shows to be effective for a scenario where there is domain and class imbalance and in general, performs better than specialists and baseline methods.The use of this method leverages the use of unlabeled data in a domain different than the one where annotations exist, which is a very common case in biomedical data where annotated data is limited.This could ease the transition to clinical practice and its implementation for computer-aided BC diagnosis.The results obtained also show that the quality of the synthetic images generated with the proposed method is good enough to deceive clinical experts.Nevertheless, additional research needs to be carried out to find accurate metrics to assess the quality of generated images objectively and to determine to which point it might be related to the classification performances.
Future work includes further validation of multi-center data, as well as the acquisition of data from other imaging domains which could help to assess better the generalization of the method, and the development of lesion detection methods that could differentiate specific image regions that correspond to the lesion and non-lesion

Ethical Approval
The proposed study is a retrospective study.No personal data was recorded.The collection of data was in accordance with the ethical standards of the Istituto Europeo di Oncologia and with the 1964 Helsinki declaration, revised in 2000.All the subjects involved in this research were informed and agreed to data treatment before the intervention.

Informed consent
Written informed consent was obtained from all patients included in the study.

Figure 1 :
Figure 1: Sample images of the different classes in the bladder tissue classification dataset.From left to right: High-Grade Carcinoma (HGC), Low-Grade Carcinoma (LGC), No Tumor Lesion (NTL), and Non-Suspicious Tissue (NST).

2 .
Let us define a dataset X = XA ∪ XB composed by the union of two subsets: XA = {(xA1, yA1), ..., (xAn, yAn)} composed by n labeled images xi belonging to domain A, and XB = {xB1, xB2, ..., xBm} composed by m unlabeled images xj belonging to domain B. Initially, a classifier C is trained in a fully supervised fashion on XA.This classifier will work as a teacher model CT at a later stage.We propose the use of cycle-consistency image translation to deal with the issue of an unpaired and imbalanced dataset.For each image in domain xA ∈ A we will generate an equivalent translation xAB ∈ B, and for every xB ∈ B we will generate an equivalent translation xBA ∈ A. The translated images xAB and xBA are produced by the generators GAB and GBA respectively.An advantage of using cycle-consistency GANs is that an additional image x is generated, which corresponds to the reconstruction back to the original image.This can be used as additional data to train the student classifier.Therefore for every image xA we have two extra images xAB and xABA and the same for xB where we have xBA and xBAB.Then we train a multi-input classifier CS which takes as input CS(xA, xAB, xABA) or CS(xB, xBA, xBAB), depending on the domain of the input data.

Figure 2 :
Figure 2: Proposed method.The network has two main elements.A).Cycle-Consistency Translation Network that translates the image from NBI to WLI and vice-versa.B).Teacher network.C).Multi-input network that performs the tissue classification task based on the features from both image modalities.The classification makes use of backbone networks that extract the features from each of the inputs to the classifier.The features are processed using Fully Connected (FC) layers which later are concatenated to perform the prediction in the final layer.

Figure 3 :
Figure 3: Samples of the generated images for the 4 classes on the 2 domains using each of the GAN models.For each model trained on the 3 different datasets (D1, D2, D3) two images are shown: 1) the translated image to the complementary domain, and 2) the reversed translation back to its original domain.

Figure 4 :
Figure 4: Box plot comparison of the surgeons performance in the tissue classification task.Blue boxes correspond to the case in which surgeons were shown a pair of real images {xi, xj}.Orange boxes correspond to cases in which a pair consisting of a real image x and its translation x to the opposite domain {xi, xj}, are shown.

Figure 5 :
Figure 5: Boxplots comparison of Acc, F − 1 score and M CC of the proposed model trained in fully supervised vs semi-supervised way using CSi-GAN pre-trained on D1, D2 and D3.The results for each metric are divided in terms of the type of data in the test dataset (WLI and NBI) and the combination of both of them (ALL).The statistical significance using Mann Whitney U-test is denoted with * : p < 0.05, * * : p < 0.01, * * * : p < 0.001

Figure 6 :
Figure 6: Confusion matrices of the best model obtained.a) Analysis on the complete test data (WLI + NBI).b) Analysis only on the WLI test data.c) Analysis on the NBI data.Is important to notice that due to the scarcity of annotated NBI data, the NBI test dataset was composed only of HGC and LHC images.

Figure 7 :
Figure 7: Comparison of the different GAN models when used as backbone for training the multi-input classifier.The results are shown in terms of FID vs :ACC, F-1 score, and M CC.

Table I :
Composition of the dataset considering two light modalities; White Light Imaging (WLI) and Narrow Band Imaging (NBI).
*The total number of patient cases does not correspond to the sum of the second column since some of the patients had more than one type of lesion.

Table II :
Dataset composition used for training the GAN models.D1 corresponds to our dataset described in Sec.III-D.D2 corresponds to a dataset composed only of external sources.D3 corresponds to the union of all the previously mentioned datasets.

Table III :
FID scores and AUC of the Sensitivity curves for each of the GAN models trained on the different datasets.The results are divided in terms of the two generators GAB and GBA.The numbers in bold indicate the cases that obtained the best metrics.

Table IV :
Average results ± standard deviation from the specialist evaluation regarding their ability to discern between real and generated images.Results are divided in terms of the two different groups: Expert Surgeon (ES) and Resident (RE), and by the type of translation performed by each generator network i.e.WLI → NBI, NBI → WLI as well as the overall performance (ALL) of the GAN.

Table VII :
Ablation results in terms of each of the classes in the dataset.The average± is the standard deviation of each metric for each of the 4 classes.The experiments to examine the impact of each of the branches (b1, b2, b3) in the multi-input classifier were performed in a fully supervised (FS) way in order to analyze the effects only of the translations performed by the GAN.The ablation result corresponding to branch b1 is equivalent to the baseline (ResNet-101) result since the inputs from CSi-GAN are not used.874±0.0340.364 0.918±0.0470.218 0.747±0.0600.121 0.864±0.0700.003 Rec 0.919±0.0460.953 0.840±0.0420.791 0.840±0.0780.233 0.865±0.0300.360 F-1 0.895±0.0270.107 0.892±0.0160.065 0.781±0.0450.128 0.853±0.0280.445