Improving Colonoscopy Lesion Classification Using Semi-Supervised Deep Learning

While data-driven approaches excel at many image analysis tasks, the performance of these approaches is often limited by a shortage of annotated data available for training. Recent work in semi-supervised learning has shown that meaningful representations of images can be obtained from training with large quantities of unlabeled data, and that these representations can improve the performance of supervised tasks. Here, we demonstrate that an unsupervised jigsaw learning task, in combination with supervised training, results in up to a 9.8% improvement in correctly classifying lesions in colonoscopy images when compared to a fully-supervised baseline. We additionally benchmark improvements in domain adaptation and out-of-distribution detection, and demonstrate that semi-supervised learning outperforms supervised learning in both cases. In colonoscopy applications, these metrics are important given the skill required for endoscopic assessment of lesions, the wide variety of endoscopy systems in use, and the homogeneity that is typical of labeled datasets.


I. INTRODUCTION
C OLORECTAL cancer is the second leading cause of cancer death and will cause a predicted 53,200 deaths in the United States in 2020 [1].Optical colonoscopy is considered the gold-standard for detecting and preventing colorectal cancer with approximately 15 million procedures being performed annually [2].Screening procedures are used to inspect the large intestine and rectum for precancerous lesions so that they may be removed prior to the onset of carcinoma.These lesions come in a variety of geometries and textures, each with an associated risk of progressing to a cancerous state [3].Colonoscopists analyze optical images to visually classify lesions, using cues such as color, shape, and vasculature patterns in conjunction with published guidelines [4]- [6].Improving the reliability of lesion classification from images and de-skilling this task could reduce the costs, time, and other resources associated with histopathology.Further, lesions which are benign in nature may be left in place, eliminating associated risks of polyp removal [7].
In the past decade, deep learning models have achieved astounding success in the computer vision field on tasks such as image classification and object recognition, sur-passing human-level performance in some cases.In medical imaging, these models have outperformed traditional image processing techniques in a variety of fields such as radiology, histopathology, retinopathy, and mammography.Most of these models are trained in a supervised fashion, requiring large quantities of expertly annotated medical data to achieve optimal performance.In the medical imaging field, compiling annotated data is particularly time consuming, expensive, fraught with privacy concerns, and limited by the availability of expert annotators.In contrast, unsupervised methods have shown that meaningful representations can be extracted from unlabeled data, which is often plentiful.In this work, we leverage the advantages of both labeled and unlabeled data using a semisupervised learning paradigm to improve the performance of colonoscopy lesion classification.
Semi-supervised learning (SSL) is an emerging area of research that aims to learn a supervised objective, while enriching the encoded features through an unsupervised task.Recent works have shown marked improvement over purely supervised training, especially with small quantities of labeled data [8]- [10].SSL involves simultaneously training an unsupervised proxy task, and a supervised task.Many proxy tasks involve applying some type of transformation to an image, then tasking the network with predicting the transformation.In this way, the network learns to encode information to a feature space which may enhance the performance of the supervised task.One example of a pretext task is applying a known rotation to an image, then tasking the network with estimating the degree of rotation.
In this paper, we use a jigsaw puzzle as the proxy task for SSL, as was first proposed by [11].In this task, an input image is cut into an N × N grid, and the resulting tiles are reshuffled into an order defined by a randomly selected pseudo-label.The network then learns to encode the shuffled image into a feature vector which allows it to accurately predict the tile order.The unsupervised jigsaw task ideally enriches the encoder's resultant feature vectors, making them more discriminative for the supervised lesion classification task.Using this method, we find that a semi-supervised learning model outperforms a purely supervised model in lesion classification.While most semi-supervised learning research focuses solely on improvements in accuracy, trained models also benefit from improved robustness and generalizability.We also investigate the jigsaw method's effect on domain adaptation and out-of-distribution detection in colonoscopy -important metrics when deploying models to real-world clinical settings.Specifically, the contributions of this study are: 1) To the best of our knowledge, this is first research applying semi-supervised learning to colonoscopy lesion classification.2) We demonstrate that a jigsaw-puzzle-solving task can effectively leverage unlabeled data to significantly improve the performance of lesion classification.3) We show that semi-supervised learning also improves performance in analyzing domain-shifted images and detecting out-of-distribution samples at inference.

A. LESION CLASSIFICATION
Polyp classification is a widely researched problem in the medical image analysis community [12], [13].Previous work has used traditional methods for hand-crafted feature extraction using color, texture, and 3D features for polyp classification in videos [14].More recent research uses deep learning models, which have shown significant improvements in classification accuracy.Most use transfer learning [15] with off-the-shelf models such as ResNet [16] and Inception [17]- [19].Others have combined traditional methods with deep learning approaches, such as fused wavelets and convolutional neural network features [20].Multi-modal fusion of pixel-level information, such as color and depth, have also been shown to improve classification accuracy [21], [22].Still, all of these methods exclusively utilize data with ground truth annotations [23].

B. SELF-SUPERVISED & SEMI-SUPERVISED LEARNING
Self-supervised and semi-supervised learning are highly active areas of artificial intelligence research.These methods exploit unlabeled data for effective representation learning.Recent semi-supervised works have achieved comparable performance to conventional fully supervised networks, while only requiring a small fraction of labeled data.To learn from data without manual annotations, self-supervised methods employ proxy tasks where pseudo-labels can be generated using know transformations or data manipulations.
According to [24], there are four common types of proxy tasks: • Generation-based methods: Some part of the data is deliberately removed, and the network is tasked with predicting the missing data.Examples include image colorization [15], image inpainting [25], and video generation from single frames using generative adversarial networks (GANs) [26].
Recent works have shown that semi-supervised learning methods improve model robustness and generalizability, as well as the ability to measure uncertainty [38], [39].Deep learning models are notorious for silently providing incorrect predictions when test samples are drawn from a distribution other than the distribution used for training.Surrogate methods have been incorporated into the inference pipeline, drawing on the network's prediction probabilities to determine an out-of-distribution score for test samples [40].The success of semi-supervised learning in medical imaging is dependent on deploying networks that can handle a wide distributing of samples, and have a mechanism for appropriately handling samples which the network is ill-conditioned to classify.

1) SEMI-SUPERVISED LEARNING IN MEDICAL IMAGING
Since labeled data in medical imaging community is particularly scarce, researchers in this field have long explored unsupervised methods.Cheplygina et al. [41] present a comprehensive review of semi-supervised and self-supervised methods employed in medical imaging.Popular approaches include using self-labeling and co-training, where a classifier is first trained on the available labeled data, and is then used to generate pseudo labels on unlabeled data.The classifier is then retrained using the newly generated labeled data.This method is especially popular where precise label- The proposed semi-supervised learning model uses lesion type labels for a supervised loss and jigsaw index pseudo labels for an unsupervised loss.This model is sequentially trained in a supervised phase then an unsupervised phase for each iteration.
ing is cumbersome, such as pixel-level segmentation tasks with applications in neuro [42]- [44], heart [45], and retinal [46] imaging.More recent works have employed state-ofart semi-supervised and self-supervised techniques across a wide range of applications, such as consistency regularization for skin lesion classification and thorax disease diagnosis [47], unsupervised anomaly detection for white matter lesion segmentation [48], and image synthesis with GANs for data augmentation in glaucoma assessment [49].

2) JIGSAW PUZZLE SOLVING
The original semi-supervised jigsaw approach proposes decomposing an image into patches, shuffling the patches, then individually feeding the patches to a Siamese network [11].The network predicts the shuffled patch order as a pretext task, and it is later fine-tuned on the downstream, supervised task using labeled data.Many variations of the jigsaw task have been explored, including for videos [50], three-dimensional data [51], and negative sample inclusion for increased difficulty [52].Specifically in medical imaging, the jigsaw paradigm has been applied to imaging of the brain and pancreas [51], [53], [54].
In this work, we adapt the jigsaw proxy task for improving the performance of a supervised classifier [39].To the best of our knowledge, this work is the first to explore semisupervised learning for lesion classification in colonoscopy.The most similar prior art is [55] which performs medical instrument segmentation on endoscopy images using image colorization as the pretext task.

III. METHODS
Our problem statement is defined as follows: given a colonoscopy image of a lesion, we attempt to classify it into one of two classes -neoplastic/precancerous or nonneoplastic.Our dataset consists of labeled and unlabeled image sets, i=1 with N l as the total number of labeled images, and D U is the set of unlabeled lesion images, where N u is the total number of unlabeled images.Detailed description of the classes & dataset is given in section IV-A.The goal is to leverage the unlabeled data D U using the jigsaw task to improve the performance of lesion classification.

A. ARCHITECTURE
As shown in Figure 1, our model consists of ResNet-18 as a shared feature encoder with two classifier heads -one for supervised lesion classification and a second for jigsaw classification.Our deep model is denoted by f , where the shared feature extractor is parameterized by θ e and the supervised and unsupervised classifier heads by θ s and θ u , respectively.The network trains in two phases -a supervised phase that minimizes the supervised loss L S followed by an unsupervised phase that minimizes the jigsaw loss L U .The parameters of the network are learned by alternating training between the supervised and unsupervised tasks on each iteration.The following sections describe the two training phases in detail.

1) SUPERVISED PHASE
The main supervised objective is to classify colonoscopy lesion images into neoplastic vs non-neoplastic classes.We aim to minimize the supervised classification loss L S , which is the weighted cross-entropy loss between the target label y i and the model prediction f (x i |θ f , θ s ) with (x i , y i ) ∈ D K .In our experiments to assess the effectiveness of semisupervised learning, we report the performance of the network trained on various fractions of the labeled dataset.Consequently, D K ⊆ D L is obtained by selecting the k th percentage of labeled data where k varies logarithmically i.e. k = {100, 50, 25, 12.5, 6.25}.A detailed description of how data selection is performed is discussed in IV-C.The cross entropy loss function is weighted to account for the class imbalance in the dataset.Formally, the supervised loss function is defined as where |D K | is the number of images in the selected labeled dataset, weight w c = 1/f req(c) is the inverse class frequency c in the dataset D K , y i,c is the one-hot encoded target label for the i th image, and p is the posterior probability obtained by taking the softmax of output logits from f .In this phase, only parameters of the feature encoder θ e and supervised fully connected layer θ s are updated.

2) UNSUPERVISED PHASE
Following each supervised phase, an unsupervised phase is trained using the entire dataset D. In this phase, the objective of the network is to learn to solve the jigsaw task.As shown in Figure 2, we first decompose an image into a 3 × 3 grid of tiles.Then, a patch of 0.75-0.9times the original tile size and a random offset is cropped from each tile.The patches are then scaled back to the original tile size, reordered according to a selected permutation index P, and concatenated to reform a 222 × 222 input image z.This transformation prevents the network from using low level cues such as continuity of edges, color, or texture when estimating the patch order.Instead, the network is forced to learn high-level, global primitives such as shape.With 9 grid positions, there are 9! possible patch permutations, creating far too many labels for the network to learn.To make the classification task achievable for the network, we select a small subset of the possible P permutations with maximal Hamming distance from one another [11].An index is assigned to each permutation, which then functions as a pseudo-label.The jigsaw task is then formulated as a classification problem, tasking the network to learn to correctly predict the pseudo label P ∈ {0, 1, 2, ..., P } of z.Here, the zero index refers to the unscrambled, original image case.We use a weighted cross-entropy loss as the unsupervised loss L U .When creating a mini-batch for training in the unsupervised phase, we keep the scrambled-to-unscrambled image ratio equal to s : (1 − s), where s ∈ [0, 1].In the jigsaw shuffler, the permutation index for the scram-  bled images is drawn from uniform distribution U{1, P }.Hence, the frequency of occurrence for permutation indices is f req = ((1 − s), s/P, s/P, ..., s/P ), where f req(P) is the frequency of permutation index P.The inverse of frequency is used as a scalar weighting in the cross entropy loss, w P = 1/f req(P).The unsupervised loss is defined as follows : where |D| is the total number of images in the training dataset, z i is the i th recomposed image, y i,p is the one hot encoded pseudo label vector, and p(y i,p |z i , θ e , θ u ) is the prediction probability for the p th permutation.Minimization of the unsupervised loss involves only learning the feature encoder θ e and the unsupervised head θ u .The overall training loss L total is then : where λ is a scalar weight applied to the unsupervised loss.
In the unsupervised phase, ordered and shuffled images are mixed.During the supervised phase, input images remained ordered, just as they are presented during testing.When training is complete, the unsupervised head is discarded, and only the trained feature encoder and supervised lesion classification head are used for testing.

B. DOMAIN ADAPTATION
This section describes experiments to assess how semisupervised learning impacts the domain generalizability of a model.In the context of colonoscopy, domain adaption would be useful when applying a network to new endoscope types or manufacturers, to endoscopes with imaging performance that varies over time (e.g.dirty optics), or to new imaging modes.We experimentally withhold a target domain of data from the supervised task and only include it in the unlabeled set for the unsupervised task.We can then assess the domain adaptability of the network by testing on labeled samples from the target domain.
In colonoscopy, two widely used imaging modalities are White Light Imaging (WLI), and Narrow Band Imaging The network training approach remains the same as was described in the previous section, with the only exception being the data used in each phase.In the supervised phase, we use the labeled WLI images from D L−WLI , whereas in the unsupervised phase we use both the labeled WLI images and the unlabeled NBI images i.e.D L−WLI ∪ D U −N BI .In the testing phase, we use labeled NBI images D L−N BI .

C. OUT-OF-DISTRIBUTION DETECTION
In out-of-distribution detection, the goal is to identify test samples which don't belong to the distribution on which the model was trained.These out-of-distribution samples can then be rejected to avoid unreliable inference.A pretrained semi-supervised learning model can act as an efficient out-ofdistribution detector.In this experiment, we train a classifier using in-distribution samples on the main objective of lesion classification, and then later test its performance as an outof-distribution detector.We consider white light images to be in-distribution samples, and NBI images are treated as out-of-distribution samples.In the supervised phase, we use labeled white light images from D L−WLI .For the unsupervised phase, we use unlabeled and labeled white image i.e.D L−WLI ∪ D U −WLI .To use the classifier as an outof-distribution detector, we utilize the posterior probabilities p(y|x).Is is shown in [56], [57] that the probability distribution of prediction softmax probabilities for out-ofdistribution samples appears roughly uniform in distribution.Whereas, in-of-distribution samples have a more 'peaky' distribution with a higher maximum softmax probability max c p(y = c|x).An out-of-distribution detector score κ based on the posterior probabilities and the auxiliary jigsaw ) where KL[U ||p(y|x)] is the KL-divergence between the uniform distribution and the prediction softmax probabilities and Σ P P=0 w P y i,P log(p(y i,P |z i , θ e , θ u ) is the unsupervised loss for image x as defined in Equation 2. KL divergence measures the difference between two probability distributions.If two probability distributions are similar, the KL divergence between them is low, whereas a high value indicates that they are starkly different.The KL divergence between distributions P (y) & Q(y) is defined as : where y is the support of the distribution i.e y ∈ {0, 1} for this case.In the baseline experiment, the OOD score is For the semi-supervised learning case, we also add the jigsaw cross entropy loss.For testing, we use unseen WLI images as the negative class (label = 0) and NBI images as positive class (label = 1).

A. DATASET
The colonoscopy video data used in this paper was collected at the Johns Hopkins Hospital using a protocol approved by the Johns Hopkins Institutional Review Board (#IRB00184221).Video segments were analyzed and cropped from patient procedure video data, retrospectively, to limit included frames to those containing lesions that were biopsied by the endoscopist.Tissue biopsies were collected from suspected lesions, and ground truth labels derived from histopathology analysis were later paired with the respective video segments.A total of 108 patients were enrolled in the study.A total of 132 videos with corresponding ground truth labels were collected, with each video segment featuring a unique lesion.Video annotations were recorded by two medical trainees and verified by an experienced gastroenterologist.An additional 112 videos with no ground truth classification were cropped and extracted for training the semi-supervised model.
Videos were further categorized into two classes: "neoplastic/precancerous" and "non-neoplastic".Using the histologic labels, adenomas and serrated adenomas were assigned to the neoplastic/precancerous class (n=110), while hyperplastic polyps were assigned to the non-neoplastic class (n=22).The videos include a diverse distribution of imaging parameters, such as varied video processors, illumination modes (WLI/NBI), as well as scope manufacturer and models with high-& standard-definition resolutions.Videos were separated into training and testing sets with equal class balance between sets.Derived image frames were stored in separate containers to prevent class leakage.Repetitive image frames resulting from minimal camera motion were discarded.A frame wise summary of the dataset is given in table 1.

B. IMPLEMENTATION DETAILS
All experiments are implemented using PyTorch library [58] on a server equipped with an NVIDIA RTX 2080Ti 11GB GPU, an Intel Xeon Processor W-2123 3.6 GHz CPU, and 64 GB of RAM.We use the JiGen repository [39] as our base code for development.All experiments utilize ResNet-18 [59] as the feature encoder.The fully connected layers are 512 × 2 for the supervised branch and 512 × P for the unsupervised branch, similar to the FCN classifier in ResNet-18, only differing by the number of output nodes.
The network weights are initialized using the pre-trained ImageNet ResNet-18 weights available in the PyTorch library.Data augmentation for whole images includes random vertical flip, random horizontal flip, random rotation into {0 • , 90 • , 180 • , 270 • }, and random crops of size [0.8, 1.0] (all p=0.5).The images are normalized with mean [0.485, 0.456, 0.406] and standard deviation [0.229, 0.224, 0.225].The augmented images are finally resized to 222 × 222.In the case of the unsupervised phase, the whole image transformations are applied before the jigsaw shuffler.No color transformations are applied, as polyp color is a discriminative feature among the classes.ADAM optimizer [60] with weight decay (L 2 Penalty) is used for training the network.The initial learning rate is kept as 0.0001.The ratio of frame-wise frequency of class is 0.83:0.17for the neoplastic to non-neoplastic classes, the inverse of which is used as weights in the supervised weighted cross entropy in equation 1.The scrambled to unscrambled image ratio s : 1 − s used in equation 2 is kept as 0.6:0.4.

C. VARYING THE QUANTITY OF LABELED DATA
To test the efficacy of our semi-supervised learning approach, we evaluate its performance as a function of the quantity of labeled data D K used for training.We train the network using k% of the total labeled training data where k varied logarithmically, k = {100, 50, 25, 12.5, 6.25}.For each k, we perform a five-fold cross validation.To split the dataset, we first select 20% of the total labeled data for validation.This split of validation set is done at the video level to prevent images of the same polyp mixing between the train and validation sets.Next, we choose k% of the remaining labeled datasets as our supervised training dataset D K .Thus, the validation dataset for a particular fold remains the same for all the values of k.We use the selected labeled dataset D K for training the supervised phase, but for all values of k we use the whole training dataset D (excludes the validation images) for the unsupervised phase.On an average there are 819 images in the validation set.
We perform an ablation study to measure the performance of SSL when compared to a baseline model.The baseline model is also a ResNet-18, and it is architecturally the same as the SSL model (described in III-A), but without the jigsaw head.The baseline model uses the same weighted cross entropy loss that the SSL model uses in the supervised phase (Equation 1).When comparing the performance of the SSL model and the baseline, both models use the same validation data and the selected labeled data for supervised training D K .
The hyperparameters which gave the best performance for both models are reported.For the baseline model, an initial learning rate of 0.0001 is used for all cases except for 100% models where 0.001 is used.As for weight decay, 100% model uses 0.005, 50% & 25% uses 0.05, 12.5% has 0.2 and 6.25% uses a value of 0.005.For the SSL models, an initial learning rate of 0.0001 is used.The number of jigsaw classes (P) is 30 for all cases except 100%, which uses 100 classes.The weight decay values are -0.005 for 100%, 0.05 for 50%, 0.07 for 25% & 12.5%, and 0.2 for 6.25%.The unsupervised loss weights λ are 1 for 100% & 50%, 2 for 25%, and 1.5 for 12.5% & 6.25%.The λ value is also increased 1.5 times every 5 epochs for low data regime training to accelerate the unsupervised learning phase to match the swift learning on the supervised end, due to small labeled data size.
We evaluate the classification performance with five commonly used metrics -accuracy, F1 score, sensitivity, specificity and precision.Accuracy is the ratio of correct predictions over the total number of test samples.Since our data has an uneven class distribution, we also use F1 score for evaluation.F1 score is the harmonic mean of precision and recall.Sensitivity is the ratio of correctly predicted positive samples to the total number of positive samples (neoplastic/precancerous class).Similarly, specificity is the ratio of correctly classified negative class samples (non-neoplastic class).Precision is the ratio of correctly predicted positives to all predicted positives.Definitions are as follows: F1 Score = 2T P 2T P + F P + F N where true positive T P is the number of correct predictions for the positive class while true negative T N is the number of correct predictions for the negative class.False negative F N is the number of samples incorrectly classified to negative class whereas false positives F P is the incorrect classifications to the positive class.Figure 4 plots the median metrics and the standard deviation across the five fold cross validation as a function The semi-supervised improvement over the baseline indicates that adding a jigsaw solving auxiliary task is beneficial.This improvement could be attributed to SSL enabling the network to learn more discriminative features, such as shape, while learning the jigsaw task.Superior performance in the low data regime, and even the extra boost with 100% labeled data, indicates that the jigsaw task effectively leverages unlabeled data.It is worth noting that the baseline outperforms SSL on the specificity metric.For our use case of precancerous lesion classification, sensitivity is more important than specificity, as missing precancerous lesions may lead to delayed treatment, a worse prognosis, and ultimately a reduced survival rate.

D. DOMAIN ADAPTATION
The goal of this experiment was to test the domain generalizability of semi-supervised learning.We train the model on labeled white light images (n=2326) and unlabeled NBI images(n=961), and then test the model using labeled NBI images (n=1685).The architecture, training protocol, and testing protocol remain the same as in the previous subsection IV-C.For the ablation study, the baseline model described in IV-C was used.For training the baseline model, we use the same set of white light labeled images (n=2326) as in SSL training.The hyperparameters used are an initial learning rate of 0.0001 and weight decay of 0.005 for both cases.In SSL, the number of jigsaw classes (P) was 100 and the unsupervised loss weight λ = 1 was used.
The results for the domain adaptation experiment are re- ported in 3. To avoid any statistical error, we report the mean values for 3 runs initiated with different random seeds.We observe that the semi-supervised model exceeds the baseline in accuracy, F1 score and sensitivity by 1.92%, 0.02 and 0.07 respectively.This superlative performance demonstrates that the semi-supervised methods take advantage of unlabeled target images to learn domain invariant feature representations.This may be enabled by the jigsaw puzzle solver learning the spatial correlation of images.

E. OUT-OF-DISTRIBUTION DETECTION
In this experiment, we test semi-supervised performance as an out-of-distribution detector.In our problem setup, we treat white light images as in-distribution samples and NBI images as out-of-distribution.The SSL and baseline models and their training algorithms as lesion classifiers as described in IV-C was used in this experiment as well.The training set for the baseline and SSL consisted of 1921 labeled white light images, with the SSL model additionally used 1518 unlabeled WLI.We used the same hyperparameters as described in IV-D for training the in-distribution models.
During inference, the out-of-distribution detector score κ for the baseline is the KL-divergence between the prediction probabilities and uniform distribution.For SSL, we add the jigsaw loss to the KL-divergence term to compute κ as described in equation 4. The test set consists of 416 white light images (label = 0) and 1685 NBI images (label = 1).The OOD scores and the labels are used to generate a Receiver Operator Characteristic (ROC) curve.The Area Under Receiver Operator Characteristic (AUROC) is then used as a metric to determine the efficacy of the OOD detector.The AUROC can be interpreted as the probability that the OOD score κ for an out-of-distribution sample is greater than an in-distribution sample.
Figure 5 shows the results for OOD detection.A ROC curve for the model with median AUROC among three runs is reported.The SSL models has an AUROC of 0.71 as compared to 0.53 for the baseline.This shows that the

V. CONCLUSION
In this paper, we explore semi-supervised learning to utilize unlabeled data and improve lesion classification in colonoscopy images.We developed a phased training model using a jigsaw solving task and observed improved performance in metrics including accuracy and F1 score when compared with a purely supervised model.These data demonstrate that the addition of a jigsaw task helps the encoder generate discriminative features.We find that a semi-supervised learning model performs significantly better than a fully supervised method, especially in the low data regime.These results suggests that unsupervised learning is strongly regu-larizing the model.
While the focus of semi-supervised learning works has traditionally been on accuracy metrics, in this paper we also study the effect of SSL on the generalizability and uncertainty of the model.In terms of generalizability, we show SSL's superior performance to supervised methods for domain adaptation.SSL improves performance on the target domain, using only unlabeled target distribution images.We also show that SSL models are better out-of-distribution detectors as compared to supervised models.This uncertainty measurement can simply be obtained from the prediction probabilities and jigsaw loss without requiring any architectural modifications.
We would like to emphasize that the point of this study is not to present the jigsaw based semi-supervised learning as the best-in-class model for the accuracy, domain adaptation, or OOD detection problems.Instead, we aim to establish proof-of-concept that adding an auxiliary semisupervised task to supervised methods can significantly improve colonoscopy image analysis.In medical image analysis in general, the paucity of labeled data makes semi-supervised learning an important paradigm.Additionally, since domain generalization and out-of-distribution detection are important challenges in many practical clinical scenarios, semisupervised learning holds significant promise to facilitate the translation of artificial intelligence techniques to real world applications.
Future work to expand and further validate this general approach include exploring additional semi-supervised learning tasks such as image colorization, and patch prediction, or even a combination of these proxy tasks in a multitask learning setup.To understand the dependence of the supervised objective on the semi-supervised learning proxy, the performance of a variety of colonoscopy challenges, such as polyp detection and segmentation, should be included, as well as additional proxy tasks.It is possible that the jigsaw task may not be optimal for improving the performance of lesion detection, for instance.The improvement in domain evaluation from SSL may be expanded by assessing not only across imaging modalities but also across different endoscopes with varying resolutions, illumination parameters, and frame rates.A deeper analysis of out-of-distribution detection, particularly for different types of out-of-distribution samples and the 'harder' near distribution anomalies, is an important future step.Lastly, it would be valuable to explore how the SSL improvements change as the size of both the labeled and unlabeled datasets increase.

FIGURE 2 :
FIGURE 2: Overview of the jigsaw shuffler procedure for generating shuffled images with a pseudolabel for unsupervised learning.

FIGURE 3 :
FIGURE 3: Illustrative colonoscopy polyp images showing difference in the WLI & NBI modalities

FIGURE 4 :FIGURE 5 :
FIGURE 4: Results comparing semi-supervised learning against baseline as function of fraction of labeled training data.Median and standard deviation for 5-fold cross validation are reported.

TABLE 2 :
Descriptive statistics comparing the performance of semi-supervised against baseline as function of labeled data percentage.The median values across 5-fold cross-validation are reported.

TABLE 3 :
A comparison of Baseline and Jigsaw Pretext Semi-Supervised Learning for Domain Adaptation.