A Study on Self-Supervised Pretraining for Vision Problems in Gastrointestinal Endoscopy

Solutions to vision tasks in gastrointestinal endoscopy (GIE) conventionally use image encoders pretrained in a supervised manner with ImageNet-1k as backbones. However, the use of modern self-supervised pretraining algorithms and a recent dataset of 100k unlabelled GIE images (Hyperkvasir-unlabelled) may allow for improvements. In this work, we study the fine-tuned performance of models with ResNet50 and ViT-B backbones pretrained in self-supervised and supervised manners with ImageNet-1k and Hyperkvasir-unlabelled (self-supervised only) in a range of GIE vision tasks. In addition to identifying the most suitable pretraining pipeline and backbone architecture for each task, out of those considered, our results suggest three general principles. Firstly, that self-supervised pretraining generally produces more suitable backbones for GIE vision tasks than supervised pretraining. Secondly, that self-supervised pretraining with ImageNet-1k is typically more suitable than pretraining with Hyperkvasir-unlabelled, with the notable exception of monocular depth estimation in colonoscopy. Thirdly, that ViT-Bs are more suitable in polyp segmentation and monocular depth estimation in colonoscopy, ResNet50s are more suitable in polyp detection, and both architectures perform similarly in anatomical landmark recognition and pathological finding characterisation. We hope this work draws attention to the complexity of pretraining for GIE vision tasks, informs this development of more suitable approaches than the convention, and inspires further research on this topic to help advance this development. Code available: https://www.github.com/ESandML/SSL4GIE.


I. INTRODUCTION
G ASTROINTESTINAL endoscopy (GIE) is a procedure for screening and treating various digestive disorders that involves the insertion of a thin, flexible tube with a camera and light at the end, known as an endoscope, into either the mouth (gastroscopy) or anus (colonoscopy or sigmoidoscopy) of the patient.The endoscope is then traversed through the gastrointestinal tract as it transmits images of the inner lining to a monitor, where the endoscopist can inspect them for abnormalities and perform any necessary interventions.However, this poses several challenges for the endoscopist, such as the high volume and complexity of visual information, the variability and subtlety of the lesions, and the need for real-time decision making [1].
To help overcome these challenges, computer vision has been identified as offering a promising set of tools for assisting endoscopists with various aspects of data analysis.Such aspects may be framed as traditional computer vision tasks such as image classification, object detection, semantic segmentation, and monocular depth estimation, among others, where the current state-of-the-art solutions for these tasks use deep learning models trained on large amounts of data.

A. RELATED WORK
While large datasets suitable for training models to perform image classification with everyday images exist; most notably the publicly available ImageNet-1k [2], but also the privately held JFT-300M [3], [4] and JFT-3B [5]; the datasets available for other computer vision tasks and distributions of images, particularly GIE images [6], are notably smaller.It has become clear that the amount of data a model is trained on has a strong influence on its performance [7], and efforts have therefore been taken to identify ways in which the largest available datasets can be leveraged in the training of models for tasks which these large datasets do not include suitable annotations for, and which may involve images of a dissimilar distribution.A now well-established approach [8] is to train (pretrain) an image classifier from random initialisation with the ImageNet-1k dataset (1.2M everday images), remove the classification layer and add any decoder components required for the intended (downstream) task to the then pretrained image encoder, and train (fine-tune) the resulting model with a dataset which does include suitable annotations for the downstream task.Encoders used in this manner are often referred to as backbones.
The approach of pretraining backbones on image classification with ImageNet-1k may however be limiting for two main reasons.Firstly, the model will learn to make high-level abstractions during pretraining, and since this pretraining is task-specific, these abstractions may not generalise well and may need to be unlearned during fine-tuning.For example, the ground truth class of many images in ImageNet-1k refers to objects in the foreground and training a model to classify images on this basis may lead to the model learning to pay less attention to the background, which could contain information that is useful for the downstream task.Secondly, image classification datasets require annotations which can be expensive to produce, limiting the degree to which we can leverage more data in pretraining.This is particularly true of GIE images [6], which are especially expensive to annotate, and the use of which in pretraining may be beneficial when the downstream task involves such images.
With the aim of addressing these limitations, a significant amount of research into self-supervised pretraining has been undertaken in recent years, leading to a range of popular algorithms [9]- [19].Self-supervised pretraining algorithms set task-agnostic objectives that require models to predict targets extracted from the input data, which can allow for the learning of generalisable high-level feature recognition.Additionally, since this paradigm of learning does not require annotations, it provides the potential for leveraging a much larger amount of data and/or data of a more similar distribution to that involved in the downstream task.
A significant amount of research into self-supervised pretraining with everyday images [9]- [19], as well as several modalities of medical images [20]- [27], has now been undertaken.However, it is still the convention in GIE to employ backbones that have been pretrained in a supervised manner with ImageNet-1k.A set of 99,417 unlabelled GIE images (Hyperkvasir-unlabelled) was however included in the recently released Hyperkvasir dataset [28] which, while much smaller than ImageNet-1k, is significantly larger than other datasets of GIE images.This data should allow for the self-supervised pretraining of GIE-specific backbones, which may be better suited to some tasks in GIE than the described convention.Additionally, self-supervised pretraining with datasets of everyday images, e.g.ImageNet-1k, may also provide opportunities for improvements.

B. CONTRIBUTIONS
This paper presents a study on pretraining encoders for use as backbones in solutions to vision tasks in GIE.We consider twelve encoders, each of a ResNet50 [29] or ViT-B [30] architecture and pretrained with one of six pipelines, including two self-supervised pretraining algorithms per architecture, each used separately with both ImageNet-1k and Hyperkvasir-unlabelled, as well as baselines of supervised pretraining with ImageNet-1k and random initialisation (not pretrained).We use state-of-the-art methods for adapting and fine-tuning each encoder for a range of vision tasks in GIE, namely: anatomical landmark recognition, pathological finding characterisation, polyp detection, polyp segmentation, and monocular depth estimation in colonoscopy; and we compare the resulting models on the basis of their finetuned performance using well-established metrics.The overall workflow of our experimentation is illustrated in Fig. 1.
In addition to identifying which architecture and pretraining pipeline (algorithm and data) is most suitable for each task, our results suggest that self-supervised pretraining with ImageNet-1k consistently allows for better performance than supervised pretraining with ImageNet-1k, across all considered tasks and architectures.We also demonstrate that selfsupervised pretraining with ImageNet-1k is typically more suitable than self-supervised pretraining with Hyperkvasirunlabelled, with the notable exception of monocular depth estimation in colonoscopy where the similarity of the pretraining data to the downstream data appears to be more critical than the amount of pretraining data.Additionally, we find that ViT-B backbones are typically more suitable for polyp segmentation and monocular depth estimation in colonoscopy, that ResNet50 backbones are more suitable for polyp detection, and that both architectures perform similarly in anatomical landmark recognition and pathological finding characterisation.
While a number of studies have experimented with selfsupervised pretraining for certain GIE vision tasks before [31]- [35], only two [34], [35] have compared self-supervised pretraining against the convention of supervised pretraining with ImageNet-1k.Additionally, in their experiments with GIE vision tasks, these works either compared selfsupervised pretraining against supervised pretraining of a different architecture with the same dataset, or the same architecture with a different dataset.Our work is therefore the first to compare self-supervised pretraining against supervised pretraining for the same encoder architecture and pretraining data, in terms of fine-tuned performance on GIE vision tasks.Additionally, we consider a much wider scope of self-supervised pretraining algorithms and GIE vision tasks than these previous works, each of which focuses on a single task, and are the first that we know of to experiment with self- supervised pretraining for polyp detection and monocular depth estimation in colonoscopy.Beyond the value of these results in isolation, this wide scope allows us to expose the general principles revealed by our analysis.

II. INVESTIGATED SELF-SUPERVISED PRETRAINING ALGORITHMS
Self-supervised algorithms for pretraining image encoders for use as backbones can be grouped into four families [36]: • Deep metric learning (DML)-based self-supervised pretraining algorithms train an encoder to describe semantically similar images with quantifiably similar representations, and semantically dissimilar images with quantifiably dissimilar representations.This is typically achieved by creating positive pairs, which are distorted variants of the same image, and negative pairs, which are distorted variants of different images, and training the encoder with a contrastive loss that is minimised through a reduction in the distance or angle between the representations of positive pairs, and an increase in the distance or angle between the representations of negative pairs.• Self-distillation-based self-supervised pretraining algorithms train an encoder to describe a variant of an image with a representation that allows for a representation of a different variant of the image, produced by another encoder, to be predicted.As a means of avoiding collapse, which occurs when both encoders learn to output the same representation for all images, the second encoder is typically an exponential moving average of the encoder being optimised, though collapse can be avoided through a Siamese network with a stop-gradient on one branch [19].• Canonical correlation analysis (CCA)-based selfsupervised pretraining algorithms train an encoder to describe an image in such a way that each feature of its representation is informative of a distinct attribute of the image.This is typically achieved with a loss function that encourages the encoder to maintain a certain amount of variance for each feature in the representation, while establishing uncorrelatedness between features.
• Masked image modelling (MIM)-based self-supervised pretraining algorithms aim to reproduce the success of masked language modelling (MLM) pretraining algorithms, first introduced for pretraining the transformerbased text encoder BERT [37], in the domain of vision.MIM algorithms are therefore typically used with ViT architectures, which are also inspired by BERT, where the image is split into patches that are treated as a sequence of visual tokens akin to the sequence of word tokens used to represent input text for BERT.In both MLM and MIM, input tokens are randomly masked, and a model is trained to reconstruct these tokens based on the information contained in the remaining tokens.The rest of this section presents the selection of algorithms considered in our experimentation, which we ensured spanned these four families of self-supervised algorithms.We illustrate and provide a definition of the key details of each algorithm, where we use f θ to denote the image encoder being optimised for use as a backbone, and explain how we obtained and used encoders pretrained using each algorithm with either ImageNet-1k or Hyperkvasir-unlabelled.Note that any training performed as part of this work was done on an ASUS ESC8000-G4 GPU server with 6× NVIDIA RTX A6000 48GB GPUs.Due to the number of GPUs and amount of memory, the batch sizes used are in multiples of 6 and, for pretraining, are the maximum that we could allow for.

A. MOCO V3
MoCo v3 [14], illustrated in Fig. 2, is the latest iteration of the momentum contrast (MoCo) algorithm, which started as an example of DML.While the distinguishing feature of all iterations of MoCo is the momentum encoder and FIGURE 2. Visualisation of the MoCo v3 algorithm.Shown for a per-GPU batch size of 2, and 3 GPUs.We use g θ to denote the projector, h θ to denote the predictor, ϕ to deno projector g ϕ • f ϕ , which is used to compute a representation for one image variant in each pair, rather than using the online encoder and projector g θ •f θ to compute both representations as is more conventional in DML, e.g.SimCLR [9], MoCo v3 incorporates a prediction head h θ .The resulting algorithm can be framed as either a DML algorithm that incorporates the principle of self-distillation, or a self-distillation algorithm which uses a contrastive loss.As such, we consider MoCo v3 as a representative of both the DML and selfdistillation families.
We define a batch of positive pairs of image variants on a single GPU as {(x i,1 , x i,2 )} N b i=1 .We then define the representations used by MoCo v3 as: where N Gb = N G N b , where N G is the number of GPUs, and the representations {k i,j } N Gb ,2 i=N b +1,j=1 are gathered from the other GPUs (see Fig. 2), where they are computed in the same manner as {k i,j } N b ,2 i=1,j=1 on different image variants, i.e. {(x i,1 , x i,2 )} N Gb i=N b +1 .The loss function used by MoCo v3 for a batch on a single GPU can then be defined: where τ is the temperature parameter, a constant positive scalar, and L IN CE is the InfoNCE loss [38], which is defined: N j=1 e CoSim(qi,kj )/τ (4) where CoSim is the cosine similarity: Note that Fig. 2 can be seen as illustrating whereas L M C3 makes this symmetrical.
The algorithm has been designed to work effectively for optimising both ResNet and ViT architectures of f θ .For ViT architectures, the patch embedding layer is frozen as a random linear projection for stability reasons, and the class token [cls] is taken as the output of f θ .The projectors g θ and g ϕ , and the predictor h θ , are defined as multilayer perceptrons (MLPs) composed of fully connected layers, batch normalisation and ReLU activations.
We consider the use of MoCo v3 for pretraining both ResNet50 and ViT-B architectures, and use the torchvision implementation of ResNet50 and the ViT-B implementation from the official MoCo v3 codebase.For the encoders pretrained using MoCo v3 with ImageNet-1k, we use the weights provided by the authors.We then used the implementation of MoCo v3 in the official codebase to pretrain encoders with Hyperkvasir-unlabelled, modifying the code only for loading Hyperkvasir-unlabelled and to change the batch size from 4096 to 1536/768 (ResNet50/ViT-B).When fine-tuning the ViT-B models, we unfreeze the patch embedding layer.

B. BARLOW TWINS
Barlow Twins [13], illustrated in Fig. 3, is an example of a CCA algorithm.Barlow Twins trains a model to maintain a certain amount of variance for each feature and to establish uncorrelatedness between features with a loss function that encourages an identity empirical cross-correlation matrix between representations of two distorted variants of the same image.Other examples of CCA differ mainly in the loss function.For example, the loss function used by VicReg [16] encourages the variance of features to be maintained, and uncorrelatedness between features to be established, on representations of individual variants of an image directly, as well as minimising the Euclidean distance between representations of two variants of the same image.
We define a batch of positive pairs of image variants on a single GPU as {(x i,1 , x i,2 )} N b i=1 .We then define the representations used by Barlow Twins as: z i,j = g θ (f θ (x i,j )) , i = 1, . . ., N b and j = 1, 2 (7) which may also be written as (z i,j,k ) d k=1 = z i,j .These representations are normalised to give: The elements of the empirical cross correlation matrix (c k,l ) d,d k=1,l=1 can then be defined: which is averaged across GPUs, the result of which we denote (c k,l ) d,d k=1,l=1 .Finally, the Barlow Twins loss can be defined: where λ is a constant positive scalar and 1 is an indicator function.
The algorithm has been designed to work effectively for ResNet architectures of f θ .The projector g θ is defined as an MLP composed of fully connected layers, batch normalisation and ReLU activations.
We consider the use of Barlow Twins for pretraining ResNet50 architectures, for which we use the torchvision implementation.For the ResNet50 pretrained using Barlow Twins with ImageNet-1k, we use the weights provided by the authors.We then used the implementation of Barlow Twins in the official codebase to pretrain a ResNet50 with Hyperkvasir-unlabelled, modifying the code only for loading Hyperkvasir-unlabelled and to change the batch size from 2048 to 1536.

C. MAE
Masked autoencoders (MAE) [12], illustrated in Fig. 4, is a particularly popular example of the MIM family.It differs from other popular MIM algorithms on two main fronts.
First, examples such as BEiT [15], PeCo [39], and SimMIM [40] use an arbitrary token in place of the masked tokens in the input to f θ , where MAE simply omits them.Notably, this is only possible with ViTs due to the use of position embeddings that inform a model of the specific patch of an image that a token corresponds to explicitly.For reconstruction, this does however require the insertion of an arbitrary token at each position of a masked token in the output of f θ , and the processing of the resulting sequence of tokens by a decoder g θ , which has a smaller ViT architecture.Secondly, BEiT and PeCo use the discrete variational autoencoder introduced as a component of DALL-E [41] to quantise all possible image patches into a finite set of visual tokens akin to a vocabulary of words, rather than directly using the patches as visual tokens.This allows the reconstruction to be framed as classifying which token in this finite set the masked token should be, closely following BERT.MAE however takes a more conventional approach to image reconstruction and frames it as a regression problem.
As is typical for a ViT, an image is first divided into a sequence of flattened non-overlapping patches that are projected by a patch embedding layer and translated by a position embedding to produce the sequence of visual tokens and fed into the first block.Before concatenating with the [cls] token however, MAE generates a set of uniformly distributed random values {α i ∼ U(0, 1)} Np i=1 and computes the permutation σ which sorts the set into reverse order, i.e. α σ(i) ≥ α σ(i+1) for i = 1, . . ., N p − 1.For a proportion of masking γ ∈ [0, 1], selected to ensure that γN p − ⌊γN p ⌋ = 0, the sequence passed forward is then . In contrast to MIM algorithms that replace rather than omit the masked tokens from the input to f θ , it is important in MAE that the same number of tokens in each input are masked, i.e. γN p is constant, to allow for batching.If the sequence of visual tokens, i.e. omitting the [cls] token, in the output of f θ is denoted , we then create the sequence (z i ) Np i=1 , where: where m is a learnt arbitrary token.The tokens in (z i ) are then translated by another position embedding and fed through the decoder blocks with the [cls] token.The output of the decoder blocks is then fed through a prediction head and the [cls] token is removed, leaving the sequence of reconstructed flattened patches for the entire image . Denoting the sequence of ground truth flattened patches (y i ) Np i=1 , in which the features have been zero-centred and scaled to unit variance for each patch independently, the loss function is defined: where d p is the dimensionality of a patch.The loss is then averaged over all images in the batch on a single GPU, and the update to the model is averaged over GPUs, as is typical in distributed supervised learning.
As mentioned, MAE has been designed for pretraining ViT architectures specifically.A notable distinction between the use of ViT in MAE and in MoCo v3 is that the loss is computed on the processed visual tokens in MAE, whereas it is computed on the processed [cls] token in MoCo v3.
We consider the use of MAE for pretraining ViT-B architectures, for which we use the implementation from the official MAE codebase.For the ViT-B pretrained using MAE with ImageNet-1k, we use the weights provided by the authors.We then used the implementation of MAE in the official codebase to pretrain a ViT-B with Hyperkvasirunlabelled, modifying the code only for loading Hyperkvasirunlabelled and to change the batch size from 4096 to 768.

III. BASELINES
For each of the considered encoder architectures, ResNet50 and ViT-B, we consider two baselines to compare the discussed self-supervised pretraining pipelines against.Most importantly, we consider supervised pretraining with ImageNet-1k, representing the conventional approach for pretraining image encoders for use as backbones in solutions to GIE vision tasks.We then consider no pretraining, i.e. finetuning from random initialisation.We use the torchvision implementation and weights for ResNet50, and the timm implementation and weights for ViT-B.
We note that we do not directly compare against the stateof-the-art methods for each task.While our primary aim is to study the relative effectiveness of different pretraining pipelines, which such comparisons would not be suitable for due to the need for consistency in all other details, we believe that this would still be informative.However, we cannot compare against previously reported results due to the lack of standardisation in the benchmarks, with different works using different splits and different evaluation methodologies, and re-implementing these methods to allow for a direct comparison would be too time-consuming.To the best of our knowledge, the state-of-the-art for each task uses either a convolutional neural network or some derivative of ViT that has been pretrained in a supervised manner with ImageNet-1k as a backbone, and as such we consider models with a ResNet50 or ViT-B backbone that has been pretrained in a supervised manner with ImageNet-1k as representative of the state-of-the-art.

IV. IMAGE CLASSIFICATION
Image classification is the problem of determining which, out of a predefined set of classes, a given image should be assigned to.In the context of GIE, the predefined set of classes may cover, for example, possible anatomical landmarks, pathological findings, or categories of polyps.In this section, we detail and present our evaluation of the fine-tuned performance of backbones in two of these image classification tasks, namely anatomical landmark recognition and pathological finding characterisation.

A. DATA
The data used in our image classification experiments is taken from the Hyperkvasir-labelled dataset [28], which does not share any instances with Hyperkvasir-unlabelled.We specifically used the anatomical landmarks and pathological findings subsets, which we treated the classification of as two separate problems.For each subset, we combined the data for the upper and lower gastrointestinal tract, and applied a random 80%/10%/10% training/validation/test split, where the validation data is used to determine whether to save the weights after each epoch of training on the training data, and the test data is reserved for evaluating the model after finetuning.The number of instances of each class, in total and in each split, are given in Table 1.

B. DECODERS
In image classification, it is typical to simply add a linear classifier to the final representation computed by an encoder to allow for prediction.Following convention, we implement this as a fully connected layer that maps the final representation to a vector of logits, one for each possible class, which is softmax normalised prior to computation of the loss.For the ViT-B models, we use the output [cls] token as the final representation.

C. FINE-TUNING PROCEDURE
We separately train each model to perform both anatomical landmark recognition and pathological finding characterisation through the same procedure.We use the common finetuning procedure hyperparameters given in Table 3 and preprocess the training images using the pipeline detailed in Table 2.The loss is then computed using a cross entropy loss function which, due to the significant class imbalance in the data, is weighted with a value of N D /N i N c for the i th class, where N D is the total number of images in the dataset, N i is the number of images in a particular class, and N c is the number of classes.Note that these numbers are for the entire dataset, rather than the training set.This weighting ensures that the total sum of weights across all instances is N D , for consistency with unweighted cross entropy.We use the macro F1-score (mF1) 1 as the validation metric: where TP i is the number of true positives for the i th class, FP i is the number of false positives, FN i is the number of false negatives, and ϵ = 1e − 8.The transformations applied to the validation images include the same resizing and normalisation applied to the training images.Finally, the model is trained on this basis for 50 epochs, with the parameters saved after each epoch that leads to an improvement in mF1 on the validation set, with any batch normalisation synchronised across GPUs.

D. EVALUATION
We evaluate the resulting image classification models using the corresponding test data, which is pre-processed in the same manner as the validation data, with four metrics, namely mF1 (as defined in ( 13)), mPrecision, mRecall, and Accuracy: where ϵ = 1e − 8.For all metrics, a higher value indicates better performance.The results for anatomical landmark recognition are presented in Table 4 and the results for pathological finding characterisation are presented in Table 5.

Operation
Image classification Object detection Semantic segmentation Monocular depth estimation 1) Pad to square

V. OBJECT DETECTION
Object detection is the problem of recognising and locating any objects of interest in an image.In the context of GIE, the objects of interest may be polyps, tools, artefacts, or disease.In this section, we detail and present our evaluation of the fine-tuned performance of backbones in polyp detection specifically.

A. DATA
The data used in our object detection experiments is taken from the Kvasir-SEG dataset [45], which does not share any instances with Hyperkvasir-unlabelled.The dataset includes 1000 GIE images, each of which shows at least one polyp and is paired with both a set of bounding boxes, specifying the location and the horizontal and vertical dimensions of any polyps in the image, and a binary segmentation map indicating which pixels correspond to a polyp and which don't.While the segmentation maps were used in our semantic segmentation experiments, here we use the sets of bounding boxes.We applied a random 80%/10%/10% training/validation/test split, where the validation data is used to determine whether to save the weights after each epoch of training on the training data, and the test data is reserved for evaluating the model after fine-tuning.

B. DECODERS
For our object detection experiments, we implemented the listed backbones within a Faster R-CNN pipeline [46] with feature pyramid network (FPN) [47], which we used the torchvision implementation of.We used the existing implementation of the pipeline with a ResNet50 backbone, specifying that all layers of the backbone should be trainable.
For the ViT-B models, based on previous analyses of using ViT backbones in object detection [48], [49], we first modified the encoders to efficiently process larger image sizes 2 by bilinearly interpolating the position embeddings and using non-overlapping window self-attention in all but the 3rd, 6th, 9th, and 12th blocks.Window attention, also known as restricted attention [50], independently applies attention to subsets of the sequence of visual tokens, where each subset corresponds to the tokens in a square window of the equivalent feature map, with no overlapping windows.We used 256 tokens in each subset, corresponding to a 16 × 16 window of a feature map.We then modified the Faster R- CNN pipeline to use the resulting encoders as backbones with a ViTDet FPN [49].

C. FINE-TUNING PROCEDURE
In the fine-tuning of both model architectures, we use the common fine-tuning procedure hyperparameters given in Table 3 and pre-process the training images using the pipeline detailed in Table 2.For the ResNet50 models, using the default pre-processing pipeline for the Faster R-CNN implementation, the images in a batch are then each resized with bilinear interpolation to a scale of min(800/min(h, w), 1333/max(h, w)) of the original size h × w, and then padded to H × W , where H is the maximum height of the resized images and W is the maximum width of the resized images across the batch.For the ViT-B models, inspired by a previous analysis [48], the images are padded to 1024 × 1024 -since several images in the dataset have a height or width larger than 1024, these images are downsampled to half the resolution using bicubic interpolation with anti-aliasing before padding.Transformations are also applied to the bounding boxes in accordance with any VOLUME x, xxxx spatial transformations applied to the image.The usual multitask loss function for the Faster R-CNN pipeline is used to compute the loss, and we use AP@[.5:.95] as the validation metric for predicted bounding boxes that have a confidence score ≥0.05: AP@[.5:.95] = 1 10 t∈T where T = {0.5, 0.55, . . ., 0.95} is the set of intersection over union (IoU) thresholds and AP@t is the average precision at the t th IoU threshold.We compute AP@t by first ranking all predicted bounding boxes with respect to the confidence score, from high to low.We then step through the predicted bounding boxes in rank order and assign the prediction to the true positives if it has an IoU with a target bounding box for the same image that is greater than the IoU threshold, and otherwise assign it to the false positives.At each rank, we then compute the precision and recall using the cumulative number of true positives and false positives and the total number of false negatives.We then determine a strictly monotonically increasing sequence of recall values (r i ) Nr i=1 , with r 1 = 0, r Nr = 1, and (r i ) Nr−1 i=2 being the recall values (excluding 0 and 1) for ranks where false positives and resulting drops in the precision occur, and AP@t is then: where p(r i ) is the maximum precision value out of those which correspond to r i , for i = 2, . . ., N r .The transformations applied to the validation images include the same resizing and/or padding and normalisation applied to the training images, and a batch size of 1 is used to ensure the evaluation of a ResNet50 model on a particular instance is not influenced by other images in a batch (through the padding to H × W ). Finally, the model is trained on this basis for 200 epochs, with the parameters saved after each epoch that leads to an improvement in AP@[.5:.95] on the validation set, with any batch normalisation synchronised across GPUs.

D. EVALUATION
We evaluate the resulting object detection models using the test data, which is pre-processed in the same manner as the validation data, with AP@[.5:.95] (AP for conciseness), AP@.5 (AP 50 ), and AP@.75 (AP 75 ) computed for predicted bounded boxes with a confidence score ≥0.05.For all metrics, a higher value indicates better performance.The results are presented in Table 6, and some examples for predicted bounding boxes with a confidence score ≥0.5 are shown in Fig. 5.

VI. SEMANTIC SEGMENTATION
Semantic segmentation is the problem of determining which, out of a predefined set of classes, each pixel in an image should be assigned to.In the context of GIE, the predefined set of classes will typically include a background class that accounts for anything that is not of interest, as well as any classes that are of interest, for example, polyps, tools, artefacts, or disease.In this section, we detail and present our evaluation of the fine-tuned performance of backbones in polyp segmentation specifically, which is notably a binary segmentation problem.

B. DECODERS
For our semantic segmentation experiments, we used the listed ResNet50 backbones with a DeepLabV3+ [52] decoder, using the segmentation-models-pytorch implementation.We then used the ViT-B backbones with the segmentation variant of the dense prediction transformer (DPT) [53] decoder, using the implementation provided in the official codebase.

C. FINE-TUNING PROCEDURE
We separately train each model to perform polyp segmentation with each dataset through the same procedure.We use the common fine-tuning procedure hyperparameters given in Table 3 and pre-process the training images using the pipeline detailed in Table 2. Transformations are also applied to the segmentation maps in accordance with any spatial transformations applied to the image.The loss is then computed using the Dice loss function [54], and we use mDice as the validation metric: where N e is the number of instances in the validation/test set, TP i is the number of true positives for the i th image, FP i is the number of false positives, FN i is the number of false negatives, and ϵ = 1e − 8.The transformations applied to the validation images include the same resizing and normalisation applied to the training images, with the validation maps also resized to 224 × 224.Finally, the model is trained on this basis for 200 epochs, with the parameters saved after each epoch that leads to an improvement in mDice on the validation set, with any batch normalisation synchronised across GPUs.

D. EVALUATION
We evaluate the resulting semantic segmentation models using the corresponding test data, where the images are pre-processed in the same manner as the validation images, but the segmentation maps are left at their original size.
The predictions are therefore resized to this original size using bilinear interpolation prior to binarisation.We then use four metrics, namely the mDice (as defined in ( 19)), mIoU, mPrecision, and mRecall: For all metrics, a higher value indicates better performance.The results for Kvasir-SEG are presented in Table 7 and the results for CVC-ClinicDB are presented in Table 8.Examples for Kvasir-SEG are shown in Fig. 6.

VII. MONOCULAR DEPTH ESTIMATION
Monocular depth estimation is the problem of predicting the length of the ray of light, that a particular pixel in an image corresponds to, between the camera and the object that the ray of light has come from, for every pixel in the image.Since the absolute scale of the scene can only be determined from the parallax observed with a second view, the problem is however inherently ill-posed and only relative scale can be determined.In this section, we detail and present our evaluation of the fine-tuned performance of backbones in monocular depth estimation in colonoscopy.

A. DATA
The data used for our depth estimation experiments is taken from the C3VD dataset [55], the only dataset that we know of which includes images captured with a clinical GIE camera (colonoscope, specifically) with paired ground truth depth maps.The dataset was collected by recording segments (sigmoid, descending, transcending, ascending, and cecum) of a high-fidelity 3D silicone phantom colon model with varying textures, emulating different patient-specific tissue features and vasculature patterns at varying optical depths, and varying illumination modes with a clinical colonoscope.Views of an equivalent 3D virtual colon model were then registered with key frames of the resulting videos, allowing for the rendering of a ground truth depth map for each frame, as well as a surface normal, optical flow, and occlusion map.Each video is also paired with ground truth camera pose, surface model, and coverage map.22 videos were recorded, with variation in the segment, camera pose, textures, and illumination, amounting to 10015 frames in total.We selected 18 videos (8610 frames) for training, 2 videos (977 frames) for validation, and 2 videos (528 frames) for testing, where the validation and test sets each include one randomly sampled video of the cecum and one randomly sampled video of the transcending segment, since the majority of videos were of one of these segments (8 of cecum and 9 of transcending).

B. DECODERS
For our monocular depth estimation experiments, we used the listed ViT-B backbones with the depth estimation variant of the dense prediction transformer (DPT) [53] decoder, using the implementation provided in the official codebase.Since there is no clear precedent for a decoder architecture for ResNet50-based depth estimation 3 , we designed our own.This decoder, designed to mirror the architecture of ResNet50, has three fusion levels.The first starts with the 3 Popular dense prediction architectures that adopt certain details of ResNets in their design and which may be suitable for depth estimation, such as ResUNet [56] or ResUNet++ [57], do not actually use a ResNet encoder.final feature maps output by a ResNet50 and halves the number of channels with a 1 × 1 convolutional layer followed by batch normalisation, before upsampling the resulting feature maps to twice the resolution with bilinear interpolation and concatenating it with the feature maps output by the previous level of the ResNet50.The concatenated features are then processed by three blocks that have the same design as the blocks used in each level of ResNet50.The second and third levels of the decoder follow the same logic as the first, except that they start with the output of the previous level of the decoder.A prediction head, which has the same design as the prediction head used in the depth estimation variant of the DPT decoder, is then used to predict a depth map from the output of the third level.

C. FINE-TUNING PROCEDURE
In the fine-tuning of both model architectures, we use the common fine-tuning procedure hyperparameters given in Table 3 and pre-process the training images using the pipeline detailed in Table 2. Transformations are also applied to the depth maps in accordance with any spatial transformations applied to the image, with absolute depth values scaled to [0, 1].The loss is then computed using the scale-and shiftinvariant (SSI) mean squared error (MSE) [58] with a multiscale shift-invariant gradient matching term [59], which is computed only on the pixels that are covered by the lens (corners are not covered -see examples in Fig. 7), and we use the mSSI-MSE for pixels covered by the lens as the validation metric: where N v is the number of pixels covered by the lens in an image, ŷi,j is the output value for the j th pixel covered by the lens in the i th image, y i,j is the corresponding target value, and s i and t i are the scale and shift computed using the closed form solution to the standard least squares problem: where h i = (s i , t i ) ⊤ and ŷi,j = (ŷ i,j , 1) ⊤ .The transformations applied to the validation images include the same padding, resizing, and normalisation applied to the training images, with the validation maps also padded and resized to 224×224 and depth values scaled to [0, 1].Finally, the model is trained on this basis for 50 epochs, with the parameters saved after each epoch that leads to an improvement in SSI MSE on the validation set, with any batch normalisation synchronised across GPUs.

D. EVALUATION
We evaluate the resulting monocular depth estimation models using the test data, where the images are pre-processed in the same manner as the validation images.We load two target depth maps for each image, one which is pre-processed in the same manner as the validation depth maps, for computing the scale and shift for pixels covered by the lens, and one left at the original size and scale ([0cm, 10cm]), for computing the performance.We compute and apply the scale and shift for the prediction, then resize the result to max(h, w) × max(h, w), where h and w are the height and width of the original image, crop to h × w to remove values for padded pixels, clip values to [0, 1], set any values for pixels not covered by the lens to 0, and scale the resulting values to [0cm, 10cm].We then use the four metrics used the SimCol3D challenge [60], namely the arithmetic mean across the test set of: the root MSE (mRMSE), the median relative absolute error (mMRAE), and the mean absolute error (mMAE), which are only applied to pixels covered by the lens: where N V is the number of pixels covered by the lens in an image at its original size, ŷi,j is the value in the postprocessed prediction for the j th pixel covered by the lens in the i th image at its original size, and y i,j is the corresponding target value.For all metrics, a lower value indicates better performance.The results are presented in Table 9, and some examples are shown in Fig. 7 with corresponding error maps shown in Fig. 8 to help visualise the differences.

VIII. ANALYSIS
The results presented in the previous sections primarily provide an indication of the ranking of the pretraining pipelines for each considered GIE vision task.Notably, there is some variation in this ranking, as illustrated in Fig. 9, however  the ViT-B encoder pretrained with MAE and ImageNet-1k most consistently allows for either the best, or highly competitive, downstream performance.Beyond this identification, however, these results provide evidence for more general principles regarding the pretraining of encoders for use as backbones in solutions to GIE vision tasks, which we reveal through an analysis presented in this section.First, we demonstrate that self-supervised pretraining is generally more suitable than supervised pretraining.To assess this, we evaluate the relative improvement of each model that uses a backbone pretrained in a self-supervised manner with ImageNet-1k vs. the equivalent model (same architecture and task) that uses a backbone pretrained in a supervised manner with ImageNet-1k.To compute the relative improvement, we consider the primary metric for each task as mF1 (image classification), AP (object detection), mDice (semantic segmentation), and mRMSE (depth estimation), as defined in the discussion of each task.Then, for all but mRMSE, we take the absolute difference between the result and a perfect score of 1, in order to convert each score (higher is better) to a measure of error (lower is better).We do not do this for mRMSE since it is already a measure of error.We then compute the relative improvement using: where δ SSL is the error for a model with a backbone pretrained in a self-supervised manner and δ SL is the error for an equivalent model (same architecture, pretraining %Improvement where δ HK is the error for a model with a backbone pretrained with Hyperkvasir-unlabelled and δ IN is the error for an equivalent model (same architecture, pretraining algorithm, and task) with a backbone pretrained with ImageNet-1k.
Note that this analysis omits any results for supervised pretraining or no pretraining.We visualise the results of this analysis in Fig. 11, where it can be seen that self-supervised pretraining with ImageNet-1k generally provides better performance than self-supervised pretraining with Hyperkvasirunlabelled, with exceptions including the anatomical landmark recognition models with MAE pretrained backbones, as well as all monocular depth estimation models.While the result for the anatomical landmark recognition models with MAE pretrained backbones shows only a marginal improvement for pretraining with Hyperkvasir-unlabelled vs. ImageNet-1k, the results for the depth estimation models are more significant.This implies that the similarity of the pretraining data to the data used in the depth estimation experiments is much more critical than the amount of pretraining data, in comparison to other tasks.While this finding is significant for the development of solutions to vision tasks in GIE, it may have broader implications and further work may find this to be true for monocular depth estimation in general.
Finally, we demonstrate that models with a ViT-B backbone are generally better than models with a ResNet50 backbone in polyp segmentation and monocular depth estimation in colonoscopy, generally worse in polyp detection, and generally similar in image classification.To assess this, we use the same measures of error used in the previous analyses and evaluate the relative improvement from using a ViT-B vs. a ResNet50 using: where δ V T is the error for a model with a ViT-B backbone and δ RN is the error for an equivalent model (same pretraining pipeline and task) with a ResNet50 backbone.
Note that this analysis omits any results for pretraining with Barlow Twins or MAE.We visualise the results of this analysis in Fig. 12, where it can be seen that the ResNet50 and ViT-B models perform similarly in anatomical landmark recognition and pathological finding characterisation, that the ResNet50 models perform better than the ViT-B models perform in polyp detection, and that the ViT-B models generally perform better in the dense prediction tasks of polyp segmentation and monocular depth estimation colonoscopy.We further demonstrate the advantage of the ViT-B models over the ResNet50 models in dense prediction by visualising the distribution of performance across the Kvasir-SEG, CVC-ClinicDB, and C3VD test sets in Fig. 13, Fig. 14, and Fig. 15, respectively.Such visualisations are only suitable for these experiments since the metrics measure the performance on each instance prior to averaging, which is not the case for our image classification or object detection experiments.While we observe that ResNet50 models are typically better on polyp detection, we note that the polyp detection model with an MAE pretrained backbone with ImageNet-1k performs better than all but two models with ResNet50 backbones with respect to AP, and performs best with respect to AP 50 , further emphasising the particular robustness of this pretraining pipeline.There is still much to understand about the relative strengths and weaknesses of these architectures, particularly in the context of domains where the availability of data is much lower than that of everyday images, such as GIE.However, these results provide useful insights into which architecture may be better suited to each considered task.
One final note we make is that, as expected, pretraining with any of the considered pipelines consistently leads to better fine-tuned performance than training on the downstream task from random initialisation.

IX. CONCLUSION
In this work, we studied the pretraining of image encoders for use as backbones in solutions to vision tasks in GIE, considering variation in encoder architecture, pretraining pipeline (data and algorithm), and downstream task.This was motivated by recent opportunities to improve on the convention  of supervised pretraining backbones on image classification with ImageNet-1k, namely modern self-supervised pretraining algorithms and Hyperkvasir-unlabelled -a relatively large dataset of unlabelled GIE images.We primarily identified the best pretraining pipeline and architecture, out of those considered, for each considered task by adapting the encoders to the tasks with state-of-the-art decoders, finetuning the resulting models on datasets that include suitable annotations for the tasks, and evaluating the performance on test sets with well-established metrics.Overall, we found that a ViT-B backbone pretrained using the MAE algorithm and ImageNet-1k was most robust.Additionally, our findings suggest three general principles regarding the pretraining of encoders for use as backbones in solutions to vision tasks in GIE, which we revealed through an analysis of the downstream performance.These include: • Self-supervised pretraining generally produces more suitable backbones than supervised pretraining.This result is significant as it is still the convention to use back-bones that have been pretrained on ImageNet-1k in a supervised manner -this implies that the current state-ofthe-art could be improved upon through self-supervised pretraining.Additionally, this result contrasts with the results observed for tasks involving everyday images, where supervised pretraining typically leads to better performance.• Self-supervised pretraining with ImageNet-1k generally produces more suitable backbones than self-supervised pretraining with Hyperkvasir-unlabelled, with the notable exception of monocular depth estimation in colonoscopy where the similarity of the pretraining data to the downstream data appears to be more critical than the amount of pretraining data.While this is a useful insight for the development of monocular depth estimation models for GIE, this finding may also be true for monocular depth estimation solutions in other domains.
• That ResNet50 backbones are generally better for polyp  We hope that this paper encourages further work on the topic of pretraining image encoders for use as backbones in solutions to vision tasks in GIE.Firstly, the scope of this work could be extended to more tasks and datasets, as well as decoder architectures and fine-tuning procedures.For example, we considered the Faster R-CNN object detection pipeline, which is a 2-stage detector, and it is worth investigating whether our findings are also true for 1-stage detectors.Additionally, we considered supervised fine-tuning for monocular depth estimation in colonoscopy, while self-supervised finetuning for monocular depth estimation is also a promising research avenue and may benefit from an investigation into pretraining.Also, the impact of existing pretraining pipelines on the hybrid architectures that combine both convolutional and transformer components and that have found success in polyp segmentation can be investigated.We believe that such research should lay the groundwork for the development of backbones that are better suited to tasks in GIE, which should allow for significant advancement in the state-of-theart.Beyond extending the scope of this study and the further investigation of existing pretraining algorithms, we suggest that future work also studies the development of pretraining algorithms specifically for this domain, as well as for other encoder architectures.

ACKNOWLEDGMENT
Data Access Statement: this publication is supported by multiple datasets which are openly available as cited in the 'References' section of this paper.A  persistent record of the software developed as part of the reported research is openly available from https://doi.org/10.17030/uclan.data.00000447.

FIGURE 1 .
FIGURE 1.The overall workflow of our experimentation.

FIGURE 3 .
FIGURE 3. Visualisation of the Barlow Twins algorithm.Shown for a per-GPU batch size of 2, and representations of dimensionality 4. We use g θ to denote the projector.

FIGURE 4 .
FIGURE 4. Visualisation of the MAE algorithm.Shown for a ViT encoder that treats an image as a 4 × 4 grid of patch tokens, with 75% masking.

FIGURE 5 .
FIGURE 5. Targets (yellow bounding boxes) and predictions (green bounding boxes) for two randomly selected instances of the Kvasir-SEG test set.For conciseness, we denote ResNet50s with RN, ViT-Bs with VT, Hyperkvasir-unlabelled with HK, ImageNet-1k with IN, MoCo v3 with MC, Barlow Twins with BT, MAE with MA, supervised pretraining with SL, and no pretraining with NA-NA.

FIGURE 6 .
FIGURE 6. Targets and predictions for two randomly selected instances of the Kvasir-SEG test set.For conciseness, we denote ResNet50s with RN, ViT-Bs with VT, Hyperkvasir-unlabelled with HK, ImageNet-1k with IN, MoCo v3 with MC, Barlow Twins with BT, MAE with MA, supervised pretraining with SL, and no pretraining with NA-NA.

FIGURE 9 .
FIGURE 9. Ranking of the performance of each model on each task, as measured by mF1 (anatomical landmark recognition and pathological finding characterisation), AP (polyp detection), mDice (polyp segmentation), and mRMSE (monocular depth estimation in colonoscopy), where a better rank is represented by a greater distance from the centre.For conciseness, we denote ResNet50s with RN, ViT-Bs with VT, Hyperkvasir-unlabelled with HK, ImageNet-1k with IN, MoCo v3 with MC, Barlow Twins with BT, MAE with MA, supervised pretraining with SL, and no pretraining with NA-NA.Additionally, we refer to anatomical landmark recognition as anat, pathological finding characterisation as path, polyp detection as det, polyp segmentation with Kvasir-SEG as segk, polyp segmentation with CVC-ClinicDB as segc, and monocular depth estimation in colonoscopy as dep.

FIGURE 10 .
FIGURE 10.Improvement of self-supervised pretraining vs. supervised pretraining for same architecture and pretraining data (ImageNet-1k).For conciseness, we denote ResNet50s with RN, ViT-Bs with VT, MoCo v3 with MC, Barlow Twins with BT, and MAE with MA.Additionally, we refer to anatomical landmark recognition as anat, pathological finding characterisation as path, polyp detection as det, polyp segmentation with Kvasir-SEG as segk, polyp segmentation with CVC-ClinicDB as segc, and monocular depth estimation in colonoscopy as dep.

FIGURE 11 .
FIGURE 11.Improvement of pretraining with Hyperkvasir-unlabelled vs. pretraining with ImageNet-1k for same architecture and self-supervised pretraining algorithm.For conciseness, we denote ResNet50s with RN, ViT-Bs with VT, MoCo v3 with MC, Barlow Twins with BT, and MAE with MA.Additionally, we refer to anatomical landmark recognition as anat, pathological finding characterisation as path, polyp detection as det, polyp segmentation with Kvasir-SEG as segk, polyp segmentation with CVC-ClinicDB as segc, and monocular depth estimation in colonoscopy as dep.

FIGURE 12 .
FIGURE 12. Improvement of ViT-B over ResNet50 for same pretraining pipeline (data and algorithm).For conciseness, we denote Hyperkvasir-unlabelled with HK, ImageNet-1k with IN, MoCo v3 with MC, supervised pretraining with SL, and no pretraining with NA-NA.Additionally, we refer to anatomical landmark recognition as anat, pathological finding characterisation as path, polyp detection as det, polyp segmentation with Kvasir-SEG as segk, polyp segmentation with CVC-ClinicDB as segc, and monocular depth estimation in colonoscopy as dep.

FIGURE 13 .
FIGURE 13.Distribution of Dice score (higher is better) across the test set for each Kvasir-SEG polyp segmentation model, visualised as box and violin plots.For conciseness, we denote ResNet50s with RN, ViT-Bs with VT, Hyperkvasir-unlabelled with HK, ImageNet-1k with IN, MoCo v3 with MC, Barlow Twins with BT, MAE with MA, supervised pretraining with SL, and no pretraining with NA-NA.For clarity, the violin plots for ResNet50 models are coloured red and the violin plots for ViT-B models are coloured blue.

FIGURE 14 .
FIGURE 14. Distribution of Dice score (higher is better) across the test set for each CVC-ClinicDB polyp segmentation model, visualised as box and violin plots.For conciseness, we denote ResNet50s with RN, ViT-Bs with VT, Hyperkvasir-unlabelled with HK, ImageNet-1k with IN, MoCo v3 with MC, Barlow Twins with BT, MAE with MA, supervised pretraining with SL, and no pretraining with NA-NA.For clarity, the violin plots for ResNet50 models are coloured red and the violin plots for ViT-B models are coloured blue.

FIGURE 15 .
FIGURE 15.Distribution of RMSE (lower is better) across the test set for each C3VD monocular depth estimation model, visualised as box and violin plots.For conciseness, we denote ResNet50s with RN, ViT-Bs with VT, Hyperkvasir-unlabelled with HK, ImageNet-1k with IN, MoCo v3 with MC, Barlow Twins with BT, MAE with MA, supervised pretraining with SL, and no pretraining with NA-NA.For clarity, the violin plots for ResNet50 models are coloured red and the violin plots for ViT-B models are coloured blue.

TABLE 1 .
Number of instances of each class, in total and in each split.

TABLE 4 .
Performance in anatomical landmark recognition.The best results for each architecture are highlighted as bold, and the best results overall are underlined.

TABLE 5 .
Performance in pathological finding characterisation.The best results for each architecture are highlighted as bold, and the best results overall are underlined.
Kvasir-SEG has already been discussed in the context of our object detection experiments, and we use the same training/validation/test split here.CVC-ClinicDB includes 612 GIE images, each of which shows at least one polyp and is paired with a binary segmentation map indicating which pixels correspond to a polyp and which don't.We applied a random 80%/10%/10% training/validation/test split, where the validation data is used to determine whether to save the weights after each epoch of training on the training data, and the test data is reserved for evaluating the model after finetuning.

TABLE 7 .
Performance in polyp segmentation with Kvasir-SEG.The best results for each architecture are highlighted as bold, and the best results overall are underlined.

TABLE 8 .
Performance in polyp segmentation with CVC-ClinicDB.The best results for each architecture are highlighted as bold, and the best results overall are underlined.

TABLE 9 .
Performance in monocular depth estimation in colonoscopy.The best results for each architecture are highlighted as bold, and the best results overall are underlined.