Factors of Influence for Transfer Learning across Diverse Appearance Domains and Task Types

Transfer learning enables to re-use knowledge learned on a source task to help learning a target task. A simple form of transfer learning is common in current state-of-the-art computer vision models, i.e. pre-training a model for image classification on the ILSVRC dataset, and then fine-tune on any target task. However, previous systematic studies of transfer learning have been limited and the circumstances in which it is expected to work are not fully understood. In this paper we carry out an extensive experimental exploration of transfer learning across vastly different image domains (consumer photos, autonomous driving, aerial imagery, underwater, indoor scenes, synthetic, close-ups) and task types (semantic segmentation, object detection, depth estimation, keypoint detection). Importantly, these are all complex, structured output tasks types relevant to modern computer vision applications. In total we carry out over 2000 transfer learning experiments, including many where the source and target come from different image domains, task types, or both. We systematically analyze these experiments to understand the impact of image domain, task type, and dataset size on transfer learning performance. Our study leads to several insights and concrete recommendations: (1) for most tasks there exists a source which significantly outperforms ILSVRC'12 pre-training; (2) the image domain is the most important factor for achieving positive transfer; (3) the source dataset should \emph{include} the image domain of the target dataset to achieve best results; (4) at the same time, we observe only small negative effects when the image domain of the source task is much broader than that of the target; (5) transfer across task types can be beneficial, but its success is heavily dependent on both the source and target task types.


INTRODUCTION
T ranfer learning is omnipresent in computer vision.
The common practice is transfer learning through ILSVRC'12 pre-training: train on the ILSVRC'12 image classification task [4], copy the resulting weights to a target model, then fine-tune for the target task at hand. This strategy was shown to be effective on a wide variety of datasets and task types, including image classification [7], [20], [42], [47], [60], object detection [39], semantic segmentation [95], human pose estimation [43], [128], [131], and depth estimation [19], [33]. Intuitively, the reason for this success is that the network learns a strong generic visual representation, providing a better starting point for learning a new task than training from scratch. But can we do better than ILSVRC'12 pre-training? And what factors make a source task good for transferring to a given target task? Some previous works aim to demonstrate that a generic representation trained on a single large source dataset works well for a variety of classification tasks [59], [76], [127]. Others instead try to automatically find a subset of the source dataset that transfers best to a given target [37], [79], [82], [122]. By virtue of dataset collection suites such as VTAB [127] and VisDA [81], recently several works in transfer learning and domain adaptation experiment with target datasets spanning a variety of image domains [26], [59], [62], [76], [81], [127], [129]. However, most previous work focuses solely on image classification [26], [37], [62], [71], [82], [92], • Google Research • Primary contacts: mensink@google.com and jrru@google.com. * Equal contribution. Manuscript submitted in March, 2021, revision submitted in September, 2021, and accepted in November 2021 for future publication in TPAMI. Copyright may be transferred without notice, after which this version may no longer be accessible. [127] or a single structured prediction task [45], [67], [106], [108], [122].
In this paper, we go beyond previous works by providing a large-scale exploration of transfer learning across a wide variety of image domains and task types. In particular, we perform over 2000 transfer learning experiments across 20 datasets spanning seven diverse image domains (consumer, driving, aerial, underwater, indoor, synthetic, closeups) and four task types (semantic segmentation, object detection, depth estimation, keypoint detection). In many of our experiments the source and target come from different image domains, task types, or both. For example, we find that semantic segmentation on COCO [70] is a good source for depth estimation on SUN RGB-D [99] (Tab. 4e); and even that keypoint detection on Stanford Dogs [8], [53] helps object detection on the Underwater Trash [34] dataset (Tab. 5b). We then do a systematic meta-analysis of these experiments, relating transfer learning performance to three underlying factors of variation: the difference in image domain between source and target tasks, their difference in task type, and the size of their training sets. This yields new insights into when transfer learning brings benefits and which source works best for a given target.
At a high level, our main conclusions are: (1) for most target tasks we are able to find sources that significantly outperform ILSVRC'12 pre-training; (2) the image domain is the most important factor for achieving positive transfer; (3) the source dataset should include the image domain of the target dataset to achieve best results; (4) at the same time, we observe only small negative effects when the image domain of the source task is much broader than that of the target; (5) transfer across task types can be beneficial, but its success is heavily dependent on both the source and target task types. The rest of our paper is organized as follows: Sect. 2 discusses related work. Sect. 3 discusses the three factors of variation. Sect. 4 details and validates the network architectures that we use. Sect. 5 presents our transfer learning experiments. Sect. 6 presents a detailed analysis of our results. Sect. 8 concludes our paper.

RELATED WORK
We review here related work on transfer learning in computer vision. We take a rather broad definition of this term, and include several families of works that transfer knowledge from a task to another, even though they are not explicitly positioned as 'transfer learning'. We pay particular attention to their experimental setting and analyses, as this is the aspect most related to our work. Domain adaptation. This family of works adapts image classifiers from a source domain to a target domain, typically containing the same classes but appearing in different kind of images. Some of the most common techniques are minimizing the discrepancy in feature distributions [36], [71], [92], [101], [107], [110], [129], embedding domain alignment layers into a network [13], [73], [89], directly transform images from the target to the source domain [9], [91], [93]. Earlier works consider relatively small domain shifts, e.g. where the source and target datasets are captured by different cameras [92]. Recent works explored larger domain shifts, e.g. across clipart, web-shopping products, consumer photos and artist paintings [110], or across simple synthetic renderings and real consumer photos [62], [81], and even fusing multiple datasets spanning several modalities such as hand-drawn sketches, synthetic renderings and consumer photos [129]. Despite this substantial variation in domain, most works consider only object classification tasks, with relatively few papers tackling semantic segmentation [45], [67], [81], [103], [106], [124]. Few-shot learning. Works in this area aim to learn to classify new target classes from very few examples, typically 1-10, by transferring knowledge from source classes with many training samples. Typically the source and target classes come from the same dataset, hence there is no domain shift. There are two main approaches: metric-based and optimization-based. Optimization-based methods employ meta-learning [32], [80], [84], [85], [117]. These methods train a model such that it can be later adapted to new classes with few update steps. MAML [32] is the most prominent example of this line of work. Despite its success, [84] later showed that its performance largely stems from learning a generic feature embedding, rather than a model that can adapt to new data faster. Metric-based methods [57], [74], [98], [104], [111] aim at learning an embedding space which allows to classify examples of any class using a distance-based classifier, e.g. nearest neighbor [98]. Overall, the community seems to be reaching a consensus [17], [38], [46], [83], [100], [105]: the key ingredient to high-performing few-shot classification is learning a general representation, rather than sophisticated algorithms for adapting to the new classes. In line with these works, we study what representation is suitable for solving a target task. In contrast with these works, our focus is on more complex structured prediction.
While most few-shot learning works focus on classification, there are a few exceptions, e.g. for object detection [113], [114], [118] and semantic segmentation [94]. These works follow a similar setup with source and targets coming from the same dataset. We also focus on structured prediction tasks, however, we tackle more realistic scenarios, where a model is adapted to new datasets, appearance domains, and task types. Transfer learning. A common practice in computer vision is to start from a neural network trained for image classification on ILSVRC '12 [4] as a generic source (Sect. 1). Because of its success, this strategy is the starting point in all our experiments, and should be seen as the baseline to beat for any transfer learning method.
While many works simply apply this strategy as a practical 'trick of the trade', several papers explore transfer learning in more depth, attempting to understand when it works, to find even better source datasets than ILSVRC '12, or to propose more sophisticated transfer techniques. The typical setting considers a very large source dataset: besides ILSVRC'12 [26], [37], [59], [76], [79], also ImageNet21k with 9M images [26], [59], [76], [82], [127], Open Images with 1.7M images [122], Places-205 with 2.5M images [37], and even JFT-300M with 300M images [26], [59], [76], [79], [82]. Several works [26], [59], [76], [127] report experiments on target datasets spanning different image domains, especially since the advent of the VTAB suite [127] which assembles datasets captured using a standard camera, as well as remote sensing, medical, and synthetic ones. Yet, each paper considers little variation in the source dataset, experimenting with 1-3 sources overall, and sometimes picking just one for each target dataset [37]. More importantly, the vast majority of reported experiments are on image classification, for both source and target datasets. Note how VTAB downgrades some structured tasks to simpler versions that can be expressed as classification, e.g. predict the depth of the closest object to the camera, as opposed to a depth value for each object or pixel.
In terms of method, most works follow the classical protocol of learning a feature representation from the source dataset, then fine-tune with a new task head on each target dataset in turn. However, they differ in how they select source images for a given target. Some aim to demonstrate that a single generic representation trained on the whole large source dataset works well for all target tasks [59], [127]. Others instead try to automatically find a subset of the source dataset that transfers best to a given target [37], [79], [82], [122]. The subsets correspond either to subtrees of the source class hierarchy [79], [82], or are derived by clustering source images by appearance [122]. After selecting a subset, they retrieve its corresponding pre-trained source model and fine-tune only that one on the target dataset.
Taskonomy [126] explores a converse scenario. They perform transfer learning across many task types, including several dense prediction ones, e.g. surface normal estimation and semantic segmentation. However, there is only one image domain (indoor scenes) as all the tasks are defined on the same images (one dataset). This convenient setting enables them to study which source task type helps the most which target task type, by exhaustively trying out all pairs. As an powerful outcome, they derive a taxonomy of task Fig. 1: We explore transfer learning across a wide variety of image domain and task types. We show here example images for the 20 datasets we consider, highlighting their visual diversity. We grouped them into manually defined image domains: consumer in orange, driving in green, synthetic in red, aerial in purple, underwater in blue, close-ups in yellow, and indoor in magenta. types, with directed edges weighted by how much a task helps another.
Finally, several other papers perform transfer learning experiments on more complex tasks than image classification as a way to validate their proposed method, but typically only for a few source-target dataset pairs, and only within the same task type (e.g. from Open Images [63] to CityScapes [21] for instance segmentation [122]; and across pairs of ILSVRC '12 [4], COCO [70], Pascal VOC [2] for object detection [108]). Other related work. Finally we discuss two other directions of related work. Experimenting over a collection of datasets is getting more common, e.g. for robust vision approaches [3], [65], [86]. However, the general aim for robust methods is to learn a single model which performs well across several datasets for one task type.
A difficulty for sequential learning of neural networks, is the tendency of catastrophic forgetting [24], [56], [96]. In our transfer learning setup, the new target model might have forgotten the old source tasks. However, forgetting is not relevant in our study, since we are interested in performance on the target task only. Moreover, we analyse in which conditions the target task can successfully re-use the knowledge from the source task.

FACTORS OF INFLUENCE
We study transfer learning at scale across three factors of influence: the difference in image domain between source and target tasks (Sect. 3.2), their difference in task type (Sect. 3.3), and the size of the source and target training sets (Sect. 3.4). To study these factors of influence, we introduce a collection of 20 datasets in Sect. 3.5, which we have chosen to cover these factors well.

Transfer Learning through pre-training
This paper explores transfer learning from a source task to a target task. We define a task as the combination of a task type (e.g. object detection, semantic segmentation) and a dataset (e.g. COCO). We follow the widespread practice of initializing the backbones of our networks with the weights obtained from image classification pre-training on ILSVRC'12 [7], [20], [39], [43], [47], [60], [95], [131]. This leads to the following process: 1) we train a model on ILSVRC'12 classification.
2) we copy the weights of the backbone of the ILSVRC'12 classification model to the source model. We randomly initialize the head of the source model, which is specific to the task type. 3) we train on the source task. 4) we copy the weights of the backbone of the source model to the target model. Again, we randomly initialize its head. 5) we train on the target training set. 6) we evaluate on the target validation set.
This protocol essentially defines a transfer chain: ILSVRC'12 → source task → target task. We compare these transfer chains to the default practice: ILSVRC'12 → target task. To ensure fair comparisons, we have one set of ILSVRC'12 pre-trained weights which are use throughout all experiments in this paper. Analogously, for each source task we create one set of pre-trained weights used throughout all our experiments.

Image Domain
We want to study transfer learning across a wide range of different image domains. For this we considered many publicly available datasets and manually selected the following domain types: consumer photos, driving, indoor, aerial, underwater, close-ups, and synthetic We deliberately only use domains from RGB image sensors, and exclude imagery from other sensors, such as CT scans, multi-spectral imagery, or lidar, because it is unclear how knowledge could be re-used when transferring across datasets acquired by different sensors.
To measure the difference between source and target domains, one way is to use these manually defined domain types. This results in a simple binary same or different measure. We also use a continuous similarity metric obtained directly from the visual appearance of the source and target datasets, to offer a more fine grained metric. In particular, we extract image features from a backbone based on our multi-source semantic segmentation model (Sect. 5.1), where we attach a spatial average pooling layer to the backbone, resulting in an 720 dimensional image vector. The domain difference is then the average distance of a target image to its closest source image: where, | · | denotes the cardinality, f denotes an image feature vector, and for d(·, ·) we use the Euclidean distance. From each dataset we sample 1000 images to compute D(T |S). The same images are used irrespective of whether the dataset is used as target or source. Due to the min operation in Eq. (1) this is an asymmetric measure, i.e. D(A|B) = D(B|A). This measure enables more finegrained analysis than using our manually defined domains.

Task type
In this study we focus on structured prediction tasks, which all involve some form of spatial localization. Since training on these tasks yields spatially sensitive features, we hypothesise that these tasks could benefit each other. We consider four task types: 1) semantic segmentation: predict the class label for each pixel in the image. 2) object detection: predict tight bounding boxes around objects and predict their class labels. 3) keypoint detection: detect the image location of body joints or parts, for the human and dog classes in the datasets we consider. 4) depth estimation: predict for each pixel the distance from the camera to the physical surface (from a single image). These four task types significantly vary in nature: semantic segmentation and depth estimation are pixel-wise prediction tasks, whereas keypoint detection requires identifying sparse (but related) points in an image, and object detection requires predicting bounding boxes. Hence, while these tasks are all about spatial localisation, their different nature makes it interesting to study the influence of transferring from one task type to another.

Dataset Size
The size of the target training set is important (e.g. [59]). When a target dataset is very large, the effect of transfer learning is likely to be minimal: all the required visual knowledge can be gathered directly from this target dataset. Therefore we consider two transfer learning settings, one using only a small number of target images and one using the full target dataset. We believe that using a small target training set is most relevant in practice, given that we often want to train strong models from a small set of annotated images.
We also perform some experiments varying the size of the source training set. A source model trained on a larger dataset is likely to be more beneficial for transfer learning [47], [72], [102]. Hence, in some of our experiments we limit all source sets to have a maximum number of images, sampled uniformly from the dataset. This allows to study the influence of source domain versus source size.

Dataset collection
To study these factors of influence on transfer learning we selected 20 public available datasets annotated for one (or more) of our task types, ensuring to span various very different image domains and with a large variety in dataset size. An overview of the datasets and their task types which we use is given in Tab. 1. The variety in visual appearance is illustrated in Fig. 1  Overview of the 20 datasets used in this paper. They vary in size, task types (we consider semantic segmentation, objct detection, keypoint detection, depth estimation), number of classes, and image domain. We included datasets from consumer imagery, driving, aerial, indoor, underwater, close-ups, and synthetic domains. Some datasets (such as KTTI [5]) have more annotation types available, but are not used in this study.
Berkeley Deep Drive [123] also includes road segmentation annotations.
In our experiments, each dataset plays both roles of source and target, in different experiments in turn. This allows us to extensively study the aforementioned factors of influence for transfer learning over a wide range of image domains, both within task types and across task types, and for training sets of various size.

SETTING UP NETWORK ARCHITECTURES FOR TRANSFER LEARNING
This section describes our network setup. Three things are important for our study: a common framework with shared data augmentation, a common backbone, and high quality models specific to each task type.

Data normalization and augmentation
During training various data normalization and augmentation techniques are typically used [22], [61] which have significant impact on model performance. We unify the data normalization and augmentation step, because it changes the input of the network, i.e. it influences the low-level statistics of the input imagery. This allows to prevent variations in data normalization and augmentation from affecting transfer learning performance.
Data augmentation also influences what the model learns. For example, applying large rotation transformations or anisotropic scaling makes the network invariant to those aspects, which may be beneficial for some tasks but detrimental for others, e.g. full rotation invariant networks are not able to distinguish 6 from 9. In this paper we do not want data normalization and augmentation to be confounding factors in transfer learning experiments. Hence we aim to keep data normalization and augmentation as simple as possible without compromising on accuracy, and apply the same protocol to all experiments.
Illumination normalization. For each dataset we normalize the images so that each color channel has zero mean and standard deviation 1. This form of whitening minimizes illumination biases from the different datasets as much as possible.
Data augmentation. To unify the data augmentation across all four task types considered, we ran many experiments to estimate the importance of common data augmentation techniques. We found the following augmentations to always have a positive (or neutral) effect and thus use them in all experiments: (1) random horizontal flipping of the image; (2) random rescaling of the image; and (3) taking a random crop. For object detection and keypoint detection, we consider only random crops fully containing at least one object. As in [65], we found that the image scales that lead to the best performance are intrinsic to the dataset, and hence dataset-dependent. Since we see this (partly) as a property of the image domain, we optimized for each task the input resolution of the network. Overall, the input resolutions during training range from 420 × 420 for Pascal Context to 713 × 713 for SUN RGB-D to 512 × 1024 for  CityScapes, where the latter is a landscape format common to most driving datasets. We always evaluate at a single scale and resize each image such that one side matches the input resolution used for that task.
After careful consideration, we decided not to use image rotation, varying image aspect ratio, or any form of color augmentation. Image rotation is incompatible with object detection which (generally) assumes axis-aligned ground-truth boxes. Moreover, we were able to reproduce the semantic segmentation performance of [65] even without rotation augmentation (see Sect. 4.3). Varying image aspect ratio did not yield positive effects on semantic segmentation, and also such augmentation is uncommon for object detection and keypoint detection. Color augmentation random changes in hue, contrast, saturation, and brightness did not yield positive effects for us on semantic segmentation and object detection, while substantially slowing down training. Our Setup. To conclude, we consistently apply a limited number of data augmentation techniques across all experiments. The only difference across experiments is the input resolution, which is fixed per-dataset and is thus consistent with the image domain factor of variation we want to study (Sect. 3.2). This uniform protocol effectively cancels out varying data augmentation as a potential causal factor influencing the performance of transfer learning.

Network Architectures
Transfer learning through pre-training is only possible when the models for all task types share the same backbone ar-chitecture. To have meaningful results, we need to choose a backbone which works well across all task types we explore. In this section we outline the used backbone architecture and the task type specific networks, these are illustrated in Fig. 2. More details are in Appendix A. Backbone Architecture. For the backbone the recent highresolution HRNetV2 [112] architecture was chosen. It extends the ResNet architecture to preserve high-resolution spatial features. Where a regular ResNet reduces the image resolution at every stage, HRNetV2 also keeps a parallel high-resolution branch, as illustrated in Fig. 2a. The resulting four output maps are combined into a single feature map by upscaling and concatenation. The backbone is pretrained with supervised classification on ILSVRC'12 using the architecture shown in Fig. 2b.
HRNetV2 has two main advantages. First, it outperforms ResNet [44], ResNeXt [121], Wide ResNet [125], and Stacked Hourglass Networks [78] on three of the tasks we explore: semantic segmentation, keypoint detection, and object detection. Second, the high-resolution output feature maps allow to use relative shallow task type specific heads, see Tab. 2 for the number of trainable weights. Semantic Segmentation. For semantic segmentation, the task type head is shown in Fig. 2c, it is an adoption of the network head proposed in [112]. It consists of a linear classifier, a softmax layer, and a bi-linear up-sampling layer to produce the final predicted segmentation map in the input image resolution. Object Detection. For object detection we follow the Cen-terNet 1 approach of Zhou et al. [131] as shown in Fig. 2e. Each pixel is classified as being the center point of a bounding box of a specific class (i.e. the heatmap); additionally the bounding box size and offsets are predicted. Using center points to predict bounding boxes results in a simpler model than a two-stage architecture like Faster-RCNN [87], while being about as accurate [27], [131]. 1. With CenterNet we denote a family of detection models which predicts boxes via their center points [27], [131].

Description
Top-1 Acc ResNet-50 [112] 76.9 HRNetV2-W44 [112] 77.0 ResNet DORN [33] .509 82.8 ResNet D-DFN [19] .   [131] as illustrated in Fig. 2f. This essentially uses the object detection architecture to predict a bounding box for each person/dog, and then predicts the location of the keypoints within this bounding box. Depth Estimation. For monocular depth estimation we mimic the architecture of the semantic segmentation head.
We use a single regression layer, followed by a soft-plus layer and a bi-linear upsampling layer, see Fig. 2d. The softplus activation: log(exp(x) + 1) is a differential clipping to convert the logit values to depth ensuring that the predictions are positive, it is also used in [40], [132].

Single Task Performances
In order to validate our setup, we compare the performance of the used networks for each task type to a set of baselines. More details are in Appendix A. Backbone. In order to validate our backbone architecture we use image classification on ILSVRC'12 and measure Top-1 accuracy. We compare our setup to a ResNet50 and HRNetV2-W44 [112] network. The results are in Tab. 3a. Our model performs best, reaching 79.5% Top-1 accuracy, validating our choice of architecture. Semantic Segmentation. To validate our setup, we compare on a subset of the MSEG dataset collection [65]. For this experiment we use the annotation provided in MSEG, which uses fewer classes than our transfer setting setup. Training starts from a ILSVRC'12 pre-trained backbone. Performance is evaluated by the Intersection-over-Union averaged over classes (mIoU) [30]. During evaluation we process complete images at a single resolution only. The results are shown in Tab. 3e. Our models perform on par with those of [65], which uses a similar architecture, but with different normalisation and data augmentation steps and a different implementation. Object Detection. We validate our setup by training on COCO17 and evaluate performance on its 5K validation set, without using Non-Maximum Suppression (NMS). We compare to results reported in [131] and [112].
The evaluation is based on the COCO definition of mean Average Precision (mAP) [70]. The results are in Tab. 3d. We observe that the best results are obtained by using data augmentation during evaluation or by using a feature pyramid in the detection phase. Without these enhancements, our implementation performs close to the Hourglass ResNet-104 and outperforms the ResNet-101. Hence we conclude that we can base our transfer learning experiments on a strong, modern object detection framework. Keypoint Detection. We validate our setup on the keypoint detection task on the COCO dataset and report the mean Average Precision at 0.5 Object Keypoint Similarity (AP50) [1].
Results are in Tab [131] and thus strong enough for our transfer learning exploration. Depth Estimation. To validate our setup we compare monocular depth estimation on the NYUDepthV2 [97] dataset, using the root mean squared error (RMSE) metric and the δ < 1.25 accuracy, where δ = max(ẑ z , ẑ z ) is a measure of relative accuracy defined in [64].
The results are in Tab. 3b, where we compare to two recent ResNet based models: [33] uses depth-specific losses based on ordinal regression; and [19] uses a depth-specific network architecture, but with the same loss function as we do. From the results. Our proposed light-weight depth prediction model outperforms both depth specific models on RMSE and hence, it is well suited for our monocular depth estimation transfer learning experiments.

TRANSFER LEARNING EXPERIMENTS
In this section we describe our transfer learning experiments. We mainly conduct experiments in two settings: transfer learning with a small target training set and with the full target set. The analysis of the results across all experiments will be discussed in Section 6.

Setup
Transfer Chains. In our experimental setup we consider transfer chains: ILSVRC'12 → source → target. Specifically, we first train a single classification model on ILSVRC'12, whose weights we reuse in all our experiments. To train a source model S, we first copy the ILSVRC'12 backbone weights into it. We randomly initialize the task head of S.
Then we fine-tune until convergence on the source training set. This results in a single set of backbone weights per source task which we reuse in all our experiments. Analogously, we continue and copy the backbone weights of S into the backbone of T , randomly initialize the task head of T , and fine-tune until convergence on the target training set. Baseline. As baseline we consider the default in the community, which is starting from ILSVRC'12 pretrained backbone weights: ILSVRC'12 → target. Evaluation. Our core question is: can we get additional gains over our baseline by picking a good source? We therefore measure improvements w.r.t. the baseline, which we call relative transfer gain: where m denotes a metric specific for the task type (e.g. mean-IoU for semantic segmentation), which is evaluated for all tasks on complete images at a single resolution only.
Since for depth estimation lower values of m mean better performance, we multiply r by −1 in that case. This notion of gain is similar in spirit to the one defined in Taskonomy [126]. However their gain is the percentage of test images for which the transfer model outperforms a model trained from scratch. Instead, our relative transfer gain metric refers to a much stronger baseline: a model fine-tuned from ILSVRC'12. Moreover, we evaluate how much better the target model becomes in terms of a standard metric specific to each task type, averaged over all test images. Multi-Source Models. Inspired by the success of using generic visual representations for various computer vision tasks [58], [59], [65], we include a multi-source model in our experiments.
We train a multi-source model for a specific task type based on several datasets. For each dataset a separate head is attached to a single, common backbone: • semantic segmentation is trained across iSAID, COCO, Mapillary, ScanNet, SUIM, vGallery, and vKITTI2; • depth estimation is trained across all three depth datasets we consider: SUN RGB-D, vGallery, and vKITTI2; • object detection is trained across all four object detection datasets: COCO, BDD, Pascal VOC, and Underwater Trash.
The multi-source models are trained using dataset interleaving at the batch level, i.e. each batch is sampled from another dataset, and each head classifies only the images from that particular dataset. When doing transfer learning, only the weights of the common backbone are used as a source model. Training. We determined the number of training steps per dataset for all source models in preliminary experiments. We use these for source model training and in the full target training setting. We lower the number of steps for the small target training setting. For each experiment in this paper, we selected the best model from three learning rates. More training details are in Appendix A.

Transfer learning with a small target training set
The first setting we study is transfer learning with a small target training set, in which we limit the number of annotated examples are available for training. We deem this setting as the most challenging and practically relevant: obtain good performance for a structured prediction task by annotating only a few images. Transfer learning is particularly relevant in such a low-data regime.
Concretely, we limit the number of target training images to 150 per dataset (this is less than 3% of the available train data for 14 out of the 20 datasets). For COCO and ADE20k we make an exception and use 1000 images, which is still less than 1% of the available train data for COCO and 5% for ADE20K. The reason is that these datasets have a large number of classes following a long-tail distribution (COCO has 134 classes for segmentation, and ADE20K has 150). Just 150 images did not properly cover all classes: In all experiments we use a seeded selection, such that for each dataset all models are fine-tuned using the exact same training samples.

Experiment.
Our results are shown in Tab. 4 (all relative transfer gains). Positive transfer gains are in green, negative transfer gains are in purple, no gains are represented by white. For ease of presentation, we group results by the target task type, which leads to multiple tables. Then we group by the source task type (two separate tables for semantic segmentation, blue vertical line in other experiments). Then we group by image domain (marked in blue above in Tab. 4a). Finally, within each image domain we order sources by their size. For completeness, absolute performance table are available in Appendix C. We tried to keep all experiments as homogenous as possible, while maintaining a wide scope of transfer across many diverse domains and task types. This leads to two peculiarities, which we clarify below.
First, in most experiments, the source training set and the target training set are disjoint (i.e. always when transferring within a task type, and most of the time across task types). However, when transferring across task types, a few of the source-target pairs use (partly) the same training set, e.g. COCO Object Detection as source for COCO Keypoint Detection in Tab. 4d or BDD Semantic Segmentation as source for BDD Object Detection in Tab. 4c. In these experiments the source models have been trained on all images of the training dataset for the source task type, while the target model is fine-tuned on only a small part of that training set, and for a different task type. This setting is relevant if one is interested in extending the annotation of a dataset from one task type to another (e.g. one has a lot of object bounding boxes, but very few depth masks for a dataset).
Second, when transferring within a task type, the multisource models are only applied for target datasets which were not part of their training (hence the blank entries for this column in Tab. 4a). This is necessary for meaningful experiments, as otherwise some target training sets would simply be a subset of the images and annotations present in the multi-source.   (a)

Keypoints Depth Detection Segmentation
St. Dogs

Transfer learning with full target training set
In this setting we use the full available target training set. We expect transfer learning to have a lesser effect, given that the target training set contains more information.
We perform a study similar to Section 5.2, but focus more on transfer within a task type. The results are shown in Tab. 5, again grouped by respectively target task type, source task type, image domain, and source size. Absolute numbers can be found in Appendix C.

Small source and small target training set
In this last setting we limit the size of the target training set as in Section 5.2, and also limit the size of each source training set to 1500 samples. This enables to study transfer learning effects where sources differ in appearance domain, but not in the amount of labeled images available. For all except two datasets, this implies a reduction to the   We study transfer learning in this setting only for the semantic segmentation task type and only for transfer withintask. The results are shown in Tab. 6.

ANALYSIS
In this section we analyse our results across different settings, task types and domains. To facilitate the discussion, we distinguish four levels of relative transfer gains (as defined in Eq. (2)): VP Very positive transfer effect, when r > 10; P Positive transfer effect, when r > 2; I Insignificant transfer effect, when 2 ≤ r > −2; N Negative transfer effect when r < −2.
We report the percentage of experiments for each level, but we do not report insignificant transfer (I). Tab. 7 shows the main results, split into the small target training set and full target training set settings, and filtered by whether we transfer within/across image domain and within/across task type. A1: Classic ILSVRC'12 transfer learning always outperforms training a model from scratch. For all our experiments starting from ILSVRC'12 outperforms training from scratch. And by a large margin: even when using the full target training set, transferring from ILSVRC'12 improves performance by 5% − 46% (see Tab. 5a, right-most column). This confirms that ILSVRC'12 pre-training is a solid way of (starting) transfer learning, which explains why this practice is widespread. A2: For most target tasks there exists a source task which brings further benefits on top of ILSVCR'12 pre-training.
This can be seen in Tab. 4 and Tab. 5, by noticing that for almost any row there are green entries, indicating positive relative transfer gain over ILSVRC'12 for that source model. To quantify this observation, we have computed the number of target tasks for which there is a source leading to positive (P) or very positive (VP) transfer effect on top of ILSVRC'12 pre-training. To do this, for each target task we take the transfer gain brought by the best source (Tab. 8). In the small target training set regime, there is a positive effect for 85% of the target tasks, and very positive for 67%. Even in the full target training set regime, for 56% of the target tasks there is a positive effect, and very positive for 19%.
So although ILSVRC'12 pre-training is the de-facto standard way to do transfer learning, there is surprisingly much to gain from an additional transfer step, and this for all task types. Next, we analyze what are the factors that influence these benefits. A3: The image domain strongly affects transfer gains. From Tab. 4a we see that most positive gains occur when the source and target tasks are in the same image domain (within-domain transfer). Conversely, often transfer across image domains yields negative gains. For example, for the consumer datasets as target, all other domains are bad sources. This makes the image domain an important factor. As can be seen, the size of the source dataset also plays a role: for example, for driving as a target domain, the larger driving and consumer datasets are generally better sources than the smaller ones. However, this effect is far less important than the image domain. Finally, Tab. 4b shows transfer from other task types to segmentation (cross-tasktype transfer). Here, most effects are negative.
To quantify, we first compare the individual effects of image domain and task type in Tab. 7. Out of all experiments with small target training set, within-domain, crosstask-type setting, 43% yields positive transfer gains and 37% negative ones. Conversely, out of all experiments in the cross-domain, within-task-type setting, only 14% yields positive transfer gains while 55% negative ones. This pattern is repeated when using the full target training set. Hence we conclude that, given a target task, the image domain of the source is more important to achieve good transfer gains than its task type.
Above we considered image domains as manually defined, intuitive types. As presented in Section 3.2, we also consider a continuous measure of domain distance based on image appearance. Tab. 9 visualizes the domain distance for semantic segmentation datasets, with lighter colors indicating a smaller distance. This appearance distance correlates with the manually defined domains. One exception is the vKITTI2 dataset, which seems to be closer to other driving datasets than to the other synthetic dataset vGallery.
With such a continuous measure of domain distance we can compare its influence on transfer gains to the influence of source size. We first measure the rank correlation between domain distance and transfer gains using Kendall τ [52]. For the small target training set experiments this yields a correlation of 0.40, while the correlation between source size and transfer gains is 0.12. Therefore the image domain is a more important factor of influence than source size.   source from a broader domain helps a target from a more specific domain contained within it. This can be qualitatively observed from Tab. 4a. For example, sources from the consumer domain achieve positive transfer not only on consumer targets, but also on driving or indoor. Indeed, the consumer datasets also contain images of street views and indoor scenes (Fig. 1). Conversely, if we look at the consumer domain as a target, none of the driving or indoor sources yield any positive transfer. The same pattern can also explain the success of the multi-source model: this model always yields positive transfer, and by design it contains images from each of the manually defined target domains.
To visualize domain inclusion, we created a t-SNE plot of the activations of 150 images from each semantic segmentation dataset ( Fig. 3a; same features as Section 3.2). The visualization show that the consumer domain points are indeed scattered around the plot, covering almost all other domains. Other domains form more compact clusters, e.g. aerial in green, and driving in yellow/orange. Finally, we also quantitatively demonstrate that domain inclusion matters. To do so, we change the function used in Eq. (1) to match target images to source images. We have four strategies: (1) Assign one-to-one each target image to a source image such that the overall Earth Mover's Distance is minimized (i.e. matching is done through the Hungarian algorithm). This is a measure of overlap of the distributions of the two image domains. (2) Assign each target image to its closest source. This measures inclusion of the target dataset in the source dataset and is identical to Eq. (1). (3) Assign each source image to its closest target. A large distance here means that many source images are far from the target.
We correlate each of the four measures to transfer gains using Kendall-τ rank correlation. To remove any influence of task type and dataset size (both source and target), we do this in the small source and small target set setting (Section 5.4). The results are shown in Tab. 10. The assignment measure capturing inclusion (2) has the highest correlation with transfer gains. Interestingly, measure (3) has (very) low correlation. This suggests that even if the source domain has images far from the target domain, this has only a small influence on transfer gains. This confirms that the most important aspect for a source to yield positive transfer is that is should include the target image domain. A5: Multi-source models yield good transfer, but are outperformed by the largest within-domain source.
Tab. 4a shows that our multi-source semantic segmentation model yields positive transfer gains for all targets. This is in line with our previous observations that the source domain should include the target, while it is less important that it spans a much broader range. Indeed, by construction the source data of this multi-source model spans all domains. However, we also observe that the largest withindomain source almost always yields better transfer gains: In 7 out of these 10 experiments the largest within-domain source yields significantly better transfer (i.e. > 2%). Only

ILSVRC'12 Segmentation Depth
(c) Different networks on SUN RGB-D. Fig. 3: Illustration of the apperance distribution of the datasets, according to different feature encoding networks, using t-SNE. Fig. 3a visualizes the semantic segmentation datasets using features extracted from the multi-source network. Fig. 3b and Fig. 3c visualizes the COCO and SUN RGB-D datasets using networks trained on different task types (ILSVRC'12 indicates classification).
for the target SUN RGB-D the multi-source model is better. This is explainable since the multi-source model includes both COCO (consumer) and ScanNet (indoor) which are both good sources for SUN RGB-D. Hence the multi-source model arguably covers the domain of this dataset better than the largest single source model.

A6: Transfer across task types can bring positive transfer gains.
Depending on the choice of task types for the source and target, cross-task-type transfer can be beneficial: for 65% of the targets within the same image domain as the source, cross-task-type transfer results in positive transfer gains (Tab. 8). To study this effect more precisely, we split the results over different cross-task-type pairs in Tab. 11. The left column looks at cross-task-type transfer on the same dataset (the specific setting studied in detail in Taskonomy [126]).
Here we can see that object detection and keypoint detection help each other. Segmentation helps object detection but not vice versa. Semantic segmentation and keypoint detection hurt each other. When transferring to a different dataset, but still saying in the same image domain (right column), results show similar effects (except that keypoint detection stops helping object detection).
Our results confirm some of the observations in [126] as we also observe that some task types are more easily to transfer from / to than others, and that transfer across task types is asymmetric, i.e. if task type A is beneficial for task type B, the reverse might not hold.
However, even when we observe gains through crosstask-type transfer within the same dataset, if there is another within-domain source with the same task type, it yields even better gains. Two examples are the object detection results for BDD and COCO as targets in Tab. 4c. This suggests that having a source with the same image domain and task type as the target is better than having annotations for another task type on the target dataset.
Tab. 7 shows that for cross-task-type transfer to work, the image domain of source and target should be the same: cross-domain, cross-task transfer yields negative transfer for 79% of the experiments, and positive for 5%. Moreover, out of those 5%, all except one experiment have consumer images as a source which includes the target domain of driving, indoor, and (arguably) underwater. Hence crosstask-type transfer can only be expected to work if the image domain from the source includes that of the target.
To understand how the task type influences the features learned by the source models, we visualize the features learned on the same dataset but for different task types using t-SNE (cosine distance, which removes arbitrary scaling effects). We do this for the exact same set of images    10: Kendall-τ correlation between transfer gains and dataset distance for semantic segmentation in the small source small target setting (Section 5.4). As distance we take the average distance between images from the source and target datasets, while varying the assignment method (i.e. min function in Eq. (1)). The assignment with the highest correlation measures inclusion of the target by the source.
for COCO (Fig. 3b) and SUN RGB-D (Fig. 3c). We observe that each model creates compact cluster representations, showing there is a larger distance between an identical image in the feature spaces of two different models than between two different images in the feature space of one model. A7: Transfer within-task-type and within-domain yields very positive effects. Combining the observations A3 and A6, our recommendation to obtain positive transfer is to use a source model trained on the same domain and for the same task type as the target task. In this setting, a total of 69% of all source-target pairs exhibits positive transfer (Tab. 7). Moreover, the best available source in this setting leads to positive transfer for 73% of the target tasks (and 64% very positive).  Quantitatively, we measure a positive transfer in 60% of within-domain and within-task-type experiments where the source training set is larger than the target dataset. Conversely, when the source training set is smaller than the target training set, only 5% of our experiments result in positive transfer. A9: Transfer learning effects are larger for small target training sets. We expect this to hold because when the target task already offers a large training set, the target model can learn the required visual knowledge from the target task directly. We can clearly see such effects by comparing the within-domain segmentation results of the small and full target training set setting, i.e. Tab. 4a vs Tab. 5a. The absolute transfer gains are higher in the small target training set setting than in the full one, while overall patterns remain roughly the same.
We now compare Tab. 4a and Tab. 5a quantitatively. We first examine positive transfer gains: out of all experiments with positive transfer gains in the full target setting, 80% also have a positive transfer in the small target setting. Vice versa, out of all experiments with positive transfer gains in the small target setting, 43% also have significant gains in the full target setting (for most of these experiments the gains become insignificant). When looking at negative transfer effects: if transfer effects are negative on a small target training set, these effects remain negative or become neutral in the full target set in 100% of these experiments.
Practically, this suggests a quick test for whether a source dataset may be beneficial for a target. First train on a small subset of the target training set. Then, if transfer effects are negative, discard this source for this target. If transfer effects are positive, explore this source further, since there is a good chance its benefits are kept when using the full target training set. A10: The source domain including the target is more important than the number of source samples. Here we aim to disentangle the size of the source training set from the breadth of the image domain it spans (e.g. COCO covers a broad image domain as visualized in Fig. 3b). The influence of source size is eliminated in our experiment in Tab. 6, where we use a small training set for all sources (and a small target training set too). It is instructive to compare this Tab. 4a, where full source sets are used.
Even after fixing source size, most patterns stay roughly the same. For example, COCO remains a good source for other consumer and indoor datasets. Indeed, COCO is not only the largest but also the most diverse consumer dataset. Similarly, Mapillary remains a good source for most other driving datasets, while other driving sources are generally worse. Mapillary was designed to capture driving conditions around all continents, whereas the India Driving Dataset covers only Indian roads, Berkeley Deep Drive is US only, while CityScapes is (mostly) Germany. Camvid consists of frames of four videos, and therefore has a narrow image domain. This suggests that much of the effects in Tab. 4a which we attributed to dataset size in A8, can in fact be attributed to the property of the source domain including the target domain. Of course, source size is generally correlated to the breadth of its domain (e.g. COCO and Mapillary were both designed to span broader domains), which is why care needs to be taken to disentangle their effects.

ADDITIONAL SCENARIOS
In this section we perform several additional experiments to verify whether our work generalizes to other scenarios. We only report the main observations while details are provided in Appendix B. Fixed image resolution. We redo the segmentation experiments in the small target training setting (i.e. Tab. 4a) but now fixing the image resolution to be 713×713 pixels across all datasets. Results in Tab. A.1 (see Appendix) show highly similar transfer patterns and all our previous conclusions hold. ResNet50. We change the backbone to ResNet50 [44] and redo the experiments in the small target training setting for segmentation and partially for detection (i.e. redo Tab. 4a partially Tab. 4c). Results in Tab. A.2a (see Appendix) suggest that ResNet50 benefits more from transfer learning, possibly due to improving its ability for localized predictions. But again, we observe similar transfer patterns and all our previous conclusions hold. Self-supervised ILSVRC'12 training. We now redo the experiments with ResNet50, but instead we start from a checkpoint which was obtained by using self-supervised learning on ILSVRC'12 using the publicly available SimCLR V2 implementation [15]. We then train our sources fully supervised as before. Results in Tab. A.2b (see Appendix) show primarily changes in the baseline: directly training from self-supervised ILSVRC'12 weights is better than directly training from ILSVRC'12 fully supervised classification weights, as also found in [29], [133]. At the same time patterns are again very similar and our main conclusions remain unaltered. Self-supervised Transfer Chain. We do a single experiment where we use SimCLR [15] to create a self-supervised transfer chain: ILSVRC'12 self-supervised → COCO selfsupervised → target. We find that this transfer chain is mildly beneficial for COCO image classification. However, for segmentation our results in Tab. A.4 (see Appendix) show that that this transfer chain is worse than directly training from ILSVRC'12 self-supervised weights for all target datasets. On average it is 0.07 IoU worse, while even for COCO as a target results are 0.02 IoU worse. This result suggests that self-supervised pre-training is biased towards image classification. Furthermore, image classification results on self-supervised models are not very predictive of performance on other tasks, as also shown in the dedicated study of [29].

CONCLUSION
In this paper we performed over 2000 transfer learning experiments across a wide variety of image domains and task types. Our systematic analysis of these experiments lead to the following conclusions: (1) for most tasks there exists a source which significantly outperforms ILSVRC'12 pre-training; (2) the image domain is the most important factor for achieving positive transfer; (3) the source task should include the image domain of the target task to achieve best result; (4) at the same time, we observe only small negative effects when the image domain of the source task is much broader than that of the target; (5) transfer across task types can be beneficial, but its success is heavily dependent on both the source and target task types.
Our findings provide support for the success of largescale pre-training for transfer learning, with a single very large source spanning a mixture of domains [51], [59], [72], [102], [127]. If a good source set should include the target but can be arbitrarily broad, training on a massive dataset which covers most of the visual domains should yield a good source model for most target tasks. However, this form of pre-training inevitably requires transfer across task types, whose success depends on both the source and target task types. This suggests that future works exploring pretraining should focus also on structured prediction tasks.

Appendices
In these appendices we mainly provide more details on the network architectures in Appendix A, generalization experiments in Appendix B, and more extensive experimental results in Appendix C. For all experiments not only the relative transfer gain is shown, but also the absolute value of the task-specific performance metric is shown. In Appendix D we discuss the additional computation costs of transfer chains. Finally, in Appendix E we discuss potential data overlap in our collection of datasets.

APPENDIX A BACKBONE AND TASK TYPE SPECIFIC NETWORK ARCHITECTURES
In this section we describe in detail the network architectures used throughout our study.

A.1 Backbone Architecture
Transfer learning through pre-training is only possible when the models for all task types share the same backbone architecture. To have meaningful results, we need to choose a backbone which works well across all task types we explore. Therefore we choose the recent high-resolution backbone HRNetV2 [112] (illustrated in Figure A.1). It has two main advantages. First, HRNetV2 was shown to outperform ResNet [44], ResNeXt [121], Wide ResNet [125], and stacked hourglass networks [78] on three of the tasks we explore: semantic segmentation, keypoint detection, and object detection. Second, its design allows to use relative shallow task type specific heads compared to the number of parameters in the backbone. In Table 2 (in main paper) we provide an overview of the number of trainable parameters. The backbone consists of 69M parameters, making up 87%-99.99% of all parameters depending on the task type and the number of classes. Architecture. The HRNetV2 backbone extends the ResNet architecture to preserve high-resolution spatial features. A regular ResNet consists of blocks organised in four 'stages', each reducing the image resolution by factor 2. HRNetV2 follows this design, but after each stage it also keeps a parallel high-resolution branch ( Figure A.1). All branches from one stage are fed into the next, so this next stage incorporates information from the representations at different

A.2 Semantic Segmentation
Architecture. For semantic segmentation, we adopt the network head proposed in [112] (Figure A.3a). It consists of three stages: (1) a 1x1 convolution changing the dimensionality of the backbone output to the number of classes K. This implements a linear classifier at each pixel; (2) a softmax non-linearity; (3) a bi-linear up sampling layer to produce the final predicted segmentation map H × W × K.
Training. The network is trained using the cross-entropy loss at each pixel, ignoring the background class: where the δ denotes the Dirac-delta function to return 1 iff the pixel depicts a not background class, and p(y k |x k ) is computed using the softmax function.
Starting from ILSVRC'12 pre-trained weights, we train on the source dataset using SGD with momentum. We use multiple Google Cloud TPU-v3-8 accelerators using synchronized batch norm and a batch size of 32. For almost all sources datasets we use a stepwise learning rate decay, and optimize per source the starting learning rate, the number of steps after which the learning rate is lowered (by a fixed factor of 10), and the total number of training steps. However, we found performance to be unstable for the small SUIM and KITTI datasets. For these we found that switching to a "poly" learning rate policy (base lr × (1 − curr step max steps ) p , p = 0.9) stabilizes training, as also done in [14], [65].
Evaluation. Performance is evaluated by the Intersectionover-Union (IoU), averaged over classes [30]. During evaluation we process complete images at a single resolution only. We scale each image to match the resolution used during training.

Results.
To validate our training setup, we compare the performance of our networks on a subset of the MSEG dataset collection [65]. While these datasets are also part of our collection, for this experiment we use the annotation provided in MSEG, which has fewer classes than in our transfer setting setup, c.f. Section 5 (in main paper). As Figure A.3b shows, our models perform on par with those of [65], which also use a HRNetV2 backbone and linear semantic segmentation layer, but have different normalisation and data augmentation steps and a different implementation. ). This demonstrates that we base our transfer learning experiments on a strong modern object detection framework.

A.3 Object Detection model architecture details
Architecture. We follow the CenterNet 3 approach of Zhou et al. [131] (Figure A.4a). For each pixel in the feature map and for each class, a binary classifier evaluates whether this pixel is the center point of an object bounding box. Additionally, at these points we predict a translation correcting offset (in x-and y-coordinates) and the size of the bounding box (width and height). Using center points to predict bounding boxes results in a simpler model than a two-stage architecture like Faster-RCNN [87], while being about as accurate [27], [131]. Following [112], [131] we add a feature adapter (neck) between the backbone and the head. This neck consists of 1x1 convolution followed by a 3x3 convolution, batchnormalization [48] and ReLu. The detection CenterNet head outputs three prediction maps, each is the result of a 3x3 convolution, ReLu, and another 3x3 convolution. The three pixel-wise prediction maps are: M1 a K-channel heatmapŶ xyc , one per class. Each pixel is assigned the likelihood to be the center on an object of a certain class computed using Gaussian kernel. M2 a two-channel offset mapÔ p representing x-and yoffsets of the point to the real center of the bounding box p; and M3 a two-channel width/height mapŜ p , which predicts the width and height of an object centered at the pixel 3. With CenterNet we denote a family of object detection models which predicts boxes via their center points with fully convolutational architectures [27], [131]. The paper of [27] is called CenterNet while the GitHub of [131] is also called CenterNet https://github.com/ xingyizhou/CenterNet. We mostly follow the approach of [131].

p.
Following [131] the final box predictions are obtained by combining the top T = 100 highest scoring pixels from the class-specific heatmaps, with their respective x/y offsets to define the box-centers. The box dimensions are extracted from the width/height prediction map. Then, optionally, non-maximum suppression (NMS) is applied.

Training.
During training, three losses are computed, one for each head [131]: L1 a logistic regression focal loss [66], [69] for the class heatmap, acting on all pixels: Here Y xyc is a sum of Gaussians around from all ground-truth boxes object centers, projected to lowresolution feature map. For α = 2 and β = 4 we use default parameters provided in [131] L2 a L1-loss for offset prediction, acting only on N pixels that are close to center points of objects in the groundtruth (those where Y xyc = 1); here o k are the groundtruth offsets: L3 a L1-loss for width and height prediction, again acting only on N pixels close to ground-truth centers; s k are ground-truth box sizes: The final training objective is L = L cls + λ off L off + λ size L size . We train using the same procedure as for semantic segmentation, but using a batch size of 128 and the Adam optimizer [54] until convergence. Evaluation. We use the COCO definition of mean Average Precision (mAP) [70]. They evaluate mAP at various IoU thresholds and average them. For COCO we do not use Non Maximum Suppression, since this did not improve results on this dataset as also observed by [131] (on other datasets NMS was useful, see details in Appendix C). Results. We validate our setup by training on the COCO training set and evaluating performance on its validation set. We compare to results reported in [131] and for a HRNetV2-W48 backbone with a feature pyramid using the CenterNet variant [112]. As can be seen in Figure A.4b, the best results are obtained by using use data augmentation during evaluation (multi-scale and horizontal flip) or by using a feature pyramid. Without these enhancements, our implementation performs close to the Hourglass ResNet-104 results of [131] (39.0 mAP and 40.3 mAP respectively). Hence we conclude that we can base our transfer learning experiments on a strong, modern object detection framework.

A.4 Keypoint Detection
Architecture. We again follow the CenterNet approach of [131] (Figure A.5a). The keypoint head architecture resembles the object detection architecture, yet with some differences. We create 6 pixelwise prediction heads, each composed of a 3x3 convolution, ReLu, and another 3x3 convolution. We do not use a neck, but directly feed the feature maps produced by the backbone into these heads. The maps output by the first three heads are analogous to the object detection ones, but for a single class only, to detect the box around the person (or dog depending on the dataset). The remaining three maps are: M4 a K-channel keypoint heatmap, where each channel is the output of a binary classifier predicting one of the K keypoints; M5 a keypoint offset map: a two-channel x/y offset map for a translation correction of the keypoint; M6 a keypoint allocation map: a 2 × K-channel map marking the displacement from the center of an object at this pixel to each of the K keypoints belonging to that object. We obtain keypoints for a image by processing these maps: First, a bounding box around the person/dog is obtained by combining the box centers (M1), with the offsets (M2) and the box dimensions (M3). Then, M4 and M5 are combined to create a set of candidate keypoint locations with accompanying confidence scores, within the person/dog bounding box. Finally, M6 is used to associate each candidate keypoint to the person/dog object it belongs to. Training. For M1-M3 we use the same losses L1-L3 as for object detection. For M4-M6 we add the following [131]: L4 a focal loss is used for multi-class joint classification, analogous to object detection loss L1 as described by

Box Centers
H blocks indicate a series of convolutional layers, connection between blocks include strided convolutions and upsampling to the respective dimensions. Each row doubles the number of channels. We use C=48 channels, resulting in an output of H/4 x W/4 x 720.

Depth Estimation
Up Sample H Eq. (A.2); L5 a L1-loss is used for offset prediction, analogous to object detection loss L2 as described by Eq. (A.3); L6 a L1-loss is used to directly regress the initial keypoint locations w.r.t. the object centers predicted by M1+M2. We largely follow the training setup for object detection, using a batch size of 128 and the Adam optimizer [54]. We set the starting learning rate and number of steps per dataset. Evaluation. We report the mean Average Precision at 0.5 Object Keypoint Similarity (OKS), as defined by the COCO challenge [1]. Results. To validate our setup we train for keypoint prediction on the COCO dataset. In Figure A.5b we report the performance of our network architecture and compare to [131]. They use as backbone a stacked hourglass [78] consisting of two consecutive hourglass models, which they call Hourglass-104. Their model yields 84.2 AP50. 4 For completeness, we include the results of a more complex two stage cascade keypoint detection approach [112], which obtains very good performance. In this paper we use the HRNetV2 backbone with the simpler single-stage CenterNet heads. Our model yields 81.1 mAP, which is near [131] and thus strong enough for our transfer learning exploration.

A.5 Depth Estimation
Architecture. For monocular depth estimation, we modify the architecture for semantic segmentation. We propose to add a single regression layer on top of the output of the backbone, followed by a soft-plus layer and a bi-linear upsampling layer to match the original resolution (Figure A.6a). The softplus activation converts the logit value to depth: log(exp(x) + 1) and is also used in [40], [132]. This is a differentiable clipping function which ensures that depth predictions are positive. Training. To train the network, we combine two losses following [19], [68]: • a L1-loss between the predicted depth map and the ground truth map at each pixel; • a smoothness loss which encourages differences in neighbouring depth pixels to be the same in the predicted depth map and the ground-truth. More precisely, we take the derivative in x-direction for both the predicted and ground-truth maps, and compare them with an L1-loss. We repeat this for the y-direction. Resulting in: where y denotes the ground-truth depth map,ŷ the predicted depth map and N the number of pixels. For training we use a batch size of 64 and the Adam optimizer [54]. Evaluation. We validate on the NYUDepthV2 [97] dataset, using the root mean squared error (RMSE) metric and the δ < 1.25 accuracy, where δ = max(ẑ z , ẑ z ) is a measure of relative accuracy defined in [64]. Results. We compare our proposed network to two recent ResNet based models: [33] which uses depth-specific losses based on ordinal regression; and [19] which uses the same losses as we do, but with a depth-specific network architecture on top of the ResNet backbone. The results are presented in Figure A.6b. We observe that our model is slightly better than both others, when measured using the root mean squared error (RMSE), while slightly behind on the δ < 1.25 measure. Hence, we conclude that our lightweight depth prediction head is well suited for monocular depth estimation.

APPENDIX B ADDITIONAL SCENARIOS
In this section we perform several additional experiments to verify whether our work generalizes to different scenarios. In particular, while we optimized image resolution per dataset, c.f. Section 4.1 (in main paper), we explore what happens when we fix the image resolution across all datasets. In addition, we partially redo experiments using ResNet50 [44] as backbone. Finally, we explore selfsupervision in our transfer learning chain. Fixed image resolution. We redo the segmentation experiments in the small target training setting (i.e. Table 4a (in main paper)) but now fixing the image resolution to be 713 × 713 pixels across all datasets. Results are shown in Table A Table A.1. More importantly, for all experiments with significant positive transfer gains, we find that the best source remains the best source. The only exception is for SUN RBG-D as target: the best three sources are the same but have a different order. All these sources yield high transfer gains. We conclude that our analysis (Section 6 (in main paper)) holds when fixing the image resolution. ResNet50. We change the backbone to ResNet50 [44] and redo the experiments in the small target training setting for segmentation (i.e. redo Table 4a (in main paper)) and partially for detection (i.e. redo Table 4c (in main paper)). Results are shown in Table A.2a. For segmentation, on average results are significantly worse (-0.11 IoU) compared to the HRNetV2 backbone used in our main experiments. But again, most transfer patterns stay the same. Quantitatively, for 81% of the experiments with significant positive or negative gains with HRNetV2 backbone (Table 4a (in main paper)), similar gains are observed for ResNet50. In fact, gains are stronger for 69% of all experiments, both in positive and negative direction. There are some difference though: generally for ResNet50 there are more source-target combinations with significant positive transfer gains. This is for iSAID as a target dataset, vKITTI2 as a source for driving, and for consumer as a source for consumer, driving, and indoor. Measured quantitatively, in 13% of the experiments where we before saw negative transfer gains, we now see positive transfer gains. This suggests that ResNet50 benefits more from transfer learning. We hypothesize this is because its activations are at a lower resolution, and training on a segmentation source dataset makes the network better suited for localized predictions.
For detection ( Table A.3) there are a few more changes when using ResNet50. Within detection, COCO remains a good source, VOC07 becomes less good (but not negative), while BDD now yields positive transfer instead of negative for VOC07 and Underwater Trash as a target. When using segmentation as a source, all but one sources which had positive transfer gains using HRNetV2 still yield positive transfer. More interestingly, the number of sourcetarget combinations which yield positive transfer gains are doubled (from 6 to 12). Again, this suggests training on segmentation makes the ResNet50 model better suited for localized predictions.
To conclude, ResNet50 models perform less than HR-NetV2 in terms of the absolute performance, however ResNet50 generally benefits more from transfer learning, possibly due to improving its ability for localized predictions. At the same time, the overall trends which we observed in Section 6 (in main paper) still hold. Self-supervised ILSVRC'12 training. We now redo the experiments with ResNet50, but instead we start from a checkpoint which was obtained by using self-supervised learning on ILSVRC'12 using the publicly available SimCLR V2 implementation [15]. Starting from these weights, we train the sources fully supervised as before. Results are shown in Table A.2b. We compare results with our other ResNet50 experiments (i.e. Table A.2a (in main paper)). We see that the best or top 3 sources remain the same for all experiments with positive transfer gains, while most previously observed patterns hold. However, there are noticeably fewer positive gains overall, some even turning into negative gains. This is especially visible for consumer as source and driving as target. But a closer look reveals that the most significant changes happens on the ILSVRC'12 baseline: results starting from self-supervised weights are on average 0.01 IoU higher than starting from fully supervised classification weights, while results for the full transfer chains are comparable. This manifests itself as lower measured transfer gains. The only exception is iSAID, where the self-supervised ILSVRC'12 baseline is 0.03 IoU lower, while results for the full transfer chain are comparable. Hence here we measure higher transfer gains.
To conclude, the main observation is that the selfsupervised ILSVRC'12 pre-training generally yields slightly           better segmentation results than fully supervised classification pre-training, which is in line with [29], [133]. Still, overall patterns remain the same and the overall conclusions in Section 6 (in main paper) remain unaltered. Self-supervised Transfer Chain. We do a single experiment where we use SimCLR [15] to create a self-supervised transfer chain: ILSVRC'12 self-supervised → COCO selfsupervised → target. We first verify whether the selfsupervision works for COCO image classification as a target following a standard self-supervised evaluation protocol [15], [18], [41]: we take the trained backbone, freeze its weights, attach a linear classifier, and train on COCO image classification. We do this for both ILSVRC'12 weights and ILSVRC'12 → COCO weights. We find that the additional self-supervised training on COCO leads to a small improvement in classification accuracy of 0.4%.
We now test the transfer chain on segmentation for the small target training setting. Results are shown in Table A.4. Interestingly, we find that this chain is worse than directly training from ILSVRC'12 self-supervised weights for all target datasets. On average it is 0.07 IoU worse, while even for COCO as a target results are 0.02 IoU worse.
This result suggests that self-supervised pre-training using SimCLR [15] is biased towards image classification. Furthermore, image classification results on self-supervised models are not very predictive of performance on other tasks, as also shown in the dedicated study of [29].

APPENDIX C COMPLETE RESULT TABLES
This appendix reports more extensive results for all of our experiments, i.e. including both absolute performance of the task type specific metric as well as relative transfer gain calculated w.r.t.the ILSVRC'12 column of the absolute performance, c.f. Eq. (2) (in main paper). The experiments and metrics are summarized as follows: • For semantic segmentation we use Intersection-over-Union (IoU), averaged over classes [30]. See Table A

APPENDIX D COMPUTATIONAL COSTS OF TRANSFER CHAINS
For performed the experiments in this paper on Google Cloud TPU-v3 chips. If we calculate costs in terms of TPUhours, defined as the number of TPUs multiplied with the computation time: • ILSVRC'12 pretraining (ImageNet) takes 364 hours.
• Training a COCO segmentation source model (largest consumer source) takes an additional 92 hours. • Training a Mapillary segmentation source model (i.e. the largest driving source) takes an additional 352 hours (due to higher resolution images). • Training a multi-source model takes 228 hours. This means that pre-training costs for our transfer chains are increased by 25%-97%, compared to the standard practice of pre-training on ILSVRC'12.
However, the source models are trained only once. The same source models can then be repeatedly used for transfer learning to many target tasks. This means that the relative additional costs of pre-training becomes increasingly small as more researchers reuse the same source model for new target tasks.

APPENDIX E DATASET OVERLAP
When using 21 datasets there is always a risk that images in the test set of one dataset are used as training in another dataset, this is deemed undesirable for fair evaluation. Since we have not quantified such an overlap, we can not assure there is no overlap at all. Still, we do believe that the risk of such potential overlap significantly changes the results of our experiments to be minimal, for the following reasons: 1) We evaluate structured prediction tasks, while the largest dataset (ILSVRC'12) had only classification annotation, thus even if an image has already been seen it had different type of annotation. 2) Within the same annotation type datasets either differ in visual domain (e.g. underwater and aerial are not expected to have any overlap), or they are acquired in different geographical regions (e.g. BDD and IDD). 3) Since the collection consists of popular public benchmarks, major overlaps of these datasets would have been reported.