Self-supervised Domain Adaptation for Computer Vision Tasks

Recent progress of self-supervised visual representation learning has achieved remarkable success on many challenging computer vision benchmarks. However, whether these techniques can be used for domain adaptation has not been explored. In this work, we propose a generic method for self-supervised domain adaptation, using object recognition and semantic segmentation of urban scenes as use cases. Focusing on simple pretext/auxiliary tasks (e.g. image rotation prediction), we assess different learning strategies to improve domain adaptation effectiveness by self-supervision. Additionally, we propose two complementary strategies to further boost the domain adaptation accuracy on semantic segmentation within our method, consisting of prediction layer alignment and batch normalization calibration. The experimental results show adaptation levels comparable to most studied domain adaptation methods, thus, bringing self-supervision as a new alternative for reaching domain adaptation. The code is available at https://github.com/Jiaolong/self-supervised-da.


INTRODUCTION
Since supervised (deep) machine learning became the key to solve computer vision tasks, the availability of task ground truth (i.e. supervision information) associated to the raw data (i.e. images and videos) has been a major practical problem. Training an image or video classifier requires to associate some class or attributes to the whole image/video [1]- [4], training an object detector requires manual drawing of object bounding boxes [5], [6], training a CNN for semantic segmentation requires the delineation of the borders between the considered classes [7], [8], etc. This kind of ground truth (bounding boxes, class borders) is usually provided by human labeling, which is a costly process prone to errors due to subjectivity and fatigue. Therefore, procedures aiming at reducing human labeling became a research topic in itself too; or alternatively obtaining the most from a fixed budget for new labels. This underlying aim appears under different names depending on the practical situation at hand, i.e. the learning conditions. Under this umbrella we find concepts such as active learning, self-labeling, transfer learning, domain adaptation, and self-supervision.
In active learning [9]- [11], the learner receives a set of unlabeled data (videos) for training a visual accurate model, which must be done minimizing the labeling effort by choosing the best training data out of the total amount. This turns out into an iterative process where a human worker labels new automatically selected data in each cycle for model refinement. This contrasts with passive learning, where the training data is selected at random, eventually requiring more labeling budget.
In self-labeling [12]- [14], an initial visual model is trained on labeled data, after, the model is applied on unlabeled data to self-collect samples which are used then for refining the model by assuming that their label corresponds to the prediction of the model; turning out in an iterative process that must avoid drifting to systematic errors or easy samples.
In transfer learning [15], [16], a model is trained to perform a visual task (e.g. image classification) but aiming at reusing it to perform a new task (e.g. object detection) in a way that we We learn a domain invariant feature representation by incorporating a pretext learning task which can automatically create labels from target domain images. The pretext and main task (e.g. object recognition or semantic segmentation) are learned jointly via multi-task learning. Solid lines indicate the forwarded data flow and the dash lines indicate optional data flow. minimize the amount of labeled data required to train for the new task (e.g. fine-tuning CNNs across tasks is a basic form of transfer learning).
In domain adaptation [17]- [20], a model is trained to perform a visual task in a specific domain (e.g. semantic segmentation in synthetic images), however, we need to apply it to perform the same task in a correlated, but significantly different, domain (e.g. semantic segmentation in real-world images); which is done by reusing the previous knowledge (in the form of model or labeled data) for minimizing the labeling effort in the new domain.
Finally, self-supervised learning [21]- [23] focuses on learning visual models without manual labeling; more specifically, auxiliary relatively simple tasks, known as pretext tasks in this context, are created for training a generic visual model in the form of CNN. The supervision consists in modifying the original visual data (e.g. a set of images) according to known transforms (e.g. image rotations [21]), training the pretext CNN to predict such transforms; thus, the transforms are the labels/supervision for the pretext task. This pretext CNN is then concatenated with another task-specific CNN. The former acting as generic feature extractor, and the later leveraging such features to create new ones specific for the main task of interest. Sometimes, both CNN blocks are fine-tuned [24], and sometimes the pretext CNN block is frozen and only the task-specific CNN block is fine-tuned [23]. Overall, the idea is that we can have a high number of supervised samples for the pretext task and this should compensate for a lower number of manually labeled samples for the main task.
Active learning can be naturally combined with transfer learning or domain adaptation [25]. Self-labeling can also be combined with transfer learning or domain adaptation [12]. Self-supervised learning, as usually performed, can be seen as a type of transfer learning (from the pretext task to the main task). What has not be explored, up to the best of our knowledge, is how self-supervised learning can support domain adaptation. This is the main focus of this paper, i.e. can we incorporate self-supervision to learn domain invariant feature representation? The goal of this work is not to propose new self-supervised learning methods but investigate how existing self-supervised representation learning methods can be used to address domain adaptation problems. With this aim, we design a multi-task learning method to jointly train pretext and main tasks ( Figure 1). The pretext task acts as nexus between source and target domains for learning a domain invariant feature representation for the main task. In this way, we have labels for the main task in source domain, but we do not require labels for such task in the target domain. In other words, via self-supervised learning, we perform unsupervised domain adaptation.
Accordingly, and using object recognition and semantic segmentation of urban scenes as challenging main-task use cases, the main contributions of this work are three-fold: • We proposed a generic method for domain adaptation with self-supervised visual representation learning.
• Focusing on the image rotation prediction pretext learning task, we proposed several variations and studied their domain adaptation performance.
• We proposed additional strategies to further boost the selfsupervised domain adaptation, including prediction layer alignment and batch normalization calibration. This paper is organized as follows. In Section 2, we review related self-supervised representation learning and domain adaptation methods. In Section 3, we explain the proposed method. In Section 4, we conduct experiments on domain adaptation for object recognition as well as semantic segmentation, via our method. Finally, Section 5 summarizes the work and future directions.

RELATED WORK
2.0.0.1 Self-supervised visual representation learning: An extensive review of deep learning-based self-supervised general visual feature learning methods from images or videos is provided in [26]. The recent work of self-supervised representation learning mainly focus on the design of pretext tasks. The work of [23] gives a comprehensive study of some state-of-theart methods. A pretext task of predicting the relative location of image patches was first proposed in [27], where the patch ID is the supervision/label. This initial patch-based method has been followed by several variants [22], [28], [29]. Other works incorporate image colorization [30] or image inpainting [31] as pretext tasks. Yet other works focus on automatic ways of creating image samples with corresponding labels; for instance, in [24] the labels are classes derived from unsupervised image clustering, and in [21] the labels are image rotation angles since from an original image four possible rotations were created. As compared in [23], the rotation prediction based method [21] has shown promising results for learning high-level image representations. The rotation based method is further improved in [32] by decoupling rotation related and unrelated features. Therefore, in this work, we employ this pretext task as well as the location of image patches in line with [27]. In [33], relative depth prediction is used as a self-supervised proxy task, which has shown improvements to the downstream tasks, including semantic segmentation and car detection. However, it relies on the video data in order to obtain the relative depth.
2.0.0.2 Unsupervised Domain adaptation: There have been numerous domain adaptation methods proposed for object recognition since [34]. After the pioneer work of [17], [18], semantic segmentation has also aroused increasing interests. Among existing domain adaptation methods, some try to align domains at input level, including GAN-based methods [35]- [37] and image stylization ones [38]- [40]. Some focus at feature level adaptation [17], [41]- [44], and others on adapting the output space [45]- [48]. According to recent surveys [20], [49], most methods are built on the principle of domain adversarial training [50], with differences on how to incorporate it to the training of the segmentation network. Among the adaptation strategies we use as complement to self-supervision, the prediction layer alignment is similar to adversarial training for output space alignment.
In [14], iterative self-labeling and fine-tuning with spatial urban-scene location priors are used to perform the domain adaptation. In [18], a curriculum learning style is applied, where super-pixels are computed in source and target domains and their distributions must match as auxiliary task during semantic segmentation training. The use of such auxiliary task is similar in spirit to our multi-task learning approach with pretext tasks as nexus between source and target domains. However, neither our auxiliary tasks nor our complementary adaptation strategies are restricted to semantic segmentation, and they are way simpler than computing super-pixels. Comparing to these work, our method is not specifically designed for semantic segmentation but generic for various computer vision tasks.
In [51], the self-supervised learning method jigsaw puzzle is used for object recognition domain generalization and adaptation. As we will see in the experimental section, our method outperforms the jigsaw puzzle based method on both object recognition and semantic segmentation tasks. For semantic segmentation, we compare our results to [14], [18], [35], [38]- [42], [47]. The final semantic segmentation accuracy we obtain in target domain is superior to most of these methods, only behind [14] which is specific for semantic segmentation, and still not being far apart. Moreover, although it is out of the scope of this paper, our method can be complementary to some of the ones aforementioned, such as those based on adapting the input images via GANs.

METHOD
In this section, we first introduce our generic framework of selfsupervised domain adaptation. Then, we present the considered pretext tasks. Finally, we introduce domain adaptation steps which complement self-supervision.

Self-supervised domain adaptation
3.1.1 Overview of the framework Taking semantic segmentation as an example of main task, but without lose of generality, our method is shown in Fig. 1; where E denotes an encoder network (feature extractor) and S a decoder network (specific of the main task), so that E + S is a CNN for semantic segmentation. This CNN is trained end-to-end with source domain labeled samples, {X s , Y s }. We denote by P the network added to support the creation of a model for solving the pretext task. This model consists in the CNN E + P , where E is shared with the CNN of the main task. The pretext task training samples, {X t , Y t }, are automatically created from the target domain images so that the training of E + P is also supervised.
The complete domain adaptation method is drawn in Algorithm 1, where we can see how the self-supervised domain adaptation is a joint training of models to perform the pretext and main tasks. During the forward propagation, both source and target domain samples pass through the shared encoder. After, the losses of the main task L seg and pretext task L p are computed, they are back-propagated and accumulated at the encoder. Because the encoder is trained with both source and target domain samples, it learns domain invariant feature representations. In the testing phase, we feed the target domain images to the encoder and pass the features to the decoder of the main task to obtain the predictions.
Data: Labeled source domain images: {X s , Y s }, and unlabeled target domain images: {X t } Result: Model trained for main task in target domain Create samples for pretext task Load target mini-batch {x t i , y t i }; Forward pass and compute L p ; Back-propagate L p gradients by P and E; Update weights of P ; Load source mini-batch {x s i , y s i }; Forward pass and compute L seg ; Back-propagate L seg gradients by S and E; Accumulate gradients from L p and L seg for E; Update weights of E and S; end Algorithm 1: Self-supervised domain adaptation It is also possible to create pretext task samples with the source domain data, i.e., dash lines in Fig. 1. In this case, the pretext model can be trained with both source and target domain pretext task samples. We investigate this in Section 4.

Pretext tasks
In this section, we first introduce the image rotation prediction pretext task. Inspired by the image-patch based methods [22], [27]- [29], we also take into account the spatial layout of the image and propose a new pretext task.
3.1.2.1 Image rotation prediction as pretext task.: We select image rotation prediction as pretext task due to its simplicity and superior performance on visual representation learning to other proposals [21]. Given a set of N t training images from target domain D t = {x t i } Nt i=0 , similar to [21], we define the set of geometric transformations as 2D image rotations by 0, 90, 180 and 270 degrees. We denote the rotation function by The geometric transformation prediction model P takes feature map from E as input and outputs a probability distribution over all possible geometric transformations. The self-supervised training objective that the geometric transformation model must learn to solve is: where θ e and θ p are the parameters of the encoder E and pretext network P respectively, L p is the loss function defined as: By learning to predict the image orientations, the convolutional neural networks also implicitly learn to localize salient objects in the images, recognize their orientations and object types [21]. Such implicitly learned knowledge contains semantic information of the target domain images which is expected to improve the cross-domain feature representation power of the encoder network. In other words, the pretext task with target domain images helps the encoder to learn domain invariant feature representation, thus, helps to achieve domain adaptation.
The work of [21] uses full images from ImageNet [1]. However, the images from a specific domain are usually biased to particular structures or patterns, especially at a full image level. If we train a rotation prediction model with full images, the training process could find a trivial solution and, thus, not being able to learn a domain invariant feature representation. To avoid this problem, we first randomly crop an image patch from the full image and then rotate this patch. In this way, we create more difficult and diverse samples for the pretext task.
3.1.2.2 Spatial-aware rotation prediction as pretext task.: Beyond image rotation, we further propose to take into account the image spatial layout to create a more complex pretext task. As depicted in Fig. 2, instead of randomly cropping a patch from the full image, we first split the full image into four regions. From each region, we apply cropping and rotation operations as in the previous pretext task. We call this strategy spatial-aware rotation prediction. The dimension of a label is then extended from 4 (rotation angles) to 16 (spatial locations times rotation angles). This scheme encodes the geometry transform as well as spatial layout information, which results in a more complex pretext task.

Objective function for domain adaptation
Given a set of N s labeled training images from the source , the segmentation network takes as input the feature maps from E(x s i ) and outputs the segmentation predictions: where C is the number of semantic categories, H and W are the height and width of the output respectively, and θ e and θ s convey the parameters of E and S, respectively. The semantic segmentation training objective that we need to solve for E and S is: where the segmentation loss is the cross-entropy loss, defined as: With Eq. (1) and Eq. (3), the objective function that selfsupervised domain adaptation must solve is: where λ p is the weight to balance the two losses. In this work, we simply set λ p = 1 for our experiments. The training process follows Algorithm 1.

Complementary adaptation steps
In this section, we introduce two different strategies to complement self-supervised domain adaptation, including adversarial training for prediction layer alignment and batch normalization.

Prediction layer alignment
The proposed pretext task learning is able to perform domain adaptation at feature level, however, the predicted semantic labels may still not be well aligned. There have been some previous work tackling this problem [46], [47]. In this work, we also consider to align the prediction layer to improve the domain adaptation performance. The main idea is illustrated in Fig. 3. For semantic segmentation, we simplified the decoder by a single up-sampling layer. In this way, the last layer of the encoder is corresponding to the prediction layer. By placing a domain discriminator after the prediction layer, the commonly used domain adversarial training can be employed. We denote by D the discriminator and θ d for its parameters. Given an input image x i , the discriminator takes as input the feature maps from the encoder E (x i ) and performs the binary classification to distinguish whether the feature map is from the source image or the target one, The training of D is a standard supervised training, which minimizes the following 2-D cross-entropy loss: where h, w are indexing the output layer, z = 0 indicates that the sample is drawn from the target domain, and z = 1 if it is drawn from the source domain.
In order to learn a domain invariant feature representation, we want the encoder to fool the domain discriminator D, which is equivalent to minimize the following adversarial loss function: L adv encourages to fool D by optimizing θ e while L d encourages to improve the classification accuracy of D by optimizing θ d . The optimization of Eq. (6) and Eq. (7) is essentialy a domain adversarial training. Combining the self-supervised domain adaptation objective function Eq. (5), the overall optimization problem that we solve is as following: min θe,θp,θs,θd where λ adv and λ d are the weights to balance the corresponding losses. These hyper-parameters are tuned on the validation set and then fixed for all experiments. In this work, we set λ adv = 0.01 and λ d = 1.0 for our experiments. We show how this prediction layer alignment improves the self-supervised domain adaptation in Section 4.3.4.

Batch normalization calibration
The batch normalization (BN) is originally designed to reduce the internal covariate shift and speedup the training of deep neural networks. Given a mini-batch B = {z 1...m } as input, BN layer first calculates the mean and variance by where is a constant added to the mini-batch variance for numerical stability. The normalized values are then scaled and shifted by λẑ i + β to produce the output, where λ and β are learnable parameters.
For a trained source domain model, µ B and σ 2 are statistics from source domain images, which may cause domain shift when applied with target domain images. Although during domain adaptation training, both of source and target data are passed through the BN layers, the statistics from both domain can be still ambiguous for the target domain model. What we proposed in this work is to re-calibrate these statistics to reduce the domain shift. Given a pretrained network, we keep all the learnable parameters fixed and feed forward the target domain training images. During this forward propagation, we re-calculate the mean and variation values of each BN layer.
Our BN calibration is similar to the AdaBN method [52]. However, AdaBN adopts an online algorithm to estimate the mean and variance, while we simply use the common moving average mean and variance available in existing deep learning frameworks. AdaBN is applied at the inference stage, i.e. to the testing images, while we use BN calibration as a post training process with target domain training images.

EXPERIMENTS AND RESULTS
In this section, we conduct experiments to validate the proposed domain adaptation method for both object recognition and semantic segmentation.

Implementation Details
We implement the proposed method using the PyTorch framework on a single GTX 1080 Ti GPU with 11 GB memory. For object recognition, we use the code base of JiGen [51] 1 . We use the default hyper-parameters and ResNet-18 and ResNet-50 architectures. The deep networks used in our semantic segmentation experiments are ResNet-101 based DeepLab-v2 [53] and dilated residual networks (DRN) [54]. Specifically, we take the commonly used DRN-26 architecture in order to compare to other state-ofthe-art methods. Both networks are initialized with ImageNet [1] pretrained weights.

Domain adaptation for object recognition
We first evaluate the proposed domain adaptation method with state-of-the-art methods on Office [34] Table 1. The results on Office dataset based on ResNet-50 are reported in Table 2. Rot achieves the best average accuracy among all evaluated strategies and reaching the accuracies of stateof-the-art methods. MixRot and SPRot obtain similar accuracies which are very close to Rot. MixRot even outperforms Rot on D −→ W task. Because Adv has very low accuracy comparing to Rot, Rot+Adv does not improve Rot. Rot+Adv+BN shows consistent improvements to Rot+Adv, but still can not outperform the simplest method Rot.
To better analysis the domain adaptation performance of Rot, we visualize by t-SNE [63] the learned deep features in Fig. 4 1. https://github.com/fmcarlucci/JigenDG on task A −→ W. Fig. 4 (a) and (b) show that categories are better discriminated by Rot than the non-adapted model ResNet-50. Fig. 4 (c) and (d) show that the source and target domains are aligned much better by Rot than ResNet-50.
For object recognition, we also evaluate on the multiple source domain adaptation dataset PACS [64] dataset, which has 7 object categories and 4 domains (Photo, Art Paintings, Cartoon and Sketches). Fig. 5 shows sample images from PACS dataset. We follow the same experimental settings as [51] and trained our model considering three domains as source datasets and the remaining one as target. Following [51], we also compare to the domain discovery method DDiscovery [65] and Dial [66]. We set three different random seeds and run each experiment three times. The final result is the average over the three repetitions. To make a fair comparison, we run jigsaw puzzle method with the same random seeds and denoted by Ours(jigsaw). The results are shown in Table 3. We obtained similar conclusions to the Office dataset experiment. Our image rotation based self-supervised domain adaptation Rot outperforms all baselines. MixRot outperforms Rot on adaptation to art painting and photo. SPRot outperforms Rot on adaptation to cartoon and photo. But their overall performance, i.e., average accuracies are still lower than Rot, showing that Rot is the most robust method. Again, due to the relatively too low performance of Adv, Rot+Adv can not further improve Rot. As in the previous experiment, Rot + Adv + BN consistantly improves Rot + Adv. Fig. 6. From (a) and (b), we can see that Rot has much better discriminativity on categories than non-adapted method SRC. From (c) and (d), Rot shows clearly much better domain alignment than non-adapted method SRC. The t-SNE visualization reveals the effectiveness of Rot on domain adaptation.

Domain adaptation for semantic segmentation
For semantic segmentation, we adapt semantic segmentation models from the source domain of synthetic images to the target domain of real-world images. For the synthetic datasets, we use SYNTHIA [67] and GTA5 [68], and for the target domain, we use the Cityscapes dataset [7]. The GTA5 [68] dataset is rendered from the Grand Theft Auto V video game. It consists of 24996 images with resolution of 1914 × 1052 and has 19 classes compatible with Cityscapes dataset. We use the full set of GTA5 as our source domain training set. For SYNTHIA dataset, we use the SYNTHIA-RAND-CITYSCAPES set [67] as the source domain training set, which contains 9400 images. We evaluate with the 16 common classes for SYNTHIA to Cityscapes domain adaptation. The training set of Cityscapes has 2975 images which are used as unlabeled target domain training samples. The validation set of Cityscapes has 500 samples which are used as our testing set.
We conduct ablation studies to understand the impact of each component of our self-supervised domain adaptation. If not otherwise specified, all the experiments in this section use ResNet-101 as backbone network and the domain adaptation is from GTA5 to Cityscapes.

Pretext task learning strategies
The first two rows in Table 4 show their domain adaptation results. As can be seen from Table 4, mixing source domain training data in the pretext learning (MixRot) shows even inferior results,

Methods
Description Ours(Jigsaw) Self-supervised domain adaptation with pretext task of solving jigsaw puzzle. [51] Ours(Rot) Self-supervised domain adaptation with image rotation prediction pretext task. Ours(MixRot) Same as Rot but mixing with source domain samples in the pretext task learning. Ours(SPRot) Self-supervised domain adaptation with spatial-aware rotation prediction pretext task.

Ours(Adv)
Adversarial domain adaptation with prediction layer alignment. Ours (Rot+Adv) Rot with Adv as complementary strategy.

Ours(Rot+Adv+BN)
Rot with Adv and batch normalization calibration as complementary strategies.       which may because the source samples are dominated in the mixed samples (24996 vs 2975), which makes the model more source domain oriented and reduces domain invariant representation power. Next, we would like to know whether the proposed spatialaware rotation prediction pretext task is better than the simple rotation prediction strategy, i.e., the Rot method. Table 4 displays the results of the spatial-aware rotation prediction pretext task as SPRot. It turns out that the more difficult pretext task learning leads to worse domain adaptation performance. In our practice, the pretext task learning of SPRot has more difficulties to converge than Rot, and this may result in the failure of learning good feature representations. Therefore, how to design a proper pretext task for domain adaptation still needs more exploration.
We also compare our method Rot to the jigsaw puzzle based self-supervision [51]. The results are shown in Table 5, where SYN2CS denotes SYNTHIA to Cityscapes domain adaptation and GTA2CS for GTA5 to Cityscapes. Rot outperforms the jigsaw puzzle for both SYN2CS and GTS2CS. Especially for GTA2CS, jigsaw puzzle has shown very limited gain (1.2 percentage point) while Rot still achieved 6.2 percentage point.

Input image size for pretext task learning
As the images from Cityscapes dataset have large resolution (e.g., 1024 × 2048). We are interested in what cropping size is best for the self-supervised learning. In Table 4, we compare three different cropping sizes. The smallest cropping size (128 × 128) shows worst performance due to too small field of view to learn good representations. Comparing the remaining two cropping sizes, we see that the larger one (400 × 400) does not further improve the performance. In fact, when we use the full image as input, the pretext learning easily gets stuck in a trivial solution, i.e. 100% prediction accuracy. As a result, the final model fails to perform domain adaptation. Thus, we believe that a proper cropping size is important to control the difficulty of learning pretext tasks.

Feature extraction layer
By default, the pretext task takes as input the features extracted from the last layer of the encoder. However, whether the last layer is the best for domain adaptation is unclear. In this section, we train self-supervised domain adaption models with different feature extraction layers. We mainly compare the feature extraction from the middle and the end of the encoder. Table 4 shows the corresponding results, where Middle represents the feature extraction from middle layer and Final uses features from the end layer of the encoder. As, in this case, the decoder of the segmentation network is simply an up-sampling layer without any learnable parameter, the Final layer is actually the prediction layer of the segmentation network. As we can see from the results, the model Middle shows slightly better results and we think the pretext task learning is not very sensitive to the choice of feature extraction layers. Table 6 shows the results with different complementary strategies. The source domain model is denoted by SRC and the model trained with target domain samples is denoted by TAR, which represent the lower and upper bound of the accuracy respectively. Rot is our baseline method. +Adv is with prediction layer alignment (Section 3.2.1), which improves Rot by 1.      Fig. 7 shows the results on multiple datasets using multiple networks. BN calibration alone achieves surprisingly good results, and the best domain adaptation gain even reaches 6.8 percentage points. However, when combined with Rot or Rot+Adv, it only improves 1 or 2 percentage points. This might be because Rot and Rot+Adv have already learned domain invariant representation that effectively reduces the covariate sift and BN calibration could not contribute more to the adapted model. The reason that Adv gives consistent rise to the base method is because Adv further aligns the predicted label distributions which is more complementary adaptation to the Rot than the provided by BN.

Qualitative analysis
Following [69], we also visualize the learned feature representations by t-SNE [63] in Fig. 8. For the non-adapted features, the classes are not discriminated well. They are discriminated better by the Rot adaptation and Rot+Adv+BN discriminates the classes best. In Fig. 9, we illustrate some qualitative results of our models. Without domain adaptation, the source domain model SRC produces noisy segmentation. Rot shows significant improvements over SRC in terms of segmentation quality. The results of Rot+Adv+BN is less noisy and more accurate in details than Rot.

Comparison to the state-of-the-art
Lastly, we compare our method to some recently published stateof-the-art works which use similar architectures to ours. The results are shown in Table 7. The compared methods cover large varieties of domain adaptation mechanisms, including input/feature/output level alignment methods, curriculum and selflabeling based methods. Some of these methods are also surveyed in [49]. We refer the readers to [49] for more details. The results in Table 7 show that our adapted models (Adapt) achieve comparable accuracies to the state-of-the-art. It is worth noting some of these state-of-the-art methods obtain worse results than we obtain when training with the source data alone (SRC columns), so their relative gain is higher. On the other hand, with this work we aim at encouraging the use of pretext tasks for domain adaption of semantic segmentation models, which, as mentioned before, can be a complementary idea to others. We also find that a deeper network (ResNet-101) can achieve better domain adaptation gain than the shallow one (DRN-26).

Discussion
The experimental results reveal several insightful observations. (1) The current deep learning methods learn good feature representations for single domain but can not remove cross-domain discrepancy. (2) Using self-supervised representation learning can help to reduce domain shift, and the simplest image rotation prediction pretext task Rot can even achieve comparable performance to the state-of-the-art domain adaptation methods. (3) Rot turns out to be more robust than other alternatives, e.g., MixRot, SPRot and Jigsaw. (4) Rot is complementary to existing domain adaptation methods, e.g., adversarial based and batch normalization based ones.
The reasons that self-supervised learning helps domain adaptation are as following: (1) the self-supervised learning involves source and target domain samples in a common supervised learning process which can help to learn cross-domain feature representations. This can be verified from the feature visualization on domain alignment, e.g., Fig. 4 and Fig. 6. (2) Because the selfsupervised learning and the main task are in a joint multi-task learning process, the model of the main task also learns from the cross-domain feature representations. As a result, the final model achieves domain adaptation on target domain.
Based on our experiments, we also have following findings about how to design a good self-supervised pretext task for domain adaptation: (1) As a common practice on many deep learning tasks, a deeper architecture can achieve better self-supervised domain adaptation performance than a shallow one. (2) Better performance on representation learning, better performance on domain adaptation, e.g., Rot vs Jigsaw. (3) More complex pretext task does not lead to better domain adaptation performance, e.g., SPRot is outperformed by Rot. (4) Adv and Adv+BN can further improve Rot if the performance of Adv is not worse than Rot. In this work, we only investigated several simple self-supervised learning strategies, we believe that for better self-supervised domain adaptation there are still large space to explore.

CONCLUSION
In this work, we have explored self-supervised learning for domain adaptation. We have shown that a simple image rotation prediction (pretext task) self-supervision can achieve state-of-the-art domain adaptation performance. We have studied several pretext tasks as well as complementary domain adaptation strategies. Taking object recognition and semantic segmentation of urban scenes as relevant use cases, we have performed an ablative analysis of the different components included in our overall domain adaptation procedure. As future work, we would like to investigate more pretext tasks and to apply our method to other relevant vision tasks.