Unsupervised Domain Adaptation With Debiased Contrastive Learning and Support-Set Guided Pseudolabeling for Remote Sensing Images

The variability in different altitudes, geographical variances, and weather conditions across datasets degrade state-of-the-art (SOTA) deep neural network object detection performance. Unsupervised and semisupervised domain adaptations (DAs) are decent solutions to bridge the gap between two different distributions of datasets. The SOTA pseudolabeling process is susceptible to background noise, hindering the optimal performance in target datasets. The existing contrastive DA methods overlook the bias effect introduced from the false negative (FN) target samples, which mislead the complete learning process. This article proposes support-guided debiased contrastive learning for DA to properly label the unlabeled target dataset and remove the bias toward target detection. We introduce: 1) a support-set curated approach to generate high-quality pseudolabels from the target dataset proposals; 2) a reduced distribution gap across different datasets using domain alignment on local, global, and instance-aware features for remote sensing datasets; and 3) novel debiased contrastive loss function that makes the model more robust for the variable appearance of a particular class over images and domains. The proposed debiased contrastive learning pivots on class probabilities to address the challenge of FNs in the unsupervised framework. Our model outperforms the compared SOTA models with a minimum gain of +3.9%, +3.2%, +12.7%, and +2.1% of mean average precision for DIOR, DOTA, Visdrone, and UAVDT datasets, respectively.

group the data variability along four dimensions w.r.t object localization and identification tasks, two related to video content capture variability, and two related to the object in the video variability: 1. Lighting Conditions significantly change the video footage captured even during one drone flight.The changes can be due to the time of day, season, weather, and cloud distribution.Figure 2(a) show the variations due to image capture time and lighting conditions, and the pixel intensity distribution varies significantly.2. Variation in Object Size is great in the same dataset due to different areas captured (e.g., urban vs. rural).The objects in the frame can vary from under 0.01% to almost 70% of the entire frame.The variation is even higher between different datasets, as the footage is captured over multiple dates, terrains, and missions.Figure 2 (b) (left) contains welldefined objects, while Figure 2(b) (right) contains lots of small (players and cars) densely packed objects.3. Geographical variance of the terrestrial terrain captured in the imagery from such high altitude poses a critical challenge for object localization.Figure 2 (c) illustrate the example of the large geographical variance that can exist between two datasets.4. Object Distribution variations in images make it challenging to separate nearly objects and eliminate overlapped objects while performing Non-max Suppression (NMS). 5. Object Labeling in aerial datasets is challenging as it is hard to distinguish correct labels among small and densely-packed objects [7].Today, only a few aerial datasets exist that cover real scenario object class diversity and a sufficient number of training examples.
Robust object localization in unseen datasets using deep neural networks requires many diverse annotated training data.The generalization of a model trained on one aerial image dataset and its application to another aerial dataset is not an Fig. 2: High-variability remote sensing frames: (a) lighting conditions variations, (b) variations in object shape and scale, and (c) high variability due to geographical and weather changes.efficient approach due to high domain shifts across datasets.Unsupervised domain adaptation methods offer a way to effectively transfer the knowledge gained from annotated data and trained models in the source domain to the target domain and accelerate the pseudo-labeling objects in the target domain using the labels already available in the source domain.

II. RELATED WORK
Computer vision research tasks have been solved using the full potential of Deep Neural Networks for consumer applications.Recent advantages in the field show that the object detection task can be successfully solved for Drone captured Visdrone dataset [8] and the COCO consumer image benchmark dataset [9].The key to the success of DNNs is the automatic feature extraction strategy which is more efficient in extracting semantic details and local features.There have been numerous works to make object detection better and more efficient.The architecture of the object detection models can be divided into two branches, 1) One-Stage Detector and 2) Two-Stage Detector.One-stage detectors [10], [11], [12], [8] are by nature faster and lightweight due to less learnable parameters and FLOPS.For generating regions proposals, onestage detectors use different scale and aspect ratios of anchors.On the other hand, two-stage detectors use a separate module called Region Proposal Network (RPN), which is responsible for generating strong region candidates for object detection.
Object Detection in Remote Sensing Images: Shi et al. propose an anchor-free-based detector called Centerness-Aware Network (CANet), which captures the symmetrical shape of objects in remote sensing videos [13].Biswas and Tešić suggest a strong custom backbone and an image difficulty scoring technique [14] to help detect small and complex objects.Zhang et al. find that context-based feature extraction is more effective for detecting complex objects and scenes in the overhead imagery [15].Global Context-Weaving Network incorporates a global context aggregation module and feature refinement module [16], and transformer-based CNN encoders are used for better feature extraction [17].Qingyun et al. perform extensive image augmentation to increase the number of samples in the minor classes.Zhu et al. modify darknet53 backbone with Cross Stage Partial DenseNet and add a transformer head in the detection layer, which gains state-of-the-art results of overhead drone images [8].Overall, the overhead video frame images require special care in anchor design for one-stage detectors, and a good RPN should be chosen in two-stage detectors to capture every small object from different levels of features.
Unsupervised Domain Adaptation: Training data for overhead images can differ significantly from the target domain regarding visual characteristics.For a labeled source dataset and an unlabeled target dataset, unsupervised domain adaptation methods generalize the model by aligning source and target [18].Cheng adjusts the decision boundary biased towards the target data source domain and adds adversarial training in conjunction with image-to-image translation techniques [19].Xiong et al. rely on the source-free feature alignment at the image and the instance to tackle the domain shift raised from the image and instance levels [20].Ma et al. [21] minimize domain discrepancy between source and target using a progressive domain mixup technique.Xu et al. have introduced a semanticaware mixup (SAM) for domain generalization, where whether to perform a mixup depends on the semantic and domain information [22].Mattolin et al. implement the confidencebased mixing of source and target domain images, where the confidence of an instance proposal is calculated based on the objectness score and the bounding box uncertainty score of each instance proposal from the image [23].A novel SemantIccomplete Graph MAtching (SIGMA) [24] framework was proposed for the Domain Adaptation task, which completes mismatched semantics and reformulates the adaptation with graph matching.The Graph-embedded Semantic Completion module (GSC) can address mismatched semantics by producing hallucination graph nodes within the absent categories.However, the above methods do not handle the imbalanced dataset problem and high-domain gap scenarios available in remote sensing images.
Contrastive Learning for Domain Adaptation: It is hard to discriminate object classes in high-variable remote sensing images.Contrastive learning is a technique that is a good fit as it contrasts samples against each other to learn commonalities

A. Contributions
In this paper, we propose a novel framework to address the high variability of remote sensing images for the object detection and labeling task in previously unseen datasets.We introduce debiased contrastive learning and create an object detection benchmark for remote sensing datasets.We show that it is very important to produce domain-invariant, but at the same time, we need to maintain class variant decision boundaries in the feature space.We employ the idea of N positive samples in domain adaptation as it proved very successful [30] in other representation tasks.Also, we carefully filter out the False Negative examples that can disturb the learning process and result in poor performance.The methodology is outlined in Section III, and the novelties of the proposed methods are: 1) The use of the instance-level features from the source and target datasets for Instance-level domain adaptation.2) A Novel way of Support-Guided progressive pseudo labeling method for generating target domain instance labels.3) Multiple numbers of positive samples in contrastive learning for domain adaptation and object detection tasks.4) Introducing debiased contrastive learning to reduce False Negatives (FN) when performing domain adaptation in imbalanced datasets.
We evaluate the proposed framework using the latest cross-domains detection benchmarks over four different highaltitude (DIOR and DOTA2.0) and low-altitude (Visdrone and UAVDT) remote sensing imagery in Section IV and summarize the findings in Section V.

III. METHODOLOGY
Our baseline detection architecture is based on our previous work [29].We use this architecture 3 due to its efficient feature extraction strategy and the saliency-weighted custom focal loss function for remote-sensing images.In our previous work, we used saliency information from each image to calculate the difficulty of each image.We tracked the number of neuron activation and the number of objects per image to derive the final saliency score.Based on this saliency/ difficulty score, the Loss function assigns more penalties on difficult images and less on easy images.The calculation details are presented in the previous paper [29].
Contrastive learning evaluates pair-to-pair relationships by measuring the similarities between different sample pairs, such as query-positive or query-negative.Here, Query is the subject feature, whereas positive samples are augmented features similar to the subject, and negative samples are randomly selected features dissimilar to the subject feature.The image level domain adaptation is a strong feature alignment that comes which the sacrifice of the instance level discriminability, as illustrated in Figure 4(middle).Some categories are not equally transferable across strong feature alignment, particularly in cross-domain object detection tasks for RS that involve complex combinations of objects and their sizes, varied geographic conditions, and backgrounds that can dominate.Strong feature alignment ensures the image features from both the source and target dataset are domain invariant by overlapping two distributions, as shown in Figure 4.However, our goal is to perform the alignment in both image and instance levels as shown at the right end of Figure 4. We extract instance-level features from the RPN and feed the features into our two new domain adaptation blocks.Where most of the previous works use traditional pseudo-labeling or random labeling such as one vs all for the target dataset, we use an advanced clustering technique K-means++ [31] for generating target labels due to its proven performance [32] in high-dimensional data.Moreover, instead of using the single example as the positive sample, we propose to use N numbers of positive samples for contrastive learning.From a previous work [30], it was observed that using more than one positive case increases the performance significantly.Finally, we perform progressive debiasing in our custom contrastive loss to remove the False Negatives (FNs) from the target negative samples.

A. Unsupervised Domain Adaptation
The idea of DA comes from the scarcity of available annotated datasets and the different factors that introduce domain gaps among the datasets.The Remote-sensing images captured from other parts of the world give different geographical variances and background difficulties as illustrated in Figure 2(c).Camera orientations, weather conditions, and image capture challenges degrade the performance of an object detection model trained on a different dataset.Unsupervised domain adaptation minimizes the domain gap between two datasets, called source and target datasets.It is assumed that we have annotations for only the source dataset, and a large domain gap exists between them.The goal is to generate domain-invariant features at different levels of image features and perform better in unseen/target datasets.
In this paper, we perform unsupervised domain adaptation with a contrastive learning technique to align domains at local, global, and instance levels.We also prove the performance gain from our proposed debiased contrastive loss in the learning phase.we denote the source as S, and the target dataset as T .We use the CycleGAN network to produce synthesized images (see input images in Figure 3) from source to target and vice-versa.The synthesized images from source to target are denoted as S ′ , where the object formation is the same as the source image, but the pixel color emulates the target dataset.On the other hand, T ′ denotes target-tosource conversion, where object formations are the target and pixel color follows the source domain.The domain adaptation with contrastive learning is performed bi-directional between (S, T ′ ) and (T, S ′ ) for better transferability and to minimize the domain discrepancies between the two datasets.Considering (S, T ′ ) and (T, S ′ ) as the source and target domain pairs, we take local features from the earlier stage of the backbone representing pixel-level and texture information and global features from the later part of the backbone which represents a more abstract version of objects.The authors performed only local-global domain adaptation in the baseline paper [29].However, we take it further to instance-level adaptation with pseudo-labeling in the target dataset.Fig. 5: Contrastive learning with multiple positive cases and false negative filtering.Here, green connections denote higher similarity, and red connections denote lower similarity with the query case.

B. Debiased Contrastive Learning
Contrastive learning is a process of matching different distributions based on Query (Q) and Key (K) embeddings [33], [34].The value of the contrastive loss function is lower when there are high similarities between the Query (Q) and positive key (K + ) pair; and low similarities between the Query(Q) and negative keys (K − ) pairs.Contrastive learning performs domain alignment by keeping similar points closer and different points distant, as illustrated in Figure 4.The most used formula for contrastive learning is outlined in Equation 1, where τ is a hyper-parameter known as temperature to put penalties on the calculated similarities [35], [36].
The similarity can be calculated using cosine, Euclidian, or Wasserstein distance functions.The cosine similarity score is used in the experiments and calculated as sim(x, y) for two features x, and y is sim(x, y) = x T /(||x|| * ||y||).We calculate query similarity CL in Equation 1 as a normalized sum of the similarity of query vector Q to N negative samples.
In the baseline paper [29], the authors used eq.1 for the local and global domain adaptation, where only a single augmented image was used as the positive case.However, earlier research shows that [30] including more than one positive case in contrastive learning can better generalize the feature representation.Based on this idea, we modify the loss function in Equation 1 as below: In Equation 2, M is the number of augmented positive samples for the query.We perform a cross-product between the query and positive cases following this operation Q(1, size)× K + (M, size) ′ = Sim(1, M ), which gives a column vector with a dimension equal to several positive cases (M).Then we take the average of all the logits and compute a single scalar value as the final similarity score.It is shown in section IV that adding more than one positive case significantly improved the performance across different datasets.
Another challenge for contrastive learning is imbalance.Table I shows that the real datasets are highly imbalanced.As samples for contrastive learning are selected randomly, we cannot control which class instances are picked in a minibatch.This raises the chances of getting False Negative (FN) in the negative samples, as illustrated in Figure 5. Earlier domain adaptation methods for consumer dataset does not deal with this problem because consumer datasets are usually nearly balanced.However, Remote-Sensing datasets are often dominated by some significant classes which require extra effort to gain optimal results.The number of false negatives (FNs) increases as we increase the number of negative samples in a mini-batch.In this light, we propose to filter out negative samples with high similarity scores with the query sample.In Figure 5, we see that 3 out of 4 images are with similarity score below 0.2.Although, one image is highly similar to the query image.To solve this issue, we first reject the FN case that 70% matches the query and replace the value with the remaining average score in the mini-batch for better consistency and stable learning.Equation 3 outlines the Debiased Contrastive Learning (DCL) formula used throughout all experiments.
Here, DK − j is calculated using below formula,

C. Support-Set Guided Progressive Pseudo Labeling
As shown in Figure 3, we first take the feature vector for all instances in a mini-batch from the RPN module to perform instance-level adaptation.We have Ground Truth (GT) for the source dataset region proposals, and we can use those GT class ids to separate positive and negative cases in contrastive learning.However, we do not have any GT for the target dataset, so we must generate labels for the target proposals to guide contrastive learning.During the early training phase, the RPN region proposal vectors for target datasets are prone to background noise and mistake many background scenes as region proposals.This introduces unwanted noise in the system, and we introduce the step in the process that reduces the number of RPN region proposal false positives.[6], DOTA2.0 [37], [38], Visdrone [39], and UAVDT [40] datasets over different categories.
First, we take R samples from each of the C classes and create a R-shot support set to guide the labeling process.Here, the dimension of the R-shot support set is R C.Then, we match all features in a mini-batch with the support set using cosine similarity metrics.Next, we keep features that match any support samples passing some defined threshold.As features are less useful during early epochs, we restrict the number of unlabeled features for labeling to minimize computation time as well as the target instance contrastive loss.After every step size, we progressively increase the number of features by some factors for the pseudo-labeling task.The curated features are then used for target pseudolabeling through a clustering method.
The K-means++ is an improved version of the original Kmeans clustering algorithm that aims to select better initial centroids in high dimensions and reduces the chance of the algorithm getting stuck to local optima compared to Kmeans [41].Thus we use K-means++ to generate pseudo labels through clustering from deep features.The clustering performance of the K-means++, as shown in Figure 6 and the value of K for clustering, is selected empirically.The selection process of K is described later in Subsection IV-C and Table VI.

D. Debiased Local Contrastive Learning
Local adaptation is a class-agnostic adaptation because we extract features at the pixel level of the source and target domain.From the architecture of our proposed model in Figure 3, we can see that the first step toward local domain adaptation is to generate synthesized images from both source (S) and target (T ) images in a mini-batch.For that, we use CycleGAN and pass both source and target image to generate translated source (S ′ ) and translated target (T ′ ), respectively.Then we pass S, T ′ , T, S ′ to the backbone for feature extraction.We save local features from the earlier layer of the backbone in the dimension of 256 × 100 × 100.Then, we pass them into the bottleneck block, which reduces the feature dimension to 32 × 100 × 100, where dimensions are C, W, and H, respectively.Next, we pass the output of the bottleneck layer to the Multi-Layers-Perceptron (MLP) block and make the final feature vector with a length of 1024.we reduce the size of each feature to reduce the necessity of the GPU memory.
Lets, represent the local features from the S, T ′ , T and S ′ as α S i , α T ′ i , α T i , and α S ′ i , respectively.Where i is the index of the mini-batch.As we are going to perform bi-directional adaptation, for the adaptation of the S and T ′ , we select a local feature α S i ∈ α S as a query and choose different augmentations of the corresponding feature from α T ′ i ∈ α T ′ as the positive cases.On the other hand, negative cases are all other local features α T ′ j ∈ α T ′ in the mini-batch, where j ̸ = i.The bi-directional local contrastive loss between (S and T ′ ) and (T and S ′ ) can be calculated from the Equation 4 and 5.

DCL S,T
In the above Equation 4 and 5, D stands for Debiased, m denotes the m th augmentation out of µ number of augmentations for a particular image.Finally, the number of negative examples drawn from a mini-batch is denoted with ν.The total bidirectional local domain adaptation loss can be formulated by accumulating the loss for all query images in a mini-batch, as follows:

E. Debiased Global Contrastive Learning
Global domain adaptation works on a more abstract view of object features.We collect global image features from the last layer of the backbones; by this, we get features with very high details on lower spatial resolutions.Like the local adaptation, we also pass these 256 × 25 × 25 to the bottleneck layer and reduce the dimension to 3 × 25 × 25.Next, features are fed to the MLP block, and a feature vector with 1024 dimensions is computed.Following the same notational format from previous section III-D, we can define the global features from the S, T ′ , T and S ′ as β S i , β T ′ i , β T i , and β S ′ i , respectively.Again, i is the index number in a mini-batch.So, the bi-directional global contrastive loss between (S and T ′ ) and (T and S ′ ) can be presented as in Equation 7 and 8.
The total bidirectional global domain adaptation loss can be formulated by accumulating the loss for all query images in a mini-batch, as follows:

F. Debiased Instance Contrastive Learning
Local-Glocal (LG) contrastive learning helps to create domain invariant features as shown in Figure 4; it is visible in the figure that Image-Level adaptation can remove the domain boundary and create a uniform domain feature space for source and target datasets.However, it is also evident that no class discrepancy is maintained at the image level alignment, and there is an overlap between the different class instances in the feature space.To solve this issue, we propose to perform debiased instance contrastive learning for the source and target dataset and achieve class discrepancy in features.The effect of this learning is illustrated in Figure 4, where we can see a moderate separation line between the two classes.
Let's denote the source and target region proposals as Γ S i and Γ T i , respectively, and corresponding classes as C S i and C T i , where i is the proposal index among P proposals.For instance-level contrastive learning, the formula can be formulated from Equation 10 and 11.
The above Equation 10 and 11 represent the source and target instance loss, respectively.Here, µ and ν stand for the number of positive and negative samples, respectively, and i stands for i th ∈ P proposal in the proposal set P. We define the class id of the query, positive and negative samples using qc, pc, and nc, respectively.The total instance contrastive loss can be formulated by accumulating the loss for all region proposals in a mini-batch, as follows: Also, confidence tends to be less reliable at the early stage of the adaptation.The feature quality and objectness score from the RPN for the target dataset is generally less reliable due to the large domain gap.In this light, we use weights W 1 , W 2 , and W 3 in Equation 6, 9, and 12, respectively, to perform progressive adaptation and give less weight during the early stage of adaptation, and progressively increase the focus with an increased object confidence score and quality features.The initial values for W 1 , W 2 , and W 3 are set to 0.1, 0.1, and 0.01, respectively.The total loss for the detection and adaptation process can be calculated by summarizing all loss components outlined in Equation 13.
T otalLoss = SW F L(x, p t , y) + DCL local + DCL global + DCL Ins (13) IV.EXPERIMENTS In this section, we evaluate our proposed debiased contrastive learning model against current state-of-the-art domain adaptation methods on four remote-sensing image datasets.The experimental setup is described in Section IV-A, the comparison findings are summarized in Section IV-B, and the extensive ablation studies over different factors and parameters are outlined in Section IV-C.

A. Implementation Details and Setup
Implementation We use the object classification pipeline similar to [29]: Darknet53 as the backbone as it is shown to preserve semantic information from the small objects than the residual-based feature extractor networks [11], [44]; RPN heatmap-based approach to identify dense small objects and remove NMS; and the detection block is Faster-RCNN [45].We have used Python with PyTorch as the deep learning framework to implement the project.Our code implementation is heavily based on an open-source computer vision library Detectron2 [46] and some part of SOD [14] implementations.With debiased contrastive learning, we implemented three new DA modules for local, global, and instance domain adaptation.Also, we implemented a Cythonized K-means++ that is much faster than the Python implementation, and the clustering time is recorded in Table VI.
In CycleGAN network [47], load 800 and crop 640 were used for the data augmentation.To train our DCLDA model, we have resized all images to 800 × 800 pixels and set eight as the mini-batch size in each epoch.So, in total, we send 8 × 4 = 32 images in a mini-batch to train the DCLDA model.Pytorch color-jitter augmentation technique was used to create multiple augmented copies of the synthesized images for image-level contrastive learning positive cases.During the support-guided pseudo labeling, we chose five samples (n) per class and created the 5-shot support set.For the feature curation, we tried different values as the cosine similarity Fig. 7: Source and Target domain detection results using our DCLDA method.threshold and found that 70% cosine similarity threshold achieves optimal performance across most of the experiments.We have used NVIDIA 2 x RTX 6000 GPU with 49GB of memory, 11 th generation Intel® CoreTM i9-11900K @ 3.50GHz × 16 CPU, and 167GB of system memory to carry out all experiments.

Method
Car  Datasets DIOR data set originally consisted of 24,500 Google Earth images from 80 countries.After selecting only common classes, the reduced dataset has 11,402 images.The images varied in quality and were captured in different seasons and weather conditions.The number of pictures in the training set is 10,888; in the testing set, we have 512 images.DOTA dataset comprises 2,430 overhead images with image sizes ranging from 800 × 800 to 29, 200 × 27, 620 pixels.The ground sample distance (GSD) in the data set ranges from 0.1 to 0.87 m, and each image contains an average of 220 objects.For experiments, we split highresolution images into patches of size 1024 × 1024 pixels with an overlap of 200 pixels.Considering only the common ten classes, the DOTA2.0training set has 11,551 images, and the testing set has 3,488 images.Visdrone is a UAV dataset containing over 10,000 image frames from more than 6 hours of videos, making it one of the largest drone datasets available.The experimental dataset includes three common object categories, and the images have different resolutions ranging from 540p to 1080p.The training and testing set contains 6883 and 546 images, respectively.UAVDT dataset contains over 80,000 frames in 179 videos captured by UAVs, making it one of the largest datasets available for object detection.The experimental dataset contains 10,000 images with three object categories with different image resolutions ranging from 540p to 1080p.The dataset covers various weather conditions, including sunny, cloudy, and rainy.The → symbol is illustrating the direction of domain adaptation: source → target.Evaluation Metrics.To assess the effectiveness of our proposed approach in the target domain, we measure its Average Precision (AP) by considering both precision and recall for each object category.The mean AP (mAP) is then calculated as the average AP across all object categories.The mAP for all experiments was calculated with an IOU of 0.

B. Method Comparisons
We compare our DCLDA method with several current stateof-the-art techniques for the adaptive object detection task on two high-variability video image datasets and two highvariability image datasets.Specifically, we have used the CenterNet2 [9] as the source-only baseline, which is trained only with labeled source data, serving as the performance lower-bound for comparisons.On the other hand, the oracle method is trained with labeled target data, serving as the performance upper-bound.We have used feature alignment DA methods such as MGADA [42] and SAPNet [43]  The Visdrone and UAVDT video datasets are two highvariability videos captured from UAV in Table III.Here we evaluate target dataset performance over three different categories.We have not only shown excellent performance on the target dataset but have also achieved a 59.2% mean average precision (mAP) (see Table III) on the source dataset, which is noteworthy.Our baseline method trained on only source data gives 26.4% of mAP, whereas our DCLDA method achieves 41.5 of mAP using the debiased contrastive learning and pseudo labeling.Also, we have a +2.1% gain margin compared to the best state-of-the-art ConfMix method.Moreover, using debiased contrastive learning, we could shrink the performance gap between the oracle and our model from 30.5% to 15.4% compared to the baseline model.

C. Ablation study
In this section, we answer several questions.The first one is: Does the Instance-level adaptation helps on target data?.
Table IV shows that the instance domain adaptation improves mAP 7.9% and 5.4% recorded for the DOTA and UAVDT target datasets, respectively.The performance on the source dataset dropped slightly by 1.5% for the DIOR dataset after IDA (w/o curation) due to the increased number of loss functions and noise from target instance labels.However, when we used the support set to cure the noisy features and guide the IDA process, we not only gained higher mAP in the target dataset, but also we were able to make a stable recovery from the source dataset performance drop (See Table IV).
The second question we want to answer is, how much we benefit from using multiple positive cases?We claim that the single sample of positive cases for contrastive learning does not work for high variability overhead videos and imagery.Table V illustrates the performance gain, and even for two positive samples, improves the overall performance by roughly 2.0% for both target datasets.
More positive and negative examples can introduce more noise and ultimately hamper the results, as illustrated in Table V for 15 negative and eight positive cases.The study found using seven negative and four positive points gives the optimal results for each dataset.The third question is: How many clusters do we set for pseudo labeling?and Table VI shows that pseudo labeling with five clusters for DOTA and two for UAVDT can achieve up to 7.5% and 5.8% increase, respectively.Table I shows that five significant classes dominate the DOTA dataset labels.For UAVDT, a single class with two minor classes separates the dataset into two clusters for target labeling.
Finally, we answer the efficacy of different modules of the proposed DCLDA model.Table VII shows that each integrated module has some performance gain in our target dataset.We started the integration of modules one-by-one and recorded the mAP performance against the experimental dataset.We first integrated CycleGAN-based synthetic image for transfer learning, and we can see that it gains +1.8% and +1.4% mAP on DOTA and UAVDT datasets, respectively.Next, we integrated three contrastive learning modules (e.g., LDA, GDA and IDA) incrementally, and the performance is presented in Table VII.However, we achieved our best performance gain by integrating the IDA module.We achieved a +11.5% and 10.4% increase in the mAP for DOTA and UAVDT datasets, respectively.Finally, we combined all proposed modules in our DCLDA architecture and recorded the optimal performance on both target datasets under careful hyperparameters selection.

V. CONCLUSION
This paper proposes debiased contrastive learning with Support-Set guided Pseudo Labeling for the Unsupervised Domain Adaptation task.We show remote-sensing video frames and images have significant domain shifts due to lighting conditions, weather changes, and geographical variance.Careful design of the detection pipeline and instance-aware domain adaptation method is required for optimal performance.Our proposed contrastive learning method consists of two significant improvements.The first is the debiased contrastive learning to remove false negative samples using the class-wise probability logits.The second introduces multiple augmented positive cases for more stability from object size and scale variation over images and datasets.Next, we show that a faster and support-guided pseudo-labeling technique can improve the target instance learning performance by eliminating noisy object features with little training time overhead.Specifically, our method takes only a second to label 4000 target features in a mini-batch.Finally, we validate our approach in four challenging high-variability datasets that showed significant performance gain over available state-of-the-art methods.For the UAVDT and DOTA target dataset, we outperformed the latest state-of-the-art ConfMix method by +2.1% and +3.2% mAP, respectively.

Fig. 4 :
Fig. 4: Contrastive Learning alignments: different color represents different domains, and shapes represent different categories.

Figure 7 (
a) and (b) shows the detection performance of DCLDA trained on DIOR source data and tested on the DOTA target dataset.

TABLE I :
Class Name # of Ins.# of Ins.# of Ins.# of Ins.Instance distribution statistics (Test Set) of the DIOR 10)

TABLE II :
Classwise performance comparisons (mAP) for DIOR → DOTA benchmark(IOU=0.5),as measured both on the DIOR (source) and on the DOTA (target) datasets.
5 at the Non-Maximal suppression stage.

TABLE IV :
Source and target detection performance (mAP) with(w/) and without(w/o) Instance Domain Adaptation (IDA).
for state-of-the-art comparisons.TableIIpresents the performance comparison for DIOR and DOTA satellite images datasets.This table shows classwise performance for the target dataset and overall performance for both source and target datasets.We can see from Table II that our baseline method achieves mAP of 66.6 and 35.4 in the source and target datasets, respectively.We improve the baseline model with Image-level local and global domain adaptation and pseudo-labeling-based instance adaptation, which helps us to outperform other state-of-the-art models by a minimum margin of 3.2 % on the target dataset.Moreover, the gap between the DCLDA and Oracle results is now narrowed Table III and Figure 7(c) and (d) demonstrate the effectiveness of our method in detecting objects from challenging and less frequent categories, including trucks and buses.Table III also demonstrates that a well-designed backbone can enhance the performance by around +2.7% on the video target domain with dominated dense objects.

TABLE VII :
Ablation study for different modules of our DCLDA method.Here, CGAN= CycleGAN Transfer Learning, LDA= Local domain adaptation, GDA= Global domain adaptation, and IDA= Instance-level domain adaptation.