Bridging the Domain Gap in Satellite Pose Estimation: a Self-Training Approach based on Geometrical Constraints

Recently, unsupervised domain adaptation in satellite pose estimation has gained increasing attention, aiming at alleviating the annotation cost for training deep models. To this end, we propose a self-training framework based on the domain-agnostic geometrical constraints. Specifically, we train a neural network to predict the 2D keypoints of a satellite and then use PnP to estimate the pose. The poses of target samples are regarded as latent variables to formulate the task as a minimization problem. Furthermore, we leverage fine-grained segmentation to tackle the information loss issue caused by abstracting the satellite as sparse keypoints. Finally, we iteratively solve the minimization problem in two steps: pseudo-label generation and network training. Experimental results show that our method adapts well to the target domain. Moreover, our method won the 1st place on the sunlamp task of the second international Satellite Pose Estimation Competition.


I. INTRODUCTION
E Stimating the pose of an uncooperative spacecraft is crucial in numerous space missions, such as debris removal [1], on-orbit servicing [2], and assets refueling [3].In the last two decades, numerous relative navigation systems have been proposed based on LiDARs [4], [5] and cameras [6]- [11].Active sensors, including radars and LiDARs, requires larger mass and higher power consumption compared to visual sensors.Moreover, a stereo system requires a large baseline and relies on robust feature matching to obtain depth information.Therefore, monocular vision based navigation systems have gained increasing attention from the academical and industrial fields.
Traditionally, monocular satellite pose estimation approaches require hand-crafted feature extraction [6]- [9], which limits their performance on challenging environments, such as occlusions, harsh lighting conditions, reflective materials, and complex structures.Recently, deep learning based methods have achieved success in satellite pose estimation [11], thanks to the powerful feature representation ability of deep models.However, training neural networks requires largescale datasets, while collecting real images of spacecrafts and annotating their 6-DoF poses are time-consuming, notoriously laborious, and difficult.Therefore, recent deep models are trained and evaluated using synthetic images [11]- [15].Due to the inherent discrepancy between real and synthetic images, the deep models that are fully-supervised by synthetic images usually show deteriorated performance when being deployed in real scenarios.This issue is revealed by the large gap between scores on the real and synthetic leaderboards 1 .
In computer vision, unsupervised domain adaptation (UDA) [16]- [21] is adopted in scenarios where the labels of real samples are scarce, by training deep models using labeled synthetic images and unlabeled real images.Motivated by these practices, Park et al. [22] created the next generation spacecraft pose estimation dataset (SPEED+) with focus on the synthetic-to-real domain gap in satellite pose estimation.Moreover, based on the SPEED+ dataset, the Advanced Concepts Team (ACT) of the European Space Agency (ESA) and the Space Rendezvous Laboratory (SLAB) at Stanford University co-organized the second international Satellite Pose Estimation Competition (SPEC2021) 2 to boost the research of bridging the domain gap.
Different to common UDA tasks in computer vision, a calibrated camera is used to measure the pose of the same satellite under various environments in space missions.For UDA in satellite pose estimation in SPEC2021, only the environmental settings are different across domains while the satellite structure and the camera parameters are the same (as shown in Fig. 1).Therefore, given a pose, the locations of keypoints and the satellite masks are the same in different domains, which are referred to as the domain-agnostic geometrical constraints.Besides, due to intense imaging noise, challenging illumination variations, and diverse poses, this task poses additional challenges compared with common UDA problems [17]- [19].
Several UDA approaches [20], [21], [23]- [26] have been proposed to explore self-training paradigms to improve performance on the target domain.Nonetheless, these methods are not specifically designed for UDA in satellite pose estimation, since the domain-agnostic geometrical constraints are not fully explored.Meanwhile, previous satellite pose estimation approaches [12], [15] represent a satellite as a set of 2D keypoints, and then estimate the satellite pose using the perspective-n-point (PnP) algorithms [27].However, as shown in Fig. 1, sparse keypoints stand for only semantic parts of , lightbox (low contrast, highly noisy), and sunlamp (specular reflection).Top: the original images from [22] with the 2D keypoints and the reprojected wireframe model.Bottom: the fine-grained masks.Note that, the visualization results on the synthetic image is obtained using the ground-truth pose, while the poses used on the lightbox and sunlamp images are predicted by our network.
the satellite.Such sparse representation leads to a significant loss of information, which hampers knowledge transfer across domains.
To tackle the above problems, we formulate UDA of satellite pose estimation as a minimization problem under a self-training framework.First, we formulate the geometrical constraints as a projection function, which maps the predefined 3D keypoints onto the source and target images using the same camera parameters.Based on the projection function, we propose a basic self-training framework by taking the poses of target samples as latent variables, which are jointly optimized with the network parameters.Second, we leverage fine-grained segmentation to extend the basic framework.Specifically, we enhance the geometrical constraints with a rendering function.Similar to the projection function, the rendering function maps the 3D mesh of the satellite to fine-grained masks, using the same camera parameters in different domains.Therefore, we take fine-grained segmentation as an auxiliary task of keypoints regression.Furthermore, as the masks provide dense descriptions with structural information, we perform adversarial training by aligning the predicted masks of the source and target samples.Finally, we iteratively optimize the network parameters and generate pseudo labels to solve the minimization problem.Experimental results demonstrate the effectiveness of our framework.Moreover, our method won the 1st and the 3rd place on two leaderboards of the second international Satellite Pose Estimation Competition 3 , respectively.
3 https://kelvins.esa.int/pose-estimation-2021/team: lava1302 Our contributions can be summarized as follows: (1) We explore the domain-agnostic geometrical constraints to propose a self-training framework for UDA in satellite pose estimation.
(2) We leverage fine-grained masks to address the information loss problem caused by abstracting the satellite as sparse keypoints.(3) Our method significantly improves the accuracy of satellite pose estimation without using real annotations.

II. RELATED WORK
Object pose estimation aims at recovering the 3D position and 3D rotation of an object in the camera-centered coordinate system.Traditional approaches [28], [29] rely on local features, suffering from texture-less objects and background clutter.Recently, CNN-based methods have dominated most object pose estimation tasks.Numerous approaches [30]- [35] have been proposed to estimate poses using putative 2D-3D correspondences and the PnP algorithms.To achieve better efficiency, several methods [36]- [39] are introduced to directly regress poses from monocular images.Other methods [40]- [42] learn the latent representations of rotation and recover poses by exploring the image retrieval paradigms.Since these methods focus on household objects in indoor scenarios [43], [44], they face significant challenges caused by wide-range depth variations and illumination changes in outer space [45].
Satellite pose estimation is a special case of object pose estimation.Spacecraft Pose Network (SPN) [14] is the first deep learning-based approach for satellite pose estimation.Specifically, the 3D rotation is recovered by discretizing the viewpoint spaces into bins, and then the 3D translation is estimated using the geometrical constraints.In other top-performing approaches, the pose estimation problem is formulated as a task of localizing semantic keypoints on the convex areas of a satellite by taking various representations, such as heatmap [12], vector [13], and set [15].These methods crop the satellite from the input images using a well-trained object detector to address scale variations.To achieve better efficiency, Hu et al [45] handle the scale problem in a single-stage way by introducing a sampling strategy.However, these methods are trained and tested on synthetic data.They usually undergo significant performance degradation when being applied to real images due to the domain gap [46].
Unsupervised domain adaptation aims at addressing the domain mismatch problem.It is a promising direction to circumvent the laborious and time-consuming procedures of data annotation.Several UDA paradigms have been studied for different vision tasks.Adversarial learning aligns both domains with a discriminator.The alignment can be achieved at image-level [16], feature-level [17], [18], and outputlevel [19].Self-training methods are introduced to utilize target samples to train the model by generating pseudo labels [23], [24], minimizing the entropy loss [20], [25], or employing the teacher-student framework [21], [26].Nonetheless, it is challenging to adopt these approaches in UDA of satellite pose estimation, which is different to the common UDA tasks in computer vision.

III. METHOD
In this section, we first introduce the UDA task of satellite pose estimation (Sec.III-A) and the PnP-based solution to monocular satellite pose estimation (Sec.III-B).Then, we formulate the task as a minimization problem in a basic self-training framework (Sec.III-C), which is extended by leveraging fine-grained segmentation (Sec.III-D).We present the solution to the minimization problem (Sec.III-E).Finally, we give a mathematical proof of the geometrical constraints (Sec.III-F).The overview of our method is shown in Fig. 2.

A. Problem Formulation
In UDA of satellite pose estimation, the satellite structure and the camera parameters are the same in the source and target domains.The intrinsic matrix of the camera is denoted by K.Then, we are given N s source images X s = {I i s } Ns i=1 with 6-DoF pose annotations We assume that the poses in the source and target domains are sampled from the same distribution.
However, given a pose consisting of R and t, the source and target images are sampled from different distributions, It reveals that the source and target samples are sampled from different joint distributions, Equation 3 shows that the independent and identically distributed (i.i.d.) assumption is violated, which is referred to as the domain shifts in satellite pose estimation.In this paper, we aim to train a neural network G(I) → [R|t], that reduces the distribution shifts across domains.

B. PnP-based Solution
As suggested by Kisantal et al. [11], the PnP-based methods significantly outperform direct regression in monocular satellite pose estimation.Therefore, we follow the PnP based [12], [15] method to estimate the satellite poses.
We assume the texture-less 3D mesh M of the satellite is given.Hence, we select on the mesh surface.Given the camera parameters and a pose, the 3D landmarks are reprojected onto 2D keypoints {p k } Np k=1 , which are represented as a heatmap H.We construct a neural network consisting of a backbone G f and a heatmap head G h and train the network using source samples in a fullysupervised way, where is the predicted heatmap, H i s is the ground truth.For heatmap regression, L h is adopted as the adaptive wing loss [47] with the default parameter setting.During the inference stage, the predicted heatmap Ĥ is decoded into 2D keypoints, which are used to build putative 2D-3D correspondences {( pk , P k )} Np k=1 .Finally, the pose is estimated by solving the PnP problem [15], where λ k is the depth of landmark P k and φ(•) is the Huber loss.
Due to the domain shifts, the neural network that trained on the source domain usually suffers from performance degradation when being applied on target images.To tackle this issue, we train the network using labeled source samples and unlabeled target samples (as described in Sec.III-C).

C. Basic Self-Training Framework
To fully exploit unlabeled target samples, we propose a selftraining framework by leveraging the geometrical constraints.Specifically, we define the function that projects 3D landmarks onto the 2D heatmap as, which provides the same geometrical constraints in the source and target domains, since the source and target samples share the same P and K.
For source samples, we supervise the network using Eq. ( 4).The ground-truth heatmap H s is obtained by projecting landmarks P using the function F with the pose annotation [R s |t s ].For unlabeled target samples, we take each target pose as a latent variable, which is sent to the function F to obtain

ℒ ( ℒ (
Source Mask Segmentation Loss Target Mask Segmentation Loss Fig. 2: Overview of our self-training framework.The satellite is represented as a set of sparse keypoints and dense fine-grained masks.The source image I s is first transformed into a target-like source image I s using a well-trained CycleGAN [48].The neural network consists of a backbone G f , a heatmap head G h , and a mask head G m .The optimization comprises two iterative steps.In the first step, pseudo mask Ỹt and pseudo heatmap Ht are fixed, and the network G f,h,g are trained using multi-task losses.In the second step, the parameters of the neural network are fixed, and the predicted mask Ŷt and heatmap Ĥt are used to generate pseudo labels.(Best viewed in color.) the pseudo heatmap Ht .Therefore, we can unify Eq. ( 4) and ( 5) in an objective function to exploit unlabeled target images.Specifically, the task is formulated as a minimization perform by simultaneously optimizing the network parameters and the target poses, where the zero set {0} is introduced to tackle the issues of less confident predictions on target samples, which are ignored during the network training by assigning their poses to 0.
The criterion to select confident predictions is presented in Sec.III-E.
Equation 7 can be solved iteratively.During optimization, the network G f,h is encouraged to predict accurate heatmaps on source and target images.Meanwhile, the target poses should converge to the ground-truth.However, the heatmapbased representation of sparse keypoints introduces severe information loss.Consequently, the optimization process is prone to converging at local optima, leading to deteriorated performance.

D. Extended Framework with Multi-task Learning
To tackle the above problem, we note that the fine-grained masks in Fig. 1 provide rich dense descriptions with domainagnostic structural context.Therefore, we apply segmentation as an auxiliary task of heatmap regression to improve pose estimation performance.In the following paragraphs, we first perform output-level alignment [19] using adversarial training, and then extend the minimization problem in Eq. ( 7) with the auxiliary task.
As shown in Fig. 2, we extend the basic network with a mask head G m to predict the fine-grained mask Ŷ = G m (G f (I)) for each image.These masks contain domainagnostic information, which can be used to align the source and target samples by adversarial training.Hence, we build a discriminator G d to predict the domain label of each sample.The discriminator G d receives the detached mask predictions from both domains and is trained using the BCE loss, where Ŷs and Ŷt are the predicted masks of the source and target samples, respectively.To bridge the domain gap, we expect the mask predicted by the network G f,m to fool the discriminator.Therefore, we perform an adversarial loss on the predicted masks of the target images and backpropagate the gradient.
Furthermore, we introduce a rendering function R to enhance the geometrical constraints.Given the pose [R|t], the mask Y can be obtained through the rendering function, Similar to the projection function F, the rendering function R also provides the same geometrical constraints on source and target samples.This is because the 3D mesh of the satellite and the camera parameters are the same in different domains.
For source samples, the network G f,m is supervised using the ground-truth masks.For unlabeled target images, the pseudo masks are obtained by sending the latent variables in Eq. ( 7), i.e., the target poses, to the rendering function R. Therefore, the minimization problem depicted in Eq. ( 7) is extended with the tasks of segmentation and adversarial training, s. t.Hj t = F(R j t , t j t , P, K), Ỹj t = R(R j t , t j t , M, K), where λ m and λ a are the weights for the mask loss and the adversarial loss, respectively.L m is the cross-entropy loss for segmentation.We refer to Ỹ as the pseudo mask.Again, by optimizing the loss in Eq. ( 11), the network G f,h,m should perform well on the target domain, while the pseudo labels, including pseudo heatmaps, masks, and poses, should approximate the ground truth.

E. Iterative Optimization and Pseudo Label Generation
We observe that the variables in Eq. ( 11) can be divided into two classes: the network parameters and the poses of target samples.Following [23], we adopt iterative procedures to optimize Eq. ( 11): 1) Fix the pseudo labels (or initialize T t as {0}) and train the network G f,h,m,d ; 2) Fix the network, optimize poses T t , and generate pseudo labels.
We take the combination of these two steps as one round and take several rounds to optimize Eq. ( 11).
In the first step, when the pseudo heatmaps and pseudo masks are fixed, the minimization problem in Eq. ( 7) is simplified as, ) It is equal to train the neural network using the labeled source samples and the pseudo-labeled target samples.In the second step, when the network parameters are fixed, the minimization problem in Eq. ( 11) is simplified as, It is equal to optimize the target poses to minimize the loss, by simultaneously aligning the heatmaps and masks generated using functions F and R with the ones predicted by the network.However, for the classification loss L m , there is a trivial solution by ignoring all pseudo heatmaps, i.e., T s = {0}.Therefore, we leverage the predicted heatmaps to estimate poses and then generate pseudo labels.Specifically, we decode the predicted heatmap Ĥt of the target image into the 2D coordinates of keypoints, which are then used to generate putative correspondences {( pk t , P k )} Np k=1 .Next, the estimated pose [ R| t] and the number of inliers N in can be obtained by solving Eq. ( 5).We kindly refer to the readers to our previous work [15] for the detail solution to Eq. ( 5).Furthermore, we take N in as the confidence of the estimated pose and select the estimated pose as a pseudo pose if N in ≥ N th .Otherwise, the pseudo pose is set to 0 and the corresponding sample is excluded during network retraining.A smaller N th encourages more estimated poses to be used to generate pseudo labels for model training.Additionally, for target samples with N in < N th , we generate pseudo masks using the approach proposed by [24].

F. Discussion
Our self-training framework is built upon the geometrical constraints, which are provided by the projection function F and the rendering function R. We try to provide a mathematical proof of the geometrical constraints in this section from the perspective of probability distribution.Since the functions F and R have the same formation, we only present the proof of the function F in the remainder of this section.
Equation ( 6) has four input parameters, including satellite pose [R|t], 3D landmarks P, and camera intrinsic matrix K. Since P and K are constant, if the poses of the satellite in the source and target domain are equal, the heatmaps in both domains are the same.
[R  Therefore, when conditioned on the satellite pose, the heatmap distributions in source and target domains are the same.
On the other hand, the heatmap distributions can be marginalized on the satellite poses as following, By combining Eq. ( 1), Eq. ( 15), and Eq. ( 16), we can get the conclusion that the heatmaps in the source and target domain have the same distribution, i.e., Therefore, the geometrical constraint provided by the projection function F is domain-agnostic.What'more, the similar conclusion also holds for the rendering function R, i.e., P (Y s ) = P (Y t ).The fine-grained mask provides dense representation, while the keypoints heatmap is sparse and leads to significant information loss.Hence, we perform output level adaptation to match mask distributions between source and target domain in Sec.III-D.

IV. EXPERIMENTAL RESULTS
We present experimental details and results in this section.We first introduce the dataset and the metrics used in the experiments in Sec.IV-A.Then, Sec.IV-B presents the details of implementation.Next, the key components of the proposed approach are studied in Sec.IV-C.We compare our approach with the state-of-the-art methods in Sec.IV-D.Finally, Sec.IV-E presents the runtime analysis.

A. Dataset and Metrics
Dataset.We conduct experiments on the SPEED+ [22] dataset to demonstrate the effectiveness of our method.The SPEED+ [22] dataset comprises images of the Tango spacecraft from the PRISMA [49] mission, consisting of three distinct domains, i.e., synthetic, lightbox, and sunlamp.Each image has a resolution of 1920 × 1200 and contains a single object.The synthetic domain comprises 59,960 images labeled with poses, which are generated using an OpenGL-based stimulator.The lightbox and sunlamp domains contain 6,740 and 2,791 images of a model of the same spacecraft captured in a robotic simulation environment.The satellite in the lightbox domain is illuminated by several lightboxes to approximate the diffuse light of Earth, while the same object in the sunlamp domain is exposed to an arc lamp to simulate the direct sunlight.Since the annotations for lightbox and sunlamp domains are not released, there are two UDA tasks with respect to SPEED+, including synthetic→lightbox and synthetic→sunlamp.
Metrics.We adopt the metrics used in SPEC2021.The rotation error is defined as the angle between the predicted quaternion q and the ground truth q, i.e., E q = 2 arccos( qT q).The translation error is defined as the difference between the predicted value t and the ground truth t, i.e., E t = t − t 2 .Given an image, the scores for rotation and translation are defined as S q = E q and S t = E t / t 2 , respectively.The overall score is given as S = 0, if S q < θ q and S t < θ t S q + S t , otherwise (18) where θ q = 0.169 • and θ t = 2.173 × 10 −3 are the thresholds, which are determined by the calibration results of the facility used to create the dataset [22].

B. Implementation Details
Mesh reconstruction and data preparation.Since the 3D mesh of the satellite is not provided in SPEED+, we reconstruct the 3D mesh M and 3D landmarks P using source samples.However, due to illumination variations and material discrepancies, the texture of satellite is not domain-agnostic, while the geometrical model are the same for the source and target samples.Therefore, we reconstruct the texture-less 3D mesh of the satellite to leverage the geometrical constraints.
The pipeline for 3D mesh reconstruction and annotation is shown in Fig. 3.Note that, the source images can have two types of background: the earth background and the black background.Since the earth background usually introduces noise during reconstruction, we first train a classifier to select images with a black background.Next, to tackle the scale issue, we select 1000 images using the criterion of 4.5m ≤ t 2 ≤ 5m, where t is the translation of the satellite to the camera.We first reconstruct the coarse mesh using the Multi-View Stereo (MVS) approach provided by colmap [50], [51].Next, we use the Neural Implicit Surfaces (NeuS) [52] approach to refine the mesh.The refined mesh is then annotated with 5 categories, including antenna 1-3, solar panel, and body.Meanwhile, we select 11 landmarks on the surface of the mesh, following previous works [12], [15].Finally, given the groundtruth poses for source images and the pseudo poses for target images, we use Blender to render the mesh for fine-grained masks, and use the pinhole camera model [53] to obtain 2D keypoints.
Architecture details.Our network comprises four modules: a backbone, a mask head, a heatmap head, and a discriminator.We construct the backbone using a transformer-based HRNet network [54], i.e., HRFormer-S with 7.8M parameters.The backbone extracts feature maps at 1  4 resolution of the input images.The output channel number is 32.The heatmap head and the mask head are constructed using an atrous-spatialpyramid-pooling (ASPP) module [55], respectively.The module consists of five parallel branches: a global average pooling layer, a 1 × 1 convolution layer, and three 3 × 3 atrous convolution layers with rates of (6,12,18).Then, all feature maps are concatenated into one feature, whose channel number is adjusted using a 1 × 1 convolution layer.The discriminator has 4 × 4 convolutional layers with channel numbers (16,32,64,128,1).The first and the second layers have a stride of 1 while others have a stride of 2. Each convolutional layer except the last one is followed by a leaky ReLU parameterized by 0.2.
Experimental details.We implement the network using the PyTorch library and train our model using the AdamW [56] optimizer.All images are resized to the resolution of 640×400.The source images are first translated into target-like images using a CycleGAN [48] to reduce the bias towards the source domain.During training, we apply different data augmentation strategies on source and target samples using Albumentations [57] with the default parameter setting.The data augmentation on target samples composes of random horizontal and vertical flipping, random translating, scaling, and rotating.On source samples, the additional augmentation includes random gaussian noise and random blur.We adopt a multi-level learning strategy to enhance domain adaptation by applying the prediction heads after stages 3 and 4 of the backbone.The balance parameters of the losses in stages 3 and 4 are set to 0.1 and 1, respectively.During pseudo-label generation, threshold N th is empirically set to 8.

C. Ablation Study
We conduct a series of ablation experiments to investigate the critical components of our approach, including selftraining, adversarial training, mask prediction, and pseudolabel generation with the geometrical constraints.Due to the unavailability of pose labels of target samples, we manually annotate 100/50 images from the lightbox/sunlamp domain by selecting semantic keypoints and solving the PnP problem.We adopt the manually annotated samples as the validation set.For each setting, we take 3 rounds to optimize Eq. ( 11) and define the first round as the pretraining stage.During pretraining, we set the initial learning rate to 0.001 and train the network for 12 epochs with a batch size of 5.For the second and the third rounds, we decrease the learning rate to 0.0005 and optimize the network for 10 epochs, which contains 10, 000 steps.The results are reported in Table I.The neural network trained after round j under the ith setting is denoted by G j i .Especially, model G 0 1 is the trivial baseline, as it is trained on target-like images transformed using a CycleGAN and no other UDA method is used.
Self-training.In each setting, models G 1 and G 2 achieve smaller pose estimation errors than model G 0 .This demonstrates that the self-training framework can promote model performance for the UDA task in satellite pose estimation.
Adversarial training.We construct a baseline G 1 within the basic framework by optimizing Eq. (7).Then, we extend the framework with adversarial training, by aligning the features extracted by the backbone.The model trained in the extended framework is denoted by G 2 .For each round, model G i 2 significantly outperforms the baseline G i 1 , i = 0, 1, 2. After the pretraining stage, model G 0 2 reduce the translation error from 1.0069m/0.9986mto 0.2793m/0.5211mfor the lightbox/sunlamp domain.When round = 2, model G 2 2 reduces the scores by half for both domains.These results demonstrate the effectiveness of adversarial training.
Mask prediction.We further extend the baseline with multi-task learning by appending a mask head after the backbone.The mask head predicts a binary mask (in model G 3 ) or a fine-grained mask (in model G 5 ) of the satellite.Different from model G 2 , we apply adversarial training by aligning the predictions of the mask head.Model G 3 achieves more accurate pose estimation results than model G 2 in all rounds except the last one in terms of total score S. Another experimental evidence is provided by the comparison between model G 3 and model G 5 .When round = 0, model G 0 5 significantly reduces the estimation error by nearly 30% and 50% in terms of score S for both domains.The key reason is that the mask head in model G 5 provides more fine-grained predictions and thus effectively enhances structural constraints and contextual information.Moreover, using the proposed self-training framework, model  G 2 5 achieves the best pose estimation results in all metrics and on all tasks.Therefore, leveraging fine-grained segmentation as an auxiliary task has positive impacts on 2D keypoints regression, resulting in better performance of satellite pose estimate across domains.
To better analyze the function of the fine-grained segmentation, we compare the mean square error (MSE) between the ground-truth and the predicted keypoints on the validation set after each epoch.Specifically, we report the MSEs of the pretrained models, including G 0 1 , G 0 3 , and G 0 5 , and the results are shown in Fig. 4. Note that, model G 0 1 always achieves the largest MSEs on lightbox and sunlamp domains, while G 0 3 benefits from the binary segmentation task.Moreover, G 0 5 shows the highest accuracy by predicting the fine-grained masks.It illustrates that the fine-grained segmentation can effectively improve the domain adaptation performance, and thus prevent the optimization of Eq. ( 11) converging at local optima.
Geometrical constraints.We also study the role of the geometrical constraints, which are used during pseudo-label generation.We directly generate pseudo heatmaps and pseudo masks according to model predictions.The model trained using this setting is denoted by G 4 .In terms of three metrics on both domains, models G 1 4 and G 2 4 show degraded performance compared to G 1 5 and G 2 5 , respectively.The performance degradation can be ascribed to the annotation noise in pseudo labels.With the geometrical constraints, models G 1 5 and G 2 5 are trained using more clean and more accurate pseudo labels and thus show superior performance.The visual comparison between pseudo labels generated with and without geometrical constraints is illustrated in Fig. 6.
Multi-task learning.We adopt adversarial training and mask segmentation to promote keypoint heatmap regression.In Fig. 5, we study the effectiveness of multi-task learning strate-    has larger heatmap losses, it achieves better pose estimation accuracy than model G 0 0 .In contrast, G 0 5 performs output level adaptation by introducing fine-grained segmentation, resulting in the smallest losses and the best pose estimation accuracy on each domain.The reason is that the feature level adaptation is performed in the high-dimensional space, leading to the alignment of easier patterns [19].Consequently, the feature distributions cannot be effectively matched.This demonstrates the effectiveness of our multi-task learning strategies.

D. Comparison with the State-of-the-Art Methods
We take KRN [13] and SPNv2 [58] as the baseline methods.SPNv2 [58] is based on EfficientDet [59] and comprises three prediction heads: the EfficientPose head [60] for object presence, bounding box, target rotation and translation; the heatmap head for the 2D heatmaps; the segmentation head for the binary mask of the satellite.Other technologies employed by SPNv2 include multi-scale design [59], [60], extensive data augmentation [61], style augmentation [62], AdaBN [63], and entropy minimization.Different from these methods, we utilize the geometrical constraints to develop a self-training  framework and explore the fine-grained segmentation to boost performance.
To achieve better performance, we add an upsampling layer followed by a 3 × 3 convolution after the backbone and thus increase the feature resolution by a scaling factor 2. We take multiple rounds to optimize Eq. ( 11) and then compare our method with KRN [13], SPNv2 [58], and top-performing methods in SPEC2021.The results are listed in Table II.On the sunlamp domain, our approach outperforms all other methods in terms of translation and rotation scores, taking the 1st place in the challenge.More importantly, our method surpasses KRN [13] trained using real annotations by more than half in terms of score S. On the lightbox domain, our approach shows competitive performance and has won the 3rd place in the challenge.Figure 7 visualizes results estimated by the models trained with and without the proposed framework.Note that, our model can handle intense imaging noise, complicated illuminations, surface reflection, and pose variations.

E. Runtime Performance
We follow the experimental settings in ablation study and compare the runtime performance of the models using different backbones and feature map resolutions.The experiments are conducted on a PC with an Nvidia GTX 3090 GPU.Table III reports the results.We first replace the transformer-based backbone with a CNN-based backbone, i.e., HRNet [64].Although the running time is shorter at the expense of larger parameter sizes and memory consumption, the pose estimation performance drops from 0.221/0.099to 0.253/0.163on lightbox/sunlamp in terms of score S. Furthermore, we add an upsampling layer followed by a 3 × 3 convolution after the backbone, to increase the feature resolution by a scaling factor 2. We observe significant improvements of pose estimation at the expense of increased computation.However, the increase in running time is only about 0.35% (which is negligible) since the upsampling module and the prediction heads are very simple.

A. Limitation
One apparent limitation of our method is that only the predicted heatmaps are used to generate pseudo poses in Sec.III-E, while the predicted masks are ignored.Another limitation would be the sparse representation of the satellite using a set of keypoints.Note that the pose estimation could be less stable when these keypoints are invisible due to truncation, low light, or high reflection.The third limitation of this work is that the implementations of the projection function F and the rendering function R are non-differentiable and performed offline.Future work is needed, specifically in incorporating a differentiable rendering engine to simultaneously achieve online self-training and pose refinement based on fine-grained masks.

B. Conclusion
This paper explores the domain-agnostic geometrical constraints to achieve unsupervised domain adaptation in satellite pose estimation.The task is formulated as a minimization problem in a self-training framework by taking the target poses as latent variables.Meanwhile, the fine-grained segmentation is introduced as an auxiliary task to improve performance.The experimental results demonstrate that our method achieves superior performance.

Fig. 1 :
Fig.1:The characteristic of the UDA task in satellite pose estimation.The satellite structures are the same in different domains with different illumination conditions.The samples come from three distinct domains, including synthetic (well illuminated), lightbox (low contrast, highly noisy), and sunlamp (specular reflection).Top: the original images from[22] with the 2D keypoints and the reprojected wireframe model.Bottom: the fine-grained masks.Note that, the visualization results on the synthetic image is obtained using the ground-truth pose, while the poses used on the lightbox and sunlamp images are predicted by our network.

Fig. 4 :
Fig.4: The mean square errors (MSE) between the predicted and ground-truth keypoints in the pretraining stage.

Fig. 6 :
Fig.6: Visualization of pseudo labels of target samples.The first column shows the pseudo keypoints generated without and with the geometrical constraints in green crosses and blue points, respectively.The red vectors illustrate the differences between the two types of pseudo keypoints.The second/last column shows pseudo masks generated without/with the geometrical constraints.

Fig. 7 :
Fig. 7: Visualization of pose estimation on target samples.The results achieved with and without our framework are shown in green and red colors, respectively.

Qifeng
Yu received the B.S. degree from Northwestern Polytechnic University, Xi'an, China, in 1981, the M.S. degree from the National University of Defense Technology, Changsha, China, in 1984, and the Ph.D. degree from Bremen University, Bremen, Germany, in 1996.He is a Professor with the National University of Defense Technology.He has authored three books and published over 100 articles.His main research fields include image measurement, vision navigation, and close-range photogrammetry.Dr. Yu is a member of the Chinese Academy of Sciences.

TABLE I ABLATION
STUDY OF THE KEY COMPONENTS ON THE SPEED+ DATASET.

TABLE II COMPARISON
WITH THE STATE-OF-THE-ART METHODS AND THE TOP-PERFORMING METHODS IN SPEC2021.

TABLE III RUN
-TIME ANALYSIS WITH DIFFERENT BACKBONES AND FEATURE RESOLUTIONS.