Fast and Accurate Spacecraft Pose Estimation From Single Shot Space Imagery Using Box Reliability and Keypoints Existence Judgments

Real-time 6DOF (6 Degree of Freedom) pose estimation of an uncooperative spacecraft is an important part of proximity operations, e.g., space debris removal, spacecraft rendezvous and docking, on-orbit servicing, etc. In this article, a novel efficient deep learning based approach is proposed to estimate the 6DOF pose of uncooperative spacecraft using monocular-vision measurement. Firstly, we introduce a new lightweight YOLO-liked CNN to detect spacecraft and predict 2D locations of the projected keypoints of a prior reconstructed 3D model in real-time. Then, we design two novel models for predicting the bounding box (bbox) reliability scores and the probability of keypoints existence. The two models not only significantly reduce the false positive, but also speed up convergence. Finally, the 6DOF pose is estimated and refined using Perspective-n-Point and geometric optimizer. Results demonstrate that the proposed approach achieves 73.2% average precision and 77.6% average recall for spacecraft detection on the SPEED dataset after only 200 training epochs. For the pose estimation task, the mean rotational error is 0.6812°, and the mean translation error is 0.0320m. The proposed approach achieves competitive pose estimation performance and extreme lightweight ( $\sim ~0.89$ million learnable weights in total) on the SPEED dataset while being efficient for real-time applications.


I. INTRODUCTION
Spacecraft 6DOF pose estimation is an important part of space proximity operations, e.g., space debris removal, on-orbit servicing, etc. The main solution is to estimate the pose of the spacecraft through monocular cameras, stereo cameras, RGB-D images or data with depth information such as point clouds obtained by LiDAR [1], [2]. Considering the size, mass, power, computation, particular mission scenario, and sustained future costs, monocular sensors can ensure rapid pose determination with lower power, lower hardware complexity and cost, and mass requirements, in contrast to LiDAR and stereo camera sensors. However, since the monocular camera cannot directly measure the relative distance, the calculation complexity is increased, and the The associate editor coordinating the review of this manuscript and approving it for publication was Andrea F. Abate . monocular camera may be less robust against variable in the lighting conditions of the space environment and the earth background. Therefore, the relative navigation sensor selection is still an open problem. At present, several navigation systems that use satellite-borne monocular camera for close-range operations have been proposed to accomplish fast target tracking and pose estimation [3]- [7]. Traditional pose estimation methods rely on the handcrafted 2D-2D or 2D-3D keypoint and descriptor correspondences [8], [9]. These algorithms are available for objects with sufficient texture, but typically failed when dealing with objects with weakly textured or without texture. To alleviate such problems, most recent approaches began to rely on supervised training with spacecraft pose annotations.
With the success of deep learning in object recognition, deep neural network has been gradually applied to objects' 6D pose estimation. Multiple end-to-end CNN-based neural VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ networks [10]- [13] have been proposed to map RGB images to 6D poses directly. Although end-to-end poses regression methods are simple, it is not clear whether such end-toend algorithms have learned enough feature representations for pose estimation. Therefore, the generalization performance of the above models is not robust, and such models are suitable for specified cases. Instead of end-to-end algorithms, the CNN based keypoints regression methods were proposed to solve the problem above. Such methods usually use CNN-based networks to regress 2D-3D keypoints and then use geometric algorithms such as PnP to generate a 6D pose [14]- [18]. However, the above methods are either a large amount of network parameters, or a complicated training method.
In this article, we propose an approach to implement spacecraft 6DOF pose estimation in real-time via the learnable method and geometric algorithm. The key component of the approach is a new and lightweight tiny-YOLOv3 based neural network for predicting the 2D locations of the projected keypoints of the reconstructed 3D model beforehand. The network model contains two sub-nets: spacecraft detection sub-net and keypoints regression sub-net. The proposed network not only achieved encouraging performance, but also the extreme lightweight. And it can be easily modified for multi-tasks.
In this article, we use SPEED datasets [19] to train and test our pose estimation model. SPEED dataset contains more than 10000 space images with the ground truth pose of the ''Tango'' satellite [20], which provided by the Kelvins Pose Estimation Challenge [21] of ESA. Six ''Tango'' image examples of SPPED, which have significant variations in view, lighting condition, image background, and object size, as shown in Fig.2. Each image in the SPEED dataset has only one known object (i.e., Tango).
The contributions of this article include the following three aspects. First, we propose a box reliability judgment model and append it to the spacecraft detection sub-net for improving the detection accuracy. Second, we construct a keypoints existence judgment model and add it to the keypoints regression sub-net to accelerate the network convergence and improve the regression accuracy. Third, the spacecraft's 6D pose is estimated by the 2D-3D correspondences produced by keypoints regression network and PnP with RANSAC. And the 6D pose is further refined by geometric optimizer. The keypoints regression network architecture of our approach, as shown in Fig.1.
In the rest of the paper, we review the related works and explain our algorithm in details in section II and III, respectively. Section IV and V are experiments and conclusions, respectively.

II. RELATED WORKS
We now review and summarize the existing works on 6D monocular vision-based spacecraft pose estimation.

A. TRADITIONAL METHODS
Most traditional methods use features extracted from the 2D image, feature matching, and Perspective-n-Points (PnP) [22] to estimate the spacecraft's 6DOF pose. The features can be divided into local keypoints, corners, and edges, etc. The features and descriptors usually produced by handcrafted detectors, e.g., SIFT [23], SURF [24], BRIEF [25], Canny edge detector [26], and Hough Transform [27], etc. Such techniques produce the features correspondences and then use the geometric methods (e.g., PnP, EPnP [28], etc.) to generate the 6DOF pose. Shi et al. [29] combine the SIFT and BRIEF methods to extract the target interest points in the image, and the EPnP is used to obtain the initial pose. Rondao and Aouf [30] leverage the FREAK descriptor [31] in combination with the EDLines detector [32] to extract keypoints, corners, and edges to find the correspondence between features, and a EPnP solver is utilized to generate the initial pose. Sharma et al. [3] use Weak Gradient Elimination to alleviate the effect of image background on the accuracy of pose estimation and the Sobel operators [33] and the Hough Transform are used to extract the features. Meanwhile, they prove that compared with the PnP method, the EPnP method has better performance in terms of both pose accuracy and runtime. However, due to the small number of image samples used, it is impossible to be sure that this method is robust to any scenario, nor can it be evaluated whether it is suitable for variable illumination conditions in typical space environment. Furthermore, Pesce et al. [34] use the GFTT algorithm [35] to extract the feature points of the target 3D model, and adopt a combination of PCA [36] and RANSAC [37] to obtain the correspondences between the target image and the 3D model. Finally, the EPnP solver is used to initialize the pose. Such methods are fast and invariant to perspective, scale and illumination changes, but still sensitive to extreme illumination scenarios, occlusion and scene clutter. They only reliably handle textured objects in high-resolution images [38]. And they will failed in the case of significant variation in light conditions, object size, and image background, i.e., the pose estimation performance relies on the quality of feature correspondences. Thus, it is necessary to explore a new method to solve the above problems. Nevertheless, earlier researches indicate that if we have reasonable 2D-3D keypoints correspondences, the geometric methods are able to estimate the 6D pose well. Therefore, we use the geometric algorithms in our pose estimation approach.

B. MACHINE-LEARNING-BASED METHODS
In recent years, with the success of machine learning especially the deep learning in object recognition and object detection, machine learning has been gradually applied to object 6D pose estimation.

1) NON-NETWORK-BASED METHODS
Shi et al. [39] proposed a PCA-based method. The PCA algorithm matches the object from the camera image to a stored matrix of images that has been transformed to its eigenspaces. Although the method can reduce the dimension of the training dataset by the eigenvectors, but this method is only valid under the conditions that each image contains one object only, the object is non-occulted, and the target object is viewed by a weak perspective.
Capuano et al. [40] proposed an EKF-SLAM based pose estimation approach that does not require any knowledge of the target. This method uses Harris corner detector [41] to obtain image features, and a single beam LIDAR measurement is also assumed to measure the distance of the extracted features to recover the scale of the reconstructed map. This method estimates its angular velocity from optical flow by EKF (Extended Kalman Filter). However, the attainable performance of this method might worsen when processing images of an orbiting target in unfavorable illumination conditions.

2) CNN-BASED METHODS
CNNs over traditional algorithms is an increase in the robustness for adverse illumination condition, as well as a reduction in the computational complexity [42].
Multiple end-to-end pose regression CNN-based neural networks [10]- [13] have been proposed to map RGB images to 6D poses directly. Although end-to-end poses regression methods are simple, it is not clear whether such end-to-end algorithms have learned enough feature representations for pose estimation. Therefore, the generalization performance of the above models is not robust, and such models are suitable for specified cases. To solve the above problems, several CNN-based keypoints regression methods were proposed.
Instead of handcrafted descriptors, CNN-based keypoints regression methods produce the 2D-3D keypoints by deep learning. Such methods usually use CNN-based networks to regress 2D-3D keypoints and then use geometric algorithms such as PnP to generate a 6D pose [14]- [18]. The object detection model is a critical part of 6D pose estimation in the CNN based keypoints regression methods. Object detection networks such as Faster-RCNN [43] and YOLO [44], etc., are usually used for object detection in the keypoints regression model first, and keypoints regression is performed later. References [14] and [38] perform semantic segmentation on the image and then predict the 3D bounding box of the object. These two methods may not need accurate 3D models. To pursue high prediction accuracy, the common method is to build large deep neural networks. But it will result in slow calculation speed. With the lightweight networks proposed, such as MobileNets [45], ShuffleNets [46], and tiny-yolov3 [47], we can achieve better speed-accuracy trade-off. In our approach, we construct a lightweight network based on tiny-YOLOv3 architecture and the attractive convolution pattern in lightweight CNN networks for predicting the 2D projected locations of 3D keypoints in each image.
In recent years, spacecraft monocular vision-based pose estimation has usually exploited CNN-based techniques. Sharma et al. [48] use the AlexNet [49] as the baseline, and the three-dimensional texture model of the space object is also leveraged to calculate the relative pose. Reference [50] proposes a deep learning framework for pose estimation based on orientation soft classification, which allows modeling VOLUME 8, 2020 orientation ambiguity as a mixture of Gaussians. References [51] and [52] simplify the pose estimation problem into a simple regression problem. Reference [51] uses a deep CNN to regress the three-axis stable satellite pose parameters. In [52], a VGG-19 [53] based architecture is used to complete the pose regression task. The Spacecraft Pose Network (SPN) [19], [54] is the seminal work on the SPEED. SPN is constructed based on the region generation network (RPN) of Faster-RCNN and uses a hybrid of classification, regression neural networks and Gauss-Newton iteration method for the pose estimation problem. References [55] and [56] detect the object by regressing a 2D bounding box, and then a separate location regression network is used to predict the 2D locations of the known surface keypoints. The extracted 2D keypoints can be used in conjunction with corresponding 3D model coordinates to compute relative pose via the PnP. Neural Style Transfer (NST) is applied to randomize the texture of the spacecraft in synthetically rendered images in [56]. In [57], a simplified stacked hourglass architecture [15] is constructed for feature detection. A Covariant Efficient Procrustes Perspective-n-Points (CEPPnP) [58] solver combined with an EKF method to enable robust monocular pose estimation for close-proximity operations around an uncooperative spacecraft.

III. METHODOLOGY
Our approach aims to implement spacecraft 6DOF pose estimation in real-time via the learnable method and geometric algorithm. We first reconstruct the spacecraft's 3D wireframe with a few 3D keypoints, which more closely related to the object features from 12 manually multiview images (high image quality and different object orientation). The 3D structure of the spacecraft is solved by the explicit annotation of the 2D locations of the image corresponding to the known 3D keypoints and the triangulation method. And we then construct a new lightweight spacecraft detection sub-net model inspired by tiny-YOLOv3 to recognize the spacecraft and predict the 2D bounding box of the input image. 2D keypoint locations regression is implemented using the input image and predicted 2D bounding box via keypoints regression sub-net later. Lastly, we use the 2D-3D keypoints correspondences, PnP, and a geometric optimization algorithm to estimate and refine spacecraft 6D pose.

A. SPACECRAFT 3D RECONSTRUCTION
Since the SPEED dataset does not provide the spacecraft's 3D model, we need to reconstruct the target's 3D model from the images and poses given in the dataset. And the 3D model will be utilized in subsequent research. It should be noted that this step is in general not required if a 3D model is available. We reconstruct the spacecraft's 3D wireframe model with a few 3D keypoints from 2D monocular vision-based images.
We select a small number of 3D keypoints, {x k } M k=1 , which are more closely related to the object features of the known 3D model prior [20] to depict 3D object model. Note that for satellite, it is best to select these keypoints to be the corner points or endpoints of the satellite solar panels, the corner points of the satellite body, and the endpoints of distinctive antennas, etc. In this article, we select M = 11 keypoints to represent satellite, including eight corners, and three antenna endpoints of the satellite. From the training set of the SPEED dataset, we manually selected N = 12 images with different object orientation, object size, and high quality, and we carefully marked the corresponding 2D keypoints locations of M 3D keypoints on the N images. Let p i k denotes the k th keypoint on the i th image, the 3D keypoints {x k } M k=1 could be reconstructed by solving the following objective: (3), K refers to the camera intrinsic matrix, R i and t i are the ground truth rotation matrix and translation matrix of the i th image, respectively. We solve the (1) by CVX solver of MATLAB. 1 3D keypoints, 3D object wireframe, and the 2D locations of keypoints projected in the image, as shown in Fig.3.

B. SPACECRAFT DETECTION WITH BOX RELIABILITY JUDGMENT 1) BOX RELIABILITY JUDGMENT MODELING
We first establish a spacecraft detection sub-net based on tiny-YOLOv3 to predict the 2D bounding box and category of the object of the input image. Tiny-YOLOv3 attracts a lot of attention for its fast computing speed and therefore, can be applied to mobile devices.
To ensure the speed of calculation and improve the accuracy of space image recognition, we use SE attention mechanism model of SENets [59] after the backbone of tiny-YOLOv3. As mentioned above, the Tiny-YOLOv3 can generate the 2D bounding box (center coordinate c x , c y , and the height h, width w of boxes), the objectness score Conf obj and the category of objects. The Conf obj describes the possibility of containing objects for each grid cell of image. Hence, we can regard Conf obj as the evaluation of the reliability of the central coordinate c x , c y of the bounding box.
The tiny-YOLOv3 could predict the reliability of the center coordinate and the deterministic coordinate values of bounding box, but it does not predict the reliability of the bounding box, i.e., the reliability of w and h, which will lead the network unable to distinguish bounding boxes with different quality. Tiny-YOLOv3 is sensitive to the scale and uses Mean Squared Error (MSE) to regress the coordinates of box. Thus, predicted boxes of different quality may have the same L2-norms (or L1-norms) but different IoUs (Intersectionover-Union). Fig.4 describes that problem (the green rectangle represents ground truth, and the red rectangle represents prediction). In Fig.4, the L2-norms between each predicted bbox and ground truth are the same, but of different IOUs and different quality. Obviously, although L2-norms between three predicted boxes and the ground truth bbox are the same, and the center point of the predicted bbox is also in the same cell, the quality of the three bounding boxes is different. Different predicted boxes with the same box regression loss will result in the inability to train the model well and inaccurate object positioning, and the regression accuracy of 2D keypoints will also be affected.
To solve the above problem, we change the output of the tiny-YOLOv3. The new network output is shown in Fig.5. In addition to the original four box coordinates, objectness score and category, we have increased the reliability score of width w and height h of box, which are denoted by Pro w and Pro h , respectively. We model w, h, Pro w and Pro h as two-dimensional Gaussian distribution, where (w, h) of bounding box is the mean of Gaussian distribution, and the corresponding reliability score (Pro w , Pro h ) is the standard deviation. The 2-dimensional Gaussian model will be used to construct the box reliability loss function. By predicting the reliability of w and h, and defining a new regression strategy, the accuracy of the box prediction is improved. Compared with [60], our output is less, but can achieve the same purpose.

2) LOSS FUNCTION OF BOX RELIABILITY
Let p w and p h denote the width and height of anchor prior; t w and t h denote the width offset and height offset of predicted 2D bounding box, respectively. We define the loss function of box reliability score as (2).
whereĤ denotes predicted box reliability 2D Gaussian distribution, λ denotes loss weight, and g denotes the grid cell of image grids. We generate the ground truth box reliability score truth br as 2D Gaussian distribution with the means equal to the width and height of the ground truth box, and the standard deviations of 1.0. Let L OriDec denotes the loss function of original tiny-YOLOv3, the new loss function of our spacecraft detection sub-net is defined as: According to the results of multiple trainings, we set λ OriDec and λ br to 1.0 and 10.0.

C. SPACECRAFT KEYPOINTS REGRESSION WITH KEYPOINTS EXISTENCE JUDGMENT
We then construct the keypoints regression sub-net (as shown in Fig.1) to regress the 2D projected location of the 3D keypoints. The end of the keypoints regression sub-net abandoned the fully connected layers, but use convolutional layers to generate predicted heatmaps for each keypoint. We use the sigmoid function to map the value of heatmap to [0, 1], and search for the location of the maximum in the heatmap to obtain predicted coordinates of keypoints, x k ,ŷ k M k=1 . MSE is used to calculate the error of image heatmap: (4) VOLUME 8, 2020 In (4), λ heatmap refers to loss weight, M denotes the number of keypoints, 1 visible k denotes the ground truth keypoint p k presents in the image,ĥ p k is predicted heatmap, and truth h k denotes the ground truth heatmap of each keypoint. Ground truth heatmap in this article is defined as 2D Gaussian distribution with means equal to coordinates of keypoints and standard deviations equal to 1.0: where (x k , y k ) is the coordinate of ground truth keypoint p k . If the keypoints regression network is trained only by minimizing (4), there is a problem that the L2-norms or L1-norms of predicted heatmaps with different qualities and the ground truth heatmap are the same. The three predicted heatmaps with the same L2-norm or L1-norm are shown in Fig.6. The top right and bottom left heatmaps can produce the correct keypoint coordinate, but the bottom right heatmap cannot correctly predict the coordinate of the keypoint. For solving the above problem, and to improve the accuracy of keypoints regression, accelerate the speed of network convergence, we use logistic regression to assign an existence score to the location with keypoint in the heatmap, i.e., the probability that the grid cell in the heatmap has the keypoint. we derive an effective loss formula based on two expectations: i). the area with keypoints of predicted heatmapĥ p k should be converged to the area with keypoints of its ground truth h p k . ii). the difference between predictedĥ p k and its ground truth h p k at non-keypoints area should converge to zero. Therefore, the keypoints existence loss is formulated below: The λ is the loss weight; W and H are the width and height of heatmap, respectively; 1 co k,i,j represents the grid cell g k,i,j has keypoint. The grid cell in ground truth heatmap where the keypoint is located is responsible for determining the value. h p k g k,i,j denotes the value for grid cell g k,i,j in the predicted heatmap of keypoint p k ; 1 noco k,i,j represents that there is no keypoint in the grid cell g k,i,j . The grid cell in the ground truth heatmap other than the location of the ground truth keypoint is responsible for determining that value.
The coordinates of predicted keypoints, x k ,ŷ k M k=1 , can be generated by searching the locations of the maximum in predicted heatmaps. However, the problem is that the predicted keypoints produced by predicted heatmaps are not differentiable. We cannot optimize the network parameters by minimizing the MSE of predicted keypoints and ground truth. Therefore, we use MSE to estimate the error of the values of the predicted heatmap grid cell where the keypoint locates and the corresponding ground truth, which denoted as L co . And we also estimate the error of the remaining grid cells values in the predicted heatmap and the relevant ground truth, which denoted as L noco . These two losses will help us to improve the accuracy of keypoints prediction and accelerate the convergence speed of the network. L co and L noco are defined as: where truth h i,j,k denotes the value of grid cell in ground truth heatmap in which the keypoint p k locates. The loss of keypoints regression network is: After several training experiments, we set λ heatmap to 50.0, λ kp and λ co to 1.0, and λ nokp , λ noco to 10.0. We train the model (spacecraft detection sub-net and keypoints regression network) by minimizing the following loss: Loss function L is defined based on single image. For a mini-batch, the loss is averaged. Our training strategy is ''first stage training-later merge training'', i.e., we train the spacecraft detection sub-net first and then train with the keypoints regression network.

D. POSE ESTIMATION
Finally, the 6D pose of spacecraft is solved by 2D-3D keypoints correspondences, PnP, and geometric optimizer. Although the proposed network can produce accurate 3D-2D correspondences, it cannot guarantee that all correspondences against to each image are correct. Since the PnP method requires more than three pairs of accurate 3D-2D correspondences, it is better to use RANSAC to improve the robustness of the PnP method when the network detects some 2D keypoints of an image incorrectly (the outliers exist).
First, we use PnP with RANSAC methods 2 to generate the initial value of the transformation matrixÊ 0 = R 0 |t 0 and then perform bundle adjustment to optimize the pose. We optimize the 6D poseÊ by solving the following formula 6D pose: where K denotes the internal matrix of the camera,p k refers to the estimated 2D keypoints, x k denotes the coordinate of 3D keypoints of the object in the world coordinate system. (12) is the Log-cosh loss function. We use Levenberg-Marquardt (LM) [61] to solve (12).

IV. EXPERIMENTS
In this section, we evaluate our method for predicting the 6D pose of spacecraft on the SPEED dataset aforementioned. We then compare our method against previous monocular vision-based RGB based state-of-the-art approaches that do not use of depth information for accurate spacecraft 6D pose estimation on SPEED dataset.

A. TRAINING AND TEST DATASET
There are 12000 synthetic images and 5 real images with ground truth pose in the training set, 2998 synthetic images and 300 real images in the test set of the SPEED dataset [19]. The size of each image in the SPEED dataset is 1920 × 1200 px. The parameters of the camera used to capture the SPEED images, as shown in Table 1. The camera parameters determine the internal matrix of the camera, and we ignore the camera distortion in this article. The labels of the test set in SPEED are not provided, so we cannot conduct the evaluation based on the test set. Therefore, we randomly select 80% of the synthetic images 2 We used the routine solvePnPRansac in Python-opencv.  from the training set of SPEED dataset as our training set. And the remaining images, including the 5 real images, are used as the test set. Fig.7 and Fig.8 show the distribution of t and q in our test images, respectively. t denotes the relative position of the target body frame with respect to the camera frame. q represents the quaternion of the target. In Fig.8, the quaternion is parameterized as Euler angles.

B. EVALUATION METRICS
The spacecraft (object) detection is reported as the Intersection-Over-Union (IoU) score, which is defined as the intersection area (A I ) divided by the union area (A U ) of the predicted 2D bounding box and the ground truth 2D bounding box: The pose estimation performance is reported as the rotation error ξ R and the translation error ξ T [21]. We define the rotation error as the angle of the rotation, that aligns the estimated and ground truth orientations. Let theq denotes the rotation quaternion estimation, and q denotes the ground truth of an image. The rotation error ξ R is defined as: where z r denotes real part ofq · conj (q), and conj (·) means conjugate. The translation error for each image is simply the L2-norm of the estimated and ground truth translation vectors. Lett and t denote the predicted translation vectors of an image and the ground truth. The translation error ξ T is defined as:

C. TRAINING DETAILS
Our network is trained on NVIDIA Titan X. The network is optimized by SGD with a moment of 0.9, a weight decay regularization of 0.0001. The batch size is 5 images. Training starts with weights from the backbone of tiny-YOLOv3 VOLUME 8, 2020  trained on COCO dataset. The initial learning rate set to 0.001 and decayed 2 times every 50 epochs. We use K-means to determine the width and height of six anchor priors (width, height): (30,47), (42,88), (55,59), (73, 105), (103, 169), (172, 254). We construct the proposed model based on the Pytorch architecture.

1) POSE ESTIMATION
The training loss using the proposed model converges to 0.015 after 240 epochs. In order to illustrate the advantages of using attention mechanism and the loss of box reliability, we compared the proposed model with a model using only attention mechanism and a model without attention mechanism and box reliability judgment. After 240 iterations, the training loss of the model using only the attention mechanism converged to 0.7, while the training loss of the model without attention mechanism and box reliability judgment converged to 1.315. Training loss of three models above is shown in the Fig.9. From the Fig.9 (a), we find that when the attention mechanism is used, it can converge faster than the model without attention mechanism and box reliability judgment. The output of the keypoints regression is an 11 heatmaps, so each channel of the convolutional layer output is very important for the regression results. By adding a channel attention mechanism, we can control the characteristics of each channel and accelerate the convergence speed of the regression model. In order to further illustrate the effectiveness of the proposed box reliability judgment model, we randomly selected 10% of the data from the test set as the validation set, and evaluate the detection performance of the spacecraft detection sub-net with and without the box reliability judgement model on the validation set. The validation Mean Average Precision (mAP) and mean IoU (mIoU) in the training process are shown in Fig.10 (''No BoxRel'' and ''Box Rel'' denote the spacecraft detection sub-net with and without the bbox reliability judgement model, respectively). It can be seen from the results that we can improve the accuracy of spacecraft detection and speed up the convergence of the model by exploiting the box reliability judgement model. Since keypoints regression requires the use of bounding box information, it is necessary to increase the reliability of the  bbox to the output of spacecraft detection. The results prove that our proposed model is effective.
And then, we verified the impact of the keypoints existence judgment model on the keypoints regression. First, we established a comparison model, i.e. the keypoints regression sub-net with no keypoints existence judgment model, and named it ''NoKE model''. Then, the keypoints regression sub-net with the keypoints existence judgment model (referred to as the ''KE'') and the NoKE model were trained respectively. The experimental environment and data are the same as those of the training spacecraft detection sub-net. Finally, the heatmap regression loss and the total loss of the keypoints regression sub-net were obtained. The comparison results of heatmap loss and total loss of KE and NoKE models are shown in the Fig.11.
It can be seen from the results that the keypoints regression sub-net using the keypoints existence judgment model can quickly converge. The accuracy of the keypoints regression is also high. For the keypoints regression sub-net that do not use the keypoints existence judgment, the network convergence speed is slow, and the keypoints regression accuracy is low. Furthermore, we calculated the translation estimation error and orientation estimation error of the proposed network for each image in the test set, and calculated the translation error and rotation error of 1 standard deviation (1σ ). The 1σ of translation error in prediction is 0.025m and the 1σ of orientation error is 0.29325 • . Fig.12 and Fig.14 show the translation estimation errors, ξ T , and the orientation estimation errors, ξ R , on the test set. We also counted the dispersion of the errors as shown in Fig.12. The results show that all ξ T are below 0.12m, and most of ξ R are below 1.5 degrees.
The ξ T in Fig.14 is parameterized as the errors on x-axis, y-axis and z-axis. The x-axis and y-axis are aligned with the image plane axes. The z-axis is aligned with the camera boresight direction. Generally, the x-axis and y-axis of the plane axes point to the right and down along the plane, respectively. The ticks of the x-axis in Fig.14 are the image indices sorted according to the ξ T and ξ R . As shown in Fig.14, the proposed method in our paper has good performance in estimating the relative position of x-axis and y-axis, and the errors are below 0.016m, while the estimation errors of the relative position of z-axis are below 0.1m. Moreover, most of VOLUME 8, 2020  the orientation errors are mainly concentrated between 0.25 and 1.25 degrees. Fig.13 shows the mean ξ T and mean ξ R computed against to the mean relative distance. According to the curves in the Fig.13, the errors of the z-axis is 10 times the errors in the x-axis and y-axis directions. Since the predicted bbox directly affects the estimation accuracy of relative position in the x-axis and y-axis directions, it turns out that the proposed spacecraft detection sub-net can better locate the position of the target in the image. For most relative distances, the mean ξ R is between 0.6 and 0.8 degrees. The experimental results above prove that the method proposed in the paper has good performance in the pose estimation of space objects.
In addition, we compared our method against SPN [19], [54], HRNet-PE [55] (named only in this article) and URSONet [50]. Table 2 reports the performance results. Our proposed method achieves competitive performances in both spacecraft detection and pose estimation. The rotational error is smaller than 1 • , and the translation error is smaller than 1 meter.
Reference [19], [54] has experimented on the test dataset. Since the [19], [54] did not release the program code, we used the experimental results in the paper for comparison.
Reference [55] used a training cross-validation dataset to train proposed network for improving the accuracy of pose estimation. Our training method does not use the training mechanism, but still has competitive performance. Fig.15 and 16 show the spacecraft detection, keypoints regression and pose estimation results on a sample of the test images. Due to the limitation of the length of the paper, we only show the heatmap of 3 keypoints. In Fig.15, the images are img007510, img001058, img013511, img000068, and img013135 in the training set of the SPEED dataset. In Fig.16, the images we show are img010873, img007926, img012856, img007898, img007343, img007628, img007396, img007517, img007582 in the training set of the SPEED dataset.

2) PERFORMANCE COMPARISON
In order to further evaluate the effectiveness of the keypoints regression network proposed, we used 5 mainstream object detection network such as YOLO-v2 [62], YOLO-v3 [44], SSD [63], RetinaNet [64], Faster R-CNN [43], to replace the spacecraft detection sub-net in the keypoints regression network. The training set and test set described above are leveraged to train the 5 network models, and obtain the predicted 2D bounding boxes and 6DOF pose. Because the difference between the comparison network models and the keypoints regression network proposed is the difference of the spacecraft detection sub-net. Therefore, we only need to evaluate the detection performance to verify effectiveness of our approach. All the models were implemented based on Pytorch framework. The backbones of Feature Pyramid Network (FPN) include ResNet-50 [65], ResNet-101 [65], ResNeXt-101 [66]. Table 3 reports the performance of these state-of-the-art works and proposed model (100 training epochs). Experiment results demonstrate that our proposed network achieves competitive performance with other popular object detection  frameworks. The following metrics were used in Table 3: average precision (AP), average recall (AR), and parameters amount (Params). If the IoU threshold is γ , the predicted bounding box is regarded as a true positive (TP) only if the IoU between the predicted bounding box and its ground truth is greater than γ ; otherwise, it is a false negative (FN). The precision, P, and recall, R, at IoU threshold γ are defined below, respectively: where FP implies false positive, and TP + FP represents the total number of bounding boxes recognized by the detection sub-net. TP + FN is the total amount of ground truth bounding boxes. Precision refers to the percentage of detected spacecraft instances that are relevant and recall refers to the percentage of total relevant spacecraft instances correctly gathered by the detection sub-net. Since each image in the SPEED dataset corresponds to only one true-value bounding box and only one object, TP + FN is the total number of images in test set. AP is defined over multiple γ : where ϒ = [0.5, 0.55, 0.60, . . . , 0.95] represent a set of different IoU thresholds, and |ϒ| denotes the length of ϒ.
In this article, AR is defined as the average of the recall R over different IoU thresholds, which can be defined as: In our experiments, the AP at γ = 0.5 (AP .50 ) and γ = 0.75 (AP .50 ) were also calculated and reported.
At the same time, according to the size of the object in the test image, we defined six different indicators about AP and AR, including AP S , AP M , AP L , AR S , AR M , AM L . In this article, according to the area of the ground truth 2D bounding boxes gt in the images, we divided the all test images into small size ( gt < 38 2 ), medium size (38 2 < gt < 101 2 ) and large size ( gt > 101 2 ). AP S , AR S , AP M , AR M , AP L , AM L represent the AP and AR for small size instances, medium size instances, and large size instances, respectively.  In the experiment, the lightweight design of the network is evaluated as well. Fig.17 illustrates the amount of parameters of different backbones and the inference time cost of YOLO series frameworks based on NVIDIA Titan X. Compared with other frameworks (YOLOv2, YOLOv3, tiny-YOLOv3 described before), our proposed network, with ∼ 0.89 million learnable weights in total, only cost ∼ 20.0 milliseconds to infer a 1920 × 1200 SPEED image, i.e., the fps (frames per second) is ∼ 50. Therefore, even though it performs a little bit inferior than other state-of-the-art frameworks in detection and pose estimation performance, the proposed spacecraft detection sub-net with channel attention and box reliability judgment, and the keypoints regression sub-net with keypoint existence judgment are absolutely a much more suitable solution for space object pose estimation in high-speed and low-cost scenarios. Although the proposed model can correctly estimate the pose of the object in most images, there are still failed cases. The failure cases are mainly caused by spacecraft detection fails. The two unsuccessful cases are shown in the Fig.18. The object in Fig.18 (a) is too small and the light condition is poor. In Fig.18 (b), the difference in gray level between the earth background and the object is ambiguity. The features extracted by the spacecraft detection sub-net are not sufficient to describe the object, so the detection result is wrong. Therefore, in future work, we will further explore how to improve the accuracy of spacecraft detection in the above scenarios.

V. CONCLUSION
In this article, we propose a novel lightweight tiny-YOLOv3-based framework to estimate the 6DOF pose of a known spacecraft from a single space imagery in real-time. The box reliability judgment model and keypoints existence judgment model are proposed for improving the spacecraft detection and keypoints regression accuracy, and accelerating the network convergence. The spacecraft's 6D pose is estimated by the 2D-3D correspondences produced by proposed keypoints regression network and PnP with RANSAC. And we use the Log-cosh and LM to remove the wrong and inaccurate predictions for pose refinement. Experimental results show that the mean rotational error is 0.6812 • , and the mean translation error is 0.0320m. The proposed approach achieves very competitive detection and pose estimation performance, and the proposed network in this article is extreme lightweight (∼ 0.89 million learnable weights in total). Our proposed method is of low-cost and carries slight quantity of learnable weights. It achieves encouraging performance in both pose estimation and real-time capacity on SPEED dataset. Future work should consider how to improve the spacecraft detection accuracy in earth background and extreme poor light condition. And we will study how to reduce the size of the network model while ensuring high keypoints regression accuracy, so that our model is more applicable for on-orbit processing.