2D-Key-Points-Localization-Driven 3D Aircraft Pose Estimation

In this paper, we are interesting in inferring 3D pose estimation of aircraft object leveraging 2D key-points localization. Monocular vision based pose estimation for aircraft can be widely utilized in airspace tasks like flight control system, air traffic management, autonomous navigation and air defense system. Nonetheless, prior methods using directly regression or classification can not meet the requirements of high precision in aircraft pose estimation context, other approaches using PnP algorithms that need additional information such as template 3D model or depth as prior knowledge. These methods do not exploit to full advantage the correlation information between 2D key-points and 3D pose. In this paper, we present a multi-branch network, named AirPose network, using convolutional neural network to address 3D pose estimation based on 2D key-points information. In the meantime, a novel feature fusion method is explored to enable orientation estimation branch adequately exploit key-points information. Our feature fusion method significantly decreases 3D pose estimation error also avoids the involvement of RANSAC based PnP algorithms. To address the problem that there is no available dedicated aircraft 3D pose dataset for training and testing, we build a visual simulation platform on Unreal Engine 4 applying domain randomization (DR) skill, named AKO platform, which generates aircraft images automatically labeled with 3D orientation and key-points location. The dataset is called AKO dataset. We implement a series of ablation experiments to evaluate our framework for aircraft object detection, key-points localization and orientation estimation on AKO dataset. Experiments show that our proposed AirPose network leveraging AKO dataset can achieve convincing results for each of the tasks.


I. INTRODUCTION
3D Pose estimation of aircraft object is a challenging problem facilitated by the well-developed aircraft detection algorithms in very recent years [1]- [5]. As a higher level task based on aircraft detection, 3D aircraft pose estimation can be widely utilized in many airspace tasks, such as vision-based flight control system [6], [7], air traffic management [8], autonomous navigation and air defense system. Compared with infrared sensors and radar system, monocular camera based on visible light can capture images with more details and high-resolution. With the tremendous development of or classify [10] the 3D pose directly from the images in an end-to-end manner. Some approaches utilize key-points location as an intermediate representation to improve the performance by smoothing the model training process followed by Random Sample Consensus (RANSAC) based Pespective-n-Point (PnP) algorithms which is not an endto-end network. Besides, the demand for 3D object models restrict the network to specific objects, resulting in the weakness of general applicability.
In this paper, we propose a 3D aircraft orientation estimation pipeline taking the output of key-points localization network as prior knowledge. The network fuses the keypoints localization information as geometry feature with the extracted color feature to provide more robust aircraft pose information which enables us to exploit the key-points feature whereas avoiding RANSAC based PnP algorithms and the necessity of 3D models of the objects.
Another problem for applying deep-learning based methods on uncommon situation is the difficulty of data collection. Acquisition of abundant high-quality images is extremely effort-consuming in aircraft pose estimation context. To address this problem, we build the AKO platform based on Unreal Engine 4 (UE4) to generate aircraft images automatically labeled with 3D orientation and key-points location. We use the AKO platform to construct a dataset, named AKO dataset, 1 containing 15000 synthetic images for training and testing. The synthetic images are generated by merging the aircraft images and background images or constructed scenes. We adopted domain randomization (DR) skill [12] in the AKO platform to strengthen the network general applicability.
Through experiments and ablation study, each of our three network branches shows competitive performance on AKO dataset in comparison with the state-of-the-art algorithm [13]. It can be concluded that our simple but effective feature fusion method has greatly improved the accuracy than directly inferring the 3D orientation of object using only feature maps extracted from the original image. Compared with methods in previous works like [33], [39] [34] that need the specific 3D model of objects for pose estimation, our network shows convincing generalization ability for different aircraft model.
Contributions: In the light of previous work, the contributions of our work are summarized as follows: i. We propose a novel aircraft-oriented network, named as AirPose network to address the issue of aircraft 3D pose estimation. To the best of our knowledge, this is the first work to combine keypoints localization and 3D pose estimation in an single end-to-end architecture. ii. We explore a feature fusion method to effectively fusing the key-points geometry feature and original color feature, which significantly improves the 3D pose estimation accuracy. iii. We construct an image data generation platform for 3D aircraft pose estimation applying domain randomization skill, which enable us to build image dataset, i.e. AKO dataset, at a low cost. We make the AKO dataset publicly available at https://www.kaggle.com/portguss/ ako-dataset.

II. RELATED WORK A. OBJECT DETECTION AND 2D KEY-POINTS LOCALIZATION
Briefly, object detection problem has been researched for many years. Recent approaches, such as R-CNN [14], Fast R-CNN [15], Faster R-CNN [16] and YOLO [17], show amazing performance on detection task. As for Key-points localization problem, it has attracted considerable study in recent years [9], [18]- [23] especially for human body pose estimation since Toshev and Szegedy utilized deep learning method to directly regress the key-points 2D coordinates with multi-stage network architecture [18]. The fully convolutional network proposed by Long et al. enables the network make dense predictions for per-pixel which greatly improves the key-points localization accuracy [19]. Based on FCN, Newell et al. proposed an hourglass architecture in which the features are processed across all scales to improve the performance [21]. He et al. adopt Faster R-CNN and FCN to propose Mash R-CNN architecture for object detection and instance segmentation, which can be easily generalized to key-points localization task [9].Different from the human body, the aircraft body struture is rigid. This inflexibility makes aircraft key-points relatively more detectable.

B. 3D AIRCRAFT POSE ESTIMATION
Some of the previous aircraft pose estimation works are focused on handcrafted feature selection and aircraft geometry structure, such as line feature detection [3]- [5], [24], [25] and skeleton extraction algorithm [26]- [28].These methods have low computational complexity, but also limitations. When the certain parts of the object is self-occluded, the geometry features like line feature are not detectable which makes them vulnerable to occlusion. Moreover, these algorithms require at least two sensors to yield 3D pose information [3], [4]. Recently, driven by the effectiveness of convolutional network [19], [29], [30], plenty of approaches to 3D pose estimation are based on deep learning methods [31]. These algorithms can be divided into methods using 3D model during inferring [23], [32]- [35]and methods without 3D model matching that directly yield 3D pose information from 2D images [36]- [40]. Mahendran  semantic key-points localization step to improve the orientation estimation performance [23], [42].
Broadly, these methods can yield accurate orientation estimation, however, requiring strict prerequisites, such as depth information [10], [33], [43] and precise model of target to be detected [34], [35]. As for aircraft pose estimation situation, the depth information is hard to collect due to the long distance between the sensor and the target, also the model of the object is not available if it's non-cooperative object.

C. SYNTHETIC IMAGE DATASET
One of the most significant problem for using deep-learning skill in monocular 3D pose estimation is the deficiency of image dataset with accurate annotations of 3D pose information. Recently, researchers start using synthetic images dataset to train deep learning network for object detection [44], [45], key-points localization [46], semantic segmentation [47] or 3D pose estimation [2], [13], [41], [48]. Su [2], [13] for flying machine pose estimation like ours, however, both their datasets are based on very few specific models with limited general applicability.

III. AIRPOSE NETWORK A. AIRCRAFT DETECTION
Our work introduces the AirPose network with three branches and a feature fusion module to detect the aircraft, locate the key-points and estimate the orientation in parallel. Fig.2 shows our 3D pose estimation overall pipeline. The input of the network is an RGB image X of single aircraft in the width of w and height of h. Subsequently, a feature extractor based on ResNet-50, f ( * ), with pre-trained weights is shared by the three branches as the network backbone: where X refers to the extracted feature maps. Subsequently, X is sent to the Region Proposal Network which outputs VOLUME 8, 2020 s set of aircraft object proposals, each with an objectness score [16]. After the RPN, the feature of proposals are resized to a fixed size of 7 × 7 × 256 by applying RoIAlign [9]. Mark the fixed size feature map as M i corresponding to the i th proposal region. The first branch, l( * ), is to output the bounding box of aircraft object as follow: where v is the prediction of object location. In this branch, the feature, M i , is sent to fully connected layers to output both the softmax probability of aircraft and bounding box regression offsets. The loss fuction of first branch can be written as: Mark the fixed size feature of detected object as M . We utilize M to get the object key-points localization and 3D orientation estimation in the next two branches.

B. KEY-POINTS LOCALIZATION
The second branch, g( * ), is used to estimate the P key points locations, L p , where p ∈ {1, . . . , P} for all P parts: where Y p ∈ Z ⊂ R 2 , b p is the relative possibility of the p th part is at every location z = (x z , y z ) of Z predicted by the network. We merge all the b p (Y p = z) for P parts to generate the corresponding belief maps: In this work, we select eleven representative semantic keypoints closely related to the aircraft structure from aircraft body (i.e. nose×1, tail×1, wingtip×2, wing root×2, horizontal tail tip×2, vertical fin tip×1 and engine×2). To yield precise aircraft parts locations, our key-points localization network adopts stacked hourglass architecture [21] based on successive steps of pooling layers and upsampling layers. Different from original hourglass network, our architecture starts from the upsampling stage. The result confidence maps are refined by the cascaded hourglass-like convolutional network. Every hourglass stage can be divided into two components. The first is an encoder stage, of which the convolutional layers and max pooling layers continuously reduce the resolution of the feature maps by half. After the resolution comes to the lowest, as shown as the smallest and layer in hourglass model in Fig.2, the second stage of upsampling begins. Instead of use transposed convolutional layer, this network takes bilinear interpolation as a simplified approach to continuously increase the feature resolution by a factor of two until reaching the output resolution. And the information of every upsampling layer and its corresponding same-scale downsampling layer are linked together to keep the representability of the features. After the final upsampling layer, the network produces the prediction in the form of confidence heatmaps. The ground truth heatmap of each part is generated by applying a 2D Gaussian distribution centered at the labelled position L p = (x p , y p ). The groundtruth associated with the p th part can be written as: Then a L2 loss function is applied to train the hourglass network comparing the predictionb p to the groundtruth b p : The loss function is minimized during intermediate supervision. After the supervision layer, there begins another technically same hourglass stage to refine the prediction produced by the first module. Then the refined predictionL p can be inferred from the maximum response of the final output heatmapsb p as follow: where {L 0 ,L 1 ,L 2 , . . . ,L P−1 } is the locations of P individual parts. The results analysis and ablation study of key-points localization branch are shown in section V. In experiments, we find that the output heatmaps of first branch contains a wealth of aircraft structure information, which is significantly beneficial for orientation estimation branch.

C. ORIENTATION ESTIMATION
The third branch, h( * ), is used to estimate the relative orientation of the target aircraft. As shown in Fig.3, the orientation can be represented by the rotation between the aircraft body coordinate frame, A, and the camera coordinate frame, C.
In consideration of applying smooth interpolation and avoiding the Gimbal Lock problem, we adopt the quaternion, q AC , to represent this rotation, note thatq AC is marked asq. Before directly use extracted feature M to produce the estimation of quaternion, we analyse the correlation between keypoints localization error and orientation error. As shown in Fig.4, the orientation accuracy is closely related to keypoints localization accuracy. Thus, this paper proposes a FIGURE 4. Relativeness Analysis: We discretize both the normalized keypoints localization error and orientation error to ten subsets from group 1 to 10. With a certain fixed key-points localization error, the distribution of orientation error can be seemed as an approximate normal distribution. Along with keypoints localization error grows, the peak of the orientation error distribution shifts up to higher mean error.
feature fusion method shown as the green block and in Fig.2 that fuse the key points location informationb p [x z , y z ] with the original feature maps M for the improvement of pose estimation accuracy.
So the orientation estimation branch, h( * ), is to produce the quaternion as follow: Firstly, we discretize the SO(3) space by uniformly sampling 32 bins from each orientation dimension. Then each of the 32 × 32 × 32 Euler angles, (φ i , θ i , ψ i ), is converted to corresponding quaternion q i , where the φ i , θ i , ψ i is yaw, pitch and roll of i th bin. Our goal is to produce the estimation,q, as close to the groundtruth q gt .
To yield precise orientation estimation for aircraft object, we propose the feature fusion module, shown in Fig.2, utilizing neural network to take in to consideration both the original image information and located key points information simultaneously.
After the key points location branch outputs the confidence maps of size w × h × P, we fuse the confidence maps with the original image by weighted averaging. Then we extract the feature of the heatmaps and the fused image by using RoIAlign layer from [9] and conv layers to resize the features to fixed size 7 × 7 × 128.
The features from three sources are then stacked together to a final feature providing more abundant aircraft pose information of size 7 × 7 × 512.
Instead of directly regress the relative attitude,q, from the stacked features [39] nor do hard viewpoint classification [41], this paper addresses the pose estimation in a soft-classification manner enabling the network outputs more accurate results. Unlike One-Hot coding using in other classification tasks, we introduce a soft-classification coding which can be written as follow: where includes the indices refer to the K nearest quaternions to the ground truth quaternion, w i is the confidence value assigned to the i th bin, σ is a parameter that controls the Gaussian width and α i is the angular distance between q i and q gt : Then the total orientation estimation loss function, L 3 , can be written as follow: where L reg represents the L2 regularization loss preventing overfitting and penalizing the large weights. We train the network using L 3 to output the estimation weightsŵ j , then the final estimationq can be inferred by minimizing the weighted least squares as follow:

IV. IMAGE DATA GENERATION
Our neural network pipeline contains millions of parameters to train, which necessitates a large annotated image dataset. In the aircraft pose estimation context, actual camera image dataset especially the relative orientation information is extremely hard to be obtained from non-cooperative aircraft using monocular camera. Therefore, we build an platform on UE4 to generate aircraft 3D pose dataset, named AKO dataset, containing 15000 synthetic images for training and testing. The synthetic images are rendered on UE4 by merging the 3D aircraft model and background images or constructed scenes. The images are automatically labeled with object bounding-box, key-points location and orientation information. We adopted domain randomization (DR) [12] in the AKO dataset skill to strengthen the network general applicability. To better learn the general aircraft structure, we apply random structure deformation on twelve different types of aircraft model as data augmentation. The dataset includes 12 types of aircraft models, besides, DR skill have been adopted from [49] to better learn the aircraft structure knowledge. The AKO dataset makes training and testing the AirPose network a feasible task.
We apply DR skills in following aspects: • Camera Parameters: focal distance, aperture size; • Light Conditions: location and intensity of the sun light, number of point light sources(from 1 to 8); VOLUME 8, 2020 • Camera Placement: location, distance, angle of the camera with respect to the scene, note that the camera location is related to the variety of specific scene (e.g., the location is more likely to be under the object when the background is sky); • Background Images: two sources: background photograph and rendered scene; • Image Noise: random Gaussian noise and random Gaussian edge blurring to the object; • Model Augmentation: random texture and painting on model surface, and model stretch(range from 0% to 10%). DR skill makes our images generated in a nonphotorealistic way. However, this non-photorealistic manner do not down our model precision after the fine-tune on real images. On the contrary, our images include more variations and the generation process is far faster due to the DR technique. Note that our data generation pipeline outputs and resizes the images to resolution of 720 × 480.

A. DATA AUGMENTATION
Our approach is evaluated on the AKO dataset. As for data augmentation, different from previous works, we apply data augmentation in a relatively prudent way since the classic data augmentation method such as spinning and cropping can cause the camera parameters change and the ground truth label error. Instead, as an offset, we generate more images to substitute data augmentation process by applying domain randomization and other process that won't change image resolution and spatial features still remain such as adding noise.

B. IMPLEMENTATION DETAILS
We implement our pipeline on a single NVIDIA RTX2070S GPU with pytorch 1.0. In training period, we use Stochastic Gradient Descent (SGD) with a momentum of 0.9, a minibatch size of 4 images from AKO dataset and a weight decay of 0.0001. The learning-rate is set to 0.001 at the beginning and decreased by ten respectively after the 5 epochs and 10 epochs. The feature extractor backbone is ResNet of depth 50.
Unless otherwise specified, the synthetic dataset is applied full domain randomization, the model is trained on synthetic images and then fine-tuned on real images, the test set contains only real images. The depth of feature from ROI is 256, and the fused key-points feature depth is 128. The key-points network including 4 stages to refine the results, and SO (3) space is discretized to 32 bins for each dimension. The feature fusion process is applied by default. All training and testing processing are based on our AKO dataset.

C. PERFORMANCE METRICS
To measure the performance of aircraft detection branch, we use the Intersection-Over-Union (IoU) metric as follow: As for keypoints localization, we use modified Percentage Correct Keypoints (PCK) metric named as aPCK that calculate the percentage of joints with predicted locations which are no further than a normalized distance from the ground truth on AKO dataset. This normalized distance associated with the airframe size is calculated by: where l x , l y , l z is the distances on 2D image respectively from nose to tail, left wing-tip to right wing-tip and fintip to the midpoint of horizontal stabilizer. We determine k = 0.05, β = 2.5 and α = 1.2. The angular distance between the estimated quaternion and the ground truth quaternion to evaluate our orientation estimation branch can be calculated as follow:

D. RESULTS
First, Table 1 shows the oveall results of our three-branches pipeline. Where M s is the model trained only on synthetic images, and M s+r is M s then fine-tuned on real images. T s and T r are testsets including synthetic images and real images respectively. We evaluate the performance for M s and M s+r on T s and T r . The model without fine-tuning on real image testset, M s , yields high-precision results on T s , but the accuracy decreases on real image testset T r rapidly. We alleviate this overfitting problem using fine-tuning metric on M s+r and achieve a significant improvement of compared with M s . To compare the performance of orientation estimation with the performance of the state-of-the-art method, we implement and slightly modify the SPN proposed by [13]. The results shows that the orientation error of our AirPose network is significantly smaller than SPN's due to our deeper network. Some examples of our results are shown in Fig.5. Table 2 shows how the feature fusion module affect the 3D pose estimation by switching the combinations of the three feature sources. The results shows that our feature fusion processing significantly improves the 3D pose estimation performance by up to 1.3 • compared to only using the color feature. Adding keypoints feature and fused image feature could also reduce the Mean E ori by 0.9 • and 0.6 • . Table 3 shows how increase of feature depth improve the performance of overall pipeline. The performance changes significantly from the depth of 32 to 256 for all the three tasks. But from 256 to 512, compared to the increase of parameters, the growth of accuracy slows down.   The impact of our multi-stage architecture for key-points localization is shown in Table 4. We compare aPCK of each key-points on model architecture of 1 to 8 stages. Parts with distinct edges, such as nose, wing tail (WT), horizontal tail tip (HTT), vertical fin tip (VFT) and tail, are easier to be localized and get more accurate results compared to parts with less texture and less significant margin. As the number of stages increases, the performance gets more accurate. Note that all of the stages share the total same structure. The effect of multi-stage architecture is notable at 1-to 2-and 2-to 4-stage, the accuracy of which are 76.9%, 83.1% and 88.2%. The modest improvement is from 4-to 8-stage: from 88.2% to 88.8%. key-points localization performance respect to the normalized distance k on each part are shown in Fig.6.
We also propose an ablation study to discuss the effect of domain randomization applied in our AKO dataset by take one of the domain randomization parameters out at a time. As shown in Fig.7, both the keypoints localization and orientation estimation accuracy change with coherence of   changing trend. As we can see, the absence of randomized light most hurts the performance. Without light randomization skills, the aPCK drops to 76.1 and the attitude error increase to 13.8. Different from [49], our results shows one unexpected point that the missing of random aircraft surface texture did not decay the accuracy as much in [49], which can be explained by that the distinct structure of objects and the lower-complexity background reduced the necessity of random texture. VOLUME 8, 2020 The aircraft models in the testing set can be divided into known model and unknown model. The unknown model means the aircraft model set that has no intersection with the training data model set, and vice versa. Table 5 shows the results are not sensitive to the prior of aircraft model. Both experiments on known model and unknown model can achieve high-accuracy at the almost same level which shows the convincing generalization ability of our network for different aircraft model without the prerequisite of the specific 3D model.

VI. CONCLUSION
This paper proposes the AirPose network for aircraft object detection, 2D key-points localization and 3D orientation estimation with our AKO dataset. We show how our feature fusion method and domain randomization skill benefit the overall performance. As future work, further research is required in following directions. First, we will evaluate and improve the computational runtime and memory usage of our AirPose network for embedding it in hardware. Second, we will apply our method to video sequences. Lastly, we envision 3D pose estimation using external monocular sensor an promising direction in the field of airspace situational awareness.