Pose Detection of Aerial Image Object Based on Constrained Neural Network

Constraints often exist in the high-dimensional data output in object detection, such as the inverse vector {cos $\theta $ , sin $\theta $ } of the two-dimensional object and the attitude quaternion of the three-dimensional object. The range of each component of the output value of the traditional neural network is unconstrained, which is difficult to meet the needs of practical problems. To solve this problem, this paper designed the transformation network layer according to the high dimensional space transformation theory and constructed a constrained neural network model to detect the pose of objects from a single aerial image. Firstly, in the yolov3 network structure, according to the size of the three field scales, three scale transformation network layers are added correspondingly to implement the constrained unit quaternion field. Secondly, according to the characteristics of quaternion, we proposed a special loss function. Then a new constrained neural network called the quaternion field pose network (qfiled PoseNet) model is constructed, which can predict the probability field of the object and the contained unit quaternion field respectively. Next, The object’s probability field is generated to determine the 2D bounding box of the object, and the unit quaternion field is generated to determine the 3D rotation R. Finally, combining the rotation matrix R and the 2D bounding box of the object to calculate the 3D translation T. We used our method to experiment on DOTA1.5 data set and the HSRC2016 data set respectively. The experimental results show that our method can detect the pose of the object well.


I. INTRODUCTION
Pose detection and recognition of objects is a technical method to identify all kinds of interested objects in complex environment images and obtain the 3D relative pose relationship between the object and the camera. It is one of the core technologies of the next generation intelligent robot, battlefield object reconnaissance, autonomous navigation, and satellite image interpretation. Its breakthrough will greatly promote the development of relevant research fields [1].
Object pose detection includes 3D translation and 3D rotation. According to the current work of most researchers, pose detection methods are mainly divided into three types: keypoint matching method, RGB-D method, and deep learning method. Among them, the keypoint matching method mainly establishes 2D-3D correspondence on local features [2] and then uses the PnP algorithm to estimate the object pose The associate editor coordinating the review of this manuscript and approving it for publication was Zhongyi Guo .
parameters. However, this kind of method can achieve high accuracy for objects with rich texture, but it is difficult to solve objects without texture [3]. The RGB-D method has also developed greatly with the emergence of depth cameras. Hintertosser et al. [4] proposed surface template matching of 3D point clouds, and then proposed point pair features and several variants [5] to improve robustness to complex background and noise. However, the RGB-D methods require high costs.
In recent years, the deep learning method has become the mainstream for solving the pose of objects, including camera pose and object pose. References [6] and [7] both regress camera pose by training CNN (Convolutional Neural Networks). In reference [6], the authors use quaternion as 3D rotation but omit the spherical constraint, using an unreasonable loss function. References [7] address the problem by developing a more scientific loss function. In reference [8], [9], the authors use CNN to regress 3D rotation, their works focus on 3D rotation estimation while 3D translation is not included. In reference [10], SSD (Single VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Shot multi-box Detector) [11] algorithm was improved for object pose detection. The authors transform pose detection into two stages, view angle classification and in-plane rotation classification. However, the classification accuracy of each stage would affect the total accuracy of object pose detection. Algorithm BB8 [12] first uses the segmentation network to locate the object, then uses another CNN to predict the 2D projection of the 3D bounding box corners of the object, and finally uses the PnP algorithm to estimate the pose parameters of the object. This algorithm achieves high precision but takes a long time. Similar to the algorithm BB8, reference [13] improves YOLO (You Only Look Once) object detection framework [14] to predict the 2D projection of the 3D bounding box, then uses the PnP algorithm to estimate the pose parameters of the object. References [12] and [13] both regression too many 2D projection points, which increases the difficulty of model learning and reduces the learning speed. Although the object attitude detection method based on neural network has achieved some research results, there are still some problems. Each output element of traditional neural network is independent, and the original output range of a single neuron is (−∞, +∞). At present, few studies consider the problem of multiple implicit constraints of neural networks. For example, the output neurons o 1 , . . . , o n on of a neural network computing layer need to meet a multivariate stealth constraint φ (o 1 , . . . , o n ) = 0. This situation is very common in many practical problems. For example, there is unit orthogonal constraint between 9 elements in the 3D rotation matrix R, the quaternion describing the object's 3D rotation has the constraint that the sum of squares of the output values is equal to 1, the vector (cosθ, sinθ) describing the ground projection direction of aerial object has the constraint that the sum of squares is equal to 1. According to Lie algebra theory, n implicit constraints are equivalent to establishing an N-n dimensional manifold in the N-dimensional neuron output space. Therefore, it is possible to use the mature methods and conclusions in Lie algebra to study how to solve the constraint problem in neural networks and establish the most effective pose feature expression.
According to the Lie algebra theory [15], this paper designs the transformation network layer and constructs a constrained neural network model to detect the pose of object in aerial images. Firstly, based on the yolov3 model, the transformation network layer is correspondingly added to the three-branch structure of yolov3 output to predict the quaternion. Secondly, according to the characteristics of quaternion, we propose a special loss function to construct the qfiled-PoseNet model. The model can predict the object's probability field and the unit quaternion field respectively. Next, the object's probability field is generated to determine the 2D bounding box of the object, and the unit quaternion field is generated to determine the 3D rotation. Finally, according to the rotation matrix R and 2D bounding box of object to calculate the 3D translation T. We verified the feasibility of the method on the DOTA1.5 data set and the HSRC2016 data set. Experimental results show that our method can achieve satisfactory results. Our work has the following advantages and contributions: 1) There are no constraints between the output components of the traditional neural network. This paper innovatively proposes a network that can meet the constraints of each output component. 2) In this paper, the transformation network layer is designed to construct a constrained neural network model, which can output the unit quaternion field used to construct the rotation matrix R. 3) For the network whose output is required to be a unit vector, a special loss function is proposed to solve the problem of 3D rotation regression. The remainder of the paper is organized as follows. After the overview of related work, we introduce the proposed object pose detection method. Then we display the experimental results, followed by the final conclusion.

II. PROPOSED ALGORITHM A. QFILED-POSENET MODEL
Quaternion is a compact and effective vector describing 3D rotation [16]. It was discovered by the famous mathematician Hamilton and has been widely used in the field of computer vision. We use q = {q 0 , q 1 , q 2 , q 3 } to represent the unit quaternion describing the 3D rotation of the object. The conversion relationship between the unit quaternion and the rotation matrix R can be expressed as [17] (1), as shown at the bottom of the next page.
Basically, the line of sight of remote sensing objects is vertically downward, so the rotation matrix R of the camera is defined as, We first label the coordinate data of four points (1-4) of an object, and θ is the unit vector formed from the center point of the object to the midpoint of points 1-2, as shown in FIGURE 1. Calculate sin θ and cos θ respectively, and then calculate the attitude quaternion q according to the attitude quaternion calculation formula (3).
During training, the ground truth data comes from the four key points of the object on the aerial image. The coordinates of four points of the object in the data set are converted into quaternions as follows, During prediction, the R matrix is easily obtained from the predicted q, The value range of convolution kernel is unlimited, and the output value range weighted by neighborhood of convolution network is also unlimited [−∞, ∞]. The convolution kernel components between different output channels of convolution network are completely independent, so the output components of different channels are also independent. Therefore, the prediction component of neural network is difficult to meet the mutual constraints between different components. For example, the quaternion describing the object attitude has the constraint that the sum of squares is equal to 1, but the traditional neural network is difficult to meet this condition. Therefore, for the learning problem of target attitude data, the paper proposes to design the transformation function and add the network transformation layer. Then, the quaternion output by the neural network is transformed into the unit quaternion through the function.
We use Q = {Q 0 , Q 1 , Q 2 , Q 3 } to represent the quaternion output when the neural network propagates forward. Each output element of the traditional neural network is independent, so Q = {Q 0 , Q 1 , Q 2 , Q 3 }∈ (−∞, ∞). While the quaternion representing the rotation is constrained by the sum of squares equal to 1.
According to the chain derivation formula rule, (8) is obtained from equation (7). Therefore, solving the transformation function problem becomes solving the system of FIGURE 2. The output quaternion of unitization constraint layer {q 0 , q 1 , q 2 , q 3 }.
Therefore, so like [18], we design the transformation function equation (9), which is exactly a set of solutions of the differential equation (8) and satisfies the constraint condition equation (6). Then the forward propagation formula of the neural network conversion layer is, where, i = 0, 1, 2, 3. Therefore, a unitized constraint layer is added to regress the rotation. In [18], the authors have proven that the normalization layer helps improve the accuracy of prediction.
Let the quaternion of the rotation obtained through the output of the unitized constraint layer be expressed as q = {q 0 , q 1 , q 2 , q 3 }, and meet the condition φ (q) = 0. The schematic diagram of the unitized constraint layer is shown in FIGURE 2.
The constraint proposed in this paper constrains each output component, which is independent of the ground truth and needs to be strictly satisfied φ (q) = 0. It is a prerequisite for training the network model.
Quaternions are hemispherically distributed in 4D space, which makes q i and −q i represent the same rotation. However, in the output of neural network, the distance between q i and −q i is the farthest. In order to solve this problem, the point product method is used to define the loss  function, the loss function E q is expressed as, whereq = {q 0 ,q 1 ,q 2 ,q 3 } is the true value of the corresponding component of the quaternion. After the loss function E q is defined, the back propagation formula is expressed as [19], We use formulas (11) and (12) to realize the gradient transfer of the unitization operator in back propagation.
In order to regress the 3D rotation of multiple objects in the scene, it is necessary to expand the scalar value into a field for calculation. We generalize the four input scalar values {Q 0 , Q 1 , Q 2 , Q 3 } into four input feature maps and the four output scalar values {q 0 , q 1 , q 2 , q 3 } into four output feature maps, so as to construct a quaternion output model supporting full convolution operation. The structure is shown in FIGURE 3.
The network transformation layer defined by us for regression unit quaternion field is added to three output branches of yolov3 with different sizes to construct the qfiled-PoseNet model.Yolov3 is a 106 layer structure. It outputs three groups of tensors in 82, 94, and 106 layers respectively, and obtains three object sets with different scales. We extended 24 layers to form a 130 layer network structure. At layers 75, 87, and 99, this network predicts 2D horizontal boxes of objects with different scales. And then, in darknet, cross-connect and jump are realized through the route layer. The 114, 122, and 130 layer outputs the corresponding object's unit quaternion fields of large, medium, and small scales. This ensures that each output object's 2D bounding box corresponds to a quaternion field. The network structure has changed from the original 3 branch structure to the current 3 × 2 branch structure. The network structure is shown in FIGURE 4. The total loss function of the qfiled-PoseNet model can be expressed as, where loss yolov3 is the loss function of yolov3 model, M , H represents the size of the tensor of the quaternion field, and 1 obj i denotes if object appears in cell i. The difficulty of object detection method based on neural network is that neural network is more suitable for learning mutually independent dense base spatial data, but not for learning constrained sparse manifold data. Therefore, we summarize the general methods of learning constrained sparse spatial data as follows: (1) In the training stage, the sparse data R in highdimensional space is transformed and compressed into dense spatial data Q.
(2) In the reasoning stage, the unconstrained output Q of the neural network is transformed into the original data R through the inverse transformation in the training stage.

B. CALCULATE AERIAL IMAGE OBJECT TRANSLATION
We calculate the translation T of the object according to the Bounding Box Equation technology proposed in the reference [19]. According to the matrix equation, where, c is the main point of the image, f is the focal length, [x L ,y T x R ,y B ] is the range of the object rectangle in the image (16), as shown at the bottom of the next page.
The 3D translation T is obtained by the least square method, expressed as,

III. EXPERIMENT
In order to speed up the experiment, we use two computers at the same time. The configurations of the two computers are: i7-8700cpu @ 3.19ghz, 32GB memory, 1080ti graphics card,windows10 system for hrsc 2016 data set, and i7-8700cpu @ 3.19ghz, 32GB memory, 2080ti graphics card, windows10 system for dota 1.5 data set. The development environment is vs2013 + opencv2.4.10. DOTA 1.5 data set (https: // captain -whu. github. io / DOAI2019) and HSRC2016 data set (https: // sites. google. com /site /hrsc2016/) are two typical remote sensing data sets. We use the proposed method to experiment with these two data sets. In order to quantitatively verify the effectiveness of the experimental results, the following data indicators are used to evaluate the object detection and recognition. (1)mAP (2)IoU The full name of IOU is intersection over union, and the calculation method is the ratio of intersection and union of ''predicted border'' and ''real border''. It is usually used to represent the 2D frame accuracy index of the object  where R is the predicted value of the rotation matrix,R is the true value of the rotation matrix.
When calculating the mAP, we require object 2D frame accuracy index(IOU) > 0.5 and COS (eRE) > 0.95. This ensures that the objects not only meet the correctness of the horizontal frame but also ensure the correctness of 3D rotation. (4)f1 The f1 value represents the harmonic average of precision and recall, f1 = 2 precision · recall precision + recall (21)

A. ABLATION STUDY
In order to verify the effectiveness of the proposed network model with constraint layer, the network model with constraint layer and the network model without constraint layer is used to test DOTA1.5 data set. A variety of performance curves are obtained, including average loss, recognition accuracy index map(mAP), recognition rate index f1 under 0.25 threshold, Recall, Precision, IOU, and oriented accuracy index(avg_dot). The oriented accuracy index is the dot product of prediction vector and truth vector, which is used to judge the accuracy of orientation prediction. As shown in Fig.5, the horizontal axis represents the number of epochs, and the vertical axis represents the corresponding value. The green represents the network model with constraint layer, and the blue represents the network model without constraint layer. The variable avg_dot in the 7 th subgraph represents Tr RR −1 − 1/2, which represents the cosine of the angle between the predicted attitude and the true attitude. When IOU>0.5, it can be seen that the avg_dot of the constraint network is closer to 1, indicating that the predicted attitude of the constraint network is closer to the true value. Because the attitude estimation is more accurate, the performance indexes of the network model with constraint layer are better than those without constraint layer. After adding the constraint network, the output of the model naturally meets the constraint conditions. The network can learn in the constraint domain space, which greatly improves the speed and accuracy of learning. At present, few scholars have studied the object pose detection based on aerial images. Therefore, this paper compares with the method using DOTA1.5 data as experimental data. The comparison results are shown in TABLE 2. It can be seen that the mAP value of the proposed method is higher than that of the comparison algorithm, especially for Plane, whose AP value is as high as 97. However, the AP value of the CC(Container Crane) class is low. This is because (u KiL r 31 − r 11 )x iL + (u KiL r 32 − r 12 )y iL + (u KiL r 33 − r 13 )z iL (u KiR r 31 − r 11 )x iR + (u KiR r 32 − r 12 )y iR + (u KiR r 33 − r 13    . PR curve, the horizontal axis is R, the vertical axis is P, and the ratio of the area (shaded part) enclosed by the red curve and the horizontal and vertical axis to the total area is mAP value.
the number of samples of CC is too small, which leads to the low mAP value of this class and reduces the overall mAP value. FIGURE 6 shows the PR curve of each type of object according to the experimental results of DOTA1.5. In the figure, recall is the abscissa axis and precision is the ordinate axis. The ratio of the area enclosed by the PR curve and the abscissa and ordinate axis to the total area is the AP value of this kind of object, which can be statistically obtained during the experiment. The mean value mAP of the AP value of 16 objects is the final position of the green line in the second subgraph in FIGURE 5. As can be seen from TABLE 1, this value is about 71.66. In addition, according to FIGURE 6, it can be seen that the AP value of Plane is the highest and that of CC is the lowest. This is because the number of samples of CC class is too small, and the results correspond to TABLE. 2.
Based on the attitude quaternion of the object, the threedimensional position and attitude of the object relative to the camera can be further obtained by using the method of literature [19]. The experimental effect of DOTA1.5 data set is shown in FIGURE 7.
In the figure, the red arrow is the x-axis direction of the object's body. The green arrow is the y-axis direction of the object's body, that is, the tail direction of the object. The blue arrow is the z-axis direction of the object's body, pointing to the ground. We only learned the directions of planes, ships, large vehicles and small vehicles. Other objects, such as storage tanks and swim pools, have no directions. We do not learn in the three new branches, but only learn the horizontal box of the object in the first three branches. The proposed algorithm can show the exact direction of the object. There may be four possible directions for the output of the traditional tilt box, which is not clear. VOLUME 10, 2022   In FIGURE 7, the main object of the first three pictures in the first row is car, and the main object of the last three pictures is plane. Both car and plane have directions, so the proposed algorithm can not only detect the position of the object, but also detect the direction of the object at the same time. The orientation of car and plane are the opposite direction of the green arrow in the figure. The comparison algorithm only uses the rotation box to find the object, and does not fully confirm the orientation of the object. The main object of the first picture in the second row are ship and harbor. It can be seen from the figure that the proposed algorithm can accurately detect the position and direction of the ship. At the same time, the harbor can also be detected, but because the object has no direction, the algorithm only detects its position and does not calculate its direction. The last three pictures show the harbor and swimming pool. The first three pictures in the third row in the figure is bridge, which are difficult to interpret manually, but they can be identified by using the proposed algorithm. The last three are storage tanks. The pose detection of remote sensing objects in complex environments (different scales, different illumination, blur, weak and small) is very challenging. For many small objects, it is difficult for human eyes to identify the direction.

C. HSRC2016 DATA SET TEST RESULT
We experimented with the HSRC2016 data set. The ship detection experiment was carried out on 350 images of the data set. The accuracy of each index and the recognition speed is shown in TABLE. 3. In order to verify the effectiveness of the proposed algorithm, it is compared with three other algorithms [25]- [29] that also experiment on In the figure, recall is the horizontal axis and precision is the vertical axis. The ratio of the area enclosed by the red curve and the horizontal and vertical axis to the total area is is the AP value of ship. Since the HSRC2016 data set has only one class of objects, the AP value of the ship is the mAP value of the experimental results. It can be seen from the figure that there are very few missed objects in the experimental results, and the detection accuracy is high. It corresponds to the mAP value in TABLE 3.
The experimental effect of HSRC2016 data set is shown in FIGURE 9. The object of HRSC2016 data set is ship with direction. In the figure, the red arrow is the x-axis direction of the object's body. The green arrow is the y-axis direction of the object's body, that is, the head direction of the object. The blue arrow is the z-axis direction of the object's body, pointing to the sky. Therefore, the direction of the ship is the direction indicated by the green arrow in the figure. In all comparison algorithms, the rotating box is only used to detect the object, rather than determine the direction of the object in the actual sense. The experimental results show that the proposed algorithm can not only detect the ship object but also determine the direction of the object.

IV. CONCLUSION
This paper designed the network transformation layer and constructs a new constraint network model based on yolov3 structure, which we call the qfiled PoseNet model. The model can predict the probability field of object and unit quaternion field respectively. The object's probability field is generated to determine the 2D bounding box of the object, and the unit quaternion field is generated to determine the 3D rotation R. And the combining the rotation matrix R and the 2D bounding box of the object to calculate the 3D translation. This paper implements a fast algorithm to infer the 3D pose and position of the object relative to the camera according to the aerial image. The proposed algorithm can detect the pose of the object in the aerial image, that is, while detecting the position of the object, it can obtain its actual orientation.
JIN LIU received the Ph.D. degree in pattern recognition and intelligent systems from the Huazhong University of Science and Technology, in 2005. He is currently an Associate Professor with Wuhan University. His research interests include pattern recognition, computer vision, and artificial intelligence.
YONGJIAN GAO is currently pursuing the master's degree with the State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University. His research interests include object detection, machine learning, and computer vision.