Vision-Based Pose Optimization Using Learned Metrics

Camera pose optimization is the basis of geometric vision works, such as 3D reconstruction, structure from motion, and visual odometry. A pose optimization method based on learned metrics is proposed to improve the optimization convexity. The neural network was designed and trained based on the collected datasets, respectively. The network inputs pairwise patches and outputs the Euclidean distance of its center. This distance is involved in the residual calculation of Gauss-Newton, and the Jacobian corresponding to this distance can be analytically solved. The simulation verified convergence and generalization of the designed network. The accuracy and robustness of the proposed pose optimization compared with intensity- and feature-based optimizations are also verified.


I. INTRODUCTION
Vision-based pose optimization is used to recover the camerato-world translation and orientation from an image, which is a key technology in geometric vision. It is a process of predicting the transformation of an object from a user-defined reference pose, given an image. It arises in computer vision or robotics where the pose or transformation of an object can be used for alignment of a Computer-Aided Design models, identification, grasping, or manipulation of the object. It is used extensively in visual odometry (e.g., [1]), wide-baseline stereo (e.g., [2]), three-dimensional reconstruction (e.g., [3]), and structure from motion (e.g., [4]). The crucial component of pose optimization is the objective design and corresponding Jacobian calculation. The similarity metrics between images participate in the loss and gradient calculations, which determines the optimization convergence.
The camera pose is usually estimated by a feature-or intensity-based method. The feature-based method accomplishes pose optimization by minimizing the alignment error between corresponding features. Feature extraction and correspondence establishment are generally conducted by cumbersome engineering methods, such as SIFT [5], SURF [6], ORB [7]. However, in extreme environments, textureless or gradient texture scenes make features difficult to extract, The associate editor coordinating the review of this manuscript and approving it for publication was Byung-Gyu Kim . and repeated textures can easily cause mismatches. The camera pose must be also estimated without feature extraction and matching [8].
By contrast, the intensity-based method optimizes relative pose by minimizing photometric errors [9]. It directly measures the similarity of pixel patches from the target and reference images via similarity metrics, such as sum of squared differences (SSD) and sum of absolute difference (SAD) [10]. The intensity-based method avoids feature extraction and matching. It works as long as the gray of pixels in image differs.
However, using the similarity metric, such as SSD, to optimize relative pose depends on the following correlation: the relative pose gradually approaches the true value as the similarity score increases. This correlation determines the convexity of the optimization model. Unfortunately, this positive correlation is weak, especially when the current value is far from the true value. Therefore, the intensity-based odometry, such as LSD [9] and DSO [11], requires good initial values, and large sampling is also needed to maintain this positive correlation for ensuring the convergence of pose optimization. In addition, the intensity-based method is based on luminosity invariance, which is susceptible to external illumination. Therefore, a good metric must be designed to enhance the optimization model convexity and robustness with respect to factors, such as illumination and initial values, when feature correspondences are unknown. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This study intends to design a metric that can directly calculate a metric from the pairwise patches, one of which is extracted from the reference frame and the other is the projected patch in the target frame. This metric do not need feature correspondence but maintain the optimization model convexity and robustness to the initial pose value. To achieve this goal, we constructed the neural network with the patches in the reference and target frames as inputs and the Euclidean distance (δu, δv) in the pixel coordinates as output. After training, the network can learn the Euclidean distance of the patch center. The learned metric was then applied to pose optimization to improve its convergence and robustness. The work carried out around this topic is summarized as follows: • The regression and improved residual neural network [12] were designed, and the datasets for network training were collected.
• The learned metric was applied to pose optimization, and the corresponding Jacobian of Gauss-Newton was derived.
• The performance verification of the pose optimization method based on learned metrics was completed by comparison experiments.
• The key affecting factors of the accuracy and robustness of pose estimation were analyzed.
Judging the approximate movement between patches can be done manually but is difficult using the conventional model. We apply machine learning to the parts that the mapping is stable. The remainder of the paper is organized as follows. Section II presents related works. Section III gives an overview of the proposed method. Section III-A presents the datasets collection and network training for the metric. The pose optimization method based on learned metric is designed in Section III-B. Section IV verifies the performance of the network and optimization method based on the learned metric. Section V and Section VI contain the discussion and conclusion of this work.

II. RELATED WORKS
The related works are introduced from the metric improvement in pose optimization and the metric learning application.

A. METRIC IMPROVEMENT IN POSE OPTIMIZATION
Intensity-based pose optimization methods are mostly based on the similarity metric mentioned in [10]. A few works have attempted to improve optimization performance by changing the similarity metric. Reference [13] applied the artificially designed metric to pose optimization, it proposes a self-similarity weighted graph-based implementation of αmutual information (α-MI) for nonrigid image registration. Reference [14] also completed the nonlinear optimization of 3D pose based on mutual information feature. This artificial design metric improves the convergence and robustness of pose optimization to a certain extent. However, the positive correlation between learning-based metrics and pose parameters can be stronger than mutual information. Neural network has strong mapping fitting ability, which provides new ideas for metric design.
The learning-based metric was initially applied to medical image registration. For example, [15] used higher-level corresponding structures, such as the same organ and lesion appearing in two images, to learn the similarity metric. Reference [16] directly estimated the displacement vector field from a pair of input images, which is better than explicitly defining a dissimilarity metric. However, medical images are treated as a 2D graph. Thus, the pose estimation is based on homography, spatial transformation and dataset collection are much easier than those in our work. Reference [17] propose a method of learning suitable convolutional representations for camera pose retrieval based on nearest neighbor matching and continuous metric learning-based feature descriptors. Similar to our learned metrics, these descriptors are sensitive to pose errors. But it directly uses these features to regress the pose, and we optimize it based on the learned metric.

B. METRIC LEARNING APPLICATION
The metric learning problem is concerned with learning a distance function tuned to a particular task and is useful when used in conjunction with nearest-neighbor methods and other techniques that rely on distances or similarities [18]. It has been gradually applied to computer vision, such as image classification [19], RGB or depth image matching [20], and image search and understanding [21]. Typical applications are listed as follows: • MatchNet [22] achieved patch matching through metric learning, which effectively improves the accuracy of matching.
• [23] used learned metrics for similarity search by learning a Mahalanobis distance function that captures the images'underlying relationships well.
• [24] used a multi-view deep metric learning architecture to recognize volumetric image stacks.
• [25] trained a convolutional neural network to predict how well two image patches match and use it to compute the stereo matching cost.
Some metrics are difficult to describe through modeling. However, neural networks can effectively describe these metrics for participation in related calculations. So that the neural networks can improve the performance of related visual computing through metric learning. Like [22] and [25], our study takes patches as the input, the output metric participates in the calculation by large patch sampling to ensure the robustness of pose optimization.

C. POSE ESTIMATION WITH DEEP LEARNING
Deep learning is applied to pose estimation in different ways. Early works [26]- [28] used images as input to construct neural networks and output absolute poses corresponding to images in fixed scenes. Some works [29] output the relative poses by learning adjacent images. The learning methods are divided into supervised learning with pose labels [30], [31] and unsupervised learning without pose labels [32]- [34]. Most of the unsupervised learning methods also used stereo images [33], depth images [35], optical flow images [36], etc. as supervised information. The above works have great theoretical value, but no matter what the supervision method is, the pose of the camera is output by the network, and the truncation error of the regression network output cannot be overcome, so that the accuracy of pose estimation is difficult to be guaranteed. In addition, deep learning was also used in image depth estimation [37], feature extraction [38], data association [39], hyperparameter selection [40], etc. to improve the performance of pose estimation. Our research applied deep learning to compute optimization metrics to improve the accuracy and robustness of pose estimation.

III. ALGORITHM OVERVIEW
Pose optimization between two frames is achieved by ensuring that the projection points in the target frame and the points extracted in the reference frame are from the same 3D point in the real scene. The optimization usually minimizes alignment errors or a photometric error where T ∈ SE(3) is the relative pose, d u i is the depth of u i in the reference frame, and u i is the pixel point in the target frame that matches u i . π and π −1 are the projection and inverse projection, respectively. The calculation of u i in (1) requires feature extraction and matching. Mismatching can cause u i to deviate and undermine the optimization convergence. The gradient calculated using the photometric error in (2) is also difficult to guarantee the optimization convergence and robustness with respect to initial values and illumination. As shown in Figure 1, the neural network is trained to input the patches and output alignment error δu as the Gauss-Newton optimization residual (the blue r in Figure 1). Then, where the weak error between r i and δu is caused by the network learning error. The paired patches are obtained through u i in the reference frame andũ i = π(T · π −1 (u i , d u i )) in the target frame and are input to the network through channel stacking. The Jacobian is calculated directly using the gradient w.r.t the alignment error (the red J in Figure 1), which is only related toũ i . In this way, the metric of pose optimization is output from the patches through the network, and the u i does not have to be calculated. The Jacobian calculated using the alignment error avoids the unreliability caused by illumination and the initial value dependence caused by the photometric error. As a result, the pose optimization can converge at a wider range of initial values.

A. DATASET COLLECTION AND NETWORK TRAINING
This learning task is first proposed, and no ready-made datasets and network structures are available for training. Figure 3, the training data requires each pair of patches to correspond to its label, which is the Euclidean distance of patch center in the pixel coordinate system. The patch dataset is collected from CoRBS dataset [41], which includes RGB and depth images captured by Kinect v2, and camera poses corresponding to each frame of the image. The depth image in CoRBS is more accurate than the depth image captured by Kinect v1 [42], [43]. On the basis of CoRBS, feature points u i are extracted in the reference frame, and these points are projected into the target frame. Thus, we obtainũ i .ũ

As shown in
The label is where u i is the matching point of u i in the target frame. d u i is the depth corresponding to u i , which is obtained from the depth image. T ∈ SE(3) is poses interference and is used to generate data for different labels.
The input item of the dataset are pairwise patches centered onũ i and u i . The orientation of the patch corresponding toũ i VOLUME 8, 2020 and u i is different due to rotation between frames.
where o is 2 × 1 vectors representing the patch orientation in the image and R is the rotational part of T · T , R [1:2,1:2] is the 2 * 2 element in the upper left corner of R. FIGURE 2. Network structures of regression (a) and improved ResNet-18 (b) networks. '/2' means stride=2. The parameter numbers of the two networks are roughly equal. As for (b), the red block is the modified layer. Output of the second large block is directly input to the avgpool layer.

2) NETWORKS
For this special learning task, we designed a regression neural network and improved ResNet-18 [12]. As shown in Figure 2, the networks take a 2 × 32 × 32 patch as the input and a 2 × 1 vector as the output, which is the optimization metric. Two patches are stacked in the channel dimension. The regression network extracts features through the convolution layer, relu acts as an activation function and BatchNorm maintains the network generalization. The 2 × 1 vector is output through the fully connected layer.
ResNet-18 has four large blocks. However, we only use two of them because the input patch is small. The input channel of the first convolution layer is changed to 2, and the output dimension of the last FC is changed to 2. The original layer of ResNet-18 loads the trained parameters (pytorch saves the network parameters trained on ImageNet [44]), and the modified layer (the red layers in Figure 2) is re-initialized (xavier_uniform for weight and constant for bias). The entire network is retrained on the basis of the initial parameters.

B. POSE OPTIMIZATION
The pixel depth in the reference frame can be obtained from the depth image when solving the pose, which is a 3D/2D optimization. The optimization is based on Lie algebra se (3), and the optimization parameter is ξ = (ω, v), where ω is the angular velocity and v the linear velocity. We verify the optimized performance of the learned metric from two aspects: only optimizing the pose, and simultaneously optimizing the pose and the mappoint. The typical feature-based method using ORB [7] and the intensity-based method using SSD are designed as the baseline for comparison.

1) FEATURE-BASED METHOD
The optimization model of the feature-based method is where K is intrinsic camera parameters.
and π is also determined by K. The Jacobian corresponding to the residual of this objective is where f x , f y is the focal length of K,

2) INTENSITY-BASED METHOD
The optimization model of the intensity-based method is where j is the j-th pixel of 32 × 32 patch. I r , I t represent gray of the reference and target images, respectively. e i is a vector of 1024 × 1. The orientation of the patch in the target image needs to be rotated in accordance with (7). The Jacobian corresponding to the residual of this objective is where ∂I t ∂ũ is the gray gradient of patch, which is a 1024 × 2 vector. I r (u) is gray of patch in reference image, which is a 1 × 1024 vector. The same is true for I t (ũ). ∂ũ ∂δξ is the same as in (9).

3) BASED ON LEARNED METRIC
The optimization model of the patch-based method using learning metric is as shown in Figure 1, where Pa i (j) is the patch around pixel j in image i. Net is the trained network. The residual r = Net(Pa r (u i ), Pa t (ũ i )) is a 2 × 1 vector, and Jacobian J is the same as in (9). We derive the Jacobian directly from the alignment error (u −ũ), because the alignment error is approximately equal to the network output. The model is optimized using Gauss-Newton, which is computed as where W ∈ R 2×2 is the diagonal matrix containing the weight. H ∈ R 6×6 is approximate Hessian matrix and b ∈ R 6 is gradient vector. As shown in Figure 1, the calculation of J and r does not require the matching point u of u. Thus, using this method does not require extracting and matching features.

C. POSE OPTIMIZATION WITH MAPPOINTS
If the mappoints need to be optimized, then the optimization parameters also include P i , which is the coordinate of the i-th mappoint. The Jacobian corresponding to P i is where R is the rotating part of the relative pose. When P i is involved in optimization, interference is applied on the pixel depth In this way, the pose optimization performance based on the learned metric can be verified when both the pose and the mappoints are optimized simutanously.

IV. EXPERIMENTS
The network is trained through the torch architecture in Python. The pose optimization uses the Gauss-Newton in g2o [45] library. The trained network is called by C++ via the torch.jit module. The paired images of CoRBS dataset are randomly divided into a training set and test set. The patch data set used for network training is collected on the images of the training set, and the pose accuracy is verified on the test set. A series of experiments are conducted to evaluate the learned metric performance: IV-A demonstrates the dataset and verifies the performance of the network; IV-B verifies the accuracy and robustness of the method based on learned metric compared with feature-and intensity-based methods; IV-C analyzes the key affecting parameters of the pose optimization performance.

A. DATASET AND NETWORK
As shown in Figure 3, each reference patch corresponds to different projection patches depending on T . The second and third lines are the corresponding labels of the pairwise patches, that is, the Euclidean distance between two patch centers. The values of a in (6) are set to 0.02, 0.06, 0.1, and 0.2 to ensure wide distribution of the dataset. The dataset is divided into training and test sets by a ratio of 10 to 1.
In an ideal situation, if the pose and scene depth (T and d u i in (4)) in the RGB-D dataset are accurate, the projection patch u * i of the reference patch u i under the real pose transformatioñ T should be equal to the matching patch u i of u i , that is, u * i = u i . The difference betweenũ * i and u i can characterize the accuracy of the dataset. We compared the accuracy of the patch pairs colloected from the TUM RGB-D dataset (Kinnect v1) and CoRBS dataset (Kinnect v2). The results are shown in Figure 4, if the CoRBS collected by Kinect v2 is used to make the dataset, the patch dataset accuracy is higher.
We use this dataset to train the networks in Figure 2. The network training uses the pytorch architecture, the learning rate is 0.01, the momentum is 0.9, and the optimization method is SGD. The training and test results are shown in Figure 5. The improved ResNet is significantly better than the ordinary regression network in convergence and generalization. ResNet has strong feature extraction and expression capabilities and is suitable for this task.
The statistical results of improved ResNet output are shown in Figure 6. The horizontal and vertical coordinates in the figure are the network label and output, respectively. The points scattered in the figure are the average of 1000 sets of VOLUME 8, 2020   data. It can be seen that most of the data is ideally distributed on the line with y = x. When the projection error is large, the absolute value of the network output is slightly smaller than the true value of the label. It is because the neural network fits too much data with small alignment error.

B. POSE OPTIMIZATION ACCURACY
The pose optimization is performed on the images in the test set of CoRBS. The frame relative pose is optimized by feature-based method, the intensity-based method, and our method using learned metric. The optimization model is divided into two types: 1) pose only: the pixel depth is known and is considered to be accurate, and only the pose is optimized; 2) with point: the coordinates of the mappoints have an error and need to be optimized together with the pose. The position and attitude accuracy are represented by relative pose error (RPE) and rotating angle (RA) of relative pose, respectively: where T is the optimized pose; T * is the true value; and trans and rotat refers to the translational and rotational component of T , respectively. We set several different initial value distributions to verify the convergence of optimization algorithm under various initial values.
where T 0 is the initial pose; T * is the true value; and the values of a are set to 0.02, 0.06, 0.1, and 0.2 to represent four initial value environments. Under the four initial conditions of the two optimization modes, we verify the optimization results of the three methods, obtaining the position and angle errors. A total of eight simulations are conducted, and each simulation generates 900 sets of random initial poses for the sequence images in accordance with (15). As a result, 900 sets of optimization results of the three methods are obtained. The statistical results (position and attitude error) of the eight groups of simulations are shown in Tabel 1. The 900 sets of data for each simulation are sorted by the initial value and divided into nine groups, and the mean and standard deviation of 100 data of each group are calculated. The result of each simulation is shown in Figure 7.
Tabel 1 shows the optimization accuracy of the position and attitude of the two modes (pose only and with point) in different initial values. The pose accuracy of the three methods decreases as the initial environment becomes increasingly harsh. However, our method has weak initial value dependence and can maintain high precision in harsh initial conditions. The learned metric method achieves the highest precision in 13 of the 16 group simulation results. Figure 7 shows the comparison of the three methods in terms of the position and attitude accuracy under four initial conditions of two optimization modes. Each row in the graph presents an optimization result under an initial condition. The first two columns are the position and attitude optimization results in pose only mode, and the last two columns are the optimization results in the with point mode.  For pose only mode, when the initial value condition is good (a = 0.02), the traditional method can effectively converge, and the three methods converge to nearly the same precision. The true pose of the dataset also has small errors. Thus, the error of the algorithm cannot be infinitely close to zero. At this time, the error between the metric of network output and the real label causes our method inferior to the two other methods. The method based on learned metric can maintain strong robustness and good convergence as the initial condition becomes increasingly harsh. Its optimization accuracy is much higher than those of the two other methods. VOLUME 8, 2020 When the mappoints are involved in the optimization, the optimization precision considerably decreased compared with that in pose only mode due to the noise added to the mappoint position. For the intensity-based method based on SSD, the degree of accuracy reduction is the most evident or even worse than the initial value because of the strong initial value dependence of the calculation metric SSD. Comparing a), b) with c), d) shows that the optimization of joining mappoints further improves the accuracy when the initial value of the optimization is close to the true value although the mappoints are noisy. Our method and feature-based method have better convergence under this condition, and the small error in the depth image is corrected by optimization.
In short, our learned metric has better performance than the computational metric SSD. The learned metric-based optimization method is also better than the SSD-based direct method and even better than the feature-based method that requires feature extraction and matching.
We draw the gradient verification graph to further compare the optimization performance of the algorithm. This graph takes the deviation between the current and true values as the horizontal axis and the optimization parameter increment calculated by the algorithm as the vertical axis. As shown in Figure 8, ideally, the increment calculated by the algorithm should be equal to the deviation between the current and true values. At this point, the algorithm will converge to the true value in the most efficient way. Thus, the best distribution of points in the figure is distributed along y = −x. The learned metric-based optimization method is close to this distribution by virtue of its reasonable gradient calculation and robustness. The results of the intensity-based method based on SSD scatter around (0, 0), and the algorithm is likely to converge only when the estimated value is close to the true value. This condition indicates the initial value dependence based on the SSD metric. Partial results of the ORB feature-based method are distributed along with y = −x, and the poor robustness caused by mismatching results in partially unsatisfactory gradient calculations. The gradient calculation along z is particularly poor due to the motion pattern of the sequence images.

C. PARAMETER ANALYSIS
To fully analyze the learned metric and the network performance, the pose accuracy with different number of features is calculated to verify the mismatch impact on the optimization system, and the different labels are used to train the network to verify whether the network learned the inherent law between patches.

1) FEATURE POINT NUMBER
We calculate the accuracy of the three optimization methods under different feature numbers. The intensity-based method only uses the position of the features. As shown in Figure 9, the accuracy of the SSD-and learned metric-based methods increases as the number of features increases because a large number of features enhance their robustness. However, the accuracy of the feature-based method decreases because The best output of the designed network should be equal to the label. At this time, the optimization objective of the learned metric-based method is the same as that of the feature-based method. From this perspective, the highest precision of our method is the accuracy of the feature-based method. However, the accuracy and robustness of our method are better because it does not involve matching and therefore does not have risk of mismatch.

2) NETWORK TRAINING LABEL
When making the dataset, the label is calculated in accordance with (5). When T = I, However, a slight error exists between the two due to the error of the true pose and depth map in the dataset. Whenũ i =ũ * i instead ofũ i = u i , the true pose in dataset is obtained. We also define the Label * to train the network.
The network trained with these two labels is used for pose optimization, respectively. The optimization results are shown in Figure 10. The optimization results using the network trained by Label show higher precision and stronger robustness. When the result of RPE greater than 0.1 or RA greater than 0.02 is eliminated, the average accuracy of using Label * is higher. The difference in optimization results is caused by the weak error between u i andũ * i . The above-mentioned comparison shows that the mapping of patches to Label * is unstable. The network training does not have good generalization. As a result, only data close to the sample can fit the better optimization results.
The training using Label in (5) allows the network to learn the inherent mapping of patches to their center differences in addition to the fit of data. This mapping is stable, which indicates that we have applied the neural network to a reasonable part.

3) JACOBIAN CALCULATOIN
For the objective function of the learned metric-based method in (12), if the corresponding Jacobian is directly calculated, then it should be J = ∂Net ∂Pa t (ũ) ∂I t ∂ũ ∂ũ ∂δξ (17) where ∂Net/∂Pa t (ũ) is a 2 × 1024 derivative matrix about network output relative to the input patch, which is obtained by network backpropagation. We also try the optimization based on this Jacobian, but the calculation of ∂Net/∂Pa t (ũ) is slow and the optimization accuracy is low. This result may be caused by a deviation of ∂I t /∂ũ or some outliers in ∂Net/∂Pa t (ũ).

V. DISCUSSION
We discussed our research's relationships with visual odometry and learning-based pose estimation.

A. RELATIONSHIPS WITH VISUAL ODOMETRY
A high-precision visual odometry requires various auxiliary strategies, such as pose initialization of front-end, image pyramid, sliding window, and local map optimization. We did not use any auxiliary strategies in pose optimization to clearly verify the advantages of the learned metric over traditional metrics. To design a visual odometry based on a learned metric, with the above-mentioned auxiliary strategies, we need to do the following: • For a specific pose initialization method, the difference between the initialized pose and the true value should conform to a certain distribution, and the dataset should be collected on this basis of this distribution to training the network so that the learned metric can adapt the motion mode.
• The patches should be input into the network in batches, to achieve high efficiency of the neural network distributed computing. Adjusting the size and number of the batches and the method in [46] can be used to balance the speed and accuracy of visual odometry.
• The learned metric can also be applied to mature SLAM systems, such as ORB and DSO, to improve its performance.

B. RELATIONSHIPS WITH OTHER LEARNING-BASED POSE ESTIMATION
Some works solve the relative pose between frames by supervised learning [31], unsupervised learning [32], or network-based optimization [47]. The neural network output of these works is the entire image, and the output is directly the pose between the images. Our study is more fundamental in image processing, and the research results are applicable to traditional geometric and intelligent pose estimation methods.

VI. CONCLUSION
Pose optimization is the basis of geometric vision work such as 3D reconstruction, SFM, and visual odometry. A pose optimization method based on the learned metric is proposed. The process of optimization is uncertain and the distribution of patch samples is difficult to predict. The accuracy of the network output also affects the optimization performance. These difficulties make the network need to have strong generalization and learning ability. The optimization performance also depends on reasonable learned metric applications and corresponding Jacobian design. The main contributions are summarized as follows: • The neural network was designed and trained by the collected patch datasets. The convergence and generalization of the network were verified.
• The optimization method based on the learned metric was designed using a reasonable Jacobian, and the high precision and robustness were verified by simulation through comparison with intensity-and feature-based methods.
• The affecting factors of the accuracy and robustness of pose optimization were analyzed to reveal the reason for the good performance of the proposed method, which provides a valuable reference for further development in this research field.
The target patch input to network is obtained by projection. The entire pose estimation process does not require feature extraction and matching. Thus, the method can be applied in a wide range of scenarios. The improvement of optimization results brought by the learned metric is verified. The learned metric increases the optimization convexity and robustness with respect to the initial value. No matching needed avoids damage to the system caused by mismatches. Despite the slight error in the network output and the true value δu, the residuals and Jacobian calculated by a large number of patch samples almost erase the effect of this error. This way of 'learning + model computing' provides a good reference for the solution of geometric vision problems.
Different patch sizes and patterns can be also used to learn the metrics for finding the most appropriate results. Apart from the difference from the patch center, additional metrics can also be learned from patches to direct the optimization. The metric can also be learned in a targeted manner according to different application modes to enhance its suitability. WEILIN GUO received the B.S. degree from the High-Tech Institute of Xi'an, Xi'an, China, in 2014, where he is currently pursuing the Ph.D. degree. His research interests include aircraft design, integrated navigation, and artificial neural networks.