Fast SIFT Feature Matching Algorithm Based on Geometric Transformation

Structure from Motion (SfM) is a series of methods for reconstructing scene structure (i


I. INTRODUCTION
Motion recovery structure(Structure from Motion, SfM)is a series of methods to reconstruct scene structure (i.e. threedimensional space points) and camera motion (i.e. camera pose) from image centralization. The typical methods are incremental reconstruction and global reconstruction. Firstly, for these two reconstruction methods, feature point matching is the most time-consuming stage. Especially when there is no prior information of matching order between images in the image set (i.e. the disordered image set), all image pairs need to be matched. Secondly, compared with global reconstruction methods, incremental reconstruction methods have the advantages of high reconstruction accuracy and robustness to external points, while its disadvantages are large time complexity caused by the selection of initial image pairs and unable to stop the loop. Compared with incremental reconstruction methods, global reconstruction methods have the advantages of fast reconstruction speed, accurate closedloop, and its disadvantages are low reconstruction accuracy and robustness to external points. Fast, closed-loop and high precision reconstruction method is particularly important. Based on the hybrid formula proposed in 2017, this paper proposes a method aimed at improving the performance of The associate editor coordinating the review of this manuscript and approving it for publication was Zhaoqing Pan . feature point matching stage: a SIFT matching algorithm GeoMatch(Geometric structure and SIFT-based Matching algorithm)based on scene geometric structure constraints and numerical statistical features of feature invariant scale transformation(Scale Invariant Feature Transform, SIFT) is proposed. Experiments show that GeoMatch outperforms both traditional tree-based and hash-based matching methods in terms of time and accuracy.

II. RELEVANT WORK
The known Structure from Motion (SfM) technology can be roughly divided into two parts: one is the description and matching of feature points, the other is the calculation of motion and scene structure of two cameras.

A. DESCRIPTION AND MATCHING OF FEATURE POINTES
Feature point matching is essentially the Nearest Neighbor retrieval problem (NN). The nearest neighbor search has the following definition [1]: There is a non-empty point set T D in D-dimensional space, if |T D | >1, then for ∀q∈T D , there is NN (q) = argmin is a vector distance metric function [2], and Nearest Neighbor (NN) is the nearest neighbor function notation. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Generally, feature point matching methods can be divided into two categories: standard matching technologies and approximate matching technologies. The standard matching technologies of feature points include linear search and tree-based search. The basic idea of the linear search is to select the nearest vector by comparing the distances of all vectors in the set T D for any query vector Q. The advantage of this method is that it is very simple to implement, but the disadvantage is that it is very time-consuming [3]. The query time complexity for a single feature point is O(|T D |). Because the time complexity of the linear search is too high, document [1] proposed Kd-tree for nearest neighbor search, and the time complexity of single record query can reach O(log|T D |). The literature [4] proposed a tree K-means method for Nearest Neighbor retrieval. This method first classifies data sets into K classes by K-means method and then divides the K classes recursively. Although the original Kd-tree search has been greatly improved compared with the linear search and works well when the data dimension is low, the time performance is not ideal with the increase of dimension in actual use [5]. In many cases, the time complexity is even worse than the linear search [6]. Document [7] proposed a tree-like K-means similar to document [4], which only uses data points instead of clustering mean in clustering center.
In [8], a new data structure spill-tree is proposed, which allows data to appear repeatedly in the node's sub-nodes. However, experiments show that the temporal and spatial performance of multiple random Kd-trees is better than spill-tree.
In [9], the author proposed a fast method of searching the nearest neighbor in large-scale data.
The data structure used in this method is similar to that in document [4] but only a single leaf node can be accessed.
When the scene is very large and complex, which leads to a large number of descriptors, how to store a large number of descriptors for fast retrieval is a very significant problem. One is quantization [10], [11] and dimensionality reduction [3], [12], the other is hash-based method [13]. The hash-based method not only expresses the original data more compactly, but also calculates the distance on Hemingway metric space more efficiently than Lp norm.

B. MOTION RECOVERY STRUCTURE
At present, the main method of monocular sparse reconstruction is the motion recovery structure (SfM). According to the different methods of camera initial motion estimation, SfM can be divided into Incremental Reconstruction [14]- [16], Global Reconstruction [17]- [19] and Hybrid Reconstruction [20].
The main method of Incremental Reconstruction is to estimate camera motion and structure by adding images to a reconstructed system step by step. This estimation will iterate over and over again. With the increase of images, the more parameters need to be estimated and optimized, and the iteration will be slower and slower. Besides, the cumulative errors will lead to Scene Drift [21]. According to different process of adding images, Incremental Reconstruction can be divided into two kinds: one is to select several images as the initial image group for reconstruction, and then add new images for iteration; the other is to cluster the images into several groups, reconstruct these groups separately, and then incrementally merge these groups. The main processes of Incremental Reconstruction are internal parameter extraction, feature point extraction, feature point matching, feature point mismatching filtering, Fundamental Matrix, Essential Matrix, Triangulation, and Bundle Adjustment.
There are many motion estimation methods for Global Reconstruction. Documents [18], [19] uses a matrix decomposition method based on rank theory to obtain camera motion estimation and point depth estimation, but this kind of method has the disadvantage of tracking corresponding points on all frames. Literature [17] uses a linear fitting method to estimate camera rotation before camera translation, but the accuracy of this method is not high.
Compared with Incremental Reconstruction, Global Reconstruction reduces the number of iteration optimization, and the error will be globally or evenly dispersed in each camera without accumulating. However, because of the sensitivity to external points, the reconstruction accuracy is not as high as Incremental Reconstruction.

III. FAST SIFT FEATURE MATCHING ALGORITHM BASED ON GEOMETRIC TRANSFORMATION
In this paper, a fast SIFT feature matching algorithm (Geometry transformation-based Feature Matching, GeoMatch) based on geometric transformation is proposed. The algorithm is divided into two stages, the first stage is initialization, and the second stage is fast matching based on geometric transformation. The initialization stage is divided into two steps: the first step is to match QSearch with global descriptors from rough to fine, and the second step is to reduce the dimension of SIFT descriptors. A more detailed algorithm flow of GeoMatch is shown in Table 1. Next, we will describe in detail the initialization phase and fast matching based on geometric transformation.

A. LINEAR SEARCH BASED ON CROSS QUADTREE
Two images with the same content should satisfy the polar geometric relationship. The polar geometric relationship between two-dimensional image points is described by the basic matrix F: In (1), x 1 and x 2 denote a matching pair of image pixels, F = K T 2 EK 1 is the basic matrix, where K 1 and K 2 are the internal parameter matrices corresponding to two images respectively, E is the essential matrix, describing the corresponding relationship between three-dimensional points, and T represents the transposition. This formula shows that if the image points x 1 and the basic matrix F are already in existence, a polar line x T Fx 1 = 0 can be determined in image 2. Solving the nearest neighbor is converted to solving the equation x T Fx 1 = 0. Since the solution obtained is not necessarily the desired feature point, it is necessary to search along the epipolar line to find the most matching feature point of the descriptor. In this way, the two-dimensional search problem is transformed into a one-dimensional search problem. However, if the scene described by an image is on a plane or an approximate plane, the projective mapping matrix (i.e. homography Matrix) can be used to describe the geometric relationship between points more accurately. Because the homography matrix is needed in feature matching, the derivation of the homography matrix is briefly introduced below.
Suppose the plane equation is n T P + d = 0, n T is a plane normal vector, D is a plane intercept and P is a 3D point. We can get the following equation.
Suppose there are two cameras: Camera 1 and Camera 2, whose projection transformation parameters are (2), the following deductions are made: When calculating the homography matrix, we need to take an image as a reference image, and then take image 1 as a reference. There are K 1 = E, t 1 = 0, substitution (3), there are: In formula (4), is the corresponding matrix. In feature matching, because there is no prior knowledge of whether the feature points are coplanar or not, the matrix with small re-projection error between H and F is generally used as the transformation matrix in actual matching operations.
When using H and F to match, H and F need to be calculated beforehand. The calculation of H requires 4 pairs of matching points, and the calculation of F generally requires 8 pairs of matching points. Therefore, a certain number of matching points (20 to 30 pairs) needs to be calculated first. To get the initial matching point pairs as simple as possible, this paper uses a linear search. But the efficiency of the linear search is too low, so a linear search QSearch algorithm based on cross-Quadtree is proposed to speed up the search process.
The so-called cross-Quadtree means that the different areas divided by Quadtree are not completely independent, which have certain cross-areas as shown in Figure 1. Assuming a plane as shown by dotted lines, the plane is evenly divided by dotted lines and the upper left area is marked by black dotted lines. However, to increase matching accuracy (explained below), the actual coverage of the upper left area is marked by the blackened solid line. It can be seen that the four larger areas intersect with each other. In order to match feature point descriptors, QSearch partitions the image plane several times and then calculates a global descriptor in each partitioned sub-region. All images are partitioned and the descriptor is computed in this way. For the two images that need to be matched, a coarse-tofine search strategy is adopted. Firstly, the global descriptor is matched to get the approximate matching region, and then the feature point descriptor is matched in the matching region. As shown in Figure 1, firstly, the matching of global descriptors is performed from four largest regions to obtain the most matched region 1 (the region marked by dots and lines); secondly, the most matched sub-region 2 (marked by dots and lines) is obtained from the four sub-regions of region 1; finally, the most matched sub-region 3 is obtained from the four sub-regions of region 2. After region 3 is obtained, feature point descriptors are matched in the region.
If there is no coverage between regions, the features appearing at the boundary of one image region are likely to be transferred to adjacent regions in the next image. In order to solve this problem, when dividing space into sub-regions, sub-regions will overlap with each other partially.

B. REDUCED DIMENSION MATCHING OF SIFT FEATURE DESCRIPTORS
In this section, statistics of SIFT numerical frequency distribution and cumulative distribution are made for four scenarios in the Oxford data set. The statistical results are shown in Figure 2 and Figure 3.  As can be seen from Figures 2 and 3, the numerical frequency characteristics of SIFT features of different images are highly consistent. Firstly, from the frequency distribution, we can see that the frequency of numerical value 0-10 is above 2%, and that of 10-255 is below 2%. Secondly, it can be seen from the cumulative distribution that the probability of the value appearing in the interval [0,100] is about 90%. This shows that if a numerical value x is sampled from several SIFT features of a natural image, the probability of X appearing in the interval [0,100] is 90%. Finally, it can be seen from the frequency distribution chart that the larger value [101,255] will still appear with a lower probability (about 10%).
In summary, two statistical features of SIFT can be obtained: one is that the probability of arbitrarily sampled values from several SIFTs falling into the interval [0,100] is 90%, and the other is that the probability of arbitrarily sampled values from several SIFTs falling into the interval [101, 255] is 10%.
Based on the above two features, it can be further deduced that when matching features, 128 dimensions of one feature can be matched from large to small, and only a few of the small dimensions need to be matched. It's not necessary to match all the dimensions, to achieve the goal of dimensionality reduction.

IV. COMPLEXITY ANALYSIS A. COMPLEXITY OF INITIALIZATION PHASE
The complexity of initialization depends on the implementation strategy. Specifically, it depends on the selection and calculation of global descriptors and the height of Quadtree. Assuming that the image size is w * h, the height of Quadtree is H, and the global descriptor is a gray histogram which takes O w * h 4 H to compute and requires O (1) space. If the bottom-up global descriptor method is adopted, the time complexity of computing all global descriptors is as follows

B. MATCHING COMPLEXITY BASED ON GEOMETRIC TRANSFORM
The complexity of this stage depends on the transformations used.
Assuming that the average number of vectors in the queried area is M, the average number of vectors in the queried area is N (finding the matching vectors of N vectors among M vectors), the dimension of the query is d, and the dimension of the feature is D.
Firstly, for each feature, it is necessary to select the dimension before selecting the eigenvalue from the 128-dimensional eigenvector, which depends on the specific algorithm. If relying on sorting algorithm, the heap sorting with the best average time complexity can be used to solve the problem with big D. The time complexity is O (128 * logD), because the constant 128 is larger than D, so the constant is not ignored here. The experiment uses a faster nth_element method with the average time complexity of O(128).
Secondly, each feature needs to search the nearest neighbors linearly on the selected d dimensions. Since the d dimensions used for each query may be different, Kd-tree acceleration is not appropriate here. The time complexity of the linear search is O(M * d).
Therefore, the optimal average time complexity of a query is O (128 + M * d).
Also, to query efficiently, the algorithm uses O (M * D) auxiliary space to save the transpose of the feature matrix in the database to index the value of one dimension of all features.

V. EXPERIMENTAL DESIGN
In this experiment design, SIFT features are extracted from two data sets: Oxford data set [22] and Tsinghua data set [23].
In order to evaluate the speed and accuracy of the proposed algorithm GeoMatch, two typical matching strategies are selected: linear search BruteForce and random Kd-tree matching. Besides, CasHash (Cascade Hash) proposed by literature [3] is added as a comparison according to the results of literature retrieval in the last three years.
The experimental environment is desktop PC, Windows 10 64 bit operating system, i5-4590 CPU, 8GB memory, C++ language. To be fair, the experiment shuts down all multithreaded acceleration, using only one thread. Because CasHash has no available CPU version on the public website, this experiment has compiled a single-threaded CPU version of CasHash.

A. EXPERIMENTS ON OXFORDS DATA SET
This experiment uses the standard Oxford data set [22] to evaluate the acceleration ratio [3] and accuracy [24] of four algorithms in SIFT feature matching.
The acceleration ratio of Kd-tree, CasHash, and GeoMatch to linear search BruteForce was measured. Accelerate Rate is defined as AR = t (BruteForce) / T (current algorithm). GeoMatch algorithm has several main parameters: the global descriptor used, the height of Quadtree and the dimension used for matching. In this experiment, the global descriptor uses the gray histogram, the height of Quadtree is 4 and the dimension setting standard is that GeoMatch algorithm has the same recall rate as CasHash. The experimental results are shown in Table 2. As can be seen from Table 2, the proposed algorithm has the best time performance under the same recall rate when matching SIFT features.
The evaluation criteria used in the algorithm are from the literature [3]. The accuracy and recall rates of the algorithm are tested under four scenarios (far-near rotation, ambiguity, JPEG compression, viewpoint change). According to the convention, the accuracy rate needs to be converted to the error rate, i.e. 1-Precision. The experimental results are shown in Fig. 4.
The results show that the proposed algorithm GeoMatch is equal to Kd-tree on the whole but significantly better than CasHash in matching accuracy and recall rate experiments for SIFT features.

B. EXPERIMENTS ON TSINGHUA DATA SET
The purpose of this experiment is to test the impact of different matching algorithms on the final dense reconstruction, mainly focus on the accuracy and completeness of the model [24]. Assuming that the reconstructed model is R, the real model is G, the set of points corresponding from VOLUME 8, 2020 R to G is RG, and the set of points corresponding from G to R is GR (usually, RG and RG are not the same), RG should remove the case that points in multiple R correspond to points in one G. Accuracy is a dimensionless distance, which depends on how the model scales up and down. If R and G are aligned according to the actual size, then R and G are dimensionless. If R and G are relatively aligned, then there is dimensionless distance. It is defined as follows: If the set RG is not empty, the elements in RG are sorted from small to large according to the distance between corresponding points. With proportion X∈[0, 1], the distance between the corresponding points represented by the |RG| * X elements in RG. Among the formula, | · | denotes the number of elements in a set, which is also called the base of a set. And · is a downward integer operation. The concept of completeness is similar to accuracy, just by exchanging the contents of R and G.
The data set used in the experiment is the Tsinghua School and Tsinghua Life Sciences Building in document [22], the Campus data set with 1040 images [29], and Trafalgar, with 4591 images [30]. There are 193 and 102 images respectively in Tsinghua School and Tsinghua Life Sciences Building data set. Meanwhile, the data set gives the real value data of buildings obtained by Riegl-LMS-Z420i laser scanner. The main body of the program used in the experiment is Bunlder [26] andi PMVS2 [25], which only need to modify the key points of matching module. Because the coordinate scale and spatial orientation used between R and    G are inconsistent, the model needs to be scaled and aligned before calculating accuracy and completeness. In this experiment, point fixing in MeshLab [27] and ICP (Iterative Closest Points) methods are used for rotation and translation and scaling is used to align the scale of R with that of G.
The main variable of GeoMatch is the number of dimensions used. The main variable of CasHash is the number of bits encoded by hash bucket. The best of the number of bits is 8 according to literature [3]. The experimental results are shown in Tables 3, 4, 5, 6, 7. As can be seen from Tables 3, 6, 7, the algorithm proposed in this chapter has the best performance in terms of time complexity and acceleration ratio at the appropriate sacrifice of the number of point clouds. As can be seen from Tables 4 and 5, the algorithm presented in this chapter is the best in most sensitivity tests. VOLUME 8, 2020 In Fig. 5, the left one is a real building and the right one is a reconstruction building. Intuitively speaking, the matching algorithm proposed in this paper can perform feature points matching well to complete the reconstruction task.

VI. TOTAL CONCLUSION
A fast matching algorithm GeoMatch for SIFT features is proposed based on the contrapole geometry and SIFT numerical distribution statistics of the scene. Experiments show that GeoMatch can greatly improve the matching speed of SIFT descriptors at the expense of a small number of threedimensional points. At the same time, it can get a good guarantee of accuracy and integrity.