An Accelerated and Robust Partial Registration Algorithm for Point Clouds

Partial registration for point clouds plays an important role in various fields such as 3D mapping reconstruction, remote sensing, unmanned driving, and cultural heritage protection. Unfortunately, partial registration is challenging due to difficulties such as the low overlap ratio of two point clouds and the perturbation in the orderless and sparse 3D point clouds. Thus, a variety of the 3D shape context descriptors are introduced for finding the optimal matching. However, extracting geometric features and descriptors are time consuming and easily degenerated by noise. To overcome these problems, we introduce a parallel coarse-to-fine partial registration method. Our contributions can be summarized as: Firstly, a robust coarse trimmed method is proposed to estimate the coarse overlap area and the initial transformation via fast bilateral denoising and parallel point feature histogram (PPFH) descriptor aligning. Secondly, an accelerated fine registration procedure is conducted by a parallel trimmed iterative closest point (PTrICP) method. Moreover, most parts of our coarse-to-fine workflow are accelerated under the Graphics Processing Unit (GPU) parallel execution mode for efficiency. Thirdly, we extend our method from the rigid registration to the isotropic scaling registration, which improves its applicability. Experiments have demonstrated that our method is feasible and robust in various situations, including the low overlap ratio, outlier, noise and scaling.


I. INTRODUCTION
Point cloud data processing is a hot topic in recent years [1], [2]. Point cloud registration, in particular, is a fundamental problem in computer vision and remote sensing [3]- [10]. It is the basis of various works, including surface alignment [11], [12], pose estimation [13], [14], 3D reconstruction [15]- [17], object recognition [18], and Simultaneous Localization And Mapping (SLAM) [19]. The aim of point cloud registration is to obtain an optimal matching that transforms point sets from various views into one global coordinate.
Iterative Closest Point (ICP), a widely used method, cast the point cloud registration as an iterative transformation process [20]. The original ICP algorithm took every point to point distance into consideration for estimating rigid The associate editor coordinating the review of this manuscript and approving it for publication was Daniel Grosu . transformations, which was suitable for aligning two closely positioned and similarly shaped point clouds. Chen and Medioni gave another version of ICP algorithm, which utilized a nonlinear process by calculating the point to plane distances in order to accelerate the convergence [21]. Besides, the generalized ICP (GICP) [22] considered the registration model into a probabilistic distribution framework, and used nonlinear optimization instead of the closed form solutions in transformation estimation. As a result, this plane-to-plane ICP method is more stable and robust for measurement noise. In general, as mentioned in [23], ICP and its variants perform well in ideal cases [24]- [26].
Unfortunately, noise is inevitable due to acquisition instruments, lighting, reflection, outliers or occlusion in the scene. Thus, point cloud processing usually includes the noise smoothing and trimming procedures [27], [28]. In general, the noise in point clouds can be classified as perturbation noise, missing points, and outliers. These noise types bring the uncertainty for registration, and also contribute to the low overlap ratio of the point clouds.
To reduce noise, some widely used filtering algorithms have been proposed for point clouds, including statistical-based filtering techniques [28]- [31]. These techniques adjust the location of each point via different projection strategies to filter point cloud, and determine the filtered position using similarity measures between a point and its neighbors. However, these kinds of noise smoothing algorithms cannot solve the low overlap ratio problem caused by missing points and outliers.
Thus, partial registration, i.e., the registration of point clouds that are overlapped partially, is challenging caused by occlusion, missing points, outliers and measurement noise. The state-of-the-art partial registration methods are mainly based on the randomized alignment or overlap ratio estimation [7]. Widely used randomized alignment based partial registration methods include 4-point congruent sets (4PCS) and Super4PCS [32] [33]. These methods use the invariance of the ratio between the lines formed by four coplanar points, and avoid the calculation of complex geometric features. Hence, 4PCS based algorithms are relatively robust to partial registration. On the other hand, the trimmed ICP method is introduced via overlap ratio estimation [34]- [37], which can trim outliers and makes the algorithm suitable for the relatively low overlap ratio (under 50%). However, the initial transformation and additional coarse registration procedure are still necessary, when the overlap ratio is low or the noise level is high.
More specifically, the coarse registration algorithm can estimate a rough transformation under harsh conditions, thus providing a good initial transformation for the fine registration. Besides the aforementioned randomized aligning methods, the principal component analysis (PCA) [38] is often used to obtain several principal axes of two point sets. Then, these global matching methods align the centers and their principal axes for the coarse transformation, while they fail if these two point clouds are under the low overlap ratio. Furthermore, some coarse registrations are between different scanning resolutions [39]. Han et al. proposed a hierarchical searching scheme for multi-resolution data to improve the robustness with respect to the local minimum [40]. Recently, a coarse-to-fine GICP algorithm combines the plane-to-plane and point-to-point trimmed ICP, which balances the stability and accuracy by changing the neighborhood search range from wide-base to narrow-base in each iteration [37]. However, this adaptive strategy still fails under the low overlap ratio.
This article introduces a coarse-to-fine trimmed strategy to solve the low overlap problem. In the coarse trimmed stage, we trim the majority of non-informative points and estimate the initial overlap area via parallel point feature histogram (PFH) descriptors matching, in which the neighborhood based bilateral filter smoothes noise since the PFH feature estimation is sensitive to noise [41], [42]. Furthermore, a fine registration procedure is conducted by a parallel modified trimmed ICP based on the initial overlap ratio and corresponding transformation. Finally, most parts of our coarseto-fine workflow are accelerated under the GPU parallel execution mode for improving efficiency.
This article is organized as follows. In section II we introduce the preliminary of our coarse trimmed stage, such as the classical PFH descriptor estimation and bilateral filter. Section III describes the details of our coarse-to-fine registration method, including the GPU parallel implementation. Section IV shows the experiments and analysis. Section V concludes the paper.

II. PRELIMINARY
As mentioned in Introduction, neither the global matching nor the hierarchical searching schemes are suitable for partial registration, since these coarse matching methods are sensitive to the low overlap problem [32] [33], [38] [39], [40]. Thus, as the first step, some geometric descriptor based matching methods are introduced for the coarse partial registration. As we know, the choice of descriptor is the key for feature alignment. A good feature descriptor should be robust to noise and invariant with the position and orientation of point cloud [43]. For example, Rusu et al. designed the PFH [44] and Fast PFH in [45], based on the spatial relationship of a point with its nearest neighbor.
In this section, we give some preparation for our new coarse trim strategy. The traditional PFH is presented in section II-A, and the bilateral filter preprocessing for reducing the noise is introduced in section II-B.

A. POINT FEATURE HISTOGRAM
The PFH descriptor is computed as a histogram of geometric description between all point pairs in the neighborhood. Firstly, PFH characterizes the relationships between the k-neighborhood points p s and p t via their estimated normal vectors n s and n t , i.e., where α, φ and θ constitute a triplet (α, φ, θ ) in the k-neighborhood ( Figure 1). All triplets are then binned into a histogram, as shown in Figure 2. The binning process separates each feature VOLUME 8, 2020 value range into some subdivisions (125 subdivisions in our algorithm), and counts the number of occurrences in each subdivision.
Moreover, PFH descriptor estimation has a high complexity of O(mk 2 ), where m is the number of points involved in the PFH computation, and k is the number of the nearest neighbors of each point. In this article, the parallel implementation methods will be proposed to reduce the complexity.

B. BILATERAL FILTER
The bilateral filter is relatively efficient for denoising [28], although it is time-consuming for large point cloud data. [42] proposed its accelerated version using OpenMP [46].
The bilateral filter [41] allows an offset correction for each point along its normal vector. Assume that p is the candidate point for denoising, and n p is the corresponding normal vector. Let p i be the neighbors of p for i = 1, · · · , N 1 , and ω s (x) = e −x 2 /2σ 2 s , ω c (x) = e −x 2 /2σ 2 c be two Gaussians with variances σ s and σ c , respectively. Then, the offset correction is . ( The new position of p isp = p + δ p n p . This is the same as projecting a point on the weighted regression plane with the neighboring weights ω( , where W is a normalization factor. The offset correction ensures that, to denoise a point near the edge, only points lying on the same facet will contribute. The bilateral filter for a given point p ∈ P is summarized in Algorithm 1.

III. THE PROPOSED METHOD
Some notations used in our algorithm are listed here. The source point cloud (red) is P = {p m , m = 1, . . . , M } and the target point cloud (blue) is Q = {q l , l = 1, . . . , L}. The goal is to find the best transformationT from P to Q. We conduct T = T * ·T 0 , where T 0 represents the transformation of coarse registration, and T * represents the best transformation of fine registration. The flow chart is given in Figure 3.
The coarse trimmed block (the yellow block) gives a good initial transformation and coarse overlap estimation, which is Compute the unit normal vector n p to the regression plane from S N 1 (p) 3: sum = 0, normalizer = 0 4: for p i ∈ S N 1 (p) do 5: normalizer += w 10: end for 11:p ← p + sum normalizer n p used to smooth the noise and trim most of outliers. On the other hand, the fine registration algorithm (the brown block) computes the fine transformation T * . To make our acceleration strategies clear, we give some parallel computation procedure details in section III-C. Firstly, we give a detailed representation of our coarse trimmed strategy.

A. COARSE TRIMMED STRATEGY FOR PARTIAL REGISTRATION
The coarse trimmed strategy, the key step of the coarse-to-fine partial registration, is used to estimate the coarse overlap area and the initial transformation. In particular, a parallel bilateral denoising method (section III-A1) and parallel point feature histogram (PPFH) aligning between key point pairs are used to eliminate the majority of non-overlap information.

1) PARALLEL BILATERAL FILTER DENOISING
The key point and PFH descriptor extraction are easily perturbed by noise, as they heavily depend on the point positions and their corresponding normal vectors. Therefore, we add a bilateral filter to denoise the noisy point cloud, and refer to the parallel implementation of point cloud bilateral filter in [42]. This algorithm could be summarized in Algorithm 2, and the bilateral function is shown in section II-B.

Algorithm 2 Parallelization of the Bilateral Filter
Require: Input point cloud P stored into an octree ψ, a search range N 1 , and two variances σ s , σ c of Gaussian distribution. 1: for each node child C in parallel do 2: for p ∈ C do 3:p ← bilateral(p, N 1 , σ s , σ c ) reference to Alg. 1 4: Update octree nodeĈ wherep is located 5: end for 6: end for

2) PARALLEL KEY POINTS SELECTION
For saving the computation cost of extracting the key points from the original data, in the beginning, we define as the local curvature descriptor [47], where N 2 is the neighborhood search range, and i is the index of neighborhood search result of each point p.
We keep the point with the smallest γ value among its neighbors, i.e., keeping the local region edge or corner points.
The key point sets are marked as P = {p : p ∈ P} and Q = {q : q ∈ Q}. Figure 4 illustrates the key point selection for point clouds P and Q.

3) PARALLEL PFH DESCRIPTOR COMPUTATION FOR KEY POINTS
We propose a parallel implementation of PFH for key point sets P and Q in Figure 5. The PPFH computation mainly consists of parallel kd-tree searching and parallel normal vector estimation. Please refer to sections III for parallel implementation details.

4) COARSE ALIGNING
Here, with PFH feature, we search the most similar correspondent key point using the parallel brute force. We denote the set of key point pairs by

B. PARALLEL TRIMMED ICP FOR FINE PARTIAL REGISTRATION
With the coarse overlap area Q 0 and initial rigid transformation T 0 , we conduct the fine registration based on the parallel trimmed ICP method under Q 0 and the source point cloud P. Denoting the transformation by T η = {R η , t η } in the current η-th iteration, the transformation can consist of rotation R η , translation t η . For the current η-th iteration, the parallel trimmed ICP registration is conducted as follows:

1) UPDATE CORRESPONDENCES
For each point p i in the source partial point cloud, find its correspondence q c η i via parallel implementation of the nearest VOLUME 8, 2020 point searching. The index of point in Q that matches point 2) UPDATE OVERLAP RATIO with the ascending order; then parallelly calculate the overlap ratio according to where λ η = λ η−1 − δ is the parameter of trimmed strategy, and the positive constants λ 0 , δ are defined in advance. Besides, we parallelly pick up the first M r = M × r η corresponding point pairs, and define the error as the Trimmed Squared Distance (TSD).

3) UPDATE TRANSFORMATION
Repeat the above steps until the stopping criteria are satisfied. In our algorithm, the iteration stops when any of the following three conditions hold.

C. PARALLEL IMPLEMENTATION
Point cloud registration, especially for high-resolution data, usually suffers from the massive size problem, leading to long execution time. Some parallel implementation methods have been proposed to deal with this problem. In 2007, Nüchter proposed the parallel implementation of classical ICP with OpenMP and utilized this method in GraphSLAM [48]. In this decade, the GPU parallel mode began to be applied on ICP and EM-ICP [49], [50].
With the GPU development, the new generation GPU cores could handle massive point size in parallel mode. Based on this, we give the parallel execution introduction of our trimmed ICP method in Figure 7, which illustrates that most parts of the algorithm are implemented in the parallel mode, except for the simple computation such as the update of transformation matrix. The parallel implementation of bilateral filter is described in section III-A1. The fine registration could be divided into three main parts, among which the correspondence and overlap ratio update are implemented in CUDA Thrust [51].
Comparing with the nearest neighbor search (NNS) based parallel ICP method [49], [50], our contributions in parallel registration lie in: 1) we provide GPU parallelization for all modules (not limited to NNS) of the point cloud registration; 2) we implement the parallel trimmed ICP algorithm, which improves the robustness by adding the parallel overlap ratio estimation in each iteration; and 3) we use the high-level Thrust library (https://thrust.github.io/) in our parallel implementation, which strengthens the reliability of our algorithm.
Most of the parallel coarse-to-fine registration implementation is based on the parallel kd-tree search, which consists of two parts: tree construction in section III-C1 and the nearest point search in section III-C2.
We first elaborate the parallel kd-tree search implementation.

1) PARALLEL TREE CONSTRUCTION
Given L points for tree construction, and restrict each leaf node containing maximum one point, the complexity of sequential kd-tree construction is O(kL log 2 L) [52], where k is the dimension of kd-tree. If each leaf node contains maximum κ points (1 < κ L), there are approximately L κ leaf nodes and the tree depth is (log 2 L κ ). Therefore, the number of nodes is 1 + 2 + 4 + · · · + L κ = 2 L κ − 1, and there are L κ − 1 node splitting operations in serial processing.
In contrast, for the parallel mode, every thread in GPU manipulates one parent node and its corresponding left and right child nodes, i.e., all node splitting procedures in the same level of a kd-tree are executed parallelly. Hence the parallel processing needs log 2 L κ splitting operations. As a result, the parallel tree construction is about L κ −1 log 2 L κ times faster than the sequential implementation.

3) DYNAMIC PARALLELISM
Dynamic parallelism, as a programming structure template, is extensively used in our parallel normal vector estimation, key points selection, PFH descriptors estimation, and PFH descriptor alignment for key point pairs. In Figure 8, each thread in our parent GPU thread block manipulates some specific data, and each parent thread could create another GPU thread block, marked as 'child', which also executes under the parallel mode.

4) NORMAL VECTOR ESTIMATION
The normal vector estimation, based on the NNS results in section III-C2, is implemented via the dynamic parallelism mode. Each thread in the parent thread block manipulates a point p in the point cloud, and each thread in the child thread block computes the dot product p 3×1 · q 1×3 , where q is one of the nearest points of p. The parallelism of child thread block reduces the complexity from O(N 0 ) to O(1), where N 0 is the neighborhood search range. The dot product results are added up by the parent thread in parallel, and the complexity of parallel adding up is O(log 2 N 0 ) [53]. Hence we get the 3 × 3 covariance matrix X cov . In the end, the parent thread utilizes the eigenvalue decomposition on X cov to get the unit normal vector of each point.

5) KEY POINT SELECTION
As mentioned in section III-A2, the key point selection is based on the normal vector estimation and the neighborhood search. We utilize the dynamic parallelism mode for the key point selection. Given M points for key point selection, each thread in the parent thread block manipulates one point, and schedules a corresponding child thread block for computing γ (p) parallelly. Assuming that the nearest point range is N 2 , the child thread block reduces the complexity of computing γ (p) from O(N 2 ) to O(log 2 N 2 ), and the entire dynamic parallelism reduces the complexity from O(MN 2 ) to O(log 2 N 2 ).

6) PARALLEL PFH DESCRIPTOR COMPUTATION
We propose a parallel implementation of PFH descriptor based on parallel kd-tree searching, normal vector estimation and key point selection.
In terms of constructing the histogram of features, we also utilize the dynamical parallelism mode. In the parent thread block of GPU, each thread manipulates the 3D point position coordinates and its corresponding normal vector of key point set, which means that the thread number of parent thread block is equal to the number of key points. In the child thread block, each thread manipulates the nearest indices of each key point inherited from its parent thread. Then each child thread computes the triplet (α, φ, θ) between each key point and one of its neighborhood points. This parallel mode reduces the complexity from O(L N 2 3 ) to O(1), where L is the size of key point set, and N 3 is the nearest search range. The final statistic histogram of the point feature is implemented with the CUDA command atomicAdd in the sequential mode to avoid the multi-thread memory access conflict.

7) KEY POINT PFH DESCRIPTOR ALIGNING
In the key point pair feature correspondence search part, each key point feature is a 125 dimensional vector. The kd-tree search method, however, is not suitable for the high dimensional feature correspondence search [54]. We utilize the parallel brute force search method, i.e., each thread in the parent GPU thread block manipulates one 125 dimensional feature vector for a key point; each thread creates another 'child' GPU thread block, and then searches for the most similar feature in another point cloud under the parallel mode. Given M key points, the complexity of parallel implementation for PFH descriptor aligning could be reduced from O(125·M ) to O(1). After successfully finding the aligned key point pairs, our algorithm obtains the initial overlap ratio r 0 and the initial transformation T 0 .

IV. EXPERIMENTAL RESULTS AND ANALYSIS
Our experiment is verified in terms of accuracy and runtime. The sequential program is written in C++ with Point Cloud Library (PCL) [55], and the parallel program is written in CUDA C/C++ [56]. All experiments are running on Intel Core TM i7-8750H CPU @2.20GHz and NVIDIA GeForce GTX 1060.
The data used in the experiments are Bunny, Dragon and Happy Buddha (Figure 9) from the Stanford 3D Scanning Repository (http://graphics.stanford. edu/data/3Dscanrep/). We also test our algorithm on

A. EXPERIMENT DESIGN
Given a source 3D point cloud P and a target point cloud Q, we conduct a predetermined random transformation T on P.
Here the transformation includes rotation, translation and scaling. As a result, a data pair (P, Q) is obtained with the ground truth of transformation T .
The goal of the algorithm is to find the best transformation T from P to Q. We compareT = T * · T 0 under various conditions such as missing points, noise, outlier and isotropic scaling. The better the registration performs, the closer the transformation to the ground truth T . We show both the visualization and the numerical results. All the quantity results include the average accuracy and runtime of 15 different sample pairs.
We measure the error by the average of point-to-point distances, marked as TSD, on the overlap area of P and Q. Meanwhile, the stopping values of the trimmed ICP, AGICP and our fine registration method are: 1) The maximum iteration number τ = 200; 2) The TSD error ε η < e −10 ; and 3) |ε η − ε η−1 | < e −10 , respectively. Figure 10 shows the results of missing points case, compared with the Super4PCS, original trimmed ICP and AGICP. The PPFH features of the key points match well on the overlap area. Figure 11 shows the zoom-in results of coarse and fine registrations, respectively. Table 1 displays the detailed numerical results, indicating that our PPFH based coarse-to-fine registration method outperforms the original trimmed ICP, AGICP and Super4PCS methods in terms of the registration average accuracy and runtime. Note that even our coarse registration outperforms the trimmed ICP, AGICP and Super4PCS in terms of accuracy. Of course, our coarse-to-fine registration method outperforms our own coarse trimmed module, which is also shown in the table. It indicates that our fine trimmed module further improves the accuracy. Our method sets the neighborhood range N 0 = 8 for the point cloud normal vector estimation, and N 3 = 32 for the PFH descriptor computation in both P and Q. Table 1 also shows the runtime of coarse and fine registrations of parallel implementation. Owing to the accurate coarse registration transformation result, the iterations of fine registration are less than directly using trimmed ICP and AGICP. It verifies again the importance of coarse registration. Besides, the fine registration utilizes the GPU parallel mode to reduce the runtime in each iteration, which has significant improvement in accuracy and runtime. Particularly, we downsample the ship data for AGICP, since this algorithm requires a large amount of memory and cannot be executed for full size data.

C. OUTLIERS
For comparing our method under different outlier levels, we add additional 10% to 40% outlier points on the missing data. Then, the actual overlap ratio becomes 19.2% to 24.9%. Figure 12 illustrates the comparison under different outlier levels, and the numerical results are given in Table 2.
Comparing to the other three methods, our PPFH based coarse-to-fine algorithm has much higher registration accuracy; owing to the accurate coarse estimation, the fine registration converges faster. In order to obtain similar PPFH features of the key points correctly, we need to reduce N 3 in KNN search to avoid capturing more random outlier information. Hence, the search region is about 3/4 of that in the raw data experiment.

D. PERTURBATION NOISE
Furthermore, we verify our algorithm under the perturbation noise. The results are shown in Figure 13 and Table 3. We compare the registration performance with seven different methods:   Figure 10. The first row is the zoom-in of the coarse registration result, and the second row is that of the fine registration.

FIGURE 12.
Comparison results under different levels of outliers. In the first row, the outlier is 10% of point cloud size, the second one is 20%, and the last row is 40%. 7) our coarse-to-fine registration method. In Figure 13, the first row is the data contaminated by noise N (0, 0.0003), and the second row is contaminated by N (0, 0.0007). Super4PCS, Trimmed ICP, AGICP and coarse w/o denoising suffer significantly from noisy and missing data. Table 3 shows the registration accuracy and runtime on the different noise levels. Comparing the coarse registrations with and without denoising, the bilateral filter efficiently reduces the noise influence on PPFH feature. Using trimmed ICP directly suffers from the low overlap ratio problem, leading to more iteration steps or wrong transformation results.
Although the runtime slightly increases with the bilateral filter in our method, the registration accuracy is much better than not using it. This outcome benefits from not only the parallel execution mode but also the denoising procedure, which reduces the iteration steps in the modified trimmed ICP.

E. SCALING
In this part, we give the extension of our coarse-to-fine trimmed partial registration method to the isotropic scaling case. To register in the scaling case successfully, we need to estimate the scale s 0 in the coarse trimmed module too.  Then, we conduct the trimmed scale ICP [36] in GPU parallel model:T = arg min s · R · P + t − Q 2 .
The detail could be summarized in Algorithm 4.

Algorithm 4 Partial Registration for Isotropic Scaling
Require: Point clouds P and Q. 1: Estimate the correspondence and coarse overlap area referred in section III-A4. 2: Estimate the coarse transformation {s 0 , R 0 , t 0 } by SVD 3: for each iteration η in fine registration do 4: Find the nearest points of P in Q with parallel mode:

5:
Update the overlap ratio via (6). 6: Compute the barycenters P and Q of the overlap area. 7: Update the scale, rotation and translation: Estimate the scale s η via the division of the singular values of SVD (P · P T ) and SVD (Q · Q T ).
Compute the rotation R η via Algorithm 3 and translation t η = q − s η · R η · p 8: Repeat the above steps until the stopping criteria are satisfied. 9: end for In order to reduce the possibility of scale degeneration in non-rigid transformation, we set the upper and lower scale constraints [s L , s U ] for each iteration.
To verify the proposed method under scaling for real data, we conduct different methods on the Ship and Volcano data. Super4PCS is not suitable for scaling situation due to its reliance on point distance. Experimental results are shown in Figures 14 and 16. Figure 14 shows that directly conducting the trimmed ICP easily fails for the low overlap ratio and the scale 2:1. In contrast, PPFH based coarse registration provides a reasonable coarse transformation including rotation, translation and scaling. This can be taken as a good initial transformation for the fine registration.
Notice that the PPFH based coarse registration gives an accurate transformation estimation, as shown in Table 4, which is better than directly conducting the trimmed ICP and AGICP. Besides, with the help of coarse registration, the fine registration converges faster than directly conducting the trimmed ICP and AGICP. Figure 16 provides the comparison for partial registration with outliers (10%) under the scale 2:1. Obviously, PPFH based coarse registration gives a reasonable coarse transformation. Figure 17 shows the zoom-in result of our fine registration algorithm. In particular, we only downsample the ship data for AGICP, since this algorithm requires a large amount of memory and cannot be executed for full size data. Table 5 further verifies that the PPFH based coarse-to-fine registration algorithm is also suitable for data with outliers (additional 10%) under the scale 2:1.

F. TIME CONSUMING ANALYSIS OF PARALLEL IMPLEMENTATION
We compare the runtime of normal vector estimation, PFH computation and tree construction in parallel and sequential implementations, respectively. The results are shown in Figure 18 and Table 6.
In Figure 18(a), the blue and orange bars represent the sequential runtime of the normal vector estimation and PFH estimation, respectively. Similarly, the purple bar represents the parallel PFH computation time, and the faint yellow 'PNormal' means the parallel normal vector estimation. Notice that the key point size significantly influences the   runtime of sequential implemented PFH and normal vector estimation, but influences the parallel execution less. Figure 18(b) compares the time of tree construction. The sequential tree construction time increases while the parallel one is very low and almost remains constant as the point size increases. Table 6 illustrates the different stages of coarse-to-fine registration for data bunny, buddha and dragon. Although  the PFH computation still occupies the main part of the registration time, the entire runtime decreases compared with the sequential mode.

V. CONCLUSION
We have proposed an accurate and robust coarse-to-fine partial registration algorithm for two point clouds with the low overlap ratio. In methodology, this algorithm utilizes the noise robust PPFH matching to trim the outliers coarsely, estimates the overlap area and computes the coarse transformation in the coarse trimmed module. Then, the parallel trimmed ICP algorithm could be successfully conducted based on the coarse registration results. Moreover, This framework can be extended to isotropic scaling registration. In implementation, most part of the procedure is accelerated by CUDA and OpenMP.