Point Cloud Registration with Object-Centric Alignment

Point cloud registration is a core task in 3D perception, which aims to align two point clouds. Moreover, the registration of point clouds with low overlap represents a harder challenge, where previous methods tend to fail. Recent deep learning-based approaches attempt to overcome this issue by learning to ﬁnd overlapping regions in the whole scene. However, they still lack robustness and accuracy, and thus might not be suitable for real-world applications. Therefore, we present a novel registration pipeline that focuses on object-level alignment to provide a robust and accurate alignment of point clouds. By extracting and completing the missing points of the object of interest, a rough alignment can be achieved even for point clouds with low overlap captured from widely apart viewpoints. We provide a quantitative and qualitative evaluation on synthetic and real-world data captured with a Kinect v2. The proposed approach outperforms the current the current state-of-the-art methods by more than 29% w.r.t. the registration recall on the introduced synthetic dataset. We show that the overall performance and robustness increases due to the object-level alignment, while the baselines perform poorly as they take the entire scene into account.


I. INTRODUCTION
M ULTI perception sensor setups with 3D depth sensors [1] and LIDARs [2] are becoming more and more prevalent for manufacturing. However, such multi sensor systems require accurate and robust extrinsic calibration in order to be usable. An increasing degree of automation in industrial manufacturing processes also raises a requirement for automated (re-)calibration, to keep the growing complexity in set-up and maintenance manageable.
While being a general computer vision problem, point cloud registration also provides a potential solution for extrinsic calibration of 3D sensors. The goal is to find the relative transformation between a point cloud pair with respect to a reference frame. Previous research was mainly based on traditional registration methods [4]- [6], which were combined, in most research works, with specific calibration objects [7]- [10] or markers [11]- [14]. Even though the target-based methods offer reliable and precise calibration, they are performed manually and require expert knowledge, which is not satisfactory for highly automated industrial processes. Later work [15] showed that automated and target-less calibration is possible but still relies on an approximate initial guess of the sensor placement. Moreover, traditional registration methods suffer from instability and lack robustness, if the input point clouds are captured from widely apart viewpoints and their overlap is relatively low.
However, these methods represent a bottleneck for reaching higher levels of autonomy in industrial processes where vision systems are vital. In such cases, learning-based techniques are used to overcome these issues. The tremendous success of deep learning for various 3D perception tasks [16]- [19] has resulted in the use of deep learning for point cloud registration as well. This can be seen in a number of approaches [3], [20]- [29] that appeared in the recent years. Despite learning-based approaches trying to mitigate the problems of previous registration methods, they require large amount of data, lack generalization and accuracy, and tend to fail when the test data distribution differs from the training data distribution.
Therefore, instead of learning low-level features on the entire point cloud, we could simply focus on an object of interest within the scene, as shown in figure 1. Moreover, a valid assumption for most relevant 3D computer vision applications is that there will always be at least one unique object, i.e. an object of interest in the scene. For example, vehicles in automated driving use cases, robot manipulators in industrial work cells, or furniture in domestic indoor scenes. Thus, by focusing on an object-centric alignment, we can overcome the problems of point clouds captured from different viewpoints, point clouds with low overlap, and the need for learning correspondences on the entire scene. Moreover, by applying this simple yet effective hypothesis, we can use any off-the-shelve methods and easily integrate them into our pipeline to adapt to any given requirements.
Hence, we propose a simple and modular registration pipeline for point cloud data to mitigate the limitations mentioned above. First, the object of interest is extracted from the input point cloud pair. The extracted points partially represent the object of interest, due to the self-occlusion of 3D sensors. Therefore, the next step in our registration pipeline predicts the missing points, which highly increases the similarity between the extracted point clouds. We leverage this similarity and perform a rough alignment of the completed point clouds of the object of interest. This provides a relatively good transformation estimation. Finally, we refine the alignment on the entire captured scene by using the roughly estimated transformation parameters as an initial guess.
Our main contributions can be summarized as follows: • A novel registration pipeline based on object-level alignment • The object extraction and completion modules that enable accurate and robust registration even for point clouds with low overlap • Extensive experiments on a new synthetic dataset containing point clouds with low overlap captured from widely apart viewpoints, and qualitative evaluation on real point cloud data.

II. RELATED WORK
In this section, we provide an analysis of relevant related research. Furthermore, we extract the limitations for each subclass of the family of point cloud registration methods.

A. TRADITIONAL POINT CLOUD REGISTRATION
Point-based Registration Methods. The most known traditional optimization-based point cloud registration method is Iterative Closest Point (ICP), which was introduced by [30] and [31]. The core idea behind ICP is to iteratively search for correspondences and estimate the transformation between them, thus finding the optimal transformation between source and target point cloud. The main drawback of this method is that it heavily relies on a good initial pose estimate, which in case of a bad estimate can lead to convergence to a localminima. To overcome this, [32], [33] and [34] use branchand-bound to search for global optimal solution. Although these approaches may be effective with bad initial estimates, they still lack in terms of robustness in the case of point cloud pairs with low overlap. Additionally, global registration methods come with a high price in required computational effort, which makes them unusable for real-time applications. Finally, [35] introduces the estimation of the velocity of the rangefinder into the ICP algorithm to compensate for any kind of distortion caused by the movement of the sensor. Handcrafted Descriptors. Contrary to optimizationbased registration techniques, handcrafted descriptor-based approaches [36]- [39] try to extract relevant features from point cloud pairs, and thus find correspondences between them. Their advantage over most of the optimization-based methods is that handcrafted descriptors don't require an initial guess. However, some disadvantages of these methods are sensitivity due to noise and occlusions, which can result in wrong correspondences. Moreover, handcrafted feature extraction methods underperform when dealing with point cloud pairs with low overlap, because there might be fewer, or even none at all, matching correspondences in the two input point clouds.

B. LEARNING-BASED POINT CLOUD REGISTRATION
Feature Learning. The rapid advancement of data-driven deep learning approaches enabled the usage of these techniques for point cloud registration. Unlike the handcrafted feature extractors, feature learning approaches train deep neural networks on large training data sets for finding correspondences. 3DMatch [20] is one of the first feature learning point cloud registration approaches. It leverages volumetric data representation and 3D Convolutional Neural Networks (CNN) to learn 3D local descriptors for finding correspondences. The authors from [20] introduced the well-known real-world data registration benchmark under the same name as the method. In order to jointly capture local and global features, Deep Closest Point (DCP) [21] employs Dynamic Graph Convolutional Neural Network (DGCNN) [40] and leverages Transformer [41] to learn contextual information. Finally, an Singular Value Decomposition (SVD) module produces the transformation matrix. A comprehensive survey of data-driven feature learning methods can be found in [42], including works up to 2021. More recent methods [22], [3] and [24] try to overcome the problem of registration of point clouds with low overlap. The approach in [22] enhances the quality of the correspondences, in a regime with low overlap, by using a graph-based self-and cross-attention network. PREDATOR [3] introduces a novel overlap-attention block that aims to focus more on the overlapping parts of point cloud pairs. [24] proposes to solve the registration of partially overlapping point clouds by learning overlapping masks to register those regions.
However, the main limitations of the above-mentioned methods are: 1) they need an immense amount of training data, and 2) if there is a relatively large gap between the training data and new scenes, then these methods suffer from a significant performance drop. 3) These methods still fail to accurately register extremely point cloud pairs with low overlap. On the contrary, our method focuses on finding correspondences on an object level. This object-centric strategy helps to address the aforementioned drawbacks. Moreover, our modular pipeline makes use of state-of-the-art methods and thus leveraging its strengths.

III. PROPOSED METHOD
Our novel point cloud registration method focuses on finding an object of interest in the input point cloud pairs for accurate and precise transformation estimation. Additionally, we make the assumption that a unique object of interest exists, specific for a particular use case, within any given scene. The proposed point cloud registration pipeline is modular, and thus enables the easy plug-and-play exchange of each module with other off-the-shelf methods or network models. For the socalled initial guess, the estimated transformation parameters, from the previous step, are used.
The following subsections explain in detail the novel point cloud registration pipeline.

A. PROBLEM STATEMENT
Lets consider two input point clouds, where M = N can be but is not necessary. Assume that the source and target point cloud have L point matches, where 0 < L < N . The task of point cloud registration is to estimate the rigid transformation matrix T Q P , which consists of a rotation matrix R and translation vector t, where R ∈ SO(3) and t ∈ R 3 , by minimizing the least squares error: The well-known ICP method tries to iteratively solve Eq. 1 by alternating in finding the right point matches, i.e. correspondences, and the optimal transformation matrix. Unfortunately, this approach is very sensitive to local optima and it fails to converge if the initial guess is poor. Therefore, we aim to provide a relatively well-aligned initial guess by focusing first on the object of interest in both the source and target point cloud.

B. OBJECT OF INTEREST EXTRACTION
As already mentioned, our approach finds first corresponding points on an object-level instead of searching for correspondences or features in the entire input point cloud sets like it is done by previous methods. Thus, the first step is to extract the object of interest point clouds } from the source P and target point cloud Q, where P S ⊂ P, Q S ⊂ Q, and J = K can be but is not necessary. It can be described with: , where f e is a function that extracts the points of the object of interest from the input point cloud data. Since our approach is modular, the function for extracting the object of interest points can be implemented by any method which is able to distinguish the object of interest from the background. For example, a 3D object detection module, trained to detect the object of interest, can be used for this task.
In our experiments, we use the DGCNN [40] semantic segmentation network as f e from equation 2 for extracting the object of interest points from the background. DGCNN is a lightweight graph-based network architecture leveraging edge convolution operations. The input point cloud is converted to a graph-based structure by using k nearest neighbours. Furthermore, their newly introduced edge convolution operation specifically combines global information with the local neighbourhood information. For more details, please refer to the original work.

C. POINT CLOUD COMPLETION
The extracted points P S and Q S , of the same object of interest from both input point clouds, represent only partially the object of interest. This is due to obvious self-occlusion since the 3D sensor can only capture one side of an object. Furthermore, we can assume that the two extracted point clouds only partially overlap which is caused by different viewpoints while capturing the input point clouds. The larger the translational and rotational difference between the two input point clouds is the smaller the expected overlap between them, and thus the harder the estimation problem. To tackle these issues we propose to predict the missing points of the extracted object of interest point clouds with: where f c is a function that predicts the missing points of P S and Q S . The complete point cloud representations of the extracted object of interests is denoted with where U = V can be but is not necessary, and P S ⊂ P C and Q S ⊂ Q C . The aim of completing the two extracted point clouds is to get a set of points that are similar w.r.t. their geometrical shape. We use this similarity between the two completed point clouds of the object of interest to roughly align them, but this is described in more detail in the following section.
To infer missing points from partial input point clouds, we use the PoinTr [43] network as f c from equation 3. PoinTr is a transformer-based network architecture for the task of point cloud completion. To process the incomplete point cloud a lightweight DGCNN model is employed. However, to reduce the computational cost, the input point cloud is hierarchically downsampled using farthest point sampling (FPS). More details can be found in the original research article.

D. TRANSFORMATION PARAMETER ESTIMATION
As mentioned in Sec. III-A, the ICP algorithm is prone to errors if the initial alignment is inaccurate, which leads to a bad transformation matrix estimation, hence a bad registration. We solve this issue by providing a relatively good initial alignment of the input point clouds, by leveraging the similarity of the completed point clouds of the object of interest. We apply PCA [44], by following [45] and [46], and find the covariance matrices, C P C ∈ R 3×3 and C Q C ∈ R 3×3 , of both completed point clouds: where the centroids of the completed point clouds,p C ∈ R 3 andq C ∈ R 3 , are calculated with: The point cloud reference system for each completed object of interest point cloud is defined by its principal components, i.e. feature vectors of the previously calculated covariance matrix, and with the centroid as its origin. By aligning the reference frame of the two completed point clouds, retrieved from the source and target input points, and applying ICP for further refinement, we obtain an object-level alignment T Q C P C ,rough . Finally, we use the object-level alignment as the initial guess for the minimization problem of equation 1 and solve it using ICP on the entire input point clouds for finding T Q P,fine . Since we can provide an initial guess, which is already close to the optimal solution, the ICP algorithm converges and finds the optimal solution without getting stuck in local optima.

IV. EXPERIMENTS
In this section, we give a detailed description of the used dataset and give an overview of the implementation details and the used evaluation metrics. Then we provide an ablation study showing the contribution of our object-centric strategy. Finally, we compare the quantitative and qualitative 4 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.   Kuka LBR iiwa collaborative robot mounted on a worktable inside our workcell dedicated for research purposes. We use multiple Kinect v2 depth sensors mounted in the corners of the workcell to get full coverage. The extrinsic calibration, i.e. point cloud registration, of the 3D vision system is performed with our proposed method, because it provides robust and accurate alignment due to the object-centric approach. Additionally, for the indoor real-world scenario, we assume that the movement of the object of interest, in our case the robot arm, cannot cause any significant positional discrepancy in the depth sensors. This assumption is valid because the time synchronization of multiple connected depth sensors is usually by magnitudes faster than the movement of any object within a scene.

A. DATASET
We assume that for most 3D computer vision applications that require point cloud registration as a necessary preprocessing step, a unique object of interest will be present within the captured scene. Hence, let us consider the scenario of an indoor industrial robotics workcell inside a manufacturing plant. We can assume that a robot manipulator will be present in all the captured scenes since it represents the main element for the operation of a robot workcell. Thus, we can consider the robot manipulator as our unique object of interest in any given robotic workcell use case. To the best of our knowledge, there are no open-source datasets that satisfy our task description. Therefore, we introduce a new synthetic dataset, containing dense point clouds of an industrial workcell with a Kuka LBR iiwa inside it. This dataset is generated using Blender [47] by realistically recreating our real-world lab robotic workcell, as can be seen in figure 4 and its realworld 3D scan taken with a Kinect v2 in figure 3, and contains 2750 scan pairs with randomly sampled robot joint states for each scan. Additionally, the scans were taken from random poses within the workcell with the condition that the robot arm is inside the field of view. Our synthetic dataset can be used to train for semantic segmentation, point cloud completion, and point cloud registration tasks. Therefore, we give the ground truth point-wise labels containing either the background class or the robot arm class. In addition to that, we provide the complete robot arm point cloud for each scan as ground truth data, in order to be able to train a point cloud completion network. The split into subsets for training, validation, and testing can be seen in table 1. We follow [3] to calculate the overlap ratio between the point cloud pairs, which is reported in table 1 as well. The overlap ratio tells us how many points of the perfectly aligned source and target point cloud lie within a threshold distance. The lower the VOLUME 4, 2016 5 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022 overlap ratio between two input point clouds is, the fewer potential correspondences exist, thus making the registration problem harder. The mean overlap ratio of the point cloud pairs from our introduced synthetic dataset is relatively low at around 30%. If we compare our dataset with the well-known 3DMatch dataset [20], where only scan pairs with an overlap > 30% are considered, we can see that our dataset represents a harder challenge for registration.Specifically, point cloud pairs with an overlapping region < 30% are considered to be low overlapping [3] and thus various methods show a rapid decrease in performance. Our method is implemented in the programming language Python using the well-known machine learning framework PyTorch [48] and Open3D [49] for 3D data processing and visualization. As mentioned previously, we used Blender [47] to generate our synthetic dataset. For the object of interest extraction module, we use the DGCNN [40], trained on our dataset by following their recommendations regarding the hyperparameters. We report in table 2 the performance of DGCNN [40] on our synthetic dataset based on the Intersection over Union (IoU) metric. The point cloud completion is obtained by using the PoinTr network [43], again following their hyperparameter recommendations. We trained PoinTr on our dataset, with the addition of generating the corresponding ground truth, i.e. the complete robot arm point cloud representation for each scan, with Blender. Table 3 shows the performance of PoinTr [43] on our introduced dataset. We set the threshold for ICP in the rough and fine alignment steps to 0.01 and 0.1, respectively. The experiments were conducted on our workstation PC with an AMD Ryzen Threadripper 2950X (16-Core) and an NVIDIA GeForce RTX 3090 GPU. For the traditional baselines, we use the implementation provided in the Open3D library [49], while for the feature learning-based method we use their publicly available opensource implementation. For fairness, we trained the feature learning-based method, PREDATOR [3], on our introduced synthetic data set by using their recommended hyperparameter settings.

C. EVALUATION METRICS
We follow [3], [50], [22], and evaluate the point cloud registration performance w.r.t. the relative translation (RT E) and rotation error (RRE) calculated by: whereR andt are the estimated, and, t gt and R gt the ground truth transformation parameters. Based on RT E and RRE, we calculate the mean translation (M T E) and rotation error (M RE), in order to evaluate the performance of the compared methods. Additionally, we also calculate the registration recall rate (RR), which gives a quantitative measure of the registration success ratio. A registration is considered successful if the relative translation and rotation error is below a certain threshold. For our use case, we consider the following thresholds: RT E < 0.05m and RRE < 5 • . We ablate our proposed registration pipeline to prove the robustness of our object-centric alignment strategy, as shown in table 4. The ablation study is conducted on the introduced synthetic data set. First, we evaluate the performance of only using the object extraction module (model A) and apply ICP on the partial point clouds. By adding the object completion module (model B) and applying ICP on it, we can observe how the performance improved. By using the object extraction, object completion, and object-level alignment module (model C), we see a clear increase in performance, but still, the overall registration recall remains relatively low for the selected thresholds. Model D represents all the modules except the object completion module. Here, we intend to emphasize the importance of the object completion module. Additionally, model D simulates a scenario where the registration pipeline fails to correctly complete the object of interest, which is caused by either poor point-wise extraction of the object extraction module or a poor reconstruction of the extracted points of the object completion module. However, if we compare model D with the entire pipeline 6 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3191352

D. ABLATION STUDY
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 5: Qualitative comparisons on our synthetic dataset. The traditional methods fail because of the large viewpoint difference of the point clouds, and due to the low overlap. Even though our method and the learning-based baseline appear to be visually identical at the scene-level, it is clear that our method performs better for the points measured on the shelf, which can be seen in the zoomed in region. Our approach has a better registration quality because we use an object of interest as a fulcrum point for rough alignment. This rough alignment serves as a good starting point for the fine alignment step at the scene level and makes the search for correct correspondences easier.
(model Full), one can see the importance of the object completion module, which adds to the overall robustness of the registration pipeline by improving the registration recall by more than 66%. The robustness increases because the object completion module minimizes the discrepancy of the geometrical shapes between the two extracted object-ofinterest point clouds significantly. Moreover, this step enables a rough alignment which serves as the initial guess for the fine alignment step.

E. QUANTITATIVE RESULTS
Traditional methods are not robust against point cloud pairs with low overlap, because they can't find enough relevant correspondences, which results in poor performance. Only the method, where RANSAC is used together with ICP, manages to register a few point cloud pairs very accurately, but still, due to a very low registration recall, this approach remains unusable for real-world applications. On the other hand, PREDATOR manages to generalize well over the test set, showing the robustness of a learning-based method specifically designed for point cloud pairs with low overlap. However, our proposed method performs similar or more accurately, w.r.t. the MTE and MRE, and is more robust in terms of registration recall compared to the baseline methods, as shown in 5. The reason for achieving such robustness is because our method focuses first on finding an object of interest and predicts the missing points to generate a similar shape. On the other hand, high accuracy is obtained with the combination of the object-level and scene alignment. The VOLUME 4, 2016 7 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and Our approach outperforms the baselines in the low threshold region by a large margin, because of the object-centric alignment strategy followed by a scene-level fine alignment step.
alignment on the entire scene begins with a good initial guess, retrieved from the object-level alignment, which guarantees convergences to an optimal solution for most cases. Furthermore, we evaluate the registration recall with different translation and rotation error thresholds, as shown in figure 6. Comparing the results, our method shows superiority for lower threshold values, which indicates that it is highly reliable for applications with a strict requirement on accuracy, such as industrial robotic workcell use cases. The better performance of our method can be attributed to the effective strategy of our novel registration pipeline. Instead of finding low-level correspondences in the entire scene, we first focus on roughly aligning an object of interest to provide a good starting point for the scene level fine alignment.

F. QUALITATIVE RESULTS
The qualitative comparison of the baseline methods and the introduced registration approach on the synthetic test data is shown in figure 5. As expected, the traditional methods suffer from instability and fail to achieve satisfying alignment, because of the widely apart viewpoints the point clouds were captured from, and due to the low overlap. On the other hand, PREDATOR manages to handle such input pairs and successfully registers them. However, by comparing the highlighted part of PREDATOR and our method, it is clearly visible that the learning-based baseline falls short of accurately aligning the input point cloud pair. The reason for this might be that the learning-based method probably requires larger amount of training data in order to learn more finegrained correspondences.
Finally, a qualitative comparison on real-world point cloud data, is displayed in figure 3. The scenes were captured with two Kinect v2 which were mounted in the corners of our workcell, as shown in figure 4. Again, the traditional baselines fail for the same reasons to successfully register the point clouds with low overlap. Surprisingly, the learningbased method, PREDATOR, fails as well to align the real point cloud pair. This is very likely due to the difference in data distribution between the training set and the real point cloud data. However, only our approach successfully registers the real input point clouds, which can be attributed again to the effective design choice of the proposed registration pipeline, by first putting the focus on an object-level instead of on the entire scene.

V. DISCUSSION AND FUTURE WORK
The design of our point cloud registration pipeline enables two properties: 1) scalability and 2) simplicity. Each module within our point cloud registration pipeline can be exchanged with any other off-the-shelf method and adapted accordingly. Therefore, our proposed method can be easily extended for other applications where an accurate and robust registration of challenging point cloud pairs is required, e.g. automated driving, 3D indoor mapping, multi-agent slam, and others. Moreover, our approach opens up a number of directions for further research. It would be interesting to see how our method could be used for cross-source point cloud registration, where different densities of the input point clouds present a difficult challenge for current methods. Finally, publicly available datasets, such as the 3DMatch [20], lack the ground truth information about the completed point cloud for potential objects of interest within the scene. Therefore, to fill this gap, we believe that our synthetic dataset will help further research in this particular direction. Finally, the conducted experiments on synthetic and on real-world data showcase the robustness and accuracy of our object-centric alignment strategy.

VI. CONCLUSION
In this work, we introduced a simple and modular approach for robust and accurate registration of point clouds with low overlap. The main idea behind this novel registration pipeline was to put focus on an object of interest in the input point cloud pair and use it as a fulcrum point. Inferring the missing points of the object of interest created a geometrically similar shape of it in both the input point clouds, which then helped to roughly align them. This rough alignment provided a good and robust initial guess for the scene-level fine alignment, and thus ensured convergence to an optimal solution. Moreover, we showed that the introduced approach outperforms other baselines on our synthetic dataset, and our method proved to be robust on even noisy real-world data while the compared baselines failed.