Learning to Optimally Segment Point Clouds

We focus on the problem of class-agnostic instance segmentation of LiDAR point clouds. We propose an approach that combines graph-theoretic search with data-driven learning: it searches over a set of candidate segmentations and returns one where individual segments score well according to a data-driven point-based model of"objectness". We prove that if we score a segmentation by the worst objectness among its individual segments, there is an efficient algorithm that finds the optimal worst-case segmentation among an exponentially large number of candidate segmentations. We also present an efficient algorithm for the average-case. For evaluation, we repurpose KITTI 3D detection as a segmentation benchmark and empirically demonstrate that our algorithms significantly outperform past bottom-up segmentation approaches and top-down object-based algorithms on segmenting point clouds.


I. INTRODUCTION
Perception for autonomous robots presents a collection of compelling challenges for computer vision. We focus on the application of autonomous vehicles. This domain has three notable properties that tend not to surface in traditional vision applications: (1) 3D sensing in the form of LiDAR technology, which exhibits different properties than traditional 3D vision captured through stereo or structured light. Despite significant work in this area, the right representation for such sparse 3D signals still remains an open question.
(2) Contemporary approaches to object detection and scene understanding tend to be closed-world, where the task is predicting 1-of-N possible labels. But autonomous systems require the ability to recognize all possible obstacles and movers -e.g., a piece of road debris must be avoided regardless of what name it has. Such understanding is crucial from a safety perspective. Historically, this has been formulated as a perceptual grouping or bottom-up segmentation task, which is typically addressed with different approaches.
(3) Finally, practical autonomous robotics makes heavy use of perceptual priors in the forms of geometric maps and assumptions on LiDAR geometry. Indeed, prior map was a crucial component among finishing entries in the DARPA Urban Grand Challenge [1,2].
Motivation: In this work, we focus on the problem of class-agnostic instance segmentation of LiDAR point clouds ( Figure 1) in an open-world setting. We carefully mix graph-theoretic algorithms with data-driven learning. Datadriven learning has made an undeniable impact on computer vision, but is difficult to make guarantees about performance when processing out-of-sample data from an open world. * indicates two authors have equal contribution. Geometric graph-based approaches for segmentation tend not to require training and so are less-like to overfit, but also tend to be brittle.
Approach: Our approach searches over an exponentiallylarge space of candidate segmentations and returns one where individual segments score well according to a data-driven point-based model of "objectness" [3]. We demonstrate that one can repurpose existing closed-world point networks [4] for bottom-up perceptual grouping tasks that generalize to objects rarely seen during training.
Optimality: We prove that our approach produces optimal segmentations according to a specific definition. First, we restrict the search into a subset of segmentations that are consistent with a hierarchical grouping of a point cloud sweep. Such hierarchical groups can be readily produced with agglomerative clustering [5], HDBSCAN [6], or hierarchical graph-based algorithms [7].
Naive methods for producing a segmentation might apply a global threshold over the whole hierarchy. It turns out that one can produce an exponentially-large set of segmentations by applying different thresholds at different branches. We introduce efficient algorithms that search over this space of tree-consistent segmentations ( Figure 2) and return the one that maximizes a global segmentation score that is computed by aggregating local objectness scores of individual segments.
Evaluation: We demonstrate empirical results on KITTI, a benchmark originally designed for closed-world object detection. Following past work, we repurpose it for openworld 3D segmentation [8]. We compare to existing bottomup approaches [9] and state-of-the-art LiDAR-based object detectors after converting their output 3D bounding boxes to a point cloud segmentation. We demonstrate that our approaches outperform both baselines on less common classes.

II. RELATED WORK
Robust 3D object detection is crucial for downstream applications such as semantic understanding [10] and tracking [11]. Comparing to monocular 3D detection [12], we focus on LiDAR-based solutions in this paper.
LiDAR segmentation: Classic LiDAR segmentation algorithms use bottom-up grouping such as flood-filling [13], connected components [14], or density-based clustering [6]. Bottom-up strategies can also be applied on LiDAR sequences, allowing for motion as an additional cue [15]- [17]. Oftentimes such approaches are tuned for particular object categories such as cars. Our work differs in its use of static, single-frame cues that are not object-specific.
LiDAR object detection: There is an ever-increasing literature on data-driven object detection with LiDAR point clouds. Early approaches include fusion-based models that combine LiDAR and imagery [18], tracking-based detectors [19] and voxel-based classifiers [20]- [22]. We have seen approaches built upon raw point clouds such as PointR-CNN [23]. Our approach is most related to Frustum Point-Net [24] in the way we use pooled point cloud representation [4]. Our work differs in that we do not make use of camera input, and most notably, focus on all possible objects in an open world. Specifically, we compare to [18,21,22,25] as a representative sample of the literature.
Perceptual grouping: Our graph-based approach is inspired by a long line of classic work on graph-theoretic perceptual grouping, dating back to normalized cuts [23], graph cuts [26], and spanning-tree approaches [27]. Such methods are typically used with hand-designed features, while we make use of data-driven techniques for learning a shape-based segment classifier.
Image segmentation: The idea of searching for an optimal image segmentation given a hierarchical image segmentation tree has been explored. [28] formulates neuron segmentation on electron microscopy images as a maximum a posteriori (MAP) labeling task on a tree-structured graph. It can be made equivalent to our search under certain conditions. [29] tackles the problem of class-agnostic instance segmentation in image space by exploiting visual appearance and motion. We discuss more in Section III and IV-B.

III. APPROACH
For 3D object point segmentation, the input is a 3D point cloud, which contains an unknown number of objects. The goal is to produce a point segmentation, in which every segment contains points from one and only one object. Fig. 2: On the left, we visualize a set with 6 points. According to Bell number, one will find 203 unique segmentations (partitions). Most of these are arbitrary and do not respect local geometry, e.g. {{1, 2, 5}, {3, 4, 6}}. On the right, we implement geometric constraints with a tree formed by hierarchical grouping. Every vertex cut of this tree is automatically a segmentation that respects local geometry encoded by the tree, e.g. Segmentation: A global segmentation P X is a partition of a set of points where M denotes the number of segments and C i ⊂ X. We refer to each C i as a local segment. Importantly, every point exists in one and only one segment, meaning ∪ M i=1 C i = X and ∀i = j,C i ∩C j = / 0. Tree-consistent segmentations: Let us use S X to denote the set of all possible global segmentations on X, i.e. all possible P X . Without constraints, the size of S X is exponential in N (i.e. the Bell number). In practice, we can reduce the number of candidates by enforcing geometric constraints. In this work, we implement the constraints by grouping all points hierarchically into a tree structure T X . We will discuss how to build such a tree structure based on local geometric cues in Section III-D. For now let us assume the tree is given.
Once we specify the tree, we can focus on a strictly smaller set of segmentations that respect local geometry. We denote such set as S X,T and call them tree-consistent segmentations. As a reference, the size of S X,T is still exponential in N, when T X is a balanced binary tree 1 . We further illustrate the relationship between S X and S X,T with an example in Figure 2. Any tree-consistent segmentation from S X,T corresponds to a vertex cut set of the tree T , i.e. a set of tree nodes, which satisfy the following constraints: (1) for each node in the vertex cut, its ancestor and itself cannot both be in the cut and (2) each leaf node must have itself or its ancestor in the cut. Such relationship allows us to design efficient tree searching algorithms, as we will see later.
Segment score: Before we discuss how to score a global segmentation, we first introduce how to score a local segment. Given a local segment C ⊂ X, we define a function f (C; θ ) : C → [0, 1] that predicts a given segment's "objectness", where θ represents the parameters. One can implement such a function with a PointNet++, where θ would represent weights of the PointNet++. We will discuss how to learn this function in Section III-C. For now let us assume it is given.
Segmentation score: We now introduce how to score a global segmentation. Given a global segmentation P X = {C i } M i=1 , we define its score F(P X ; θ ) : P X → [0, 1] by aggregating over local objectness of its individual segments. Specifically, we introduce worst-case segmentation and average-case segmentation. Note that our objective can be made equivalent to [28] if we score a segmentation as the sum of its local segment scores. As we see in Section IV-B, this objective produces much larger oversegmentation error.
A. Worst-case segmentation Worst-case segmentation scores a global segmentation as the worst objectness among its local segments: where P X ∈ S X,T , P X = {C i } M i=1 , and C i ⊂ X. We define P * X,min as the optimal worst-case segmentation if It turns out the problem of finding optimal worst-case segmentation has optimal substructure (Theorem 1), allowing us to find the global optimum efficiently with dynamic programming (Algorithm 1).
We briefly describe how the algorithm works. Given a set of points X and a tree T X , OPTMINSEG(X , T X ) (Algorithm 1) produces an optimal worst-case segmentation P * X,min with score F * min (P * X,min ; θ ). For simplicity, we refer to a node in the tree by the set of points it is associated with. The algorithm starts from the root node X and chooses between a coarse segmentation ({X}) and a fine one. The fine segmentation will be the union of all X's children's optimal worst-case segmentation, which can be computed recursively. The algorithm would first traverse down to the leaf nodes, representing the finest segmentation. Then it will make its way up, during which it finalizes optimal segmentations for each intermediate node by making local coarse vs. fine decisions. Eventually, it returns to the root node and produces an optimal worst-case global segmentation.
Lemma 1: Given pairs of non-empty sets that contain real Theorem 1: Given C and T C , Algorithm 1 finds the optimal segmentation P * C,min = argmax P C ∈S C,T F min (P C ; θ ). Proof: Proof by structural induction. Base: When N C = / 0, meaning C corresponds to a leaf node in T C , the algorithm returns {C}, which is the only segmentation in S C,T and obviously is optimal. Induction: When N C = / 0, we need to show that the algorithm will produce the optimal segmentation, i.e. P * C and F * C , if it has access to the optimal segmentation for each of C's child C i , i.e. P * C i and F * C i (optimal substructure). Let P C be the segmentation that the algorithm produces for C and let F C be its score. If P C were not optimal, there must exist a different segmentation P C with score F C , s.t. P C = P C and F C > F C . Moreover, P C is either a trivial segmentation, i.e. P C = {C} or the union of segmentations over each of C's children nodes, i.e.
Algorithm 1 Optimal worst-case segmentation 1: function OPTMINSEG(C, T C ) return a segmentation P C with a score of F C 2: N C ← set of C's children nodes in T C

5:
if N C = / 0 then 6: for C i in N C do 7: Thus, P C has to be the union of segmentations over each of C's children node. According to the inductive hypothesis, the algorithm has the optimal segmentation over each of C's children node, meaning ∀i, Here, z represents an arbitrary local segment from a segmentation over C i . By applying Lemma 1, we have On one hand, On the other hand, the algorithm by design chooses the higher scoring one between P C = {C} with a score of F C = f (C; θ ) and P C = ∪ i P * C i with a score of With these and (5), we conclude F C ≥ F C , which contradicts the assumption F C > F C .
Generality: Our analysis makes no assumptions about the objectness function f (C; θ ) except the fact that it cannot be affected by the partitioning of other segments. In particular, this would allow objectness to depend on contextual arrangement of surrounding points outside C -e.g., f (C, X; θ ).
Efficiency: Given points X and a tree T X with N leaf nodes, Algorithm 1 guarantees to return the optimal worstcase segmentation after visiting every node in the tree. In practice, it might not visit all nodes. Instead, it skips the rest of sub-trees whenever one sub-tree exhibits lower score than the coarse segmentation (line 9 in Algorithm 1). The algorithm's complexity is linear in N despite the fact that the search space is exponential in N.

B. Average-case segmentation
Average-case segmentation scores a global segmentation as the average objectness among its local segments: where P X ∈ S X,T , P X = {C 1 , . . . ,C M }, and C i ⊂ X. We define P * X,avg as an optimal average-case segmentation if P * X,avg = argmax P X ∈S X,T F avg (P X ; θ ) It turns out that the problem of finding the optimal average-case segmentation does not have optimal substructure, unlike worst-case segmentation, meaning a locally optimal partitioning might no longer be optimal when considering global partitioning. Formally speaking, Lemma 1 no longer holds once min is changed to avg.
Despite without optimal substructure, we apply a similar greedy searching algorithm. The main difference is how we aggregate local scores. Though greedily averaging local scores might lead to myopic decisions in certain situations ( Figure 3), it performs quite well in practice (Section IV).

C. Learning the objectness function
We have discussed segmentation algorithms under the assumption that we already have access to an objectness function f (C; θ ), which predicts an objectness score for a given point cloud. We now introduce how to learn this function. Despite there has been a line of work that focuses on learning better representation, including Kd-networks [30], PointCNN [31], EdgeConv [32], PointConv [33], just to name a few, we choose a simple PointNet++ to parameterize such an objectness function as a proof of concept. Below, we talk about how to learn a PointNet++ as a regressor to predict objectness score.
Ground truth objectness: First, we must define regression target, i.e. ground truth objectness, of a given segment C. Suppose we have ground truth segmentation P gt = {C gt 1 , . . . ,C gt L }, where L is the number of ground truth segments. We can define C's target objectness as the largest point IoU between itself and any ground truth segment (8).
Such a definition of objectness is only reasonable if points are uniformly distributed in space. In practice, 3D sensors (e.g. LiDAR) tend to produce denser points near the sensor. In consequence, the objectness will be heavily influenced by the partitioning of points closer to the sensor. For example, imagine two objects are segmented into one segment. Suppose one object has n 1 points and the other has n 2 . If we use vanilla IoU as objectness, this segment would score max(n 1 ,n 2 ) n 1 +n 2 . When n 1 n 2 , the score could be really close to 1 despite it clearly introduces an under-segmentation error. To compensate such bias towards nearby objects, we propose a simple modification to IoU as in (9). Ob jectness(C, P gt ) = max l=1,...,L where x T x represents a point x's squared distance to sensor origin. (8) is a special case, where x T x is replaced with 1. Implementation: We train a PointNet++ w/ multi-scale grouping (MSG) [4] for learning the objectness function. Starting from the off-the-shelf architecture, we replaced the classifier with a regressor that produces a real-value given an input point cloud. We applied a sigmoid function to convert the regression output to numbers between [0,1]. Finally, we compute the mean-squared error between prediction and ground truth objectness and perform backprop. In terms of preprocessing, we follow [24] to make sure the input cloud is centered at origin and rotated based on the viewpoint. To facilitate batch processing, we follow the standard practice for PointNet++ and re-sample each segment to 1024 points.

D. Building tree hierarchies
We have discussed segmentation algorithms under the assumption that we have access to a tree hierarchy. Now we introduce how to build such a tree hierarchy given a set of points X. One natural approach is agglomerative clustering. After we define a metric (i.e. pairwise distance between two points) and a linkage criteria (i.e. pairwise distance between two sets of points), we can start from {{x 1 }, . . . , {x N }} and keep merging the closest pair of point sets by taking the union over them, until all points are merged into one set. Such an approach produces a tree in a bottom-up fashion.
This approach tends to create tree hierarchies with very fine granularity, e.g. one node may differ from another with only one point of difference. As we have mentioned, our segmentation algorithms need to evaluate the objectness of every node in the tree. From an efficiency point of view, we would like to build a coarser tree whose leaf nodes are segments rather than individual points. Moreover, adjacent nodes should differ from each other much more.
Implementation: We build tree hierarchies by applying Euclidean Clustering [9] recursively in a top-down fashion with a list of decreasing ε. Since Euclidean Clustering finds connected components w.r.t. a distance threshold ε, we start with the largest ε that defines the most coarse connected components. Then, we apply Euclidean Clustering with a smaller ε within each connected component. This produces a multiple-tree top-down hierarchy. In our experiments, we use ε ∈ {2m, 1m, 0.5m, 0.25m} to build tree hierarchies for both training and testing. During training, we extract segments out of tree hierarchies built with the same parameters to form our training set for learning the objectness function. During testing, we apply the same learned objectness function in both worst-case semgentation and average-case segmentation.

IV. EXPERIMENTS
For evaluation, we repurpose the KITTI object detection benchmark for point cloud segmentation following the setup in [8]. In our case, 3D objects do not physically overlap with one another. Therefore, we use ground truth 3D bounding boxes to produce ground truth segmentation. To do so, we first remove all points outside ground truth 3D bounding boxes (Figure 1). Then we treat points within one ground truth 3D bounding box as the ground truth segment for the object. On KITTI, there exist ground truth 3D bounding boxes that overlap with each other. We ignore such segments during evaluation, since it is not clear how to define the ground-truth for the points in such bounding boxes [8]. We follow [34] for splitting data into training and validation.
Evaluation protocol We follow evaluation metrics introduced by Held et al. [8], which consists of two errors, under-segmentation error and over-segmentation error. Given ground truth segmentation P gt = {C gt 1 , . . . ,C gt L }, we compute under-segmentation error U and over-segmentation error O given an output segmentation P = {C 1 , . . . ,C M } as: with where 1(·) is an indicator function and τ U , τ O are both constant thresholds. We set τ U = 2/3 and τ O = 1 following [8]. We ignore ground truth objects with overlapping bounding boxes (529/20870 ≈ 2.5%) and those with 0 points (238/20870 ≈ 1.1%) inside their 3D boxes. Other than these, we compute segmentation errors over all objects from all classes and also provide errors focusing on objects within 15m. We also adopt a slightly modified evaluation: instead of skipping objects with overlapping boxes entirely, we only ignore their overlapped regions.

A. Baselines
Euclidean clustering: We use Euclidean clustering with 4 different distance threshold {2m, 1m, 0.5, 0.25m} to build trees of segments, which defines the space of possible segmentations for our approach. Therefore, it makes sense to include all 4 of them as baselines and see if our approach indeed finds a better solution.
State-of-the-art 3D detectors: We compare our approach to AVOD [18], PointPillars [21], PointRCNN [25], and SECOND [22]. We follow the off-the-shelf training and testing setting as closely as possible. For AVOD, we re-train a LiDAR-only car detector and a LiDAR-only people (pedestrian and cyclist) detector following official implementation 2 . For PointPillars, we re-train a detector that simultaneously detects cars and people (pedestrian and cyclist) following an author-endorsed implementation 3 . For PointRCNN, we evaluate the official pre-trained car model as there are no available ones for other classes within its official implementation 4 . For SECOND, since it is our best performing baseline, besides re-training the off-the-shelf model, we also explore various ways to improve its performance. By design, these detectors output class-specific bounding box detection. To produce class-agnostic segmentations, we ignore the class label and follow a greedy procedure: We start with the highest scoring bounding box and group all points within the box as one segment. We then remove those points and move onto the next highest scoring detection. We repeat until exhausting either detections or 3D points. In the end, we might still not have every point assigned to a segment. A simple fix is grouping leftover points as a new segment. We discuss a much better alternative approach below.
Detector++: A better approach to handling missed detection is to fall back to clustering. Specifically, we apply Euclidean Clustering (EC) with a fixed ε on all leftover points, producing a set of leftover segments. For each leftover segment, we check if it can merged into an existing detection segment, using the criteria of whether the smallest pairwise distance between two segments is smaller than the threshold ε. If so, we merge the leftover segment into the detection segment. We refer to such baselines as Detector++ (e.g. AVOD++ etc.).

B. Results
We first present qualitative examples of our approach segmenting rare objects on KITTI Val, as shown in Figure 4. For quantitative evaluation, we present both per-class and overall segmentation errors in Table I.
Ours(min) vs. Ours(avg): We label the optimal worstcase segmentation as Ours(min) and the average-case segmentation as Ours(avg). Ours(avg) consistently outperforms Ours(min) in terms of the total error. Ours(min) produces a much lower over-segmentation error but a much higher under-segmentation error, suggesting it makes more mistakes of grouping different objects into one segment and less mistakes of splitting points from one single object into multiple segments. The cause of such behavior might be due to the risk-averse objective of optimal worst-case segmentation. However, current evaluation does not emphasize the worst-case performance, instead, it measures the average performance over all objects. We observe that if we evaluate the worst-case objectness (Section IV-C), Ours(min) does outperform both Ours(avg) and AVOD++.
Ours vs. Euclidean Clustering: We label Euclidean Clustering as "EC(ε)", where ε represents the distance threshold    (meter). All together, they define a segment hierarchy. We construct a pool of segments that contains every node (segment) in the hierarchy and call this "EC(all)*". This serves as a unreachable upper-bound, since segments from such a pool overlap with each other, which violates the non-disjoint constraint of a valid partition. Nonetheless, it shows that there gap between our proposed method and the upper bound is relatively small (3-4%), suggesting plenty of room left for improvement in creating better hierarchies.
Detector++ vs. Detector: We focus on AVOD to demonstrate the improvement of Detector++ over Detector. AVOD produces much larger oversegmentation errors, likely due to imprecisely localized 3D bounding boxes. For example, when a 3D bounding box is predicted smaller than it should be, the resultant segment might miss points on the edge, leading to oversegmentation. AVOD++ is designed to fix this issue and dramatically improves the oversegmentation error. The undersegmentation errors also improves significantly from AVOD to AVOD++, likely due to successfully segmenting objects that are completely missed by detections.
Ours vs. Detector++: SECOND++ performs the best among all Detector++ baselines and also achieves the lowest overall total error among all methods. However, if we break down total segmentation errors on a per-class basis, our approaches perform much better than SECOND++. Such difference is due to a skewed data distribution. For example, 68% objects are labeled as car while only 3% are labeled as misc. SECOND++ performs better on common classes such as car and ours perform better on rare ones such as misc.
Runtime analysis: Our algorithm requires running Point-Net++ on every candidate segment in order to compute its objectness. In practice, one frame from KITTI Val, which contains 68(σ = 42) segments on average, takes about 0.19s(σ = 0.06s) to process on a single GTX 1080.

C. Additional evaluation protocols
Class-agnostic instance segmentation: The evaluation protocol we adopt comes from the robotics community [15]. It differs from the standard evaluation in computer vision, i.e. per-voxel instance segmentation in ScanNet [35]. One key difference is that 3D instance segmentation does not require the output segmentation to be a valid partition. Instead, it treats the task as retrieval and evaluates the tradeoff between precision and recall. Here we take a similar approach as ScanNet, but modify the evaluation protocol to be classagnostic and per-point instead of per-voxel.
As we can see in Table II, the observations are consistent with what we see in Table I: SECOND++(8) with both modifications outperforms our segmentation approach on common classes such as car, but falls short on rarer classes (such as person sitting and tram) by a large margin. Overall, the best SECOND approach outperforms the best variant of our approach by 1.6% in mAP.
How objectness generalizes To evaluate how well our learned objectness model generalizes, we apply it onto ground truth segments from the validation set. In Figure 5, we plot the average objectness score for each class and the standard deviation. We also show the percentage of objects for each class within the training set. As the number of training data decreases dramatically, the average score tends to drops slightly and the variance tends to rise slightly.
Worst-case evaluation In Table I and II, we see Ours(avg) outperforms Ours(min) despite the latter is provably optimal. We have briefly discussed the reason: current protocols do not evaluate worst-case performance. Here, we score the worst IoU between a set of local segments and the ground D. Additional diagnostics Sensitivity analysis Our objectness function is learned on segments from a EC hierarchy generated with 4 distance thresholds {2m, 1m, 0.5m, 0.25m}. To analyze how robust our algorithm is to change of hyper-parameters, we test the learned objectness function on different hierarchies. In Table I, II, we find that having a deeper hierarchy significantly reduces segmentation errors. Comparing to hard-thresholded segmentation errors, there are only slight changes in multithreshold instance segmentation mAP.
Weighted vs. vanilla IoU Here, we empirically compare weighted IoU and vanilla IoU in terms of defining the training target for our objectness model. As we see in Table III, for both worst-case and average-case segmentation, the objectness model trained with weighted IoU perform slightly better than the one trained with vanilla IoU. Note "Ours(min) -vanilla" and "Ours(avg) -vanilla" share the exact same underlying objectness model.

CONCLUSION
We present an approach for class-agnostic point cloud segmentation.The approach efficiently searches over an exponentially large space of candidate segmentations and return one where individual segments score well according to a data-driven point-based model of "objectness". We prove that our algorithm is guaranteed to achieve optimality to a specific definition. On KITTI, we demonstrate our approach significantly outperforms past bottom-up approaches and topdown object-based algorithms for segmenting point clouds.
Acknowledgements: This work was supported by the CMU Argo AI Center for Autonomous Vehicle Research.

APPENDIX
Slides Please find a slide deck (here) that illustrates the main ideas in this paper.
Additional visualization Please find videos (1, 2, 3) that show advantages and limitations of our approach.
Additional evaluation In Table IV, we show segmentation errors under a slightly modified evaluation: instead of skipping overlapping objects entirely, we only ignore the points that fall into the overlapping region.