A Supervoxel Segmentation Method With Adaptive Centroid Initialization for Point Clouds

Supervoxels ﬁnd applications as a pre-processing step in many image processing problems due to their ability to present a regional representation of points by correlating them into a set of clusters. Besides reducing the overall computational time for subsequent algorithms, the desirable properties in supervoxels are adherence to object boundaries and compactness. Existing supervoxel segmentation methods deﬁne the size of a supervoxel based on a user inputted resolution value. A ﬁxed resolution results in poor performance in point clouds with non-uniform density. Whereas, other methods, in their quest for better boundary adherence, produce supervoxels with irregular shapes and elongated boundaries. In this article, we propose a new supervoxel segmentation method, based on k-means algorithm, with dynamic cluster seed initialization to ensure uniform distribution of cluster seeds in point clouds with variable densities. We also propose a new cluster seed initialization strategy, based on histogram binning of surface normals, for better boundary adherence. Our algorithm is parameter-free and gives equal importance to the color, spatial location and orientation of the points resulting in compact supervoxels with tight boundaries. We test the efﬁcacy of our algorithm on a publicly available point cloud dataset consisting of 1449 pairs of indoor RGB-D images, i

dynamic cluster seed initialization introduces additional com-87 putational overhead, the gained accuracy in terms of bound-88 ary adherence and compactness can be beneficial for algo-89 rithms that are not seriously restricted in time. We previously 90 introduced this method in [10] where we used it to cluster a set 91 of 3D points with surface normals representing camera poses 92 on the surface of a vehicle's 3D model. It was introduced as a 93 pre-processing step to reduce the input complexity of an opti-94 mal camera placement (OCP) problem for vehicle surround 95 vision. The method showed promising results for the OCP 96 problem by significantly reducing the overall computational 97 time (up to 160 times). However, the supervoxel method 98 was not analyzed as it was used as a pre-processing step. 99 In this article, we present a detailed analysis of the method 100 on colored point clouds to compare its efficacy against state-101 of-the-art supervoxel segmentation methods. The rest of the 102 document is organized as follows: Section II details relevant 103 literature, Section III details our proposed clustering method 104 and the results are discussed in Section IV. 105 II. BACKGROUND WORK 106 Superpixels are 2D versions of supervoxels. While 107 superpixels are extensively studied in the field of image 108 processing, [8], [11], [12], [13], [14], supervoxels have not 109 been studied enough despite their requirement due to recent 110 advances in 3D image analysis. In the beginning, video 111 sequences or stacks of 2D images collected over time were 112 considered as 3D images. Therefore, the first 3D extensions 113 of superpixel methods were tailored to deal with stacks of 114 images, with time being the third dimension.
[8], [14], [15] 115 are some of the first supervoxels methods that extended 116 their work to video sequences. Moore et al. [14] produced 117 over-segmentation on videos by iteratively partitioning pixels 118 into clusters by horizontal and vertical cutting in 3D grid. 119 Achanta et al. [8] proposed an efficient and widely successful 120 approach based on the k-means algorithm. In their method, 121 they distribute cluster seeds uniformly across a 2D or 3D 122 grid, search a local neighbourhood around each cluster seed 123 and assign points to the closest cluster center based on a 124 distance metric that relates pixels to a cluster center using 125 position and color information. The primary idea behind 126 the clustering method we propose here is based on this 127 method.
The goal is to group them into K disjoint subsets S = 203 {S 1 , . . . , S K }, where each subset, S k represents a supervoxel 204 with the label k. At the end of the clustering process, it is 205 expected that every point in P is assigned a label k ∈ 206 [0, . . . , K ], depending on which supervoxel the point belongs 207 to. Each supervoxel is represented by its centroid, C k , that is 208 a 9D vector calculated as the mean of all points assigned to 209 it. The points are assigned to a supervoxel based on a similar-210 ity metric, D, calculated as the Euclidean distance between 211 a point p i and a cluster center C k . Supervoxel segmenta-212 tion algorithms have individual strategies to tackle outlying 213 points. At the end of all iterations of our algorithm, we assign 214 a label k = 0 to the outlier points to mark them as unlabeled. 215 The algorithm requires one input parameter, i.e., the num-216 ber of supervoxels, K . Our proposed algorithm differs from 217 VCCS, [21], in three aspects: (1) for cluster center initializa-218 tion, instead of uniformly sampling the point cloud we exploit 219 the surface geometry to identify important regions in the point 220 cloud, (2) instead of projecting the points into lab color space, 221 we propose a method to use the similarity metric in the RGB 222 color space, and (3) we allow to add or remove supervoxels 223 dynamically to ensure that the entire point cloud is covered 224 by the over-segmentation. The following sub-sections detail 225 the individual steps of the algorithm, i.e., cluster center ini-226 tialization in Section III-A and, assignment and update steps 227 in Section III-B.

229
We propose a novel approach to select initial cluster seeds 230 based on points' orientation while ensuring that they are not 231 initialized close to one another. Like the strategy used in [8], 232 we assume that supervoxels are regular in shape and estimate 233 the sidelength of each supervoxel as S = N K . Two cluster 234 centers placed close to one another will have a significant 235 overlap in their search spaces. This results in competition for 236 VOLUME 10, 2022   For indoor scenes taken by one still camera (like the data 276 used here), the surface normals lie within only one hemi-277 sphere as the normals point towards the camera. Therefore, 278 for better binning accuracy in indoor point clouds, we create 279 b n = 2K number of bins. After creating the histogram of 280 surface normals, all the bin centers without any assigned 281 normals are deleted and one voxel from each of the remaining 282 bins (say we have b n bins with at least one assigned normal) is 283 selected as a cluster seed. The remaining K − b n cluster seeds 284 are initialized at equal intervals in the remaining voxels across 285 all bins. Through this strategy, we give more importance to 286 the geometry of the scene than to the spatial distribution 287 of points. Identifying important regions through binning of 288 normals allows for greater representation of small distinct 289 objects, while at the same time, there is a higher chance that 290 large objects (e.g., walls) get multiple cluster seeds.

292
In the assignment step, the points p i are assigned to the 293 cluster center C k , based on a distance metric D. D is com-294 puted as a combination of the Euclidean distances between 295 the position, color, and normal vectors of a point p i and 296 a cluster center C k . For a given point p i and a clus-297 2 is the distance betw-300 een normal vectors. In [29], the authors propose a novel 301 low-cost approximation for calculating the distance between 302 two colors in RGB space directly. They cite subjective exper-303 iments to claim that their proposed formulation overcomes 304 limitations of LUV color space. Moreover, as general datasets 305 have color information given in RGB space, computations 306 to convert from RGB space to LUV space can be avoided 307 through this non-linear distance metric. We propose to calcu-308 late the color distance between a point and a cluster center 309 where, f r = 2 + r m 256 , f g = 4, and f b = 2 + 255−r m 256 are weights 311 for the respective colors, and r m = r i +r k 2 is the mean of red 312 color. The distance metric for comparing two points is then 313 given as, if D < D i then D i = D; labels i = k; end end end outliers = collect points with label == −1; N ol (t + 1) = size(outliers); remove all centers with size(C k ) < 0.1 × c size ; Re-estimate C k as mean of all p i with labels i == k; Initialize c add clusters and append to C k ; K = K after adding and/or removing clusters; t = t + 1; end (1) the original voxel cloud connectivity segmentation 382 (vccs) method that works on voxelated point clouds, [21], 383 (2) a supervoxel segmentation method framed as a subset 384 selection problem (ssp), [22], and (3) a K-nearest-neighbours 385 version of vccs method (vccs-knn) that works directly on 386 the point clouds without voxelation, provided by the authors 387 in [22]. The vccs method is available as part of PCL (Point 388 Cloud Library), [30], and it was tested using the default 389 parameter settings. Voxel resolution for VCCS method was 390 set at 0.1m for all experiments. The ssp and vccs-knn methods 391 were tested using their openly available source code 1 with 392 the parameters for both methods set to the values as proposed 393 in their article, [22]. As the methods are tested on the same 394 dataset, the default parameters must be set to produce the best 395 results. While the vccs, ssp and vccs-knn methods implic-396 itly compute the surface normals for the point clouds, for 397 our method they were computed using the standard nearest-398 neighbours-based method provided by PCL with number of 399 neighbours equal to 30. Our method requires only one input 400 parameter, i.e., the number of clusters K . All the experiments 401 given as, r = L P 2π . If A S is the area of the circle with radius r, 449 then the isoperimetric quotient is given as, 451 Therefore, if I is the set of all segments in a segmented image, 452 then the compactness measure is given as, where, |P| is the size of the segment and N is the total 455 number of pixels in the image (or points in the point cloud). 456 C = 1 implies that the estimated segments are perfect circles 457 whereas, C = 0 implies that the segments have highly 458 irregular and non-convex shapes. Lastly, we also compare the 459 algorithms based on the contour density (CD) metric, [11]. 460 CD measures the fraction of boundary pixels in the segmenta-461 tion image. Given a set of boundary pixels, B, of an estimated 462 segmentation contour density is defined as, CD= |B| 2N , where 463 N is the total number of pixels. The fraction is divided by two 464 because computation of segment boundaries produces edges 465 that are two-pixels wide. The contour density metric also 466 indicates regularity of the boundaries as higher values of CD 467 mean that there exist more number of boundary pixels for the 468 same number of supervoxels. Higher values of CD indicate 469 that the object boundaries are irregular and elongated.

471
All the above-mentioned algorithms were tested on the NYU 472 V2 Depth dataset and compared using the above-mentioned 473 metrics. While supervoxel resolution as a measure has geo-474 metrical significance, we believe that the number of clusters 475 is easier to interpret for a general user. The complexities 476 of all algorithms is given in terms of the input size. As a 477 result, a user can have better control on estimating the com-478 plexity of subsequent algorithms which would be used on 479 the segmented point cloud when there is a direct control 480 on the number of supervoxels that need to be produced in 481 a point cloud over-segmentation. Moreover, it is important 482 to note that each algorithm produces a different number 483 of supervoxels for a point cloud at any given supervoxel 484 resolution. In all cases, ssp method produces the highest 485 number of supervoxels between all the algorithms for any 486 given resolution. It is well known that as the number of 487 supervoxels increases the performance in terms of metrics 488 also increases. Due to this reason, as one method produces 489 more supervoxels than another for the same segmentation, 490 comparison by the supervoxel resolution becomes an unfair 491 comparison for methods that produce fewer number of super-492 voxels. Therefore, we choose to compare results based on the 493 number of supervoxels produced. 494 We tested the vccs, ssp and vccs-knn algorithms at super-495 voxel resolutions (in meters) 0.10, 0.11, 0.12, 0.13, 0.14, 496 0.16, 0.18, 0.20, 0.25 and 0.35. To keep the number of 497 supervoxels produced by our method in similar range as other 498 methods, we set the output number from vccs method as 499 the input number of clusters for our method. Although the 500 number of clusters produced by our method is dynamic, it is 501 usually within a range of ±100 clusters from the user-chosen 502 number K . To maintain uniformity in comparison, for each 503 experiment, we round the output number of clusters to the 504 nearest 100 and order the results according to these multiples. 505 We believe that this strategy allows for a fair comparison as  The ssp method performs worst in this metric as their method 543 produces supervoxels with highly irregular boundaries. This 544 quality of ssp method is also reflected in the CD metric 545 (see Fig. 4), where ssp method performs worst out of all 546 the methods. Large values of CD for ssp method, or a large 547 number of boundary pixels for a given number of super-548 voxels, reflects the irregularly shaped supervoxels produced 549 by it.

550
It is to be noted that a higher number of boundary pixels 551 may result in artificially inflated values of R as there is a 552 greater chance that a segmentation boundary lies in close 553 distance to a given ground truth boundary. Therefore, it is 554 possible that the ssp method may not produce results that 555 are visually as appealing as reflected by their quantitative 556 analysis values. The same can be verified from Fig. 5, where 557 it can be seen that although the supervoxels produced by ssp 558 method agree to object boundaries to an extent, the resulting 559 supervoxels are irregular in shape. Irregularly shaped super-560 voxels are undesirable for subsequent applications as it intro-561 duces spatial discontinuity within the segmentation. With the 562 least CD values of all methods, the vccs method can be 563 expected to produce the most compact supervoxels. Although 564 it produces more regular shaped supervoxels than ssp and 565 vccs-knn methods, our method outperforms it because the 566 vccs method fails on noisy point clouds or on point clouds 567 with low spatial density (or high variation in depth). The same 568 can be verified visually from the second row of Fig. 5(b), 569 where there exist empty regions (seen in white) and small 570 isolated supervoxels in the segmentation. While missing 571 regions or small and isolated supervoxels do not contribute 572 to the CD, they impose a serious penalty on the compact-573 ness metric, thereby resulting in lower values of C for vccs 574 method. 575 Fig. 5 shows visualizations of segmentations from the 576 four methods on some example point clouds. In general, 577 our method produces visually appealing segmentations with 578 compact and regularly shaped supervoxels that addhere to 579 object boundaries well. An exception where our method 580 fails to produce compact supervoxels can be seen in the 581 second row of Fig. 5. The mesh doors at vicinity of the 582 viewpoint act as noise in that region of the point cloud as 583 they appear as scattered points at a different depth than the 584 background, resulting in irregular supervoxels. The supe-585 rior compactness of supervoxels produced by our method is 586 visible in the fourth row of Fig. 5. The highlighted region 587 consists of planar regions with very few objects. Yet, all 588 methods except ours produced supervoxels with arbitrary 589 shapes in that region. The shown segmentations are for 500, 590 700, 2000 and 1100 supervoxels for the first to fourth rows, 591 respectively. The result for vccs-knn method in the third row 592 of the figure consists of only 800 supervoxels as the method 593 produced only those many at a resolution of 0.1m, while the 594 rest of the methods produced about 2000 supervoxels. Finally, 595 it can be said that this figure presents an accurate visual reit-596 eration of the results shown as part of quantitative analysis. 597 Visualizations show that the vccs method fails in the case of 598 point clouds with complex scenes and when the point clouds 599 VOLUME 10, 2022  when compared against the other methods. This is expected 615 as our algorithm involves the assignment step for all the 616 points in every iteration. The step of collecting unassigned 617 points and initializing and/or removing cluster centers adds 618 additional computational overhead. However, our proposed 619 initialization procedure based on Fibonacci binning produces 620 accurate initial seeds. The algorithm typically converged in 621 6 − 8 iterations in all cases. While the overall complexity of 622 our algorithm seems big when compared to other methods, 623 the gained accuracy and quality of segmentation may be 624 beneficial to applications that do not have serious limitations 625  to track the change in cluster seeds based on the L 1 -norm 664 between cluster centers at current and previous iterations and 665 stop the assignment step when the norm becomes lower than 666 threshold.