Feature Based Sampling: A Fast and Robust Sampling Method for Tasks Using 3D Point Cloud

Point cloud data sets are frequently used in machines to sense the real world because sensors such as LIDAR are readily available to be used in many applications including autonomous cars and drones. PointNet and PointNet++ are widely used point-wise embedding methods for interpreting Point clouds. However, even for recent models based on PointNet, real-time inference is still challenging. The solution to a faster inference is sampling, where, sampling is a method to reduce the number of points that is computed in the next module. Furthest Point Sampling (FPS) is widely used, but disadvantage is that it is slow and it is difficult to select critical points. In this paper, we introduce Feature-Based Sampling (FBS), a novel sampling method that applies the attention technique. The results show a significant speedup of the training time and inference time while the accuracy is similar to previous methods. Further experiments demonstrate that the proposed method is better suited to preserve critical points or discard unimportant points.


I. INTRODUCTION
I N real-world applications, it is essential to receive and use 3D data as input to interpret the real world. For example, Simultaneous Localization and Mapping (SLAM) is used for robot or autonomous vehicle applications to locate the current position and it is necessary to process 3D geometric information in the form of point cloud from sensors such as LIDARs (light detection and ranging). Point cloud data is both unordered and unstructured where, N pieces of data represent an object or surrounding terrain, and each point consists of coordinates and features. To process point cloud data, a method that guarantees permutation invariance and fast inference is required.
There are many point cloud data processing methods that use Deep Learning techniques. Multi-view-based methods using traditional 2D models by projecting point cloud data in multiple directions and turning it into 2D images [1]- [3] and Volumetric-based methods using 3D convolution by these point-based method, a scheme that uses the SharedMLP model called the PointNet [10] that can process 2D data and point cloud data was introduced. PointNet became the basis of many pointwise-MLP methods.
An improved version of PointNet, a hierarchical structured version called PointNet++ [11] was proposed. The Point-Net++ architecture uses the sampling and grouping module throughout the enhanced 2D convolution. Furthermore, the hierarchical structure gives advantages to the receptive field. PointNet++ uses a sampling method called Furthest Point Sampling (FPS) that reduces the size of the original image while attempting to preserve points that are crucial for recognition tasks. The FPS method incurs a substantial computational cost because the cost increases squarely as the number of points in the image increases. Thus, as complex images consist of more points in a point cloud, the original speed advantage is diminished. Also, in general, the FPS scheme fails to exclude outliers in noisy and complex images, which can be critical to a model's performance. Many recent approaches that enhance the point-wise methods by adding techniques for robustness [12], [16] still use FPS as the sampling method resulting in slow training and inference time. There are other sampling methods such as Gumbel Subset Sampling [17] that uses high-dimensional spatial analysis by mixing features extracted from each point in the point cloud, adaptive sampling for PointASNL [12] and ASHF-Net [18] that shift the neighbor points to the centroid point and then uses FPS, Self-Organizing Map for So-Net [19] that finds points that express objects well using unsupervised competitive learning, and random sampling in RandLANet [20] that has a tremendous advantage in speed.
In this paper, we propose a novel sampling method called feature based sampling (FBS) that is inspired by Importance Sampling [21] to significantly reduce computational cost while trying to preserve the accuracy for recognition tasks. The targeted accuracy is the accuracy of the PointNet++ and pointASNL that currently shows the best accuracy when using 1k number of points. FBS consists of two steps which are Positional Encoding and Self-Attention Scoring as shown in Fig. 1 and is applied to all input points to reduce the number of points for the next step of a given task. The Positional Encoding phase calculates the relationship information of all points, for each point using its surrounding points and the query ball point method [11]. In the Self-Attention Scoring phase, important features are highlighted using a dot product calculation [22] of the weights of the features.
Our proposed sampling method, FBS, is applied for the classification task [23] and part segmentation task [24]. In the classification task, our proposed method shows a maximum of three times faster inference time than the existing model PointNet++ and about eight times faster than Point-ASNL [12], with small accuracy improvement. In the part segmentation task, our proposed method shows that the accuracy decreases by less than 1%, while it improves the inference time by 25%. An additional experiment applying random noise to each dataset shows that PointNet++ using our FBS method successfully excludes the noise data.
To summarize, our contributions include: • A novel feature-based sampling based on the concept of importance sampling is proposed that reduces the computation cost of the sampling process that are used in many tasks using point cloud data. • Our proposed method show comparable accuracy for classification and part segmentation tasks while reducing the computation cost. • FBS can better preserve important data points while deleting outlier points. The rest of this paper is organized as follows. In Section II, previous research is reviewed and difference is discussed. The novel sampling method is proposed in Section III. In Section IV, robustness, accuracy and the speed of FBS is compared with previous research. Finally, conclusions are summarized in section V.

II. RELATED WORK
PointNet [10] is a pioneering work that directly uses the unordered and unstructured point cloud data. PointNet uses a point-wise multi-layer perceptrons (MLPs) followed by the max-pooling operation. An improved hierarchical version of PointNet, PointNet++ [11] collects information sequentially through hierarchical structures and calculates the relationship between points and use the grouping scheme for the furthest point sampling method. PointASNL [12] improves the efficiency of feature extraction by applying the attention technique and also achieves robustness through Adaptive Sampling. In addition, Point-ASNL proposed a local-nonlocal (L-NL) module for capturing the neighbor and long-range dependencies of the sampled points. PointWeb introduced Adaptive Feature Adjustment to efficiently select features for local neighborhoods from the given points [16]. PointNet++ and PointASNL and PointWeb that improves PointNet++ use Furthest Point Sampling as the sampling method. Our Feature Based Sampling method incur lesser computational cost while keeping the accuracy comparable and show better performance for outlier removal.

A. OVERVIEW
This Section proposes Feature-based Sampling (FBS) that consists of two modules called Positional Encoding (PE) in subsection III-B and Attention Scoring (AS) in subsection III-C. The general flow diagram of FBS is illustrated in Fig.  2 (b).

B. POSITIONAL ENCODING
Positional Encoding (PE) extracts relevant information of a given point using its surrounding points to determine the importance of that point. PE encodes the surrounding points P k for a given point in point cloud P, constructing new information using existing features F (e.g., raw RGB or learned features). Encoding based on x-y-z coordinates creates the geometric information of the center point relative to its surrounding points. PE is divided into Grouping Neighbour Points, Feature Extraction, and Feature Augmentation steps.

1) Grouping Neighbour Points
For each i-th point, K neighborhood points are determined. The proposed method utilizes the query ball scheme where it only collects information from neighbors that are only a certain distance away (inside a certain radius).
The Q is the Query ball point algorithm, P is the x-y-z coordinates of the points, and F is the feature of the points. index i is the index set for the i-th point detemined using the query ball point scheme. p i denotes the center point (centroid) in P .
For each centroid p i (dashed line in Fig. 2), the index i is applied to P and F . The points surrounding the centroid p K i (yellow-green (N, k) in Fig. 2), and the features corresponding to the surrounding points are expressed as f K i (orange (N, k) in Fig. 2).

2) Feature Extraction
For each centroid p i (yellow (1, 3) in Fig. 2), geometric information is extracted using the set of K points close to the centroid p i and p k i consist of x-y-z coordinates, and results are derived through concatenation marked as ⊕. Then using a multi-layer perceptron (MLP) module, the features e k i (green in Fig. 2) are determined. This method is similar to the method used in randLA-Net [20]. The difference is that the Query Ball Point scheme is used. The resulting features are used to calculate the score in subsection III-C. VOLUME 4, 2016

3) Feature Augmentation
The feature is augmented using e k i determined from equation (2). That is, a feature f k i of p i is concatenated using e k i . Then, the total dimension of e k i ⊕ f k i is expanded to C, which is the number of channels. There is a trade off between the number of channels and accuracy. Thus, C is determined empirically.
The encoded information e k i , F c i is created (orange and green in Fig. 2) by concatenating the previously obtained f k i .

C. SELF-ATTENTION SCORING
The self-attention method ( [22]) is used to determine the importance of the features generated in the previous positional encoding phase. The dot product equations (2) and (3) is calculated to increase the difference of the importance. Finally, the sum of the calculated importance is used for sampling (selecting or discarding). From the given set of local features where softmax is first applied and then a shared MLP is used. The result s c i is shown in equation 4 (blue in Fig. 2).
Applying softmax makes it possible to generate a weight that can emphasize the quality of each feature. That is, this weight will identify the essential features and adverse features. Using this information, importance of features can be distinguished and the ranks of the points are determined. This method is suitable for removing outliers, which means that noise is minimized, objects are expressed more clearly, and important points can be selected in the sampling phase.
The final score, Score i , (purple in Fig. 2) of a particular point is determined by the dot product calculation of equations (3) and (4) and summing all the C features as shown in equation (5). Summation is used instead of typical pooling techniques such as Max/Min/Avg to prevent information loss.

D. FEATURE-BASED SAMPLING 1) Importance Sampling
Feature based sampling (FBS) discovers neighboring points for each centroid, creates a new feature, assigns an attention weight, and a sampling score for each point. The process is used to determine how critical the point is for a given point cloud image. Therefore, our sampling method can be referred to as importance sampling.

2) Computational Cost
Feature-based Sampling consists of a repeated sequence of overlapping Positional Encoding and Attention Scoring. Because all modules are connected in a linear form, the module with the highest time complexity dominates the overall time complexity. In positional encoding, Grouping Neighbor Points is worst case O(N 2 ) when using the query ball point method, where N is the total number of points. In both feature extraction and feature augmentation, the time complexity is O(N ) because of concat and MLP operations N times. Attention scoring consists of N dot-products and sigma, and the time complexity is O (N ). Thus, the time complexity of FBS is dominated by the query ball scheme which is worst case O(N 2 ). Compared to the K nearest neighbor computation whose time complexity is always O(N 2 ), query ball point method stops when K points are found within the desired radius. The worst case occurs when the query ball method cannot find K points within a certain radius and distance for all points are calculated. The best case time complexity is O (N  *  K), where the query ball scheme only searches K times and finds exactly K points within a radius for all N total number of points.
Usually when applying the Attention technique, it is applied in a point-wise form. However, FBS uses a feature-wise from and therefore, the computational cost is lower than other techniques that use Attention.

A. EXPERIMENTAL SETUP
In this paper, ModelNet40 [23] for the classification task and ShapeNet [24] for part segmentation task is used to evaluate the proposed method. Python and PyTorch are used for the implementation of the proposed method. The machine used for all experiments was the Threadripper 3990X 64-Core (128-thread) with 128 GB RAM and the NVIDIA RTX 3090 24 GB as the GPU. The PointNet++, and Point-ASNL are implemented and executed for accurate comparison of the proposed feature based sampling (FBS) method. PointNet++ is used as the backbone and the only difference is the sampling part (FBS). Negative Log-Likelihood (NLL) [25] is adopted for the loss function.
ModelNet40 has 9843 training data and 2468 test data in 40 different classes. For the training phase, 1024 points images are used as the input. The augmentation strategy from PointASNL [12] is used that includes (1) (3) 20% random dropout points. The same hyperparameters and settings were used except for the sampling module for fair comparison. Adam Optimizer is used with an initial learning rate of 0.001 and the decay rate of 0.0001. The batch size was set to 16, and the methods were trained for 200 epochs. When determining the overall accuracy for all test data, the experiments were conducted by applying vote 10. A random scale was applied to the data loader. Overall Accuracy (OA) is used as the metric.
ShapeNet is a manufactured synthetic dataset of 16,881 shapes divided into 16 classes and 50 subdivided parts. Unlike classification, part segmentation is a more difficult task. Because there are several partial classifications of an object. All parameters are set to the same ones used in PointNet++. The batch size was set to 16 and was trained for 251 epochs. The learning rate was 0.001 and the weight decay rate was 0.001. For the base model, the number of points was 2048, a learning decay of 0.5 was applied for every 20 step size and used Adam as the optimizer. From the dataset, up to 2048 points are randomly selected and used for training and the class is determined through one-hot encoding at the end. Mean units of intersection (mIoU) is used as the metric.

B. CLASSIFICATION
As shown in Table 1, the overall accuracy of the two best methods using 1k points, PointNet++ and Point-ASNL is 92.9%. Compared to these methods, the overall accuracy of the proposed method that uses FBS was slightly increased to 92.96%. All methods are run 100 times and the average is shown. SO-Net is inserted into Table 1 to be a reference of the accuracy performance. SO-Net uses a lot more points to solve the classification problem and therefore it is difficult to compare directly. However, as shown in the table, the accuracy performance is comparable. Train and Inference Time is determined for different number of input points (e. g., 512, 1024, and 2048). This experiment is aimed to compare the performance when the number of points is increased such that the results can be a guide for similar tasks or applications that use that number of points. The performance counter (perf_counter()) from the Python Time package was used to profile the computational time. While training time was determined per batch, inference time was measured in terms of the end-to-end time of the method. The three methods are run 100 times and the average time was used for comparison.
When compared to PointNet++, the proposed method completed the task faster for 512, 1024, and 2048 input data points as shown in Table 3. For both 512 and 1024 points, our method reduced training time by around 40% and time was reduced by about 23% for the 2048 points case. The most time consuming part is the sampling process and the proposed method in this paper effectively reduces the time to sample while also determining the points that are important. Furthest Point Sampling idea, which is used by previous research, approximates the importance of points using the distance and thus it is not a sampling method that requires learning. On the other hand, FBS has learning parameters because it proceeds in the form of calculating and learning the relationship between points. Even though FBS has more learning parameters than FPS, our method is still faster because of the reason discussed in Section III. The inference speed is about three times faster for 512 and 1024 points cases. For the 2048 points case, the inference time was reduced by about 21%.
Robustness The robustness experiment on the missing data ratio was conducted for the three sampling methods, Random, FPS, and FBS. The dataset consisting of 1024 points was used and out of 1024 points, 512, 128, 64, and 32 points are randomly picked for the robustness test (i.e., corruption ratio of 50%, 88.5%, 93.7%, and 96.8%) .

VOLUME 4, 2016
Method mIoU  aero  bag  cap  car  chair  ear  guitar knife  lamp laptop  motor mug pistol  rocket  skate  table  phone  board  #shapes  2690  76  55  898  3758  69  787  392  1547  451  202  184  286  66  154   Although the results are similar, FBS always showed better results than FPS and random generally showed worse results than the other two. PointASNL When compared to Point-ASNL that uses Furthest Point Sampling, our method showed significant time reduction. Because the experimental environment of PointASNL was originally Tensorflow, it had to be implemented using the PyTorch for fair comparison. The Table  4

C. OUTLIER REMOVAL
Experiments are executed to compare the outlier removal between FPS and FBS. For a point cloud image, noise ((x, y, z)±random) was randomly injected. The experiment was carried out using the visually well-represented Airplane and Chair (average 8.4 noise points) point cloud image. A total of 40 objects were selected for the experiment, 20 for each class. After sampling from 1024 points images, FPS resulted in an average of 7.1 outlier points for 512 samplings and 5.6 for 128 samplings. The FBS method resulted in an average of 5.1 outlier points in 512 samplings and 2.6 in 128 samplings. The reason is because the Furthest Point Sampling selects points that are furthest apart by calculating the distance and has no information about the importance of the points and therefore, outliers are more prone to be selected for the next stage of the classification process. Because outliers are most likely to be further away from other points. Adaptive sampling used in PointASNL proceeds with the combination of adaptive shift and FPS. However, as it uses FPS, the same problem arises. The example shape shown in Figure 5 may seem like a degraded image for the human perception, but it has sufficient information for a learning network to complete the classification task.

D. TENSORFLOW CUDA KERNEL
Experiments are conducted for FPS in two different settings; the Tensorflow CUDA kernel implementation and PyTorch implementation. The reason is that the version using Tensorflow has a time discrepancy by creating and using a C++ code that processes the FPS algorithm directly using the CUDA kernel. The machine used for the experiment was Intel Xeon 1.6GHz 6-Core (x2) (12-Thread) with 98 GB RAM and Titan XP(12GB) as the GPU. Pytorch uses only the GPU memory it needs for the sampling process because it calculates during the learning phase. Therefore, the number of points affects the inference time. In the Tensorflow implementation, the maximum gpu memory is used when the sampling operation is performed during the preprocessing phase. Results show that the both implements are similar in time but the pytorch implementation is faster. Thus, the comparison of FBS versus FPS using the pytorch implementation is a valid comparison.

E. PART SEGMENTATION
The part segmentation experiment resulted in our method achieving a mIoU of 84.3% as shown in Table 2 the difference is less than 1%. Train and Inference Time is reduced but in smaller amounts than the classification problem, because the part VOLUME 4, 2016 segmentation task is a more complex task than the classification task, additional modules are included to better identify the parts from the part segmentation data. Therefore, the sampling portion of the overall model is reduced and thus the effect of a better sampling is also reduced. As shown in Table 6, for both 512 and 1024 points data, our method reduced the training time by 35% and 25% for 2048 points. This result shows that the volume of tensor increases as the number of points increases. The inference time is reduced by approximately 60% for both 512 and 1024 points data and around 20% for 2048 points. The inference time also increases as the number of points increases.

F. IMPACT OF FEATURE BASED SAMPLING
We showed that the proposed method achieves acceptable throughput aside from the previous approaches. While the condensed 3D sensing data, including LIDAR, is highly integrated, previous approaches perform sampling to extract representative values and downsize such data. Conducting appropriate sampling at this time dramatically affects both the overall training and inference process. Until recently, most models used FPS [11] or Adaptive Sampling [12] to process the point cloud data. We showed that KNN is an operation with too much overhead, and random sampling is not appropriate because it can eliminate important data points, as shown in outlier removal. However, this paper presents that optimizing appropriate sampling techniques is crucial for real-world point cloud data processing. After all, the engineers in the real world need to increase the throughput for data processing under 10ms for real-time processing. Therefore, the approach presented by this paper points out that optimization of the sampling part and the model itself must be performed to increase the FPS.

V. CONCLUSION
This paper proposed the Feature-based Sampling (FBS), a sampling method using the importance sampling idea. FBS is a fast and robust sampling method that uses Positional Encoding and Attention Scoring. Results show that FBS significantly reduces the training time and inference time compared to the PointASNL and PointNet++ methods while achieving comparable accuracy for the classification task (minimal increase) and part segmentation task (slight decrease). Outlier removal experiment shows that FBS is better suited to discard outliers (i.e., noise in images) than the Furthest Point Sampling method that is used by PointNet++ and PointASNL. The next step is to apply our sampling method to various other tasks.