Projection-Based Point Convolution for Efficient Point Cloud Segmentation

Understanding point cloud has recently gained huge interests following the development of 3D scanning devices and the accumulation of large-scale 3D data. Most point cloud processing algorithms can be classified as either point-based or voxel-based methods, both of which have severe limitations in processing time or memory, or both. To overcome these limitations, we propose Projection-based Point Convolution (PPConv), a point convolutional module that uses 2D convolutions and multi-layer perceptrons (MLPs) as its components. In PPConv, point features are processed through two branches: point branch and projection branch. Point branch consists of MLPs, while projection branch transforms point features into a 2D feature map and then apply 2D convolutions. As PPConv does not use point-based or voxel-based convolutions, it has advantages in fast point cloud processing. When combined with a learnable projection and effective feature fusion strategy, PPConv achieves superior efficiency compared to state-of-the-art methods, even with a simple architecture based on PointNet++. We demonstrate the efficiency of PPConv in terms of the trade-off between inference time and segmentation performance. The experimental results on S3DIS and ShapeNetPart show that PPConv is the most efficient method among the compared ones. The code is available at github.com/pahn04/PPConv.


I. INTRODUCTION
Recent developments in 3D scanning devices have incorporated a great amount of 3D data into machine vision applications, such as robotics, autonomous vehicles, and VR/AR.Point cloud is a popular type of data with 3D geometry, and thus it is significantly beneficial to have a reliable autonomous point cloud processing system for better 3D perception.Although computer vision algorithms have achieved remarkable improvements in image recognition, 3dimensional data is naturally different from images that have only 2 spatial dimensions.Thus, researchers have studied novel methods to deal with 3D data effectively.
As convolutional neural networks (CNNs) have been proved to be effective in image recognition, researchers have attempted to extend the use of CNNs to 3D data processing.Several voxel-based methods [9], [27], [40] applied CNNs to 3D data, but this type of approach usually suffered from computational complexity of 3D convolutions.Higher voxel resolution led to a rapid increase in memory requirements and computation time, whereas lower resolution caused larger quantization errors.Alternatively, point-based methods [18], [29], [30], [42], [46] could alleviate the exhaustive memory requirements, by allocating memory that is linearly proportional to the number of points.However, during local information aggregation, frequent neighbor search and dynamic kernel computation led to delayed latency due to the irregular memory access and additional alignment computation, as pointed out in [24].
In this study, we aim to design a point convolutional module that does not have the limitations of 3D convolutions and point-based convolutions.Therefore, we choose 2D convolutional blocks as the fundamental operator.Even though projection-based methods for 3D data have been suggested before [17], [32], [34], [47], they showed degraded performance due to the nature of 2D convolutions, which distorts the 3D geometry and thus is not as expressive as 3D operators on 3D data.We intend to tackle a few problems that projection-based methods usually face.First, an N×C point cloud needs to be transformed into a H×W×C feature map to be processed with 2D convolutions.In naive projection methods, e.g., removing one of the coordinates, information loss is inevitable due to the dimension reduction.Next, when projected features are processed through separate CNNs and merged at the final stage, intermediate features are not aware of context information from other projections, which limits the advantage of multi-view projection.
To this end, we propose Projection-based Point Convolution (PPConv), a point convolutional module for point cloud processing with 2D convolutions.In PPConv, two separate branches process point features: point branch applies MLPs to point features and projection branch uses 2D convolutions.In the projection branch, point clouds are projected through PointNet-based learnable projection, which minimizes the information loss during the dimension reduction.Furthermore, features from each branch are fused inside the convolutional module, so that the output of PPConv is always the aggregated information of all the branch features.For the feature fusion, we propose Importance-Weighted Fusion and Context-Aware Fusion modules that enables effective fusion of multiple features.By integrating all these modules, PPConv maximizes the effectiveness of 2D convolutions on point cloud processing.
We demonstrate performance and computational efficiency of PPConv by means of comparison with several stateof-the-art methods.To strictly measure the effect of convolutional modules, we use a backbone of PointNet++ [30] with architecture hyperparameters used in [24], and only replace the convolutional modules by PPConv.The experimental results on S3DIS [1] indoor scene segmentation demonstrate that PPConv shows superior efficiency among the compared methods, in terms of the trade-off between inference time and segmentation performance.Furthermore, PPConv also achieves comparable performance to the state-of-the-art on ShapeNetPart [48].
Contributions of this paper are as follows: • We propose the 2D convolution-based point convolutional module, which can be used as a building block of a point-based network for point cloud processing.• We propose novel feature fusion modules which can effectively fuse point features generated by different types of convolutional operations.• We compare the inference time of state-of-the-art methods and some representative studies on the same hardware environment, using their publicly available implementations, and show that PPConv is one of the most efficient modules in terms of the trade-off between inference time and performance.

II. RELATED WORKS
Voxel-based methods use voxel representations for 3D data processing.VoxNet [27] used 3D CNNs on 3D data, such as raw LiDAR point clouds, RGB-D data, and CAD models, after transforming them into voxel representations.To address the computational burden of 3D convolutions, octree-based method [40] and sparse 3D convolution [9] were proposed, both of which reduced redundant computations at unoccupied spaces.Another study [4] proposed optimized sparse convolution for processing point cloud with varying density.
Several other studies have attempted to combine the advantages of voxel-based and point-based methods.In PVCNN [24], point-wise MLP was used along with 3D convolutions on coarsely voxelized data.MLP extracted the individual point features, while 3D convolutions aggregated local information from adjacent voxels.These two features were added to produce the output of a PVConv module.The aid of MLP alleviated the requirement of high voxel resolution, because MLPs could focus on fine-grained features while voxel features captured larger contextual features.Lower voxel resolution led to significant improvement in computational efficiency, which is the main contribution of PVCNN.Our research is highly motivated by PVCNN and mainly compared with it.Following PVCNN, Sparse Point-Voxel Convolution [35] was suggested to process sparse point cloud with PVConv, and FusionNet [50] was proposed to extract local features with an improved voxel feature aggregation module.

III. MOTIVATION
Most of the recent point cloud processing networks use either voxel-based 3D convolutions or point-based convolutional operations, as summarized in section II.However, each group of methods has critical limitations as pointed out in [24].In voxel-based methods, memory usage and computational cost  grow cubically with respect to the voxel resolution, which is considered to be a severe problem in scalablity.To alleviate this problem, some recent studies [4], [9], [35], [50] proposed to use sparse 3D convolution, which helps reduce redundant computations and thus enables to use small voxels in large scenes.In this paper, we take a different strategy to reduce computation by not using 3D convolutions at all, but instead processing point clouds with only 1D and 2D convolutions.
Point-based methods [18], [37], [46] suffer from delayed processing time caused by the additional computations needed for nearest neighbor search or kernel weight interpolation.In point clouds, the neighboring points are not saved in the memory contiguously: local point grouping operations require either radius search or k-nearest neighbor search.Another problem is that the input point locations vary for every sample, so the kernel points must be aligned with them before computing features.Based on these observations, we aim to use grid-based convolutions as the core operation to avoid inefficient computations.Thus, our design goal can be summarized as to process 3D point clouds with 1D and 2D operations on a grid structure.

IV. PROJECTION-BASED POINT CONVOLUTION
In this section, we introduce Projection-based Point Convolution (PPConv) module.The overall framework is depicted in Fig. 1.As a point-based convolutional method, PPConv module takes a point cloud 3+Cin) as input and outputs the transformed point cloud P ∈ R N ×(3+Cout) .Each point p i contains its point coordinates c i and point features f i , i.e., p i = (c i , f i ).In the first layer of the network, the input may be only the 3D coordinates of each point, or may have extra features such as RGB values or normalized coordinates (C in = 6 when RGB values and normalized 3D coordinates are used).In the other layers, each point has its coordinates c i and features f i , which are the output of the previous layer (C l in = C l−1 out , where l is the layer index).In PPConv, P is processed through two separate pathways: the point branch and the projection branch.In the point branch, MLP transforms each point feature individually.In the projection branch, the point cloud is projected onto a 2D plane and subsequently processed using 2D convolutions.Then, the feature map is backprojected to the point space and combined with the other features through fusion module.Each branch is further explained in the following subsections.

A. POINT BRANCH
In point branch, point-wise feature transformation is performed through a single-layer MLP, followed by batch normalization and ReLU: where For the experiments presented in this paper, we use C point = C out /2, where C out is the output channel of PPConv module.As MLP transforms individual point features, this branch is able to extract features that differ from point to point.

B. PROJECTION BRANCH
As described in Fig. 1, the projection-based feature extraction can be divided into three steps: projection, 2D convolutional module, and backprojection.Projection aggregates information along the projection axis, and then 2D convolutional module aggregates information along the other two axes.In this subsection, we provide the fundamental framework, and each module can be selected among several candidates depending on the characteristics or complexity of the target data.

1) Projection
Projection of a point cloud onto a 2D plane can be performed using several different methods.The projection plane is basically divided into grid, and then the key problem is how the points inside each cell can be aggregated into a single feature vector.The simplest way is to average features of points in the same cell.To further reflect the detailed point locations, bilinear interpolation can be applied to each point feature.During bilinear interpolation, a point feature affects four nearest grid cells.This method is useful in cases where most grid cells include only a few points.
If more points belong to a grid cell, more effective methods are required to avoid information loss during projection.In some previous studies [15], [54], a mini-PointNet structure is employed to aggregate multiple point features inside a pillar or a voxel.Following these works, we use PointNet-based projection to aggregate information of multiple points in a grid cell.Since PointNet aggregates information of arbitrary number of points into a single feature vector, it can produce a fixed-length feature for each grid cell.The PointNet-based projection consists of MLPs and max-pooling in each grid cell: where π(k) = (i, j) is point-to-grid index mapping function, which indicates that the (i, j)-th grid cell is where the point p k falls into.K(i, j) denotes the index set of all the points that fall into the grid cell (i, j).As in [15], the input features are augmented with the relative coordinates to the arithmetic mean of coordinates of the points inside the grid cell (x c , y c , z c ) and the relative location to the grid center (x p , y p in case of z-axis projection).After projection, a 2D feature map is constructed by stacking a feature vector for every grid cell.The feature vector of an empty cell with no points is filled with zero.

2) 2D Convolutional module
After projection, the 2D convolutional modules are used to transform the projected feature map.Since a 2D feature map has been produced through projection, any techniques used in conjunction with 2D convolution can be seamlessly integrated into PPConv.In this paper, we use a basic residual block [10] with 2 convolution layers, each followed by batch normalization [13] and leaky ReLU.The Squeezeand-Excitation (SE) module [11] is incorporated to further improve training.

3) Backprojection
After being transformed by 2D convolutions, feature maps are backprojected to the original point locations.Backprojection is performed by calculating each point feature based on the features of the 2D grid cell that the point was projected to.As shown in Fig. 1 2).Based on this mapping function, backprojection can be performed through several possible methods, the simplest of which is the nearest neighbor assignment.In nearest neighbor assignment, each grid cell feature (with channel size of C conv ) of the 2D feature map is assigned to every point that corresponds to that cell.However, this forces the points that share a grid cell to have exactly the same backprojected features.To efficiently reflect the contribution of each grid cell feature to the individual point, the distance between a point and the center of the corresponding grid cell is used to calculate the weights: where π(k) = (i, j) is the point-to-grid index mapping function as in (2), a i are the axis indices except the projection axis, m(i, j) is the center of grid cell (i, j).c k are the point coordinates of point p k ; thus c k [a 1 , a 2 ] is the location of p k on the projection plane (e.g.(x k , y k ) in case of z-axis projection).
During experiments, we further investigated various backprojection modules.However, the effect of fusion module was much more significant and thus various backprojection methods did not show noticeable differences.Thus, we use a simple module presented in (4) and focus on the fusion module.

C. FEATURE FUSION
Once each branch has processed features through different operations, they need to be effectively merged to leverage the advantages of different branches.In PPConv, the basic feature aggregation is performed by concatenating two features, one from point branch and the other from projection branches, and then applying a single-layer MLP.In case of PPConv with multiple projection branches, the backprojected features are added before being combined with point branch features.Throughout the experiments, we use C proj = C conv = C out /2 so that the concatenated feature has the channel dimension of C out .Here, we investigate the fusion of n p + 1 features, where n p is the number of projection branches and +1 is for the point branch.The first method is Importance-Weighted Fusion (IWF), which is depicted in Fig. 2. Features extracted from each branch are transformed through a single-layer MLP and sigmoid function to produce an N × (n p + 1) matrix.Then, these matrices are summed to form one N × (n p + 1) matrix and then softmax is applied to each row of this matrix.Each row, which corresponds to one of the N points, represents the fusion weights for each of the n p +1 features of that point.Finally, the corresponding weight is multiplied to each feature and summed to compute the output of PPConv module.This approach enables the network to learn how to fuse features by computing weights depending on the importance of each feature.
Another proposal of feature fusion strategy is Context-Aware Fusion (CAF), described in Fig. 3.In this module, the point coordinates (N × 3) are fed into a mini-PointNet structure to produce N × (n p + 1) matrix.Then, the weights, after applying softmax, are used in the same way as in IWF module.CAF module uses the geometry of the input point cloud to produce the weight matrix.Intuitively, CAF can be considered to be deciding which features are more informative for each point based on the shape of current input point cloud.Furthermore, CAF includes a max-pooling operation over the point dimension, which allows the attention weights to reflect the global context.

D. DISCUSSION
In this subsection, we explain the difference of PPConv from other methods in detail.Primarily, following our design goal explained in Section III, PPConv utilizes 2D convolutions to efficiently extract local features, unlike voxel-based or pointbased methods.Projection-based point convolutional networks proposed in some studies [19], [36] require neighbor search and local point grouping, while in PPConv, they are efficiently carried out using grid cell index.Comparing with projection-based 2D CNN models [15], [41], [53], PPConv fuses multi-view projection features and point-wise features in every convolutional module, while existing studies focused on aggregating the final features of CNNs.
In PPConv, two feature fusion modules are proposed to aggregate features from different branches.IWF resembles attentive fusion strategies presented in recent point cloud segmentation studies [3], [31].The previous works proposed to summarize each N × C feature into an N × 1 vector using MLPs and then concatenate these vectors before normalizing them with softmax to calculate attention weights.In IWF, an N × (n p + 1) matrix is calculated from each N × C feature, thus enabling every feature to directly affect the weights for all the other features.Additionally, CAF module takes the point cloud geometry into account while calculating the weights, which is a novel approach to attention-based feature fusion, since other methods generally use the same features for computation of attention weights and for fusion based on those weights.
Recently, feature aggregation based on self-attention has been popularized in computer vision tasks, on both images [5], [38] and point clouds [52].These attention mechanisms, called transformers, are effective in various applications but known to consume significant computational resources as they require to calculate pairwise relations of the features.Compared to transformers, the proposed feature fusion modules are much simpler and thus bring small increase of runtime.Furthermore, the intermediate features in the fusion modules have a smaller channel dimension than the features of each branch, which also brings a small increase in memory requirements.Therefore, through the proposed feature fusion modules, we can achieve efficient 3D feature extraction using 2D convolutions.

V. NETWORK ARCHITECTURE AND LOSS
In order to demonstrate the effect of PPConv module, we follow the PointNet++ [30] architecture as used in PVCNN [24].In PointNet++, Set Abstraction (SA) module aggregates the local geometry of each sampled point through a local PointNet.Feature Propagation (FP) modules use MLPs to aggregate multi-level features, which consist of interpolated higher-level features and previously extracted lower-level features from the corresponding SA module.We refer the readers to PointNet++ paper [30] for further details.
We follow the layer configurations of PVCNN++ architectures to purely compare the effectiveness of the convolutional module.For PPConv, we use the grid resolution of 64 for the first layer, which is determined through ablation study (in Table 11).Hereafter, PointNet++ architecture equipped with PPConv is denoted as PPCNN++.In Section VI, PPCNN++ architectures with different configurations are indicated by PPCNN++(config).For example, PPCNN++ model with 3 projection branches (x-, y-, and z-axis projection) and Importance-Weighted Fusion (IWF) module is denoted as PPCNN++(x,y,z,IWF).
Following recent state-of-the-art methods [4], [31], [37], we use cross entropy loss for each point.As semantic segmentation task targets for a class prediction for every input point, we simply apply the point-wise classification loss to each point and average those values to get the loss value for the input point cloud.

VI. EXPERIMENT A. SETTINGS
In this section, we present the experiments to demonstrate the efficiency of PPConv module.We performed experiments on two datasets of different scales: indoor scene segmentation on S3DIS [1] and object part segmentation on ShapeNet-Part [48].Both datasets are aligned; each axis (x, y, or z) of 3D coordinates shows a consistent direction throughout the dataset.Thus, we only use the x-, y-, and z-axis projections in this paper.
We report experimental results along with the inference time of each model.For the segmentation performance, we report the best mIoU of each model for fair comparison with other works.Performances of other methods were copied from the corresponding published paper, and the inference time was measured in a fixed hardware environment using the code provided by the authors.Because runtime of each method reported in the paper may vary depending on the hardware environment, we only report those measured in our setting: inference on an NVIDIA RTX 2080 Ti GPU following warm-up and synchronization of the GPU.

B. INDOOR SCENE SEGMENTATION 1) Dataset
For indoor scene segmentation, we used Stanford Large-Scale 3D Indoor Spaces (S3DIS) [1] dataset.In S3DIS, point clouds are scattered over 271 rooms in 6 areas, and area 5 is often used for testing as it contains no region that overlaps with the other areas.Following previous works, we trained the models on areas 1 ∼ 4 and 6, and then tested on area 5.Each point in the dataset is assigned one out of 13 categories, including large objects (e.g., ceiling and floor) and small objects (e.g., beam and chair).We used 6 extra feature channels as the input following the majority of recent studies, which are RGB values of each point and normalized 3D coordinates with respect to the room size.
The data pre-processing and evaluation method for S3DIS experiments were borrowed from PVCNN [24] and FP-Conv [19].PVCNN used 8,192 points for each 1.5m × 1.5m block as input, while FPConv used 14,564 points in a 2.0m × 2.0m block.Furthermore, the blocks are divided in the pre-processing step in PVCNN and the same block is used for the whole training procedure, while in FPConv, the block location is randomly sampled on the fly.We evaluate PPCNN++ with these two data pre-processing methods.In this paper, we denote PPCNN++ models trained under PVCNN's experimental protocol as "PPCNN++ (PV)", and those trained under FPConv's protocol as "PPCNN++ (FP)".

2) Architecture and Hyperparameters
As explained in Section V, we used the same network architecture with PVCNN++, except for the grid resolution, which we define as 64 × 64 for the first layer.The layer configuration is shown in Table 1.In the table, the numbers in During testing, we adjusted the number of input points while measuring inference time for fair comparison.For the majority of methods, the number of input points can be controlled using the batch size, while other methods, such as MinkowskiNet [4] and KPConv [37], take input with varying number of points.Thus, we set the batch size as the minimum value such that the total number of input points exceed the average number of input points of MinkowskiNet (51k).This results in the inference batch size of 7 for the model that takes 8,192 points per sample, and 4 for the model with 14,564 points per sample.For the methods with different number of input points, the batch size was adjusted so that the total number of input points do not exceed 7×8192.Therefore, we ensure that the inference time of PPCNN++ was measured using the number of input points that is comparable or larger than all the other methods.

3) Results
The indoor scene segmentation results are shown in Table 2 and visually presented in Fig. 4. In the graph, we provide the number of input points for each method in the legend.
The graph shows that PPCNN++ shows better efficiency than any other compared methods.Some state-of-the-art methods, such as BAAF-Net [31], MinkowskiNet [4], and KPConv [37], show better performance than PPCNN++ with significantly slower inference speed.For example, BAAF-Net, which is the fastest among those models, takes 364ms to process 40k points, while PPCNN++ model with the best mIoU takes 221ms to process 58k points.When compared to RandLA-Net [12], a point-based model known to be extremely fast, PPCNN++ with FPConv protocol (red square) shows better efficiency trade-off.Overall, the experimental results clearly show that PPCNN++ can perform point cloud segmentation with a small gap of accuracy from the stateof-the-art models, but with a remarkable advantage in inference speed.Additionally, 6-fold cross validation results are shown in Table 3, class-wise mIoUs are shown in Table 4, and visualized examples of the prediction of PPCNN++ are shown in Fig. 5.We present the results reported in each paper, thus only those papers that published 6-fold cross validation results (Table 3) or class-wise mIoUs (Table 4) are shown.

1) Dataset
In this subsection, we report the object part segmentation performance of PPCNN++ on ShapeNetPart dataset [48].Following previous works [18], [24], we evaluated the models using mIoU averaged over    2) Architecture and Hyperparameters We train PPCNN++ model with the grid resolution of 64 × 64 for the first layer of the network.Each sample of ShapeNetPart dataset is a single object, which means that the geometry of point cloud is simpler than that of S3DIS samples.Thus, we used a smaller number of layers in PPCNN++, which is presented in detail in Table 5.

3) Results
Table 6 presents comparison of PPCNN++ with other stateof-the-art methods.The efficiency comparison using runtime and performance can be better analyzed on relatively larger dataset, i.e., with larger number of input points.Since we already compared runtime on S3DIS dataset, we only report the segmentation performance in this section.The results show that PPCNN++ exhibits comparable performance to the state-of-the-art methods.
In Table 7, we present mean IoU of each class compared with recent methods.PPCNN++ model outperforms all the other methods in two classes, and shows comparable instance mIoU ("mIoU" column) with state-of-the-art point-based methods.In this table, we only include methods that reported class-wise mIoU in the published paper.

VII. ABLATION STUDY
In this section, we present analysis on components of PPConv by providing experimental evidence of the effect of each module.The analysis was performed using S3DIS dataset using experimental protocol of PVCNN [24].For every experiment in ablation studies, we trained the model 5 times with different random seeds, and report the averaged performance.
First, we demonstrate the effect of each branch of PPConv module: projection branch and point branch.We trained the model without each branch and report test performances in Table 8.For the model without point branch, feature fusion was performed by adding 3 projection branch features and  Next, we evaluated PPCNN++ models with different number of projection branches.We trained 3 different models, each with 1 (z), 2 (x,z), and 3 (x,y,z) projection branches.Every model included point branch in this experiment.The results are shown in Table 9, which shows increasing performance as projection branch is added.On the other hand, less projection branch results in faster runtime, which gives possible choices over the trade-off between runtime and performance.We also evaluated different projection methods to demonstrate their effect on the segmentation performance.PointNet-based projection was mainly used throughout the experiments in the paper.We evaluated simpler projection methods, averaging features and bilinear interpolation, which was explained in Section IV-B1.The results in Table 10 present that PointNet-based projection is clearly better than the other methods, justifying the use of PointNet-based projection for segmentation.For the grid resolution during projection, we tested a few choices in order to determine the optimal value.We evaluated four different resolutions: 32×32, 48×48, 64×64, and 96×96.Through experimental results, which is shown in Table 11, we observed that 64×64 shows the best performance and chose this value as the default experimental setting in this paper.Then, we present experimental results of variations in 2D convolutional modules.As 2D convolutional modules deal with projected feature maps, which are of the same shape as general image feature maps, any techniques that can be combined with 2D convolutions can be used.We test two of the most popular modules: residual connection and Squeezeand-Excitation [11] module.We trained three different models: without any of the two components, with only residual connection, and with both modules.The segmentation performance, as shown in Table 12, increases as each module is added to the 2D convolutional module.Finally, we trained models with different fusion strategies to demonstrate the effectiveness of proposed fusion modules.The default fusion adds 3 projection branch features and then concatenate it with the point branch feature.Then, a single-layer MLP is applied to produce the output feature with the pre-defined channel dimension.We proposed 2 different fusion modules in Section IV-C, and provide the experimental results in Table 13.The results show that both fusion modules contribute to performance improvement over the default method, while runtime is also increased due to the additional MLPs and matrix computations.Context-Aware Fusion module shows noticeably larger improvement while increase of runtime is comparable to that of Importance-Weighted Fusion.

VIII. CONCLUSION
We proposed PPConv, an efficient local aggregation module for 3D point cloud processing.The experimental observations demonstrated that efficient 2D convolutional modules can facilitate faster computation while keeping performance not far behind from that of 3D convolutional modules.PP-Conv shows remarkable efficiency in terms of the trade-off between inference time and segmentation performance, providing an efficient alternative for point-based convolutional models.PPConv can process both object part segmentation and scene segmentation under appropriate network architectures, thus can be used as a generic building block for 3D point cloud processing networks.

FIGURE 1 :
FIGURE 1: Description of PPConv module.The input point cloud is processed through projection branch and point branch, and then fused together to produce the final output point-wise features.

FIGURE 5 :
FIGURE 5: Visualized predictions by PPCNN++ (middle) on S3DIS, presented along with the corresponding input point cloud (left) and the ground truth (right).

Projection Conv 3×3, 𝑪 𝒄𝒐𝒏𝒗 BN × N ReLU Backprojection Projection branch Fusion Point branch Input PC
, backprojection transforms 2D feature map (H × W × C conv ) into point features (N × C conv ); thus a mapping function between grid cell indices H × W and point indices N is needed.A natural choice would be reusing the mapping function used in projection: π(•) in (

TABLE 1 :
Network layer configurations of PPCNN++ for S3DIS experiments

TABLE 4 :
Mean IoU for each class for S3DIS (Area-5) (Bold: the best result for each class, Underlined: the second best result)

TABLE 7 :
Mean IoU for each class for ShapeNetPart

TABLE 8 :
Ablation study on each branch of PPConv module

TABLE 9 :
Ablation study on the number of projection branches

TABLE 10 :
Ablation study on projection methods

TABLE 11 :
Ablation study on the grid resolution

TABLE 12 :
Ablation study on the 2D convolutional module

TABLE 13 :
Ablation study on feature fusion methods