A Geometry Feature Aggregation Method for Point Cloud Classification and Segmentation

In spite of the good performance of convolutional neural network (CNN) and graph neural network (GNN) in 3D point cloud classification and segmentation at present, to aggregate local information of point clouds and improve the robustness of geometric transformation are still challenging problems. In order to tackle the problems, we propose Geometry Feature Aggregation Network (GFA-Net), which can effectively learn the context information of each point to aggregate local information, so as to enhance the robustness of rotation and translation. Compared with the current popular method GNN that convolves on nearby points in Euclidean space, GFA-Net can better aggregate the geometric features around the points. GFA-Net uses the Laplacian feature mapping to reduce dimensions, and aggregates the nearest neighbor features in the space after dimensionality reduction, and fuses them with the nearest neighbor features of Euclidean space, so as to better obtain the geometric features of each point. Then, points are grouped with geometric features, so that nearby points are insensitive to geometric transformations such as rotation and translation. This method allows GFA-Net to better obtain holistic geometry features, such as symmetry. In addition, we use attention mechanism instead of pooling, so that important neighborhood information can be learned automatically and information loss can be reduced. We conduct extensive experiments on public datasets ModelNet40 and ShapeNet Part. The experimental results show that GFA-Net achieves very good performance, which is very close to the current state-of-the-art methods, and GFA-Net has better robustness.


I. INTRODUCTION
At present, with the development of computer vision and artificial intelligence, point cloud classification and segmentation has become a very challenging and important problem in 3D vision. As robot control [1], automatic driving, remote sensing [2] [3], reconstruction of 3-D buildings [4] and other fields have the need to interact with the real scene. For the past years, tremendous progress in the automatic classification of ALS point clouds has been achieved in the community of remote sensing and photogrammetry [5].
At present, deep learning and convolutional neural network (CNN) show great brilliance in the field of 2D images [6] [7]. 3D point cloud is non-ordered, that is, the change of point order will not affect its meaning, while deep neural networks require input data with regular structure. Moreover, 3D point cloud is irregular, so it is not easy to apply deep learning to 3D point cloud. In this paper, we consider to handle point cloud classification and segmentation. The traditional method to solve such problems is to extract the geometric features of point clouds with hand-made features [8] [9]. In recent years, deep learning has made great progress in LiDAR point cloud analysis, and many deep learning-based point cloud classification and segmentation methods have been proposed [10] [11]. The deep learning method to handle such problems is converting 3D point cloud data into a voxel form with regular shape [12] [13], and then using 3D CNN for feature learning to process point cloud data. However, such conversion consumes expensive computing and memory costs, which increases network overhead and is not conducive to the use of large-scale point cloud scenarios. Then, deep neural network designed to process the raw point cloud data directly is presented. PointNet [14] is the pioneer of neural network that directly processes the raw point cloud data. By learning the information coding of each point and aggregating the features of each point into the global features, PointNet [14] uses the MaxPooling function to realize the permutation invariance of the points, but this lose the local information of each point. In order to tackle the problem that PointNet [14] fails to extract local information, PointNet++ [15] is proposed. PointNet++ [15] adopts the strategy of stratified sampling and uses neighborhood balls to get subregions. Then, PointNet [14] is used to extract local features of different scales of the network.
In order to enhance the learning ability of each point to local features, PointWeb [16] calculates the influence among all points in the local region by considering the interaction among all points in the local region to obtain features. So a Adaptive Feature Adjustment (AFA) module is designed in PointWeb [16], which connects all points in the region and finally forms a local fully connected network. Recent work has focused on designing some unique convolution operations for irregular 3D point cloud. ConvPoint [17] proposed by Boulch in 2020 utilizes the concept of convolution. Multi-layer perceptron (MLP) is used to learn the continuous weight of adjacent points in each point, and then the density of each point is calculated to solve the problem of uneven sampling of point cloud. RS-CNN [18], a convolutional network based on geometric relations, proposed a novel convolutional operator RS-Conv that learns from relations. RS-Conv extracts the topological constraint relationship from the point cloud, so the weight of the convolutional network is also constrained by the topological relation in learning. This method extends the convolutional networks based on ordered information to the convolutional networks which adapt to unorganized information, thus improving the shape perception and robustness to a great extent. Similarly, to aggregate information in Euclidean space, DGCNN [19] proposed a new convolution method, EdgeConv. EdgeConv uses K-Nearest Neighbors (KNN) algorithm to construct directed graph of point clouds. Then, the adjacent points in Euclidean space are recalculated with the directed graph of point cloud, and the new graph structure is passed to the next layer for processing, so as to realize the dynamic graph structure. However, in Euclidean space, DGCNN [19] only considers the nearby neighborhood points and fails to consider the points that may have the same geometric structure at the far end. Point2Node [20] proposed a learning network with a high-dimensional node graph model, which can fuse self, local and non-local correlation information between nodes on 3D point cloud, and designed adaptive aggregation features information of Gate Mechanism at channel level.
The above methods only consider the local features of each point extracted from the Euclidean space. However, the geometry of many objects may be symmetric, with points moving away from each other. If only the local structure of adjacent points is considered, the features of points with the same geometric structure cannot be obtained, and even the overall geometric structure will be lost. In order to capture points with similar geometrical structure and share the information of points with similar geometrical structure, the features of points cannot be extracted only from Euclidean space.
Inspired by the above work, we propose a Geometric Feature Aggregation Network (GFA-Net). This network not only gathers the eigeninformation of points in Euclidean space, but also the eigeninformation of points in eigenspace after dimensionality reduction of Laplace matrix. The Laplace matrix is calculated by constructing the fully connected point cloud graph. Then, the geometric features of each point can be learned through Laplace eigenmapping, so that points with similar geometric characteristics can exchange information. It makes up for the shortage of remote points with similar characteristics but unable to exchange information. We fuse the feature map calculated in Euclidean space with the feature map obtained in Laplacian eigendimensionality reduction space. This step takes into account not only nearby points but also points with similar geometrical structures. We also improve the pooling operation. For the information of local points, instead of simple pooling operation, the weight of local information carried by each point is calculated by attention mechanism, that is, the degree of influence on the central point was used for fusion calculation. Different from other methods that mostly deal with point cloud features in Euclidean space, our method deals with point features in Laplace feature space and makes full use of all point information to extract local information. Since we need to consider the relationship between all points, not only will the calculation be complicated, but also the learning rate will sometimes be reduced because of the redundancy of points.
We conduct extensive experiments to explore the effectiveness of GFA-Net for point cloud classification and segmentation. The experiments demonstrate that our method achieves the same performance as the current state-of-the-art methods on both ModelNet40 [36] and ShapeNet [37].
Overall, three key contributions of this paper can be summarized as follows: • We propose a novel geometric feature aggregation Geometry Feature Aggregation (GFA) module, which effectively extracts the geometric features between points, and integrates with the features obtained from Euclidean space, so as to better extract the local and global geometric features. • We prove that the neighborhood features obtained from the feature space of Laplacian dimension reduction are invariant in rotation and translation. • Our GFA-Net achieves the state-of-the-art performances on datasets ModelNet40 and ShapeNet.

A. Method based on multiple views and voxels
MVCNN [21] uses two-dimensional rendering images of multi-view 3D objects as training data, and recognized 3D objects based on CNN. Later, Charles R et al. [22] propose a voxel-based CNN point cloud identification method, which raster the point cloud data to form the voxel structure, and then process it through 3D convolution operation, and introduce auxiliary training tasks to reduce overfitting. The auxiliary training task is to use partial subvolumes to predict the type of the object. Only partial subvolumes need to be collected without any additional manipulation. The higher the resolution of the data, the better the effect is achieved by this method, so the effect is not obvious for the data with lower resolution. There are also some approaches to regularize the point cloud data by using the tree structure, for example, OctNet [23] makes use of the sparse input data and uses a group of unbalanced octree to stratify the space, and modifies and realizes the convolution operation, so as to adapt to the mixed gridd-octree data structure. The Kd-Net [24] proposed by Roman Klokov et al. uses KD tree to divide the point cloud space, and its hierarchical structure is used as different feature forms.

B. Point-based Networks.
In addition to PointNet [15] and PointNet++ [16], much of the new work involves extracting local features of each point by designing complex network modules. Slice pooling layer proposed by RSNet [25], it is to map the features of disordered point cloud data from x，y，z directions to the sequence of ordered feature vectors, and then use bidirectional RNN to update the features, so as to extract local correlation features. The computational complexity of extracting local correlation features is relatively small. SONet [26] simulates the spatial distribution of point cloud by constructing self-organizing map. Based on SOM, SO-Net is used to extract layered features from single point and SOM node. Finally, a feature vector is used to represent the input point cloud. There are also some methods to design a unique convolution for point cloud data [27] [28]. Pointwise [29] proposes a new point-by-point convolustion method for 3D point cloud data. Pointwise [29] convolution is very similar to ordinary convolution. The difference is that point clouds are irregular, but they also use fixed-size convolution kernel to convolution. Pointwise convolution is the multiplication of the points in each small square of the convolution kernel with the weight to get the average value of each small square. After that, the average value of each small square is added to get the output of this layer. There are also some very clever convolution methods. FPConv [30] is a flattening convolution method, and maps each point and its neighborhood into a plane by attention mechanism, and uses a convolution method similar to 2D images to carry out sliding convolution on the planalized point cloud data. PointCNN [31] uses X-Conv operator to re-encode and weight input points and features to make point cloud data into a canonical order, and then processes the reconstructed points through conventional convolution.

C. Other Methods
Most studies focus on the calculation of adjacent regional points from Euclidean space, while GS-Net [32] finds that features of distant points with similar geometric features could not be obtained only in Euclidean space, so the Geometry Similarity Connection (GSC) module is designed. GSC module obtains the similar geometric features of distant points in the feature space, and then fuses them with the features of neighboring points obtained from Euclidean space to get local feature descriptors. Recently, there has also been some work using transformer for 3D point cloud [33] [34]. PCT [34] proposes a new transformer-based point cloud learning framework PCT, which avoids the disorder of point cloud data by using transformer's inherent order invariance and carries out feature learning through the attention mechanism.

III. METHOD
In this part, we introduce point cloud segmentation and classification network GFA-Net. GFA-Net is in three layer structure. Each layer consists of GFA module and attention mechanism. In each layer, geometric feature aggregation module is used to enrich local geometric information, and then attention mechanism is used to learn and select the extracted geometric features of surrounding points. The output features of attention mechanism are the input of the next GFA module. Finally the whole feature descriptor is obtained, sending it to the corresponding classification and segmentation network to segmentation and classification task.

A. GFA
Considering that in the Euclidean space to aggregate information of nearby points will lose the information of distant points with similar geometric features, while in the feature space, it is unstable to extract the geometric features of distant points, and sometimes feature chaos may occur. So we construct the fully connected graph of the points and compute their Laplace matrix. The Laplacian eigenspace is used to obtain the geometric features of distant points, and then the features of neighboring points in Euclidean space are integrated with them to better collect local features. As shown in Figure 1, the GFA module can not only identify adjacent points from Euclidean space, but also identify distant points with similar geometric features, such as symmetric geometric points.  (1) The Laplacian matrix is defined from the adjacency matrix.
In different variants of Laplacian matrices, we define the combinatorial graph Laplacian used in [30] where is the degree matrix-a diagonal matrix with , = ∑ , =1 . We define normalized graph Laplacian matrix as where ℎ is for contacting features.
In order to aggregate local geometric features and global geometric features, we choose the following structure for aggregation function ℎ ( , , ) :

B. Atention aggregation mechanism
MaxPooling or AvgPooling are usually directly used to integrate the features of all neighborhood points. However, the Pooling operation maintains a global fairness, that is, it regards each neighborhood point as equally important, so a lot of information is lost, and it is unfair to the points carrying more geometric features. To handle this problem, attention mechanism is used to replace pooling for learning. In this way, points carrying more information will be given greater weight, so that they can play a greater role in local feature aggregation.
Computing Attention Scores: Given a set of local features = { 1 , … … , }, a shared function (), consisting of a shared MLP followed by softmax, is used to learn a unique attention score for each feature. Shared function () is formally defined as follows: where W is the learnable weights of a shared MLP.
Finally, the weighted sum operation is carried out for the features of all selected neighborhood points:  Figure 3: Geometry Feature Aggregation Network structure for 3D point cloud classification and segmentation. The network inputs N points, each of which has only three dimensional coordinates as its feature as input. In each layer, geometric feature aggregation module is used to enrich local geometric information, and then attention mechanism is used to learn and select the extracted geometric features of surrounding points. In addition, the output of each layer is the input of the next layer, and finally the output of the three layers is integrated into a global feature descriptor. For segmentation model, the feature descriptor of each point is concatenated with the global feature descriptor and the classification score of each point is semantic labels. For classification model, the Pooling operation of the global feature descriptor is drawn into one-dimensional vector, and then input into the fully connected neural network to get a classification score.

C. Network Structure
The GFA-Net network composed of GFA module and attention mechanism for point cloud classification and segmentation is shown in Figure 3. This model is in three layer network. The GFA module at each layer can effectively collect local geometric feature information for each point, and then attention mechanism learns and selects the extracted geometric features of surrounding points, so that neighboring points carrying more geometric features can play a greater role.

Classification model
With n points as input, the local geometric features of each point are collected by GFA + attention in three layer. Then the MaxPooling operation is carried out for all points information to form a global descriptor. Finally, the global descriptor is input into the classification network model to get a classification score.

Segmentation model
The 1D global descriptor for each point and the output of GFA+ attention for each layer (used as a local descriptor) are cascading to extend, the resulting features are copied to each point through repeat operation, and then the classification score for each point is obtained through MLP.

A. Datasets
We studied and evaluated our proposed approach in the public datasets ModelNet40 [36] and ShapeNet [37]. ModelNet40 [36] was used to evaluate and test the point cloud classification task. ModelNet40 [36] contains 12311 models of mesh CAD from 40 categories, including 9843 models for training and 2468 models for testing. The objects in this data set are complete, without any occlusion or background. ShapeNet [37] was used to evaluate and test the point cloud segmentation task. ShapeNet [37] contains 16,881 3D shapes from 16 object categories, with a total of 50 parts are labeled. Each object has 2 to 6 parts, among which the training sample number is 12,137 and the test sample number is 2,874.

Experiment setting
We followed the experimental scheme of the modle such as PointNet [14]. 1024 points were uniformly sampled from the mesh surface, and the 3D coordinates of each sampling point were used as the input data. Three layer GFA + attenetion extracts the local geometric features, in which the output of each GFA module was weighted by attention mechanism learning as the input to the next GFA module. For the selection of the hyperparameter, we chose the neighborhood value = 25, and the calculation of the weight of the Laplace matrix, the hyperparameter = 20. The feature dimensions output by each layer of GFA + attention were respectively 64, 128 and 256, which were then concatenated with the 3D coordinate features of the initial input point cloud to obtain the complete local features, then the MaxPooling was used to obtain the final global features, and finally three fully connected layers (512, 256, ) were used to classify the global features and obtain the classification score . We used Dropout and L2 regularization to prevent overfitting and leakyRelu as the activation function. Pytorch framework was used to build the network model, set epoch = 200, batchSize = 8, and the training was carried out on the NVIDIA Tesla V100 16GB GPU. Using the SGD optimizer, the initial learning rate was 0.01, and the learning rate decreased by 0.75 decay rate for every 40 epochs.

Experiment Analysis
We tested the mean classification accuracy (MA) and overall accuracy (OA) on ModelNet40. The results are shown in Table 1. It can be seen that our proposed method has achieved very competitive results, with the OA accuracy reaching 93.2%, 0.5% higher than DGCNN [19]. It can be seen that GFA-Net is the mainstream point cloud identification method in the classification task at present.
In addition, we tried to use different neighborhood numbers K and different hyperparameters to carry out comparative tests to find the influence of different neighborhood numbers K and hyperparameters on the experiment. As shown in Table  2, when the neighborhood number is 25, the performance is better. When is less than 25, there is insufficient extraction of local points. If is greater than 25, there may be too many neighborhood points, leading to local feature redundancy. We tested the hyperparameter with the fixed value of , and found that the performance is obviously better when gets 20 than other values.

Ablation Study
We performed ablation analysis on the components in the classification task on ModelNet40. All experiments in the ablation analysis used 1024 points, the neighborhood value = 25 , and the hyperparameter = 20 for testing. The selection of features is a factor that affects the local geometric features information and the relationship between each point, so how to select features is important. In order to extract the most appropriate eigenvalues, we tried four settings, as shown in Table 3. It can be seen that only using the three-dimensional coordinates of Euclidean space, the performance is greater than 92.0%, while fusing the Euclidean space and the Laplacian eigenmapping space reaches 92.4%, and adding the original three-dimensional features into it reaches 93.2%. In summary, the ablation analysis prove that the dimension reduction using Laplace eigenspace is effective.

Complexity Analysis
We evaluated the complexity of the model in terms of model size and running Time in Table 4. The experiment was performed on an NVIDIA Tesla V100 16GB GPU with 8 batches to calculate Forward Time (ms). Other conditions were the same as the hardware environment, and the model was implemented by PyTorch.The results show that the size of our model is second only to DGCNN [19], but in Forward Time(MS), the Time efficiency is not high because the Laplacian characteristic graph needs to be constructed.

Experiment setting
The segmentation experiment was carried out on ShapeNet [37]. GFA + attention were used to extract the local geometric features in each layer, then the 1D global descriptor was connected in series with the respective outputs of the three layers, and finally the classification output of each point was calculated by the shared MLP (256,128). We also used Dropout and L2 regularization to prevent overfitting and leakyRelu as the activation function. The same training settings as the classification task were used to train on the NVIDIA Tesla V100 16GB GPU. The same evaluation scheme as DGCNN [19] was adopted. The IoUs of a shape was calculated by averaging the IoU of the different parts that appeared in the shape, and the IoUs of that shape was obtained by averaging the IoUs of all shapes that belonged to that category. At last, the mean IOU (mIOU) was calculated by averaging the IOUs of all test shapes.

Experiment Analysis
We compared the results with the network models of PointNet [14], PointNet++ [15], PointCNN [31], DGCNN [19], and KD-Net [24], and the results are shown in Table 5. It can be seen that our method's result reaches the state-of-theart performance in some objects' segmentation. However, in some categories such as Cap, Bag, Rocket and other objects with small training samples, they may be inferior to other methods. The main reason is that due to the small number of samples, our method may not learn enough point features, leading to segmentation errors. In order to show the performance of our method more intuitively, Figure 4 shows the direct comparison between our method and DGCNN [19] and the ground truth.

Ⅴ. CONCLUSION
We propose a novel network structure GFA-Net for point cloud feature aggregation. It consists of GFA module and attention mechanism. GFA-Net gathers the information of the same geometric feature points, which makes up for the deficiency that only the features of nearby points can be considered in Euclidean space, thus improving the robustness of rotation and translation of point clouds.
Experiments show that GFA-Net has state-of-the-art performance, can better collect local geometric features, and has strong robustness. In future work, we hope to apply our method to processing large point cloud data with more characteristic information.