A Category-Contrastive Guided-Graph Convolutional Network Approach for the Semantic Segmentation of Point Clouds

The semantic segmentation of light detection and ranging (LiDAR) point clouds plays an important role in 3-D scene intelligent perception and semantic modeling. The unstructured, sparse and uneven characteristics of point clouds pose great challenges to the representation of the local geometric shapes, which degrades semantic segmentation performance. To address the challenges of describing local geometric shapes due to unstructured and sparse 3-D point clouds, this article proposes a category-contrastive-guided graph convolutional network (CGGC-Net) for the semantic segmentation of LiDAR point clouds. First, a detailed geometric structure of the raw point clouds is encoded to represent the inherent geometric pattern within the local neighborhood. At the same time, the geometric structures information is transmitted across multiple layers, so that the geometric structure encoding information containing different receptive fields and richer neighborhood spatial structure can be aggregated. Following this, the graph convolution neural network uses the edge convolution layer to adaptively describe the semantic correlation between the query point and its neighboring points, and combines the attention mechanism to gather the surrounding feature information to the query point. As a result, the graph convolution neural network and attention mechanism are iteratively stacked for the aggregation and fusion of spatial context semantic information, to generate highly discriminative semantic feature representation. Finally, the superparameters of the model are learned through a multitask optimization strategy guided by category-aware contrastive loss and cross-entropy loss. Experiments are conducted on the public SemanticKITTI dataset and the Stanford large-scale 3-D Indoor Spaces dataset to demonstrate the effectiveness and reliability of the proposed CGGC-Net from both quantitative and qualitative perspectives. The results indicate its capability of automatically classifying LiDAR point clouds, with a mean intersection-over-union of 58.4%. Moreover, multiple comparative experiments also demonstrate the superior performance of the proposed method, exceeding state-of-the-art methods.


I. INTRODUCTION
L IGHT detection and ranging (LiDAR) point clouds have increasingly attracted interests in numerous applications, especially autonomous driving [1], [2], virtual reality and robotics [3], [4], due to their superior ability to preserve the spatial detail information of objects or sceneries [5], [6], [7]. In these applications, fine-grained classification, which assigns semantic labels to each point that belongs to the objects of interest, is a fundamental and important task. This detailed semantic information plays an important role in the downstream tasks, such as place recognition [8], instance segmentation [9], and scene reconstruction [10]. Therefore, the automatic fine-grained classification of LiDAR point clouds has been an active topic.
To date, many methods have been developed for the semantic segmentation of 3-D point clouds. Traditionally, machine learning-based approaches (e.g., support vector machines [11] and random forests [12]), where hand-crafted features are designed for representing the geometric structure information of point clouds, have been adopted for the semantic segmentation of point clouds. Although these approaches have shown the capability of automatically classifying point clouds, their performance is limited by the descriptive ability of the designed hand-crafted features and the reliability of the selected classifier.
More recently, deep learning techniques have demonstrated excellent abilities in various computer vision and natural language processing fields and are increasingly popular in scene understanding tasks, such as classification, object detection and instance segmentation, based on point clouds. Due to the discrete and disordered data characteristics of point clouds, it is challenging to directly implement classic convolutional neural networks on raw point clouds. Some solutions that transform raw point clouds into regular representations, such as projected images and structured volumetric grids [13], [14], have been presented, which then serve as the input of classic convolutional neural This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ networks for semantic segmentation. Although these solutions enhance the descriptiveness and discriminativeness of feature representation to some extent, the local geometric structures and fine-grained semantic contexts are difficult to preserve due to the regular transformation.
Although there have been many deep learning-based methods presented for the semantic segmentation of point clouds in recent years, this task is remarkably challenging due to the following aspects. First, the efficient aggregation of rich semantic information at various scales remains difficult due to unstructured data characteristics. Although graph convolution-based methods have been explored for the semantic segmentation of point clouds in recent years [17], [18], [31], capturing local geometric patterns and aggregating spatial context is necessary, especially for dynamic and large scenarios [27], [32], [33]. Second, although semantic supervised labels effectively improve the descriptiveness of feature representation via an end-to-end deep learning architecture, most approaches update the model superparameters to converge by only comparing their predictions with the associated semantic labels. Few studies have focused on explicitly using semantic supervised information to guide the process of generating high-level semantic feature representations of point clouds.
To address the aforementioned issues, this article develops a category-contrastive guided graph convolutional network (CGGC-Net) method for the semantic segmentation of LiDAR point clouds. Moreover, both quantitative and qualitative analyses are conducted on the public SemanticKITTI dataset benchmark [34] and S3DIS dataset [35] to evaluate its robustness and reliability. Our contributions in this article are as follows.
1) The spatial context information from both the detailed geometric structure and semantic feature are locally aggregated for each point in parallel using a graph convolution module and attention mechanism as the receptive field progressively increases, to generate highly discriminative semantic feature representation. 2) A category-contrastive loss is designed to guide the learning process of semantic feature representation, which would make the semantic features of the same class remain close while put those of different classes far apart. Moreover, combined with the cross-entropy loss, a multitask optimization strategy can jointly utilize the discrepancies among different categories to highly-discriminative and descriptive semantic representation.
The rest of this article is organized as follows. The related works are briefly reviewed in Section II. Section III describes the developed LiDAR point cloud classification framework in detail. Section IV presents the experimental results and an analysis for both quantitatively and qualitatively evaluating the developed method. Finally, Section V concludes this article.

II. RELATED WORK
Semantic segmentation, especially for large scale scenes, has been an active topic. However, this is also challenging for accurate semantic classification in large scenes due to complex elements, varieties of scene classes, occlusions, and noise. In recent years, deep learning techniques have become increasingly popular since they can produce promising interpreted results. In this section, we review the relevant literature, which can be generally divided into four groups: projected image methods, voxel-based methods, point-based methods, and graph-based methods.

A. Projected Image Methods
Due to the great success of 2-D images semantic segmentation [36], [37], unordered and unstructured 3-D point clouds have been initially projected onto regular and structured 2-D images. Subsequently, numerous mature deep neural networks for 2-D image semantic segmentation could have been used for pixelwise labeling. 2-D multiview synthetic images have been generated from point clouds [38], [39], [40], [41], and multistream convolutions on different views have been applied for labeling the semantic information, which is then reprojected to each point. In addition, spherical projections, such as SPLAT-Net [42] and SequeezeSeg [43], have been used to alleviate geometric information loss during preprocessing. To consider the uneven distribution of point clouds in grid cells, polar bird'seye-view representations, such as PolarNet [44], have been defined through the polar coordinate system. Although mature 2-D semantic segmentation methods can be implemented on regular and structured projected images, geometric information is inevitably missing during projection transformation, which can inhibit the classification quality.

B. Voxel-Based Methods
Voxel-based methods, such as VoxNet [13] and OctNet [45], convert discrete 3-D point clouds into volumetric occupancy grids whose feature maps can be generated through a 3-D convolutional neural network (3-D CNN). It is obvious that as the resolution of voxelization increases, richer geometric information can be retained, which results in excessive memory consumption and a heavy computational cost. To ease this issue, sparse convolution has been designed and implemented on flexible unbalanced octrees that adaptively partition 3-D point clouds based on sparsity [45], [46], [47]. In fact, voxel-based methods extend the success of 2-D convolution into 3-D space, which can effectively deal with unstructured point clouds. Similar to projected image-based methods, voxel-based methods inevitably lead to information loss although they retain 3-D information to some extent.

C. Point-Based Methods
Point-based methods carry out the convolution operation directly on unordered and irregular point clouds. PointNet [15] was a pioneering model which that utilized on unordered and irregular point clouds. Although it enhanced the feature representation capability by directly using raw point clouds, PointNet ignored the spatial context without taking neighborhood information into consideration. Subsequently, PointNet++ [16] applied a ball-query module to extract and aggregate local features using a hierarchical structure. Nevertheless, PointNet++ still lost the relationship between points within a ball-query set. To this end, other works [24], [25] concentrated on how to carry out an effective and efficient convolutional operator directly on raw point clouds via graph convolutional networks and attention mechanisms [22], [27], which augments and fuses feature maps of multiple resolutions for large-scale semantic segmentation.

D. Graph-based Methods
Point clouds inherently lack topological information, so designing a model to recover topology can enrich the representation power of point clouds. To better exploit semantic relevance between neighbors, numerous studies have focused on relationship modeling via graph structures or attention mechanisms, where semantic context could then be extracted and aggregated into the corresponding center points. Graph neural networks (GNN) were first proposed by Hu et al. [27], and have been widely used in different fields, including semantic understanding [49], medical neuroimaging [50] and social networks [51], to describe the local and global contexts within unstructured data in recent years. For instance, superpoint graph (SPG) [52] was constructed to realize semantic segmentation in a large scene. However, this required extra preprocessing for segmenting the superpoints, and the labeling quality in large scenarios was unsatisfactory. Wang et al. [18] proposed the graph attention convolutional network, where adaptive weights were assigned to different neighbors through a self-attention mechanism, and then the local spatial context could be aggregated using adaptive pooling to automatically classify point cloud data. Wang et al. [31] developed a dynamic graph convolutional neural network that adopted edge convolution to extract and dynamically update local semantic features through the characteristic relationship between center points and neighbors. Liu et al. [32] explored the graph convolutional network to preserve rich geometric details and capture long spatial dependencies for enhancing the network feature representation.
To summarize, inspired by [22], [27], [32], our work uses a graph convolutional neural network as the baseline and is dedicated to semantic segmentation directly on raw LiDAR point clouds. Unlike previous GNNs that focused on updating semantic features and neglected the detailed geometric structure information [18], [22], [31], [32], our work encodes detailed geometric structure information into semantic features and aggregates the long-range spatial context as the receptive fields are expanded and stacked. In addition, our work further extends the applications of contrastive learning on a small number of objects such as 2-D images [53], [54] to the semantic segmentation of massive 3-D point clouds and verifies the effectiveness of the category-aware optimization strategy in the point cloud domain.

III. METHODOLOGY
To capture local geometric patterns and aggregate spatial context effectively and efficiently, this article follows the encoderdecoder architecture [16], [20], [27], [37] and develops a CGGC-Net method for the semantic segmentation of LiDAR point clouds, which has an encoder network and a corresponding decoder network, followed by a final point-wise classification layer. Fig. 1 illustrates the pipeline of the proposed CGGC-Net for the semantic segmentation of point clouds.
The encoder network consists of four encoder layers; the detailed geometric structure and semantic feature are locally aggregated for each point in parallel using graph convolution module and attention mechanism as the receptive field progressively increases through iterative stacking. Moreover, the detailed geometric structure information is transmitted across multiple encoder layers to effectively preserve complex local geometric patterns. Consequently, multiple encoder layers are progressively stacked for the aggregation and fusion of spatial context information to generate highly discriminative semantic feature representation. Each encoder layer has a corresponding decoder layer, and thus the decoder network also has four layers. The encoder layer and its corresponding decoder layer are connected through skip connections, which combine deep, semantic, coarse-grained feature maps from the decoder layer with shallow, low-level, fine-grained feature maps from the encoder layer. Finally, the decoder output is fed into a classification layer, consisting of three fully connected layers and a multiclass softmax classifier, to produce class probabilities for each point independently. The superparameters of the model are learned through a multitask optimization strategy guided by categoryaware contrastive loss and cross entropy loss. As a result, the raw point clouds are interpreted to obtain the final pointwise labeling results. More details of the proposed CGGC-Net are given below.

A. Detailed Geometric Structure Encoding
In this section, a detailed geometric structure encoding module is designed to describe inherent spatial relations within the local neighborhood and preserve complex local geometric patterns as much as possible, which enhances the expression and refinement of the subsequent semantic features.
1) Local Geometric Structure Descriptor: X-Y-Z coordinate information is incapable of directly describing complex local geometric patterns since relative spatial relations between the query point and its neighbors are unexplored. Thus, inspired by a previous work [32], we use a local geometric structure descriptor to explicitly represent their potential geometric structure within the local neighborhood directly on 3D coordinates. A tensor P = [p 1 , p 2 , · · · , p n ] T is defined to represent a set of point clouds, where p i denotes the i th point. For the query point p i , its neighboring points are gathered using a simple K-nearest neighbors (KNN) algorithm based on Euclidean distances, as where K denotes the number of neighbors. Following this, we adopt (1) to describe the geometric structure of neighborhood (1) where p i denotes each center point, p k i denotes the K neighboring points of the query point. · represents the Euclidean distance between the query point and its neighboring points, (p i − p k i ) reflects the 3-D coordinates difference between p i and p k i , and Concat[·] denotes the concatenation operation. As a consequence, r ∈ R N ×K×10 is encoded as the representation of spatial relationships between the query point and its neighboring points from redundant 3-D coordinates.
To efficiently aggregate the neighboring relations, we use attentive pooling [27], which adaptively allocates a unique attention score to different neighbors, to automatically learn and select the salient geometric structures, as defined where AttentivePool(·) denotes the attentive pooling function consisting of a shared MLP followed by softmax. To summarize, given the input point cloud P , an informative feature vector g ∈ R N ×10 in the first layer is generated to effectively describe complex local geometric patterns.
2) Local Geometric Structure Transmission: To capture complex local structure patterns, a series of downsampling operations is performed to alleviate the limited the size of receptive field. With the encoder layer deepening through downsampling operations, the receptive field of each point increases. In this way, richer local structures are progressively aggregated due to wider context information for each point. It is inevitable that fine-grained spatial relationships might be lost. Therefore, the geometric structure feature g is transmitted across multiple encoder layers to effectively preserve complex local geometric patterns, so that it is efficiently augmented and enriched with a combination of different receptive fields, which provides a fundamental spatial basis for mining the semantic correlation between neighboring points of discrete 3-D point clouds. Finally, the detailed local geometric structure encoding in the t th layer can be represented as where DS denotes down-sampling operation, Concat[·] denotes the concatenation operation, 1 ≤ t ≤ 4 in this article. Fig. 2 illustrates differences of local structure patterns under different receptive fields.

B. Geometric and Semantic Aggregation Graph Convolution Module (GSAGCM)
In this section, we design a graph convolutional neural network module to produce new semantic features by aggregating the neighboring semantic information, which takes the detailed local geometric structure encodingg and semantic features F as the inputs. Initially, the semantic features are embedded from 3-D coordinates using a simple MLP operation. In the GSAGCM, the propagated edge convolution (PEConv) is used to extract the semantic feature relationship between the query  point and its neighbors, and aggregates the neighboring feature information to the query point through attention pooling. Finally, detailed local geometric structures and semantic features are fused by stacking several layers with residual connection to update the new semantic feature per point. Different from original RandLA-Net [27], we adopt attentive pooling to aggregate the encoded local spatial information into the associated query point. In addition, after multiple transmissions, we use a further augmented local geometric structure to induce the expression and refinement of semantic features with the help of graph convolution rather than a single layer of MLP.
1) Propagated Edge Convolution for Feature Aggregation: Within the local neighborhood, PEConv and attention pooling are used to achieve the extraction and transmission of neighborhood information. This consists of three parts: graph model construction; edge feature representation; and edge feature aggregation. As a result, a new semantic feature per point is produced, which aggregates newer semantic features using PEConv and attention pooling (see Fig. 3) or serves as the input of the subsequent encoder layer with the detailed local geometric structure encoding (see Fig. 1).
1) Graph Model Construction: Unlike 2-D raster images, 3-D point clouds are discrete and disordered, and there is no explicit topological relationship between points. However, points that are adjacent to each other in Euclidean space usually have interaction relationships. In addition, for a specific point, the geometric structure formed by its several neighboring points is the foundation of semantic mining. As mentioned above, we obtain the index of the K nearest points of each point by KNN, and establish the directed edge between the query points and the neighbors. 2) Edge Feature Representation: Many graph-based networks stack both global and local information as their edge representations. What distinguishes us from them is that the local geometric structure is also included in addition to the semantic feature used in building undirected edges. Considering that the global information has been embodied in g, we eventually use the difference between the query points and neighbors, which can be calculated as where G j i represents the geometric and semantic stacked feature of the j th point in the corresponding neighborhood of the i th point.
Ultimately, we extract the edge attribute features from e by means of a three-layer successive stacked MLP, which can be expressed as follows: where h Θ 1 denotes feature learning of R d × R d → R d , d is the feature dimension, and Θ 1 denotes the learnable weights of the multiple groups.
3) Edge Feature Aggregation: To aggregate edge attribute features into the query point while avoiding the loss of important edge information, we introduce a self-attention mechanism to adaptively learn the unique score of each edge attribute and maximize the characterization of the edges they contain. The aggregated feature of the query point can be calculated, as defined in where Θ 2 is also a group of learnable weights.

2) Residual Connected and Dilate Stacked Module:
To expand the receptive fields, many existing works [55], [56] optimized the K-nearest searching strategy, which was required to search more neighboring points in different receptive regions and select a fixed number of neighboring points regularly. Undoubtedly, these approaches would create additional memory costs on searching more nearest points. In this section, we stack multiple propagated edge convolutional layers to increase the receptive field by means of feature propagation. Moreover, to address the problem of gradient vanishing and model degradation in deep neural networks, a PEConv is used as our residual connection rather than MLP [57], [58]. Fig. 4 illustrates the increasing size of the receptive field when stacking. When the PEConv is first performed on the input G, the receptive field of each point is the corresponding number of neighborhoods K. In regard to the second layer, although the number of neighborhoods remains constant, the actual receptive field becomes K 2 since the semantic feature of neighbors has contained information of its own neighborhood in the previous layer. As a result, the size of the receptive field is repeatedly expanded through feature aggregation within the local neighborhood. In this article, we ultimately stack 2 layers.

C. Category-contrastive and Cross-Entropy Guided Optimization Strategy
It is well known that the superparameters within the whole network are learned for mapping a set of inputs to a set of outputs from massive high-quality labeled training data. Generally, the problem of learning is cast as an optimization problem, which navigates the space of possible sets of superparameters within the whole network to produce the satisfactory predictions. Here, we present a category-contrastive and cross-entropy guided optimization strategy to search for a candidate solution with the optimal values.
For the multitask classification of point clouds, the most typical and effective method is to describe the cross-entropy loss between predictions and the ground truth, which can be calculated as follows: where y i denotes the predictions,ŷ i denotes the ground truths, and V represents the number of categories. Consequently, the predicted probability distribution gradually approaches the true probability distribution by minimizing L cro through gradient back-propagation.
However, cross-entropy loss ignores the relations between categories themselves. In fact, class separation in the latent feature space would also be an ideal characteristic to discriminate among different categories. Theoretically, feature vectors of the same category should remain close in latent feature space, while those of different categories should be far apart. Therefore, based on output feature µ v generated by the encoder-decoder architecture belonging to class v, we design a category-guided contrastive loss that is devoted to depicting the category-specific distance from the centroid representation of each class δ i . It can be shown as where D(·) denotes Euclidean, cosine or any other distance function, and Δ represents the maximum distance of the same class and the minimum distance of different classes.
In the specific implementation, we define a fixed size of tensor β i ∈ R S×D per category i for storing the corresponding features, where D is the dimension of µ v and S represents the maximum number of stored features. In addition, we randomly select N points for updating the centroid representation of each category, which strikes a balance between effectiveness and efficiency. The centroid representations δ new i is calculated based on β i every I p iterations. To avoid rapid fluctuation, we set a momentum m between δ i and δ new i so that the centroid in the feature space can evolve steadily in an end-to-end manner, which is formulated as: Finally, the total loss function can be represented as where λ is a weight between category-guided contrastive loss L cont and cross-entropy loss L cro . As a result, the differences between the prediction and the ground truth are measured and the superparameters of the model are updated using the stochastic gradient descent algorithm so that the next evaluation reduces the differences, which enables the superparameters of the model to move toward convergence.

A. Experimental Dataset and Evaluation Metrics
To verify the effectiveness and reliability of the proposed approach, we select two well-known public dataset benchmarks: SemanticKITTI dataset [34] and the Stanford large-scale 3-D Indoor Spaces dataset [35]. Raw point clouds are manually classified into 19 categories as ground truths and the 3-D point cloud data only presents X-Y-Z, intensity information without RGB information. S3DIS dataset is divided into six large-scale indoor areas, containing more than 215 million labeled points. And raw points are manually annotated into 13 categories. Each point has 3-D coordinates and RGB information. According to previous work [27], [32], we adopt six-fold cross-validation for evaluation. To measure the classification quality, we conduct the quantitative evaluation using intersection over union (IoU) per class, mean IoU (mIoU), overall accuracy, mean accuracy over classes (mAcc) and Kappa as defined in (13)-(17). IoU is a measure which imposes the penalty of false positive on the class accuracy per class, and the mean IoU is the IoU over union in all classes. Overall accuracy denotes the sum of the true positives plus true negatives divided by the total number of queried individuals. And mAcc denotes the sum of the true positives plus true negatives divide by the total number of queried individuals, which reflects the proportion of the correct samples identified by the classifier to all samples Overall accuracy (OA) = TP + FN TP + FP + TN + FN where TP denotes the number of positives that are correctly classified as positives, TN denotes the number of positives that are correctly classified as negatives, FN denotes the number of negatives that are incorrectly classified as negatives, and FP denotes the number of negatives that are incorrectly classified as positives, TP i , GT i , and FP i denote the number of positives that are correctly classified as positives, ground truth and the number of negatives that are incorrectly classified as positives in the class i, respectively. p o is the overall accuracy, and p e can be denoted as where a represents the confusion matrix, and N is the number of samples.

B. Implementation Details
The experiments are implemented on deep learning framework PyTorch [63] with Ubuntu18.04. We train for 100 epochs on Geforce RTX 3080 GPU (memory size is 12GB) with a bath size of 6. Besides, we use Adam optimizer and weight decay is set as 0.00001. The initial learning rate is set as 0.004 and we adopt exponential scheduler with gamma = 0.95 to maintain a better learning rate. Moreover, to prevent the overfitting, dropout with p = 0.5 is added after the fully connected layer. Table I gives the quantitative comparisons with different existing models on the SemanticKITTI dataset. It clearly illustrates that our proposed CGGC-Net has surpassed the other approaches by a large margin with an mIoU of 58.4%. In detail, the CGGC-Net demonstrates a remarkable advantage in classifying small instances such as person, bicycle, motorcycle, and bicyclist, achieving 58.8%, 35.2%, 40.8%, and 57.6%, respectively.

1) Semantic Segmentation on the SemanticKITTI Dataset:
In addition, some qualitative results are visualized, as shown in Fig. 5, where the first and third rows represent the ground truth and the second and fourth rows represent our prediction. We could observe that our CGGC-Net is able to classify most objects and still perform well in incomplete places due to occlusions or defections. This could be attributed to the geometric structure encoding, which captures inherent geometric spatial relations within neighborhoods to provide more geometric information for the GSAGCM. Therefore, we could conclude that our CGGC-Net is capable of capturing and exploiting both the local geometric and semantic information of small local regions as well as incomplete places.
Moreover, the visualization of the confusion matrix is also provided in Fig. 6. Kappa reaches 0.847, demonstrating that our proposed CGGC-Net is an excellent classifier for semantic segmentation of large-scale outdoor scene point clouds.

2) Semantic Segmentation on the S3DIS Dataset:
To further evaluate the effectiveness of the proposed network in a largescale indoor scenario, experiments are reported on the S3DIS dataset. In our implementation, the six-fold cross-validation strategy is applied, where every five areas are used as the training set to evaluate the remaining area. Table II gives the comparable quantitative results with different existing models on the S3DIS dataset. It shows that the OA and mIoU achieve 88.5% and 70.2%, respectively. In particularly, our method achieves the highest accuracy in the floor, beam, window, and sofa. It is worth noting that our proposed CGGC-Net is superior to a CAN [32], even though they can capture long-range dependencies to enhance the representation of point clouds.
Moreover, the detailed semantic segmentation results of 6 areas are also reported in Table III and the associated visualization of the confusion matrix is shown in Fig. 8. Many metrics have illustrated that our CGGC-Net is an ideal classifier for large-scale indoor scene point clouds. Fig. 7 shows the selected examples on the S3DIS dataset. We can observe that our CGGC-Net performs well in all categories, especially in wall, beam, door and chair. Owing to the local geometric multiple transmission and the GSAGCM, the network can capture geometric and semantic relations from long distances. As a result, there are few mistakes at the boundaries of objects.

D. Sensitivity Analysis of Numbers of Neighbors
The number of nearest neighbors directly determines the description of the local geometric structure as well as the extraction of semantic features in the GSAGCM. Thus, a series of comparative experiments are conducted to discuss the influence of the parameter K, which is set to 8, 12, 16, 20, and 24. Fig. 9 indicates the sensitivity analysis of the size of the neighborhood on the classification quality. When K is set to 8, CGGC-Net cannot effectively extract the geometric and semantic features due to the limited neighborhood information.  As the size of the neighborhood rises, the classification quality progressively improves, with a fluctuation in mIoU of over 5%. However, when it reaches 24, a small degradation appears possibly due to potential noise and the adhesion of adjacent objects. Fig. 10 also shows a detailed comparison of different numbers of neighbors in each category. We can observe that it has a more prominent impact on some small-scale instances, such as bicycles, trucks and other-vehicles while some large-scale instances such as buildings, roads and vegetation are slightly influenced. Considering the classification performance and computational cost, we set K to 16 as an optimal value in our work.

E. Sensitivity Analysis of the Length of the Stored Tensor β i
The parameter S controls the length of each β i , determining the number of latest features stored in iteration. Unlike contrastive learning applied in 2-D images, the value of S is  recommended to be set higher, as β i can be renewed rapidly with massive point clouds. However, as given in Table IV, when S exceeds 200, there is also a decreasing trend in mIoU. Ultimately, S is set as 200 in our implementation.

F. Sensitivity Analysis of the Margin in Contrastive Loss
The parameter margin Δ is a criterion of similarity measure using category-specific distances. It defines the maximum and minimum distance between input features and the centroid representation of the same class in the feature space. The results are given in Table V. It is worth noting that although the classification results may improve as the separation between categories increases theoretically, the mIoU reaches a peak when Δ is set as 1.5.

G. Sensitivity Analysis of the Number of Selected Points in Each Iteration
As explained in Section III.C, it is definitely impossible to update β i by using the entire point clouds of each iteration (almost 3 × 10 5 ) because of the large amount of data. Hence, we select a fixed number of points randomly. Here, we vary the parameter N , and the experimental results are given in Table VI. Consequently, N is set as 5000 in our CGGC-Net.

H. Ablation Studies
In this section, extensive ablation experiments are carried out to further demonstrate the effectiveness of our CGGC-Net.
1) Ablation Study of Detailed Geometric Structure Encoding: In the detailed geometric structure encoding module, a local  geometric structure descriptor is employed to describe the inherent spatial relations within the neighborhood, and the local geometric structure transmission is designed to further augment geometric information with different sizes of the receptive field. Comparable results with different settings are given in Table VII. The simplification of the geometric structure leads to the lack of local geometric information in the neighborhood, resulting in a decrease of 1.4% in mIoU. In addition, we observe that different forms of transmission play a prominent role in enriching the local geometric structure. In detail, the geometric structure feature transmitted across single and multiple layers can cause an increase of 2.2% and 3.8% in mIoU, respectively.
2) Ablation Study of GSAGCM: Based on PEConv operation, the GSAGCM utilizes both geometric and semantic information to extract the semantic relations between adjacent points, which achieves the aggregation of semantic contexts. To measure the performance of the GSAGCM with comparable settings, we set a series of experiments given in Table VIII. The most distinguished impact is brought by the GSAGCM, leading to 12.0% and 2.4% increases in mIoU and OA, respectively, which largely demonstrates that geometric and semantic information is critical for the semantic segmentation of point clouds. In addition, the number of stacked PEConvs is of great significance, which achieves the highest mIoU (approximately 59.2%). Using a single layer could prevent information propagation from a broader perspective, resulting in a decrease of 5.3% in mIoU. In addition, when the edge attribute feature adopts h Θ (e j i , G i ), there is a decline of 6.4% and 1.5% in mIoU and OA, respectively. Moreover, different forms of residual connections in the GSAGCM could affect the model to some extent, with the absence of residual connections and the use of MLP as a shortcut reducing the mIoU by 6.4% and 10.7%, respectively. Ultimately, although the max pooling operation retains the most distinguished features within the neighborhood, some useful information is inevitably lost compared with attentive pooling, contributing to a decrease of 1.8% in mIoU.
3) Ablation Study of the Category-Contrastive and Cross-Entropy Guided Optimization Strategy: The categorycontrastive and cross-entropy guided optimization strategy adopts additional weighted contrastive loss L cont , which could induce the high-dimensional semantic feature to be more discriminative. The results of introducing contrastive loss on different datasets are given in Table IX. The introduction of additional contrastive loss brings an improvement of 1.2% and 0.4% in mIoU and OA, respectively, in SemanticKITTI.  Meanwhile, the S3DIS dataset can also lead to an increase ranging from 0.7% to 2.0% in mIoU. We conclude that the utilization of inter-category information in contrastive learning is of great significance in both large-scale indoor and outdoor scenarios.

I. Visualization in Latent Feature Space of Contrastive Learning
In this section, to further illustrate the clustering results of contrastive learning more explicitly and vividly, comparative visualizations of whether contrastive loss is introduced during iteration are performed using t-distributed stochastic neighbor embedding techniques. As shown in Fig. 11, after applying contrastive loss, points belonging to the same category tend to be gathered together in latent feature space, while those belonging to different categories are forced to stay apart. In brief, we can conclude that feature representations are prompted to be more discriminative in the process of minimizing the category-specific distances, which would be beneficial to the determination of semantics and enhance the accuracy of multitask classification results.

V. CONCLUSION
With the rapid development of 3-D scanners, the semantic segmentation of LiDAR point clouds is the foundation for spatial intelligent perception and has been a trending topic in recent years. Hence, in this article, we develop a contrastive-category guided learning graph convolutional neural network for the semantic segmentation of LiDAR point clouds. First, the detailed local geometric structures are designed to extract the inherent geometric information and combine it from different receptive fields. Then, a GSAGCM utilizes the multistacked PEConvs and attention pooling to achieve the extraction and transmission of neighboring semantic relationship information, which aggregates newer semantic features per point in parallel. Finally, by introducing contrastive loss, the semantic features generated from the previous encoder-decoder architecture could become more discriminative, benefiting the transformation to the pointwise classification score in the subsequent classification layer. Experiments on the SemanticKITTI and S3DIS dataset have shown that our CGGC-Net performs well in both large-scale outdoor and indoor scenarios and is capable of classifying small and even incomplete instances.
Nevertheless, the semantic segmentation of large-scale point clouds in fully-supervised tasks requires time-consuming and laborious dense annotation. Therefore, in the future, we will explore weakly supervised point cloud semantic segmentation.