Learning from Mistakes: Self-Regularizing Hierarchical Representations in Point Cloud Semantic Segmentation

Recent advances in autonomous robotic technologies have highlighted the growing need for precise environmental analysis. LiDAR semantic segmentation has gained attention to accomplish fine-grained scene understanding by acting directly on raw content provided by sensors. Recent solutions showed how different learning techniques can be used to improve the performance of the model, without any architectural or dataset change. Following this trend, we present a coarse-to-fine setup that LEArns from classification mistaKes (LEAK) derived from a standard model. First, classes are clustered into macro groups according to mutual prediction errors; then, the learning process is regularized by: (1) aligning class-conditional prototypical feature representation for both fine and coarse classes, (2) weighting instances with a per-class fairness index. Our LEAK approach is very general and can be seamlessly applied on top of any segmentation architecture; indeed, experimental results showed that it enables state-of-the-art performances on different architectures, datasets and tasks, while ensuring more balanced class-wise results and faster convergence.


I. INTRODUCTION
S EMANTIC scene understanding is a challenging computer vision problem that finds application in various fields including autonomous driving, robot sensing, and virtual reality.
Specifically, semantic segmentation is the most fine-grained scene understanding task that provides point-wise labeling on image pixels or 3D points.Since the refinement of classification accuracy can bring immeasurable benefits on different tasks (from navigation control to action planning), recent research has focused on improving different deep learning models for heterogeneous types of sensed data (e.g., RGB images, depth maps, and point clouds).Such improvements can be achieved in different ways.One possibility involves the adoption of enhanced processing architectures [38] that better parameterize the input data with respect to the segmentation task.In this case, it is possible to improve the results sensibly at the cost of higher computational loads and memory requirements.Current stateof-the-art solutions are typically built on the top of autoencoder architectures or fully-convolutional models [38], and their inner structure strongly depends on the task and the properties of the processed data, without relying on specific class-conditional constraints or abstractions.Fig. 1: We identify semantic macro communities (e.g., vehicles) of micro classes (e.g., car and truck) automatically analyzing the accuracy results of any semantic segmentation model.We regularize model training with 2 components.Top: a macro-aware fairness (F ) score on the micro classes promotes homogeneous scores within each macro cluster.Bottom: classconditional latent features-to-prototype alignment at 2 levels (micro and macro) improves class-wise features discrimination.
A possible alternative consists in designing more accurate learning paradigms that are able to self-optimize the final performance even if architecture and datasets remain the same (without any additional intervention or information by the designer).To this extent, recent works [23], [34], [82] have focused on the adopted learning paradigms that maximize the semantic segmentation performance without increasing the size or complexity of the network.Such approaches exploit either body-edge features [34], self-supervised depth estimation [23], or pseudo labels [82], and do not perform any adaptive selfregularization estimated from a preliminary class-conditional accuracy.The main advantage of learning-based enhancement is that the training process structure the data in a self-supervised manner, in such a way that the memory and computational requirements remain the same at inference time, but the final accuracy improves.This proves to be extremely desirable in many application scenarios (e.g., embedded deep learning scheme with real-time constraints) where low latency and a limited network size can not be bargained with improved performance.
Following this trend, we propose LEAK (LEArning from mistaKes), a novel coarse-to-fine learning strategy that automatically optimizes the performance of a semantic segmentation network by shaping class subspaces according to classification errors and unbalancing.Remarkably, our approach does not imply additional supervision by human operators, making the full training process self-organized.The core idea relies on dividing the feature space in micro and macro regions (where macro space is an aggregation of micro class subspaces fused according to their mutual misclassification probability) and balance them according to their representativeness.
First, we use a pre-trained standard segmentation model to derive a confusion matrix over the (micro) classes contained in our dataset.Then, the optimization routine identifies macro classes that include visually similar micro classes by means of spectral clustering [1] on the confusion matrix.Empirically, we verified that macro classes include similar semantic content, as expected.
Such micro-macro partitioning is used to derive different regularization terms.A fairness-enforcing constraint is included in the training loss to make classification errors uniformly distributed regardless of the sample frequency or accuracy per class (upper part of Fig. 1), and hierarchy-aware classconditional regularization constraints are introduced to embed feature vectors of the same class tightly around their micro and macro prototypical representations (lower part of Fig. 1).
We tested LEAK on different point cloud semantic segmentation networks (RandLA-Net [24], Cylinder3D [81], RangeNet++ [43]) and datasets, including sequential LiDAR data (SemanticKITTI [4]) and static datasets (Semantic3D [20], S3DIS [3]) acquired with laser scanners and other technologies.The proposed framework can be seamlessly adapted to different scenarios and is agnostic to the architecture and dataset.Its generality was also verified by adapting it to a standard semantic image segmentation dataset, i.e., Pascal-VOC2012 [16], using DeepLabV3 [10].Some recent approaches [42], [47], [63], [80] have highlighted the opportunities of a feature-level clustering using class prototypes to characterize a generic feature for each class; however, experimental results show that hierarchy-awareness can significantly boost semantic recognition.
We can summarize the main contributions of this work as follows.(1) We propose a general framework for semantic segmentation, adaptable to different experimental scenarios.In this way, LEAK proves to be very general and adaptable to different contexts.(2) We identify a semantically-consistent partition of classes in macro categories by inspecting the confusion scores of a standard model via spectral clustering.Previous coarse-to-fine approaches [40], [54], [56], [57] usually need human supervision in defining the hierarchical split.LEAK adopts a fully automated procedure to extract and concretely encode such information, constraining the network with outputlevel fairness and feature-level regularization.(3) We devise a hierarchy-aware fairness constraint to balance classification scores regardless of the frequency or accuracy of each class.Output-level fairness has been widely investigated in resource allocation [27]; however, no prior works include fairness measures in loss definitions for training deep architectures.(4) We compute a class-conditional hierarchical prototype structure that enforces an alignment of the generated feature vectors around their prototypical representation.Some recent approaches [42], [47], [63], [80] have highlighted the opportunities of a featurelevel clustering to characterize a generic feature for each class.However, experimental results show that hierarchy awareness can significantly boost semantic recognition.(5) We benchmark our approach on different standard point cloud and RGB semantic segmentation datasets outperforming state-of-the-art architectures.

II. RELATED WORK
Point Cloud Semantic Segmentation (PCSS) has been tackled using different methods and architectures [8].A first set of approaches relies on discretization methods that transform point clouds into discrete data structures.These structures can be dense, like voxels [60] or octrees [51] or sparse, like permutohedral lattices [52] and can be treated as threedimensional images where convolutions can be applied.Another category of methods consists in projecting the point cloud on a bi-dimensional structure to infer predictions and map it back in a later stage.The projection methods are based either on multi-view [58], spherical [43] or cylindrical [74] projections.Deep learning architectures used in these cases are usually well-established convolutional neural networks (CNN) pretrained on image datasets.Compared with discretization-based models, these methods are able to improve the performance for different tasks by taking multiple views of the object or scene of interest.In addition they are efficient in terms of computational complexity.Finally, point-based methods avoid limitations posed by both projection-and discretizationbased methods, e.g., loss of structural information, via direct processing of the raw point cloud data.Among these methods we can distinguish point-wise MLP approaches [24], [48], [49], point-convolutions [35], RNN-based [26], [71] and graph-based methods [32], [67].
However, the most recent approaches rely on 4D convolutions [17] or transformers [19], [31] to accomplish the task and they need a huge computational power and storage capacity.More lightweight approaches consist in a mixture of methods; for example, some architectures provide voxel-wise predictions refined with point-wise labels [81].These approaches have been recently exploited with Knowledge Distillation methods that combine the two data representations in order to improve performance [22].
In [55] the idea that there exists an embedding in which points cluster around a single prototype representation for each class is first formulated, and then features are assigned to the class of the closest prototype (nearest neighbor).Since then, prototypes have been employed in many ways.Some works employ prototypical contrastive learning by clustering features of the same class tightly around their prototype, while spacing apart features of different classes [33].Matching of prototypes improved generalization across domains [36], [45] and reduced forgetting when distilling knowledge from a support set of prototypes [9], [41].
Multiple class-conditional prototypical representations have been employed in [2], [70], [79] to better capture the complex statistical distribution of the extracted features.However, to the best of our knowledge, no prior work investigates the interaction of coarse-and fine-level prototypes to leverage standard supervised model training.For point clouds, prototypes have been used in [33] to support few-shot PCSS by either composing fine-level prototypes [33] or by building an attention mechanism from multiple prototypes [79].
Hierarchical composition of semantic representations has been explored in few previous work for part-based regularization [33], [70] and coarse-to-fine approaches where coarse-level classes are refined into finer categories [40], [56].However, they all require an explicit micro-to-macro assignment.

III. METHODOLOGY
Given an input point cloud, i.e., a set of N 3D points X = {x 1 , x 2 , . . ., x N }, and a set of candidate semantic labels Y = {y 1 , y 2 , . . ., y N }, the objective of semantic segmentation is to associate each input point x i with a semantic label y i .
Such input points can be treated in several ways: (1) they can be discretized as voxels, (2) they can be projected and treated as 2D images, (3) or they can be processed directly as they are.We devise experiments on each of the three methodologies, addressing the main focus on the direct processing method.In the following paragraphs we provide a detailed explanation of each component, i.e., clustering on the standard model mistakes, class balancing through fairness enforcement, and feature-prototype alignment in the latent space.Note that the first two components are totally independent of the model, while the latter weakly depends on how the construction of prototypical features occurs within the architecture, which in turn depends on the processing method used for point clouds.
An overall scheme of our approach is shown in Fig. 2. Spectral clustering is applied on the confusion matrix inferred from the standard pre-trained model.The hierarchical partition obtained over the set of classes is used within the fairness objective at the output and to build prototypes at both micro and macro levels.This way the model learns from mistakes by adopting a semantic-driven self-regularization approach, and obtains an overall improvement against the standard solution.

A. Learning from Mistakes
The first building block of our LEAK is the effective core of the self-regularization strategy based on mutual semantic misclassifications that learns from mistakes.Generally, standard supervised approaches train models from scratch relying on an annotated training set; coarse-to-fine hierarchical approaches exploit additional information, grouping ground truth labels a priori into several macro categories [56], generated via a human annotation activity.Conversely, LEAK performs a posteriori unsupervised clustering of classes, independently from the specific dataset and architecture.Indeed, the class partition is derived from the misclassifications produced by the standard segmentation method.Such errors provide meaningful feedback about the feature space organization and allow highlighting of the classes that can be easily confused.Therefore it permits an optimization of their hidden representations enhancing the separation between the corresponding semantic regions.
A pre-trained standard segmentation model is employed to infer predictions on the validation set, computing the confusion matrix over classes.This matrix A is considered as an adjacency matrix associated with a complete graph network G, where the different classes are assigned to nodes and the conditional error probabilities are the edge weights.We identify the nodes of G with {c i }, i ∈ [0, m), where m is the total number of classes, and the edges with {d i,j }, i, j ∈ [0, m), where i, j denote the ground truth and predicted class index, respectively.The edge d i,j is associated with the probability of classifying ground truth class c i into predicted class c j .We use this representation to draw the subdivision in communities with a clustering algorithm, identifying M clusters, defined by Specifically, this partitioning is performed via spectral clustering commonly used to identify communities of nodes in a graph based on the edges connecting them.The adjacency matrix A is provided as an input and consists of a quantitative assessment of the relative similarity of each pair of points in the dataset.The algorithm follows an iterative procedure that exploits the eigenvalues of the similarity matrix to conduct dimensionality reduction and progressively subdivides the network into two clusters until the optimal number of communities is reached.This number is estimated a priori thanks to the graph conductance measure [28] where the optimal number of clusters corresponds to the number of local minima.The communities found at this step represent the macro-grouping of classes.The left side of Fig. 2 shows a visual representation of the effects brought by the spectral clustering algorithm on the graph and the confusion matrix.Each color corresponds to a set of nodes (i.e., micro classes) that belong to the same cluster (i.e., macro class).The tree structure of the left side in Fig. 1 is derived bottom-up with this approach and shows the classes' hierarchical organization.

B. Feature-Prototype Alignment
Prototypes (i.e., class centroids) are non-learnable vectors in the feature space that are representative of each semantic category that appears in the dataset [44], [72], [80].During training, the features extracted by the encoder contribute in forming the latent prototypical representation both for micro and macro classes.Class prototypes Γ c ideally represent the features' objective for the respective class at each training step.
Their computation occurs in place with a running average updated at each training step with supervision.At training step t with batch B of B total samples, the prototypes are updated for a generic class c as: where φc is a feature vector in the current batch B corresponding to class c, k c [t] is the number of feature vectors corresponding to class c met in all previous batches, and n c is the number of feature vectors corresponding to class c in current batch B. Therefore, The correspondence of each feature vector φ with class c is based on the idea that a general encoder network preserves local structures from the input space, whether it is composed of convolutional or MLPs' layers.Therefore, the ground truth labels of each point cloud are tracked throughout the encoder to reach the latent space and provide semantic labels for it.Then, features with the same semantic class c are aggregated to contribute in the construction of prototype Γ c .
Class prototypes are initialized to We use the l 1 norm || • || 1 as metric distance.We report in Suppl.Mat. an ablation on the loss function used.The specific loss function is defined as: The integration of prototypical representations in the objective function promotes a self-driven progressive regularization of the latent space, forcing an alignment of new incoming features to their prototypes (lower of Fig. 1).We compute prototypes both at micro and macro levels.
Macro prototypes are obtained following Eq.( 1), but considering the macro class C = f (c) instead of the micro class c, with f (•) being the micro-to-macro mapping function identified by our spectral clustering algorithm.The respective loss function L P M is drawn as in Eq. ( 2), considering M macro classes indexed by C.
Including both micro and macro prototypes, we increase simultaneously coarse and fine expressiveness (division in wellseparated clusters) for the latent representations, promoting a meaningful hierarchical organization of feature samples.The addition of feature-level regularization has shown to be beneficial also for different tasks such as image semantic segmentation [69].
Note that features are latent representations of input points, but they have usually lower resolution and have different shapes with respect to the input tensors.To assign feature labels we must account for the specific encoder structure and subsampling method.Therefore, feature labels are evaluated by propagating input labels through the encoder.The three selected architectures represent exemplar networks of the three most common methods to process point clouds: RandLA-Net [24] is a point-based method built of MLP layers and sub-samples points using random sampling; Cylinder3D [81] partitions the 3D space discretizing point clouds with cylindrical voxellike structures; RangeNet++ [43] projects point clouds on 2D surfaces to process them like images.

C. Attentive Fair Weighting
To enforce the regularization effect of feature-prototype alignment, an attentive per-class weighting scheme is introduced.This constraint is derived from the experimental observation that the number of points per class plays a significant influence on the classification accuracy for that class.For example, in SemanticKITTI [4] the most frequent classes, e.g., vegetation or road, obtain higher accuracy with respect to the least frequent ones, e.g., person, independently on the underlying architecture.Besides, in many practical applications, the least frequent classes are the most critical ones (e.g., person in an automotive scenario).
We propose a regularization objective derived from the Jain's fairness index F [27] to provide a balanced per-class weighting.
In other words, we address a resource allocation problem within each macro class, considering micro classes belonging 20.1 72.0 41.8 18.7 5.6 62.3 53.7 0.9 1.9 0.2 0.2 46.5 13.8 30.0 0.9 1.0 0.0 16.9 6.0 8.9 TangentConv [59] 40.9 83.9 63.9 33. 4  to the same macro class as the users that are sharing the same resource: where m C is the number of (micro) classes within macro class C, π c,c represents the c-th element in vector π c , and π c is the average prediction vector for class c, obtained as: where p c is a generic prediction vector with ground truth class c and nc is the number of points labeled as c in the current batch B.
A high fairness value denotes a truly balanced resource allocation among the entities, while low values of fairness show an unbalanced share of resources (upper part of Fig. 1).
Therefore, in order to preserve accuracy homogeneity among classes, we design a fairness-based loss function as follows: While prototype alignment forces a semantic target representation for each class at micro and macro levels, the attentive weighting constraint based on fairness aims at providing unbiased output predictions within each macro class.In other words, by posing such weighting constraint to predictions of the same macro class, we give the same importance to all the semantically consistent micro classes.

D. Objective Function
The training objective is given by the combination of the base loss function for each architecture (L 0 ) with the additional objective given by the LEAK components.The base loss function depends on the selected architecture.It corresponds to the standard cross-entropy loss with inverse class weighting for RandLA-Net [24] and RangeNet++ [43], to the Lovaszsoftmax loss [5] for voxel features plus the cross-entropy loss with inverse class weighting for point-feature refinement in Cylinder3D [81], and to the plain cross-entropy loss for DeepLabV3 [10].
The LEAK components are given by micro-level (L Pm ) and macro-level (L P M ) feature-prototype alignment objectives, and a class-wise attentive weighting constraint (L F ).The full objective is then computed as: where the balancing hyper-parameters have been tuned using a validation set.

IV. TRAINING PROCEDURE
We experiment on publicly available benchmarks, using three point cloud datasets (2 outdoor and 1 indoor) and one image dataset.
The SemanticKITTI [4] dataset consists of 43552 densely annotated outdoor LiDAR scans.The training split contains 19130 scans, while the validation split 4071 scans (that we used for testing, as done by all competing works being the test labels not publicly available).The mean Intersection over Union (mIoU) score over 19 categories is used as the standard metric and results are reported on the original validation set.
The Semantic3D [20] dataset consists of outdoor static point clouds, 15 for training and 15 for online testing.Each point cloud has up to 108 points.We only use color and spatial coordinates to train and test our models, following previous work [24].We evaluate the performance via mIoU and OA on the 8 classes.The S3DIS [3] dataset consists of indoor scans of medium-sized single rooms, with dense 3D points.We use the standard 6-fold cross-validation in our evaluation.We report mIoU, mean class Accuracy (mAcc), and Overall Accuracy (OA) of the 13 classes.
The PascalVOC2012 [16] is used to validate our methods on RGB image samples.It contains 10582 images for training and 1449 for online testing.We use the mIoU to evaluate the performance on the 21 different classes.
Our proposed strategy is agnostic to the backbone architecture.To prove it, in our experimental evaluation we use RandLA-Net [24], RangeNet++ [43], Cylinder3D [81] and DeeplabV3 [10] with their original hyper-parameters configuration.We train RandLA-Net optimizing the network weights following [24], with Adam optimizer and the same learning rate policy, momentum, and weight decay.The initial learning rate is set to 10 −2 , and decreased with a polynomial decay rule with power 0.95.In each learning step, we train the models for 100 epochs with a batch size of 6 for SemanticKITTI and S3DIS, and 4 for Semantic3D, as in the original model.For Cylinder3D we train the model on SemanticKITTI using the standard configuration for 40 epochs with batch size 2 and learning rate 10 −3 .Finally, for the evaluation on DeepLabV3 we follow the original configuration [10] and we trained the model for 30 epochs with initial learning rate 7 × 10 −4 and batch size 6.We use an NVIDIA GeForce RTX 3090 GPU with CUDA 10.2 to train all the models.

V. EXPERIMENTAL RESULTS
We extensively evaluate the performance of our approach with different architectures and datasets, analyzing each component of LEAK separately.
First of all, we show the results on SemanticKITTI [4] dataset with two architectures: namely, RandLA-Net [24] and Cylinder3D [81].Then, we tested LEAK on two different datasets: S3DIS [3] and Semantic3D [20].Fig. 5: mIoU curves comparing reference value (blue) to supervised training with the addition of prototype regularization (green), fairness (orange), or both (red).Curves smoothed via running average with window size 12.LEAK provides higher mIoU and at the same time faster convergence speed.The robustness of the method is further studied by applying LEAK in a different domain: we perform image semantic segmentation using DeepLabV3 [10] on PascalVOC 2012 [16] dataset.Finally, the effect of each component is analyzed separately in an extensive ablation study.
SemanticKITTI.We start our analyses on the public SemanticKITTI [4] benchmark, reporting the results in Tab.I, where our approach is compared against several wellestablished architectures.We took two among the most successful architectures and we applied LEAK during their model training.The introduction of LEAK components improved the mIoU by 1.3% on RandLA-Net [24] and by 0.7% on Cylinder3D [81].The performance drop in the retrained baselines are due to the use of a more recent version of CUDA library and other packages.More information about specific packages and versions are provided in the code repository.
However, training with LEAK shows considerable improve-ments that outperform the baseline solution for both RandLA-Net and Cylinder3D.Note that in RandLA-Net the prototypical representations are built based on the point-wise features, while in Cylinder3D on the voxel-wise ones.
Looking at the per-class scores, it emerges how LEAK can provide more balanced results across the classes, thanks in particular to the fairness constraint.The automatic hierarchical grouping for SemanticKITTI classes is shown in Fig. 3, where we can appreciate that each macro class contains semantically similar micro classes.It is also possible to notice that the perclass IoU increases on the least populated classes (made exception for truck in RangeNet21).In this case, fairness weighting combined to a higher misclassification probability depending on misleading object shapes after projection (RangeNet is based on a set of convolutional layers applied on images after spherical projections) induces a higher error probability on the points along the border of the object.This inconvenience is solved whenever applying LEAK to the other two architectures (RandLA-Net and Cylinder3D) that process array points in 3D coordinates.
Fig. 4 shows some qualitative results that compare the predictions of RandLA-Net trained with or without our LEAK.The most relevant improvements are highlighted with red circles.We observe that prediction errors of the original model often involve semantically and geometrically similar classes: sidewalk is misclassified as road in the first row, building is misclassified as fence in the second row.Our approach, instead, can correctly label these objects, thanks to the increased latent space self-regularization and fairness of accuracy across the same macro class.Indeed, road and sidewalk belong to the same a posteriori macro class, similarly to fence and building.Further qualitative results are provided in Suppl.Mat.
Moreover, to verify the robustness of LEAK, we perform additional experiments on two other datasets, with different properties and acquired with different methodologies with respect to SemanticKITTI.We select S3DIS [3] and Semantic3D [20] datasets, employing RandLA-Net.The TABLE IV: Ablation study on SemanticKITTI [4] dataset with RandLA-Net [24].*: measure of the prediction coherence within the macro classes.strong generalization ability of our approach emerges clearly, despite the huge gap in the nature of the considered Indeed, the three point cloud datasets differ greatly in terms of the number of scenes, the number of points per scene, the acquisition methodologies, the environment (indoor/outdoor), and the number of classes.SemanticKITTI scenes are organized in sequences and distributed according to sparse concentric regions.Instead, static datasets present denser and regularly distributed point regions.Qualitative results are provided in Suppl.Mat.

S3DIS.
Results on S3DIS are shown in Tab.III.RandLA-Net with LEAK achieves remarkable results and, in particular, outperforms the baseline training scheme by 0.5% of mIoU.The improvement is shown also according to the other metrics, with an increase of 0.5% and 0.8% for the OA and mAcc, respectively.
Semantic3D.The per-class results in terms of mIoU and OA achieved on Semantic3D are reported in Tab.II.RandLA-Net with LEAK outperforms all the reported approaches, boosting the baseline of 1.0% mIoU.

VI. ABLATION STUDIES
We report an extensive ablation study to fully validate our approach.First, we show task generalizability, reporting the results on VOC2012 [16].Then, we devise separate studies on each component of our LEAK.
A. Task generalization.
VOC2012.We show the robustness of LEAK on segmentation of RGB images rather than point clouds.To perform image semantic segmentation we consider the VOC2012 [16] dataset and we use DeepLabV3 [10], with ResNet-101 [21] as backbone (weights pre-trained on ImageNet [14]).Also in this case, LEAK brings an improvement compared to the standard method.In terms of mIoU, we obtain a value of 66.4% using the standard approach and 66.7% when LEAK is introduced.Such a result further proves the validity of our method, showing that it is agnostic not only to the backbone architecture and the dataset but even, more generally, to the task.
In addition, we report in Tab.V further experiments on recent architectures for Point Cloud Semantic Segmentation showing that LEAK improves baseline performance on top of every segmentation architecture, also in case transformer-based architectures are employed.

B. LEAK Components
A first set of analyses were carried on the SemanticKITTI dataset with RandLA-Net architecture.Tab.IV highlights the contribution of each component of LEAK on the final model scores.We observe that both the fairness and the featureprototypes alignment constraints significantly increase the final mIoU over the original model by 1% or 0.5% respectively.The combination of such regularization terms produces nonoverlapping and mutual benefits, allowing for an increment of 1.3% in the final mIoU.Experimental results show that a fundamental requirement for the success of our approach is the a posteriori hierarchical clustering of the micro classes; indeed, employing a semantic-unaware random clustering of micro classes leads to a small mIoU of 51.3%, which is even lower than the original approach (−2.1%).
Also, we report two additional metrics to show the effects of each component: the frequency weighted Intersection over Union (fwIoU) that weights each class importance according to their appearance frequency, and we introduce the hierarchical Intersection over Union (hIoU), defined as: where q = f (p), such that each prediction p is mapped on the respective macro class with f (•).This metric is computed in such a way that every predicted micro class should be equal to the target macro class that contains it.Its purpose is to underline the weight of the hierarchical self-induced assessment over the original model.In fact, the disentangling effect of fairness and prototypes lead to an improvement of 6.9% and 7.3% respectively, in terms of hIoU, which is slightly reduced in the final model (+5.2%).We can appreciate improved results also Pre-LEAK Post-LEAK Fig. 6: Confusion matrices of the fine-grained predictions grouped according to the macro classes identified a posteriori pre-and post-LEAK on RandLA-Net [24] with SemanticKITTI [4].
in the fwIoU (+0.7%),where the main contribution is given by the class-balancing effect of fairness.Fig. 5 compares the temporal evolution of the mIoU score over different training epochs for each ablation model.We observe that the introduction of feature-prototype alignment delays the increase in the mIoU compared to the original baseline model (green curve), but leads to an overall improvement of 0.5% mIoU against the original model in the long term.This delay is partially generated by the dominance of the new loss terms over the cross-entropy loss.In addition, this performance degradation proves initial prototypes to be quite unreliable, since they are computed just on few feature samples.As training proceeds, the prototypical features become progressively better defined and separated, leading to an increment in the overall accuracy.
Tab. VI reports further ablation results.We report tests using other cross-entropy weighting schemes (that in general helps class balancing), and tests using a priori standard class grouping, as in [4].We extended the ablation on micro-macro grouping for both RandLA-Net and Cylinder3D, outperforming standard coarse-to-fine strategy and some other state-of-the-art weighting schemes, including SalsaNext [12] (i.e., squared root of inverse class frequency), and inverse class frequency weighting on Cylinder3D.Our accuracy gains are robust across setups, improving training with no architectural changes.Fig. 6 reports the confusion matrices obtained with the standard method (Pre-LEAK) and the confusion matrix analyzed at the end of the training with LEAK (Post-LEAK).
The overall error percentages are reduced and fall mostly in the macro categories derived with a posteriori clustering.In particular, we can notice that LEAK brings a great reduction in the misclassification of vehicles with environment, from 21% misclassified samples to 2%.Also, we can see an improvement in the misclassification of people with vehicles, from 18% to 8%, with the overall results on the diagonal either improving or remaining the same.

C. Feature-prototype alignment
To measure the impact of feature-prototype alignment module we compute some metrics to compare LEAK against the original model, as shown in Tab.VII.First, we define the inter-proto angle Θ Γ as the angle between prototypes in the n-dimensional space, i.e.:

D. Attentive fair weighting
The effect of attentive fair weighting is instead to boost the performance from the immediate beginning of the training (green curve in Fig. 5), obtaining faster convergence and overall higher precision.The attentive fair weighting curve in Fig. 5 reaches 50% mIoU in only 10 epochs, i.e., the 91.4% of the final score (54.7%).We can explain this rapid increase in the mIoU with the re-weighting, computed at each training step, which forces an equalization among different classes.Fair weights computation is performed class-wise, and shows beneficial effects from the beginning, balancing some unfair weighting of the cross-entropy that privileges the most diffused classes.In the final LEAK method, where all the regularization terms are combined, the improvement due to fair weighting is slightly attenuated by feature-prototypes alignment in the first 15 epochs but performs the best in the long run, improving the mIoU of 1.3% with respect to the standard configuration.In the lower of Tab.VIII we appreciate that per-class accuracy are more balanced when fairness is enabled (see Suppl.Mat.).The reported metrics are computed on the per-class results of Tab.I and confirm that fair weighting balances accuracy over classes.The mean squared error (MSE) is computed with respect to the average mIoU value and minimized by the introduction of Fairness; the same observation hold for standard deviation (σ) and entropy, which is however maximized.

E. Performance
Equipping models with LEAK improves the segmentation performances of the model with only a slight additional occupancy in terms of RAM.For example, Cylinder3D [81] with

VII. CONCLUSION
In this paper, we showed that accuracy, convergence time, and homogeneity can be improved for standard point cloud semantic segmentation methods, by automatically analyzing the classification mistakes of the original models.Thanks to these errors, we identify macro classes, hierarchically grouping sets of mutually misclassified micro classes.Experimental evidence showed a posteriori that such groups are consistent in terms of semantic characterization or classification difficulty.(e.g., due to the sample frequency or sparsity).This information is exploited in model learning to regularize feature space in a two-fold manner.First, we cluster features of the same class (both macro and micro) tightly around their prototypical representations.Second, we constrained the class-wise accuracy score to be equally distributed across micro classes contained within the same macro cluster.The proposed method boosted network performance while reducing the convergence time.Also, our solution proved to be totally agnostic to the backbone architecture and dataset, and remarkably prompt to generalization, as it was adapted to 4 different datasets, 3 architectures, and 2 types of input data (i.e., point clouds and RGB images).

Fig. 2 :
Fig. 2: Overall pipeline of the proposed approach.First (left side), we analyze the results of a standard supervised learning performed by any off-the-shelf segmentation model identifying macro communities of similar micro semantic classes.Then (right side), we regularize the learning of the model by clustering features around their prototypical semantic representation at two levels (micro and macro) and by a macro-aware fairness score on the micro classes.

8 ) 1 N
in order to measure the relative distancing among prototypes.Then, we compute the Class Center Distance (CCD) as the average distance between features φ c of class c and the average feature array φc of class c [64] (computed at the end of training), i.e.: c c∈[0,m) ||φ c − φc ||.(9) The parameter N is the total number of samples in the dataset, while N c is the total number of samples for class c.The purpose of CCD is to parameterize how tightly the features are clustered around their class center.Similarly, we define also the Prototype Distance (PD) as the average distance of the features of class c from prototype Γ c (which is computed by a running average during the training phase), and the Proto-Center Distance (PCD) as the distance between φc and Γ c .The metric values are reported in Tab.VII and confirm that featureprototype alignment thins the regions associated to a specific class as CCD and PD are much lower when feature-prototype alignment is enabled.Additionally, the increased value of Θ Γ in LEAK with respect to the original training highlights that a progressive spacing is introduced among prototypes thus refining the class latent feature representations.

TABLE I :
[4]-class IoU on SemanticKITTI[4]dataset. †: model re-trained from the official codebase for a fair comparison.Bold indicates best compared to baseline.

TABLE II :
[24]titative results on the Semantic3D[20]dataset with RandLA-Net[24].†: model re-trained from the official codebase for a fair comparison.Bold indicates best compared to baseline.

TABLE V :
mIoU fwIoU hIoU* Recent state-of-the-art approaches results in terms of mIoU for standard method and equipped with LEAK, which is shown to boost results on every type of architecture and dataset.

TABLE VI :
[4]ation study on hierarchical grouping and weighting schemes using SemanticKITTI[4]dataset. *: coarse-to-fine is intended as a two-steps training, half of the total epochs on macro classes and half on micro classes, using LEAK class partitioning.

TABLE VIII :
[4]]yses on SemanticKITTI[4]and RandLA-Net[24].Classbalancing effect measured by standard deviation (σ), MSE and entropy.mIoUσ↓ MSE ↓ entropy ↑ SemanticKITTI[4]shows an increase in memory occupancy of +28 MB.This variation is almost negligible considering the size of a general point cloud segmentation network (e.g., Cylinder3D architecture takes around 8 GB).Moreover, inference time is not affected by our method.Training time is slightly worse for LEAK-based methods, but the change is almost negligible.For example, Cylinder3D with SemanticKITTI takes about 1.68 seconds per iteration with LEAK, while about 1.61 seconds per iteration without LEAK.