Compositional Semantic Mix for Domain Adaptation in Point Cloud Segmentation

Deep-learning models for 3D point cloud semantic segmentation exhibit limited generalization capabilities when trained and tested on data captured with different sensors or in varying environments due to domain shift. Domain adaptation methods can be employed to mitigate this domain shift, for instance, by simulating sensor noise, developing domain-agnostic generators, or training point cloud completion networks. Often, these methods are tailored for range view maps or necessitate multi-modal input. In contrast, domain adaptation in the image domain can be executed through sample mixing, which emphasizes input data manipulation rather than employing distinct adaptation modules. In this study, we introduce compositional semantic mixing for point cloud domain adaptation, representing the first unsupervised domain adaptation technique for point cloud segmentation based on semantic and geometric sample mixing. We present a two-branch symmetric network architecture capable of concurrently processing point clouds from a source domain (e.g. synthetic) and point clouds from a target domain (e.g. real-world). Each branch operates within one domain by integrating selected data fragments from the other domain and utilizing semantic information derived from source labels and target (pseudo) labels. Additionally, our method can leverage a limited number of human point-level annotations (semi-supervised) to further enhance performance. We assess our approach in both synthetic-to-real and real-to-real scenarios using LiDAR datasets and demonstrate that it significantly outperforms state-of-the-art methods in both unsupervised and semi-supervised settings.


INTRODUCTION
L IDAR is currently the most suitable sensor for cap- turing accurate 3D measurements of an environment for autonomous driving [1] and robotic navigation [2].Semantic scene understanding is a crucial component for AI-based perception systems [3].LiDAR measurements can be analyzed in the form of 3D point clouds, with point cloud semantic segmentation used to assign a finite set of semantic labels to the 3D points [4].To train accurate deep learning models, large-scale datasets with point-level annotations are necessary [5], [6], [7].This involves a costly and labor-intensive data collection process, as point clouds need to be captured in the real world and manually annotated.An alternative is to use synthetic data, which can be conveniently generated with simulators [8].However, deep neural networks are known to suffer from domain shift when trained and tested on data from different domains [8].Although simulators can reproduce the acquisition sensor with high fidelity, further research is still required to address such domain shift [9].
Data augmentation techniques based on the combination of samples and their labels, such as Mixup [10] or Cut-Mix [11], have been proposed to enhance deep network generalization.The underlying concept involves mixing • C. Saltori, N. Sebe and E. Ricci are with the Dept. of Information Engineering and Computer Science, University of Trento, Italy.E-mails: cristiano.saltori@unitn.it,niculae.sebe@unitn.it,e.ricci@unitn.it.• F. Galasso is with the Dept. of Computer Science, Sapienza University of Rome, Italy.• G. Fiameni is with NVIDIA AI Technology Center, Italy.
samples to expand the training set and reduce overfitting.These methods were initially applied to image classification tasks and later adapted for domain adaptation and domain generalization in image recognition [12], [13].Similar ideas have also been successfully extended to 2D semantic segmentation [14], [15].While Unsupervised Domain Adaptation (UDA) for semantic segmentation in the image domain has been extensively studied [14], [15], [16], [17], [18], less attention has been devoted to developing adaptation techniques for point cloud segmentation.Point cloud UDA can be addressed in the input space [9], [19] with dropout rendering [19] or adversarial networks [9], or in the feature space through feature alignment [20].A few studies have proposed exploiting sample mixing for point cloud data [21], [22], but they target different applications than UDA for semantic segmentation.
In this paper, we present a novel domain adaptation framework for 3D point cloud segmentation, named CoS-Mix, which extends the approach presented in [23] to the semi-supervised settings (SSDA).CoSMix is designed to mitigate the domain shift by mixing semantically-informed groups of points (patches) across domains.Specifically, we design a two-branch symmetric deep neural network pipeline that concurrently processes point clouds from a source domain (e.g.synthetic or real) and point clouds from a target domain (e.g.real or real but captured with a different sensor).Target point clouds can be either unlabeled or partially labeled if one wants to use CoSMix for UDA or SSDA, respectively.Each branch is domain specific, i.e. the source branch is in charge of mixing a source point cloud with selected patches of a target point cloud, and vice versa for the target branch.We formulate mixing as a composition operation, which is similar to the concatenation operation proposed in [21], [22], but unlike them, we leverage the semantic information to mix domains.Patches from the source point cloud are selected based on the semantic labels of their points.Patches from the target point cloud can be selected based on the predicted semantic pseudo-labels in the case of UDA and based on human annotations in the case of SSDA.We will show that only a handful of manually annotated points are sufficient to significantly improve the domain adaptation performance.When patches are mixed across domains we apply data augmentation both at local and global semantic levels to boost the efficacy of the mixing.An additional key difference between our method and [21], [22] is the teacher-student learning scheme that we implement to improve the accuracy of the pseudo-labels.We evaluate CoSMix on large scale point cloud segmentation benchmarks, featuring both synthetic and real-world data, in several directions such as synthetic to real and real to real.Specifically, we use the following datasets: SynLiDAR [9], SemanticPOSS [6], SemanticKITTI [5], and nuScenes [7].Our results show that CoSMix can reduce the domain shift, outperforming state-of-the-art methods in both UDA and SSDA settings.We perform detailed analyses of CoSMix and an ablation study of each component, highlighting its strengths and discussing its limitations.
This paper extends our earlier work [23] in several aspects.We extend the original CoSMix in order to tackle the SSDA setup.The current design allows a user to input a few annotated points to significantly improve the semantic segmentation performance on the target domain.Then, we significantly extend our experimental evaluation and analysis by adding new experiments, new comparisons, and new ablation studies to evaluate this new setup.We extend the related work by thoroughly reviewing additional state-of-the-art approaches, and summarizing these approaches in a comprehensive table that highlights key contributions and setups.Lastly, the code is available at https://github.com/saltoricristiano/cosmix-uda.

RELATED WORK
Point cloud semantic segmentation.Point cloud semantic segmentation can be performed at point level [37], on range views [38] or on a voxelized point clouds [39].Pointlevel architectures process the input point cloud without the need for intermediate representation processing.This architectures include PointNet [40], which is based on a series of multilayer perceptrons.PointNet++ [37] improves on PointNet by aggregating global and local point features at multiple scales.RandLA-Net [41] extends PoinNet++ [37] by embedding local spatial encoding, random sampling and attentive pooling.KPConv [42] learns weights in the continuous space, and introduces flexible and deformable convolutions for point cloud processing.These methods are computationally inefficient when large-scale point clouds are processed.Computational efficiency can be improved by projecting 3D points on 2D representations [25] or by using 3D quantization approaches [4].The former includes 2D projection-based approaches that use 2D range maps and exploit standard 2D convolution filters [38] to segment these maps prior to a re-projection in the 3D space.RangeNet++ [25], SqueezeSeg networks [20], [43], 3D-MiniNet [44] and PolarNet [45] are approaches that belong to this category.Although these approaches are efficient, they tend to lose information when the input data are projected in 2D and re-projected in 3D.The latter includes 3D quantization-based approaches that transform the input point cloud into a 3D discrete representations, and that employ 3D convolutions [39] or 3D sparse convolutions [4], [36] to predict per-point classes.VoxelNet [39] maps input points into a voxel-grid and processes the input voxel-grid with 3D convolutions.SparseConv [36], [46] and Minkowsk-iNet [4] improves voxel processing and introduce sparse convolutions to improve efficiency.Cylinder3D [47] further improves voxel processing for LiDAR data by using cylindrical and asymmetrical 3D convolutions.In our work, we use MinkowskiNet [4], which provides a trade off between accuracy and efficiency.

Domain adaptation for point cloud segmentation.
Unlike domain adaptation for image-based tasks [48], [60], domain adaptation for point cloud segmentation still lacks a unified experimental setup to compare different approaches.We review domain adaptation approaches for point cloud segmentation by grouping them into range-view methods, multi-modal (2D&3D) methods, and 3D-focused methods.
Multi-modal models are designed to process the information captured by multiple input sensors, e.g.RGB cameras and LiDAR sensors are those typically used.Domain shift is tackled by enforcing prediction consistency among modalities and domains, and by using target (pseudo) labels.
xMUDA [29] uses cross-modality and cross-domain consistency to learn a domain agnostic model in the real-to-real UDA setup.Cross-modal consistency exploits source labels and target pseudo-labels to produce consistent multi-modal predictions in both the domains.DeepCORAL feature alignment [64] is used to enforce feature alignment between source and target domains.In [32], xMUDA is extended to SSDA settings showing that cross-modal consistency is effective even in the semi-supervised settings.
3D methods can process input point clouds with or without prior voxelization.UDA approaches for 3D segmentation include voxel-based architectures such as SparseConv [46] and MinkowskiNet [4].Domain shift can be tackled by focusing on the problem of sparsity [9], [35] or by employing mix up strategies [23].Complete&Label [35] reduces the sparsity difference between real domains by formulating the domain adaptation problem as a point cloud completion (or densification) problem.A self-supervised completion network is trained to make the sparse input point cloud denser.The pre-processed point clouds can then be used as intermediate domains in order to lower the domain shift.PCT [9] disentangles domain shift between synthetic and real point clouds into appearance and sparsity.Then, PCT learns an appearance translation module and a sparsity translation module.These modules are used for translating source data in the target modality.Translated data are then used together with ST [18] and APE [65] in the UDA and SSDA settings, respectively.
CoSMix [23] is a method that reduces domain shift in point cloud data by introducing a compositional semantic mixup strategy with a teacher-student learning scheme.The method obtains domain-invariant models/features by creating two new intermediate domains of composite point clouds: a mixed source and a mixed target.In the mixed target, source instances pull the target domain closer to the source domain, preventing overfitting from noisy pseudolabels.In the mixed source, target instances (pseudo-labels) bring the target modality into the source domain, pulling the source domain closer to the target domain.The teacherstudent learning scheme enables the iterative improvement of pseudo-labels, progressively reducing the domain gap.In this work, we extend CoSMix [23] to the SSDA settings by allowing target labels to be mixed in the source and target (unlabeled) point clouds while improving adaptation.We also show how a small amount of target supervision can significantly improve the adaptation performance.

Preliminaries and definitions
CoSMix implements a teacher-student learning scheme that exploits the supervision from the source domain, the selfsupervision from the target domain and, if available, the supervision from a few labeled target samples to improve the semantic segmentation on the target domain.Our method is trained on two different mixed point cloud sets.The first is the composition of the source point cloud with pseudolabeled portions of points, or patches, of the unlabeled target point cloud.Target patches bring the target modality in the source domain making the altered source domain more similar to the target domain.The second is the composition of the unlabeled target point cloud with randomly selected patches of the source point cloud.Source patches make the altered target domain more similar to the source domain, preventing overfitting from noisy pseudo-labels.If available, labeled points of the target point clouds can also be used in both the mixed point cloud sets.This target supervision can further reduce domain shift.The teacherstudent learning scheme iteratively improves pseudo labels, progressively reducing the domain gap.Fig. 1 illustrates the block diagram of CoSMix.
Let S = {(X s , Y s )} be the source dataset that is composed of N s = |S| labeled point clouds, where X s is a point cloud and Y s is its point-level labels, and |.| is the cardinality of a set.Labels take values from a set of semantic classes C = {c}, where c is a semantic class.Let On the upper branch, the source point cloud X s is mixed with selected patches of the target point cloud X t U and, selected patches of the supervised point cloud X t L when available.The unlabeled target patches from X t U are subsets of points that correspond to the most confident pseudolabels Ŷt U that the teacher network produces during training.The supervised target patches are subsets of points that are randomly selected based on the class frequency distribution in the source training set.On the lower branch, the target point cloud X t U is mixed with the selected patches of the source point cloud X s and with the selected patches of X t L , if available.The source patches are subsets of points that are randomly selected based on their class frequency distribution in the training set.
We define the branch that mixes target point cloud patches to the source point cloud as t → s and the branch that does the vice versa as s → t.Let X t→s be the mixed point cloud obtained from the upper branch, and X s→t be the mixed point cloud obtained from the lower branch.Lastly, let Φ θ and Φ θ ′ be the student and teacher deep networks with learnable parameters θ and θ ′ , respectively.

Semantic selection
To train the student networks with balanced data, we perform a selection of reliable and informative point cloud patches prior to mixing points and labels across domains.To select patches from the source point cloud, we use the class frequency distribution by counting the number of points of each semantic class within S. Unlike DSP [15] that selects long-tail classes in advance, we exploit the source distribution and the semantic classes available to dynamically sample classes at each iteration.
Let P s Y be the class frequency distribution of S. We create a function f that randomly selects a subset of classes at each iteration based on the labels Ỹs ⊂ Y s .f performs a weighted random sampling of α classes from the input point cloud by using 1 − P s Y as the class weight for each class.α is an hyperparameter that regulates the ratio of selected classes for each point cloud.The output of f is a set pointlevel labels belonging to the sampled classes, i.e.Ỹs .The likelihood that f selects a class c is inversely proportional to its class frequency in S. Formally we have Example: with α = 0.5, the algorithm selects a number of patches corresponding to the 50% of the available classes, i.e. long-tailed classes are selected with a higher likelihood.Let X s be the set of points that correspond to Ỹs , and let X s c ⊂ X s be a patch (set of points) that belongs to class c ∈ C. To select patches from the target point clouds, we apply the same set of operations but using the pseudo-labels produced by the teacher network based on their prediction confidence.Specifically, we define a function g that selects reliable pseudo-labels based on their confidence value.The selected pseudo-labels are defined as where Φ θ ′ is the teacher network, ζ is the confidence threshold used by the function g and Ỹt U ⊂ Ŷt U .Let X t U be the set of points that correspond to Ỹt U .In the case of target supervision, we apply f to the target labels Y t L and randomly select target patches as where µ is an hyperparameter that regulates the ratio of selected classes for each point cloud similarly to α.

Compositional mix
The goal of our compositional mixing module is to create mixed point clouds based on the selected semantic patches.The compositional mix involves three consecutive operations: local random augmentation, where patches are augmented randomly and independently from each other; concatenation, where the augmented patches are concatenated to the point cloud of the other domain to create the mixed point cloud; global random augmentation, where the mixed point cloud is randomly augmented.This module is applied twice, once for the t → s branch (top of Fig. 1), where target patches are mixed within the source point cloud, and once for the s → t branch (bottom of Fig. 1), where source patches are mixed within the target point cloud.Unlike Mix3D [21], The SSDA setting uses also the middle branch in addition to those used in UDA (gray line).In the top branch, the input source point cloud X s is mixed with the unsupervised target point cloud X t U obtaining X t→s .In the bottom branch, the input target point cloud X t U is mixed with the source point cloud X s obtaining X s→t .In the SSDA setting, the labeled target data X t L are mixed with the source point cloud X s and with the unsupervised target point cloud X t U .A teacher-student learning architecture is used in both the UDA and SSDA settings to improve pseudo-label accuracy while adapting over target domain.This is achieved by updating the teacher network through Exponential Moving Average (EMA).Semantic Selection (f and g) selects subsets of points (patches) to be mixed based on the source labels Y s , the target labels Y t L , and target pseudo-labels Ŷt U information.Compositional Mix applies local h and global r augmentations and mixes the selected patches among domains.our mixing strategy embeds data augmentation at local level and global level.
Let δ be the indicator function that we define as which indicates whether the supervised target set T L is empty or not.This can be interpreted as the user desire or need to use additional target supervision.
In the s → t branch, we apply the local random augmentation h to all the points X s c ⊂ X s .We repeat this operation for all c ∈ Ỹs .Note that h is a local and random augmentation that produces a different result each time it is applied to a set of points.We define the result of this operation as If δ(T L ) = 1 we can apply h also to X t L and obtain Then, we concatenate the locally augmented patches with the target point cloud X t L and we apply the global random augmentation, such as Their respective labels are concatenated accordingly as where r is the global augmentation function.
The same operations of Eq. 7-8 are also performed in the t → s branch by mixing target patches within the source point cloud.Instead of using source labels, we use the teacher network to generate pseudo-labels from the target data.Additionally, we use target supervision if δ(T L ) = 1.Then, we concatenate them with the labels of the source data.This results in X t→s and Y t→s .Note that T L may be used without compositional mix and without double branched mixing.We implement h and r by using typical augmentation strategies for point clouds [4], i.e. random rotation, scaling, and translation.We report additional information in Sec.4.2.

Network update
We leverage the teacher-student learning scheme to facilitate the transfer of knowledge acquired during the course of the training with mixed domains.We use the teacher network Φ θ ′ to produce target pseudo-labels Ŷt U for the student network Φ θ , and train Φ θ to segment target point clouds by using the mixed point clouds X s→t and X t→s based on their mixed labels and pseudo-labels (Sec.3.3).
At each batch iteration, we update the student parameters Φ θ to minimize a total objective loss L tot defined as where L s→t and L t→s are the s → t and t → s branch losses, respectively.Given X s→t and Y s→t , we define the segmentation loss for the s → t branch as the objective of which is to minimize the segmentation error over X s→t , thus learning to segment source patches in the target domain.Similarly, given X t→s and Y t→s , we define the segmentation loss for the t → s branch as whose objective is to minimize the segmentation error over X t→s where target patches are composed with source data.We implement L seg as the Dice segmentation loss [66], which we found effective for the segmentation of large-scale point clouds as it can cope with long-tail classes well.Lastly, we update the teacher parameters θ ′ every γ iterations following the exponential moving average (EMA) [67] approach where i indicates the training iteration and β is a smoothing coefficient hyperparamenter.

EXPERIMENTS
We evaluate our method in both synthetic-to-real and realto-real UDA and SSDA settings.We use SynLiDAR [9] as synthetic dataset, and SemanticKITTI [5], [26], [30], Se-manticPOSS [6], and nuScenes [7] as real-world datasets.We compare CoSMix with five state-of-the-art UDA methods: two general purpose adaptation methods (ADDA [49], Ent-Min [50]), one image segmentation method (ST [18]), and two point cloud segmentation methods (PCT [9], ST-PCT [9]).Then, we compare CoSMix with five state-of-theart SSDA methods: three general purpose adaptation methods (MMD [49], MME [68], APE [65]), and two point cloud segmentation methods (PCT [9], APE-PCT [9]).We refer to CoSMix-UDA and CoSMix-SSDA to indicate the version of CoSMix for UDA and SSDA, respectively, we use CoSMix to refer to our method in general otherwise.PCT, ST-PCT and APE-PCT are the only three state-of-the-art methods developed for 360 • LiDAR point clouds and have only been applied for synthetic-to-real UDA and SSDA settings.We re-implemented the comparison method and adapted to the same backbone network as that of CoSMix.We refer to these methods as EntMin Source ⋆ , Target ⋆ and Fine-tuned ⋆ .Moreover, we extended EntMin [50] and ST [18] to the SSDA setting, and refer to them as EntMin-SSDA ⋆ and ST-SSDA ⋆ .For completeness, we also include the results of these methods as they are reported in [9].

Datasets and metrics
SynLiDAR [9] is a large-scale synthetic dataset that is created with the Unreal Engine [69].It is composed of 198,396 annotated point clouds with 32 semantic classes.We use 19,840 point clouds for training and 1,976 point clouds for validation [9].SemanticPOSS [6] is composed of 2,988 annotated real-world point cloud with 14 semantic classes.We use the sequence 03 for validation and the remaining sequences for training [6].For the SSDA settings, we follow [9] and use the point cloud 172 of sequence 02 as the semi-supervised target set.SemanticKITTI [5] is a large-scale segmentation dataset consisting of LiDAR acquisitions of the popular KITTI dataset [26], [30].It is composed of 43,552 annotated real-world point clouds with more than 19 semantic classes.We use sequence 08 for validation and the remaining sequences for training [5].For the SSDA settings, we follow [9] and use the point cloud 848 from sequence 06 and the point cloud 940 from sequence 02 as semi-supervised target set.nuScenes [7] is a large-scale segmentation dataset.It is composed of real-world 850 sequences (700 for training and 150 for validation), for a total of 34, 000 annotated point clouds with 32 semantic classes.We use the official training and validation splits in all our experiments.For the SSDA settings, we follow the same selection protocol used in [9] and use the point cloud with token n015-2018-07-24-11-13-19+0800 LIDAR TOP 1532402013197655 as semisupervised target set.We make source and target labels compatible across our datasets, i.e.SynLiDAR → SemanticPOSS, SynLiDAR → SemanticKITTI and, SemanticKITTI → nuScenes.In SynL-iDAR → SemanticPOSS and SynLiDAR → SemanticKITTI, we follow [9] and map labels into 14 segmentation classes and 19 segmentation classes, respectively.In SemanticKITTI → nuScenes we map source and target labels into 7 common segmentation classes as in [70].
We evaluate the semantic segmentation performance before and after domain adaptation [9] by using the Intersection over the Union (IoU) [71] for each segmentation class and report the per-class IoU.We average the IoU over all the segmented classes and report the mean Intersection over the Union (mIoU).

Implementation details
We implemented CoSMix in PyTorch and run our experiments on 4×NVIDIA A100 (40GB SXM4).We use MinkowskiNet as our point cloud segmentation network [4], in particular we use MinkUNet32 as in [9].We pretrain our network on the source domain with Dice loss [66] starting from randomly initialized weights.In SSDA, we start from the pre-trained source model and finetune on both source and labeled target for two additional epochs.The finetuned model is used as pre-trained model in the semi-supervised settings.In UDA, we initialize student and teacher networks with the parameters obtained after pre-training.The pre-training and adaptation stage share the same hyperparameters.In both the pre-training and adaptation steps, we use Stochastic Gradient Descent with a learning rate of 0.001.
We set the value of α by examining the long-tailed classes present in the source domain during the adaptation process.Similarly, we set the parameter µ to the same value.We assign the values of α and µ based on our prior experience, rather than optimizing these parameters through a systematic process.In the target semantic selection function g, we establish the value of ζ based on a qualitative assessment of a few target frames, with the aim of producing spatially compact predictions.This approach yields approximately 80% of pseudo-labeled points per scene.
On SynLiDAR → SemanticPOSS, we use a batch size of 12 and perform adaptation for 10 epochs.We set source and supervised target semantic selection (f ) with α = 0.5 and µ = 0.5 while we set target semantic selection (g) with a confidence threshold ζ = 0.85.On SynLiDAR → SemanticKITTI, we use a batch size of 16, adapting for 3 epochs.During source and supervised target semantic selection (f ) we set α = 0.5 and µ = 0.5 while in target semantic selection (g) we use a confidence threshold of ζ = 0.90.We use these last same hyperparameters also on SemanticKITTI → nuScenes, and SynLiDAR → nuScenes.
Our local augmentations h and global augmentations r are based on data augmentation strategies that are typical in the LiDAR segmentation literature [4].h involves rigid rotation around the z-axis, scaling along all the axes and random point downsampling.We remove xy rotation to produce coplanar and concentric mixed point clouds, and to preserve point ranges.For the same reason, we remove rigid translations.We bound rotations between [−π/2, π/2] and scaling between [0.95, 1.05], and perform random downsampling for 50% of the patch points.r involves rigid rotation, translation and scaling along all the three axes.We set the parameters of r the same as those used in [4].During the network update step (Sec.3.4), we update the teacher parameters θ ′ i with β = 0.99.On SynLiDAR → SemanticPOSS, we set γ = 1 and do not perform parameter tuning.On SynLiDAR → SemanticKITTI, we increase γ to γ = 500 to obtain a stable teacher behavior, i.e. stable source performance, high average confidence of pseudo-labels, and ∼ 80% of pseudolabeled points.We use these same hyperparameters also on SemanticKITTI → nuScenes, and SynLiDAR → nuScenes.On average, MMD ⋆ is the best performing method among the comparison methods with 47.0 mIoU.CoSMix-SSDA achieves 48.9 mIoU, outperforms all the comparison methods and further improving over CoSMix-UDA.

Domain adaptation between different sensors
We study the domain adaptation performance of CoSMix when the source point cloud is synthetically generated with a certain sensor and the target point cloud is captured in the real world with a different sensor.We use SynLiDAR as the source domain and nuScenes as the target domain.The semantic classes are mapped into the common 11 segmentation classes: car, bicycle, motorcycle, truck, bus, pedestrian, road, sidewalk, building, vegetation and, terrain.This case exhibits a rather strong domain shift, as the SynLiDAR point clouds are dense and nearly noise-free, while the nuScenes point clouds are sparser and noisier.We follow the same implementation details of SemanticKITTI → nuScenes, except we change the target domain.Tab. 8 reports the domain adaptation results on SynLiDAR → nuScenes in both the UDA and SSDA settings.Source ⋆ and Target ⋆ models achieve 23.7 mIoU and 47.7 mIoU, respectively.Fine-tuned ⋆ improves the performance to 26.4 mIoU.CoSMix-UDA improves over Source ⋆ by achieving 27.3 mIoU.Despite the lack of target supervision, we also outperform the Fine-tuned ⋆ baseline.
CoSMix-SSDA further improves the results by achieving 27.6 mIoU when we introduce limited target supervision.We observed a lower improvement of CoSMix compared to the other adaptation directions, which we attribute to the different simulated sensor and the large density difference between SynLiDAR and nuScenes scans.

Qualitative results
Fig. 2 shows some domain adaptation results on SynLiDAR → SemanticPOSS.Predictions of Source ⋆ are often incorrect.CoSMix-UDA improves the segmentation results with more homogeneous regions and correctly assigned classes, and CoSMix-SSDA further improves the segmentation quality.Fig. 3 shows the results on SynLiDAR → SemanticKITTI that follows the same result pattern as in Fig. 2. Some classes (e.g.car, vegetation, pole) greatly improve when CoSMix-SSDA is used.An evident increment of performance can be observed from Source ⋆ to CoSMix-UDA to CoSMix-SSDA in both the studied domains.Although the limited amount of target supervision used in CoSMix-SSDA, these experiments show evidence of the benefits of our SSDA method.

ABLATION STUDY
We investigate the performance of CoSMix in both its UDA and SSDA variants by using the SynLiDAR → Semantic-POSS setup.The first three experiments are designed to study CoSMix in the UDA setting.In Sec.5.1, we analyze CoSMix-UDA components.In Sec.5.2, we compare our mixing approach with three recent point cloud mixing strategies, namely, Mix3D [21], PointCutMix [72] and PolarMix [73].In Sec.5.3, we investigate the robustness of CoSMix to noisy pseudo-labels by changing the confidence threshold ζ and with different pre-trained models.In the last experiment (Sec.5.4), we analyze CoSMix in the SSDA setting, comparing our semi-supervised mixing approach with three variations of our approach.

Method components
We analyze CoSMix by organizing its components into three groups: mixing strategies (mix), augmentations (augs) and other components (others).In the mix group, we assess the

Point Cloud Mix
We compare CoSMix with Mix3D [21], PointCutMix [72] and PolarMix [73] to show the effectiveness of the different mixing designs.As per our knowledge, Mix3D and PolarMix are the only mixup strategies designed for 3D semantic segmentation, while PointCutMix and PolarMix are the only strategies for mixing portions of different point clouds.We implement Mix3D and PointCutMix based on authors descriptions: we concatenate point clouds (random crops for PointCutMix) of the two domains, i.e., X s and X t , as well as their labels and pseudo-labels, i.e., Y s and Ŷt , respectively.PolarMix [73] uses our same experimental settings and backbone therefore we consider the results reported in their manuscript.We refer to these mixing strategies as Mix3D ⋆ , PointCutMix ⋆ and, PolarMix † .CoSMix double is our twobranch network with sample mixing.For a fair comparison, we deactivate the weighted sampling and the mean teacher update.We keep local and global augmentations activated.
Fig. 5a shows that Mix3D ⋆ outperforms the Source ⋆ model, achieving 28.5 mIoU, followed by PolarMix † which achieves 30.4 mIoU.PointCutMix ⋆ reaches 31.6 mIoU, outperforming the previous strategies.When we use the t → s TABLE 8: Adaptation results on SynLiDAR→nuScenes.We denote our reproduced baselines and results with ⋆ , e.g., Source ⋆ .Source ⋆ and Target ⋆ correspond to the model trained on the source synthetic dataset (lower bound) and on the target real dataset (upper bound), respectively.Results are reported in terms of mean Intersection over the Union (mIoU).branch alone we can achieve 32.9 mIoU and when we use the s → t branch alone, CoSMix can further improve the results, achieving 34.8 mIoU.This shows that the supervi-sion from the source to target is effective for adaptation on the target domain.When we use the contribution from both branches simultaneously, CoSMix achieves the best result with 38.9 mIoU.

Robustness to noisy pseudo-labels
We investigate the robustness of CoSMix to increasingly noisier pseudo-labels.Firstly, we study the effect of different confidence thresholds ζ.Secondly, we evaluate different versions of pre-trained models that we use for generating pseudo-labels.
Confidence threshold.We study the importance of setting the correct confidence threshold ζ for pseudo-label distillation in g (Sec.3.2).We repeat the experiments with a confidence threshold from 0.65 to 0.95 and report the obtained adaptation performance in Fig. 5b.CoSMix is robust to noisy pseudo-labels reaching a 40.2 mIoU with the low threshold of 0.65.The best adaptation performance of 40.4 mIoU is achieved with a confidence threshold of 0.85.By using a high confidence threshold of 0.95 performance is affected reaching 39.2 mIoU.With this configuration, too few pseudo-labels are selected to provide an effective contribution for the adaptation.Model pre-training.We quantify the robustness of CoSMix and ST ⋆ [18] in response to pseudo-labels generated with different pre-trained models on SynLiDAR and tested on SemanticPOSS.In this experiment, we only utilize ST ⋆ as it is the sole method from those we benchmarked that is based on pseudo-labels.We denote the pre-trained model as P ⋆ .Fig. 4 displays its performance at different epochs: (a) 1, (b) 2, (c) 4, and (d) 9. Unlike CoSMix, ST ⋆ proves sensitive to pseudo-labels as it underperforms P ⋆ in three out of the four cases.A plausible explanation for this is that ST ⋆ refines the pre-trained model using filtered pseudo-labels during adaptation, depending on the quality of pseudo-labels.This dependency may cause ST ⋆ to drift during the adaptation process, thus impacting performance.Differently, CoSMix blends source and target (pseudo) labels, producing two intermediate domains with mixed labels.In the mixed point clouds, pseudo-labels are integrated with (noise-free) source labels (t → s) or noise-free selections (s → t), thus mitigating the negative effects of noisy and imprecise regions.Furthermore, our application of a teacher-based approach allows us to rely on progressively more precise pseudolabels, thereby minimizing undesirable drift effects.

Mixing target supervision
We compare CoSMix-SSDA to three alternative mixing strategies: naive, sup → s and sup → t.In Fig. 5c, we name each strategy as CoSMix-SSDA (a-c).In version CoSMix-SSDA (a), we apply CoSMix-UDA without mixing T L in X s→t and X t→s .Dice segmentation loss is applied separately on T L and averaged with our total objective loss in Eq. 9.In the single branch mixing with source point clouds (sup → s) and with target point clouds (sup → t), versions (b-c), we apply only the upper or lower branch of CoSMix-SSDA, respectively.Full is our proposed double branched CoSMix-SSDA.CoSMix-SSDA (a) approach reaches 33.7 mIoU, which shows that traditional training by using labeled target points as is leads to inferior performance than using our SSDA approach.Both the single branch mixing strategies achieve better performance with 38.9 mIoU and 40.5 mIoU for sup → s and sup → t, respectively.The version (b) shows that the mixed target modality with noise-free annotations helps in reducing the domain shift.The version (c) suggests that the addition of target noise-free labels helps us in achieving higher performance.However, both the single branch approaches are not sufficient to outperform the Full mixing strategy.

CONCLUSIONS
We introduced the first method for domain adaptation in 3D semantic segmentation, featuring a novel 3D point cloud mixing strategy that harnesses both semantic and structural information simultaneously.We developed two variations of our approach: one for unsupervised adaptation (CoSMix-UDA) and another for semi-supervised adaptation (CoSMix-SSDA).We performed comprehensive evaluations in both synthetic-to-real and real-to-real contexts within UDA and SSDA settings, utilizing large-scale, publicly available LiDAR datasets.Experimental results demonstrated that our approach significantly surpasses current state-ofthe-art methods in both contexts.Moreover, detailed analyses underscored the significance of each component within CoSMix, confirming that our mixing strategy effectively addresses the issue of domain shift in 3D LiDAR segmentation.A primary limitation of CoSMix is its reliance on pseudolabels, making the quality of the initial warm-up model on the source domain crucial to the adaptation performance on the target domain.An alternative approach could involve the implementation of self-supervised learning in lieu of using source data.Future avenues for research might encompass the incorporation of self-supervised learning tasks, domain generalization, extending CoSMix to source-free adaptation tasks, and its application to 3D object detection.

Fig. 1 :
Fig. 1: Block diagram of CoSMix detailing the UDA and SSDA settings.The UDA setting uses the top and bottom branch (red line).The SSDA setting uses also the middle branch in addition to those used in UDA (gray line).In the top branch, the input source point cloud X s is mixed with the unsupervised target point cloud X t U obtaining X t→s .In the bottom branch, the input target point cloud X t U is mixed with the source point cloud X s obtaining X s→t .In the SSDA setting, the labeled target data X t L are mixed with the source point cloud X s and with the unsupervised target point cloud X t U .A teacher-student learning architecture is used in both the UDA and SSDA settings to improve pseudo-label accuracy while adapting over target domain.This is achieved by updating the teacher network through Exponential Moving Average (EMA).Semantic Selection (f and g) selects subsets of points (patches) to be mixed based on the source labels Y s , the target labels Y t L , and target pseudo-labels Ŷt U information.Compositional Mix applies local h and global r augmentations and mixes the selected patches among domains.

Fig. 2 :
Fig. 2: Results on SynLiDAR → SemanticPOSS.Source ⋆ predictions are often wrong and mingled in the same region.After adaptation, CoSMix-UDA and CoSMix-SSDA improves segmentation with homogeneous predictions and correctly assigned classes.The red circles highlight regions with interesting results.
lower bound.See Tab. 9 for the definition of these different versions.When the t → s branch is used, CoSMix (a) achieves an initial 31.6 mIoU showing that the t → s branch provides a significant adaptation contribution over the Source ⋆ .When we also use the s → t branch and the mean teacher β, CoSMix (b-d) further improve performance achieving a 35.4 mIoU.By introducing local and global augmentations in CoSMix (e-h), we can improve performance up to 39.1 mIoU.The best performance of 40.4 mIoU is achieved with CoSMix Full where all the components are activated.

Fig. 3 :
Fig. 3: Results on SynLIDAR → SemanticKITTI.Source ⋆ predictions are often wrong and mingled in the same region.After adaptation, CoSMix-UDA and CoSMix-SSDA improves segmentation with homogeneous predictions and correctly assigned classes.The red circles highlight regions with interesting results.

Fig. 4 :
Fig. 4: Adaptation results on SynLiDAR→SemanticPOSS with different pre-trained models.We compare the adaptation results of CoSMix (Ours) with ST ⋆ starting from different initialization points (P ⋆ ) indicated with (a-d).

Fig. 5 :
Fig. 5: a) Comparison of the adaptation performance with different point cloud mix up strategies.Compared to the recent mixing strategies Mix3D [21], PointCutMix [72] and, PolarMix [73], our mixing strategy and its variations achieve superior performance.b) Comparison of the adaptation performance on confidence threshold values.Adaptation results show that ζ should be set such that to achieve a trade-off between pseudo-label correctness and object completeness.c) Comparison of the SSDA performance with different mixing strategies: optimization without mix (naive), single branch mixing with source point clouds (sup → s), single branch mixing with unsupervised target point clouds (sup → t).Each variation is named with a different version (a-c).In all the experiments, Source ⋆ and Target ⋆ performance is the lower and upper bound.

TABLE 1 :
Overview of existing methods for unsupervised (UDA) and semi-supervised (SSDA) adaptation in point cloud segmentation.For each approach, we report the sensor setup (Setup), the architecture (Input data type and Model), and the source and target datasets.Then, we classify the adaptation strategy into mixup based, adversarial learning based, alignment based, generative based, self-training based and auxiliary task based.Furthermore, we report whether the implementation (Code) is publicly available.

SSDA Mixup Adv. Align. Gen. Self-train. Aux. task
[9]average, we achieve 40.4 mIoU, surpassing ST-PCT by +10.8 mIoU and improving over the Source ⋆ of +18.8 mIoU.CoSMix-UDA improves also on difficult classes as person, traffic-sign, cone, and bike, whose performance are rather low before domain adaptation.ST ⋆ and EntMin ⋆ improve over Source ⋆ .ST ⋆ improves over ST while EntMin ⋆ achieves lower performance.Tab.3 reports the results of SynLiDAR → SemanticKITTI.SemanticKITTI is challenging as the validation sequence includes a wide range of different scenarios with a large number of semantic classes.CoSMix-UDA improves all the classes when compared to Source ⋆ , except for traffic-cone.We believe this is due to the noise introduced by the pseudo labels on these classes and in related classes such as road.CoSMix-UDA improves on 10 out of 19 classes, with a large margin in the classes car, motorcycle, truck, person, road, parking and sidewalk.On average, we achieve state-of-the-art performance with a 32.2 mIoU, outperforming ST-PCT by +3.3 mIoU and improving over Source ⋆ of about +8.4 mIoU.Source ⋆ and Target ⋆ models are the lower and upper bound of the UDA settings.The Fine-tuned ⋆ model is obtained by fine-tuning Source ⋆ with the semi-supervised target samples.It shows the highest possible bound without any adaptation approach.Finetuned ⋆ always outperforms Fine-tuned from[9].Similarly, the discrepancy between MMD ⋆ and MME ⋆ , and the results reported in[9]may due to a different parameter choice.In SynLiDAR → SemanticPOSS (Tab.5), CoSMix-SSDA outperforms all the comparison methods on all the classes, except on plants, fence and bike where MME and MME ⋆ achieve better results.On average, we reach 41.0 mIoU, outperforming APE-PCT by +9.8 mIoU and improving over Source ⋆ by +19.4 and over Fine-tuned ⋆ by +15.5.
⋆ model is the upper bound of each scenario with 44.7 mIoU on SynLiDAR → SemanticPOSS and 44.0 mIoU on SynLiDAR → SemanticKITTI.Note that Source ⋆ models always outperform Source.This may be due to a better parameter choice that leads an improved generalization ability.In SynLiDAR → SemanticPOSS (Tab.2),CoSMix-UDAoutperforms the other methods on all the classes, except on pole where ST achieves a better result.Real-to-real.Tab.4 reports the results on SemanticKITTI → nuScenes in the UDA setting.SemanticKITTI → nuScenes is a more challenging direction as source and target sensors are different, and nuScenes has rather sparse point clouds.Source ⋆ and Target ⋆ models achieve 40.1 mIoU and 62.0 mIoU, respectively.CoSMix-UDA outperforms the compared methods on 4 out of 7 classes, with the largest margin on the class person.On average, EntMin ⋆ and ST ⋆ achieve 43.4 mIoU and 43.6 mIoU, showing a limited improvement over Source ⋆ .CoSMix-UDA achieves the best results of 46.2 mIoU, outperforming all the compared methods.4.4 Quantitative comparison for SSDASynthetic-to-real.Tabs.5&6 report the results in the SSDA settings on SynLiDAR → SemanticPOSS, and on SynLiDAR → SemanticKITTI, respectively.cyclist, road, parking, and pole.On average, CoSMix-SSDA achieves 34.3 mIoU, outperforming the best baseline APE-PCT by +7.3 mIoU and improving over Source ⋆ of +10.5 mIoU and over Fine-tuned ⋆ of +9.8 mIoU.Compared to our UDA pipeline, CoSMix-SSDA improves of +2.1 mIoU, showing that the additional target supervision is beneficial for further reducing the domain gap.

TABLE 2 :
Unsupervised adaptation results on SynLiDAR → SemanticPOSS.We denote our reproduced baselines and results with ⋆ , e.g., Source ⋆ .Source ⋆ and Target ⋆ correspond to the model trained on the source synthetic dataset (lower bound) and on the target real dataset (upper bound), respectively.Results are reported in terms of mean Intersection over the Union (mIoU).

TABLE 3 :
Unsupervised adaptation results on SynLiDAR → SemanticKITTI.We denote our reproduced baselines and results with ⋆ , e.g., Source ⋆ .Source ⋆ and Target ⋆ correspond to the model trained on the source synthetic dataset (lower bound) and on the target real dataset (upper bound), respectively.Results are reported in terms of mean Intersection over the Union (mIoU).

TABLE 4 :
Unsupervised adaptation results on Se-manticKITTI → nuScenes.We denote our reproduced baselines and results with ⋆ , e.g., Source ⋆ .Source ⋆ and Target ⋆ correspond to the model trained on the source real dataset (lower bound) and on the target real dataset (upper bound), respectively.Results are reported in terms of mean Intersection over the Union (mIoU).
Real-to-real.Tab. 7 reports the results on SemanticKITTI → nuScenes in the SSDA settings.Source ⋆ and Target ⋆ models achieve 40.1 mIoU and 62.0 mIoU, respectively.The Finetuned ⋆ model improves over Source ⋆ and achieves 43.5 mIoU.CoSMix-SSDA achieves the best results on 3 out of 7 classes, with the largest margin on the class pedestrian.

TABLE 5 :
Semi-supervised adaptation results on SynLiDAR → SemanticPOSS.We denote our reproduced baselines and results with ⋆ , e.g., Source ⋆ .Source ⋆ and Target ⋆ correspond to the model trained on the source synthetic dataset (lower bound) and on the target real dataset (upper bound), respectively.Results are reported in terms of mean Intersection over the Union (mIoU).

TABLE 6 :
Semi-supervised adaptation results on SynLiDAR → SemanticKITTI.We denote our reproduced baselines and results with ⋆ , e.g., Source ⋆ .Source ⋆ and Target ⋆ correspond to the model trained on the source synthetic dataset (lower bound) and on the target real dataset (upper bound), respectively.Results are reported in terms of mean Intersection over the Union (mIoU).

TABLE 7 :
Semi-supervised adaptation results on Se-manticKITTI → nuScenes.We denote our reproduced baselines and results with ⋆ , e.g., Source ⋆ .Source ⋆ and Target ⋆ correspond to the model trained on the source real dataset (lower bound) and on the target real dataset (upper bound), respectively.Results are reported in terms of mean Intersection over the Union (mIoU).→ s branch is active, also the pseudo-label filtering g is utilized, while when f is not active, α = 0.5 source classes are selected randomly.With different combinations of components, we obtain different versions of CoSMix which we name CoSMix (a-h).The complete version of our method is named Full, where all the components are activated.The Source ⋆ performance is also added as a reference for the

TABLE 9 :
Ablation study of the CoSMix components: mixing strategy (t → s and s → t), compositional mix augmentations (local h and global r), mean teacher update (β) and, weighted class selection in semantic selection (f ).Each combination is named with a different version (a-h).Source ⋆ performance are added as lower bound and highlighted in gray to facilitate the reading.