DAPS3D: Domain Adaptive Projective Segmentation of 3D LiDAR Point Clouds

LiDARs are one of the key sources of reliable environmental ranging information for autonomous vehicles. However, segmentation of 3D scene elements (roads, buildings, people, cars, etc.) based on LiDAR point clouds has limitations. On the one hand, point- and voxel-based segmentation neural networks do not offer sufficiently high speed. On the other hand, modern labeled datasets primarily consist of street scenes recorded for driverless cars and contain little data for mobile delivery robots or cleaners that must work in parks and yards with heavy pedestrian traffic. This article aims to overcome these limitations. We have proposed a novel approach called DAPS3D to train deep neural networks for 3D semantic segmentation. This approach is based on a spherical projection of a point cloud and LiDAR-specific masks, enabling the model to adapt to different types of LiDAR. First of all, we have introduced various high-speed multi-scale spherical projection segmentation models, including convolutional, recurrent, and transformer architectures. Among them, the SalsaNextRecLSTM architecture with recurrent blocks showed the best results. In particular, this model achieved the 83.5% mIoU metric for the SemanticKitti dataset with joint categories. Secondly, we have proposed several original augmentations for spherical projections of LiDAR data, including FoV, flip, and rotation augmentation, as well as a special T-Zone cutout. These augmentations increase the model’s invariance when dealing with changes in the data domain. Finally, we introduce a new method to generate synthetic datasets for domain adaptation problems. We have developed two new datasets for validating 3D scene outdoor segmentation algorithms: the DAPS-1 dataset, which is based on the augmentation of the reconstructed 3D semantic map, and the DAPS-2 LiDAR dataset, collected by the on-board sensors of a cleaning robot in a park area. Particular attention is given to the performance of the developed models, demonstrating their ability to function in real-time. The code and datasets used in this study are publicly available at: github.com/subake/DAPS3D.


I. INTRODUCTION
The self-driving cars are ever-evolving during the last decade due to the active development of deep learning. This brings the question of the safety and reliability of the The associate editor coordinating the review of this manuscript and approving it for publication was Thomas Canhao Xu .
navigational system. A key challenge of the navigational pipeline is semantic segmentation.
Semantic segmentation of 3D point clouds is one of the essential building blocks for autonomous driving systems [1], [2], [3], [4]. It leverages the precise and dense range data from LiDARs to better understand the driving scene and helps to distinguish vulnerable road agents.
The most popular are several main groups of methods for solving the problem of point cloud segmentation. All of them are learnable and based on different architectures of deep neural networks [5], [6], [7], [8], [9]. One of their key differences is the 3D scene representation. It can be based on points [10], [11], voxels [6], [12], point cloud projection onto a sphere [3], [7], as well as various combinations of these approaches [5], [8]. A separate research area is the fusion of various sensor sources (including onboard cameras) to improve the quality of segmentation of LiDAR point clouds [13], [14].
Currently, state-of-the-art models can solve the segmentation problem on public datasets with high quality [3], [6], [9], [15], [16]. However, the metrics of the models can vary drastically in different environments.
The group of projective neural network methods [3], [7], [17] has the highest performance and real-time potential. But at the same time, they are very sensitive to data domain change like all image-based approaches.
In this article, we explore segmentation of LiDAR point clouds from a data domain when a mobile ground robot with a special sensor setup has to move through park areas with heavy pedestrian traffic. Existing datasets and neural network models are not quite suitable for high-precision and fast 3D segmentation in this case.
To overcome this problem, we propose a novel approach called DAPS3D for domain adaptive projective segmentation of 3D LiDAR point clouds. It includes new augmentation techniques for 3D LiDAR point clouds to make the final model robust to the domain shift: • The augmentation sets we are investigating, including FoV, flip, noise, dropout, and rotation augmentation, as well as a special T-Zone cutout, have not previously been considered in public sources.
• The labeling enhancing process proposed by us made it possible to augment and improve the labeling quality of the RELLIS-3D open dataset [18] from a domain close to ours.
• The proposed original approach to building a semisynthetic dataset by adding 3D human models to the existing popular SemanticKITTI dataset [15] has been demonstrated in the creation of the DAPS-1 dataset.
Another important part of our DAPS3D approach is the development and study of novel neural network models for projective point cloud semantic segmentation: • We propose SalsaNetRec, SalsaNetRecLSTM, and Sal-saNextRecLSTM architectures of deep neural networks. Their difference from existing SalsaNet [2] and Sal-saNext [7] projective baselines is the use of recurrent blocks in the model. This allows us to extract more useful information from point cloud sequences. For the case with the SalsaNextRecLSTM model, it leads to a significant improvement in the segmentation quality.
• We first tried to adapt fast 2D segmentation methods based on convolutional DDRNet [19] and SegFormer transformer [20] for projective 3D point cloud segmentation.
An additional contribution is that we have developed and presented two new datasets to validate the quality of the proposed DAPS3D approach: the DAPS-1 dataset based on the original augmentation of the reconstructed 3D semantic map, and the DAPS-2 LiDAR dataset, collected by the onboard sensors of the cleaning robot in the unique park area.
We have demonstrated that the proposed segmentation models from DAPS3D have real-time performance and competitive quality for application in computer vision systems of autonomous robots and vehicles.

II. RELATED WORK
Over the last five years, significant progress in semantic segmentation has been achieved. The main differences between state-of-the-art models lay in the neural network architecture, sensor setup, and data representation. In this section, we review the current state of the semantic segmentation task, paying close attention to the data representation and domain stability.

A. POINT-AND VOXEL-BASED POINT CLOUD SEGMENTATION
The first group of models uses raw point clouds to operate with the data [11], [16], [21]. Models that work with point-wise representation use 3D convolutional blocks that aggregate information about the geometry and relative positions of the points. The downsides of such approaches are their time complexity, sensor sensitivity, and nonmonotonicity of the data. All of that makes them impossible to be used in real-time computation in the outdoor environment. However, in offline problem statements, such models often outperform others. There are different noteworthy papers on point-wise models. Thus said, the authors of S3Net [10] have shown, that point-wise models with sparse convolutions can achieve state-of-the-art performance. Authors of [22] showed that such models can also achieve high inference speed using random sampling.
A logical continuation of point-wise methods is voxelwise methods. They split the space into a grid and aggregate information about every cell. The size of the voxels brings a tradeoff of speed and accuracy. This phenomenon was shown and researched in the voxel model's papers [5], [12], and [8]. Such representation has inherent qualities of pointwise representation, but changing voxel size provides flexibility in speed and domain robustness to the models. Its advantages have been explored in the SPVNAS [5], achieving both real-time inference speed and competitive accuracy. Another way to look at the voxels was proposed in the paper [6]. They adapted the idea of voxels to the task of LiDAR semantic segmentation by changing the shape of the grid from cubes to cylinder sectors. One of the recent papers [23] presents a multi-task model that bridges the performance gap between the multi-task and multiple singletask networks. The global context pooling module helps 79342 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. extract global context features from LiDAR scans, there is a separate head for each task, so segmentation results are refined based on detection. To deal with sparse distant points, SphereFormer [9] introduced radial window self-attention that partitions the space into multiple non-overlapping narrow and long windows. It helps to overcome the disconnection issue and enlarges the receptive field of the model.

B. PROJECTIVE POINT CLOUD SEGMENTATION
Another way to use 3D points is to project them on the sphere. This method is organic due to the LiDAR mechanism, which launches lasers with fixed yaw and pitch angle steps. This type of projection was used in FIDNet [30], SalsaNext [7] and showed an outstanding performance in the sense of speed and quality tradeoff. On the one hand, this representation allows using 2D convolutional layers and image augmentations, maintaining high speed and knowledge from computer vision. On the other hand, the overall geometry can be lost since the 2D convolutions can find only two-dimensional patterns. Moreover, models, which use the projection as the source of the data, suffer from domain instability because projection operation is susceptible not only to the LiDAR placement and configuration but also to the robot geometry. Another approach that pushes this to the limits was presented in [17]. This model uses lightweight harmonic dense convolutions and an improved global contextual module to achieve high accuracy without drawbacks in inference time.

C. FUSION MODELS
Some models additionally use data from other sensors for 3D segmentation [1], [31], [32], [33], [34]. There are many reliable and fast ways to semantic segmentation of onboard camera images [4], [35]. Complementing the reliable, accurate, but sparse spatial information obtained from LiDAR with camera images that carry rich information about color and texture will expand the view of the surrounding space.
To benefit from LiDAR and camera data, early work has fused information from two sensors to improve the robustness and accuracy of the 3D segmentation algorithm [1], [31], [32], [33]. RGBAL [1] transforms RGB images to a polar-grid representation and outlines early and middle model fusion. PointPainting [33] obtains the segmentation results of images and projects them to the LiDAR space by spherical projection [36], which are subsequently concatenated with the original point cloud features. Zhuang et al. [13] proposed a model with two parallel streams for information from cameras and LiDAR. After the LiDAR point clouds are projected onto the image plane, the data is passed through the network, where the features obtained from the images are mixed with the sparse LiDAR features. Proposed perception-aware losses estimate the vast perceptual difference between LiDAR and camera modalities and boost the fusion of different perceptual information [13].
It is also worth mentioning 2DPASS [14], it additionally uses camera images only at the training stage. Using the approach proposed in the article allows you to increase the learning of representations without any restrictions on the input data during validation.

D. DOMAIN ADAPTATION IN 3D SEGMENTATION
The domain adaptation problem among different LiDARs is an essential task that helps to use collected and synthesized data to operate in real-world situations without additional expenses. Currently, there are several different approaches to this issue. Models in [37] and [38] rely on adversarial loss that helps them to learn domain-agnostic embeddings. This method can be used for cross-modal tasks, but it limits the variety of available architectures. Another way to handle this problem is to use discrepancy loss which directly measures the distance between representations across domains. This approach was used in [39]. The model's embeddings are forced to be similar in learning this way. However, if the domain gap is big enough, such models can't keep up. The last approach is based on synthetic and semi-synthetic data [36]. The drawback of these methods is the need to solve another domain adaptation task or to make generated pipeline to be from the same domain. However, if generated properly, such data can be used to train any semantic segmentation model with little to no effort and without a drastic drop in quality.
The other way to address the gap between domains is adjusting ground truth data. Several recent papers explore those options. The authors of [40] propose choosing rectangular areas around selected points from domains and swapping them. Another approach to that issue from [41] introduces an encoder-decoder network for adaptation patterns of objects as a whole.

E. DATASETS FOR POINT CLOUD SEGMENTATION
There is a limited number of datasets for solving the problem of semantic segmentation of 360-degree LiDAR data in an urban environment. The key ones are given in Table 1. In it, we also provided statistics on the proportion of points belonging to the five joined categories among the total number of points in LiDAR scans. Their mapping is shown in Table 2. These are the most characteristic categories that can be further used when navigating and planning the movement of a robot or an unmanned vehicle.
One of the most popular datasets for segmentation methods evaluation is the large-scale SemanticKITTI dataset [15]. Similar KITTI-360 dataset [27] provides point-wise instance and semantic segmentation for 3D point clouds. The dataset consists of aligned 360 degrees 2D images and 3D point clouds. It can be used for benchmarking models on a wide variety of tasks including semantic scene understanding, novel view appearance synthesis, and many others. There are some other similar outdoor datasets collected from autonomous cars: SemanticPOSS [42] and Pandaset [43]. An important disadvantage of all these datasets is the small number of people in the near zone of the ego-vehicle at a distance of up to 10-15 meters. VOLUME 11, 2023  The problem of well-labeled urban Toronto-3D [25] and Paris-Lille-3D [24] datasets is that they do not contain separate LiDAR scans, but already combined semantic maps, which limits the possibility of their use for the real-time segmentation task.
Recently two large-scale open nuScenes [28], and Waymo [29] datasets released 3D semantic segmentation extensions. Both of them share the same label schema with detection datasets. Their features are the presence of only urban street scenes and the absence of park areas.
The relevant off-road datasets are SemanticUSL [26] and its updated version RELLIS-3D [18]. They consist of fullyannotated LiDAR scans from off-road environments with a lot of scenes containing lawns, dirt roads, and various vegetation. At the same time, these datasets contain a relatively small number of labeled people and vehicles.
The limitations of existing datasets on the presence of people in the near zone and those recorded in park areas have modified us to collect and make publicly available our two datasets DAPS-1 and DAPS-2. Their more detailed description is given in Subsection III-D.

III. METHODOLOGY A. TASK STATEMENT
In the article, we are addressing the domain adaptation problem for 3D semantic segmentation tasks using only one LiDAR sensor.
More formally, the proposed approach called DAPS3D takes raw (N × 3) point cloud P from LiDAR and returns (N ) label vector C P with semantic class for every point. A feature of the proposed approach is the study of neural network models M with parameters θ, which use the projection π of LiDAR point cloud onto a sphere. This projection makes it possible to use fast state-of-the-art neural networks designed for processing 2D images to recognize 3D data. Details of the various proposed M model architectures are given in Subsection III-B.
To ensure the domain adaptation of the model, we proposed original sets of augmentations A for the spherical projection, which are discussed in detail in Subsection III-C.
Thus, we can give a formula that simply describes the process of segmentation of a LiDAR point cloud: Learning the optimal parameters θ of the model M is defined as the task of minimizing the loss function for the 2D image segmentation with known ground truth masks C gt P from the dataset: The proposed approach also includes the method to generate synthetic datasets for a domain adaptation problem, which is presented in Subsection III-D. For the comparison among different datasets, we decided to aggregate all classes into N = 4 ground truth labels: vehicles, humans, surfaces, and static obstacles.
An important aspect of this work is the study of the quality metric of different approaches on different data domains (see Subsection IV-C). Such a metric is Intersection over Union (IoU), generally recognized in the semantic segmentation tasks [7], [8], [20].
The developed model should also provide real-time performance. By real-time, we mean such a latency of the algorithm, which should be less than the period of receipt of sensory data. In our case, this value is equal to 100 ms (this is sufficient for the timely recognition of a three-dimensional scene for a slowly moving mobile ground robot).

B. PROJECTIVE NEURAL NETWORK MODEL 1) DATA REPRESENTATION
In our research, we are using spherical projection as our data representation for our models. The main benefits of such representation are real-time inference speed, coverage of augmentation methods and model architectures from computer vision, and state-of-the-art quality. The projection process creates an organic dense and compact representation without any information losses. To obtain such Range View (RV) image a special formula was used. Each 3D point (x, y, z) from the raw LiDAR point cloud is associated with another point (u, v) such as projection π: where h and w represents the height and width of the spherical projection, r denote the range of each point as r = x 2 + y 2 + z 2 , and f defines the sensor vertical field of view in the range [f down , f up ]. This formula with correct LiDAR parameters guarantees that every point from the 3D point cloud will have a corresponding pixel with the (x, y, z) vector stored in it. However, if we widen our vertical field of view, we will have overlapping points, which will lead to a dense range image, but with some informational losses. The same artifact occurs when we choose h and w less than the grid parameters from the LiDAR. The important part is that even though we lose some information, the range image remains dense. We use this fact during the training process when using wider FoV and smaller projection sizes to match our target domain.

2) BASELINE MODELS
As a baseline, SalsaNet [2] and SalsaNext [7] were picked due to their performance and speed trade-off. These models are fast and convertible to TensorRT, which is crucial for our primal task. In our work, we use modifications of those models, which take only (x, y, z, r) vector instead of (x, y, z, r, i), where r is the range index, and i is the LiDAR-dependent intensity value [7]. The reason behind that is that in the domain adaptation scenario, we do not want to use additional data, that can vary from dataset to dataset.

3) RECURRENT MODELS
Originally segmentation neural network takes range-view image obtained after spherical projection. This type of data representation allows the use of 2D convolutional recurrent blocks. In our DAPS3D approach, we propose SalsaNetRec and SalsaNextRec models ( Figure 1) with two different types of recurrent approaches for different purposes ( Figure 2).
Models, which we have named SalsaNetRecLSTM and SalsaNextRecLSTM, contain block 1 shown in Figure 2. This is an LSTM 2D block that was described in [46]. Its main goal is to convert the idea of memory from the original VOLUME 11, 2023 79345 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.   [19]. ''OC'' and ''DA'' denote Object Contest Block [44] and Deformable Attention Block [45] respectively.
LSTM block to sequences of image data. The recurrent block 2 ( Figure 2) is used directly in the architecture called SalsaNetRec. It was proposed in [47] and aims to leverage the recurrent architecture to enhance feature representation for segmentation tasks without changing the number of parameters.

4) ATTENTION-AND TRANSFORMER-BASED MODELS
We have also considered the adaptation of 2D segmentation models to perform a 3D segmentation task. DDRNet [19] and SegFormer [20] models were chosen according to quality and speed trade-off. Due to the specifics of spherical projections of LiDAR clouds, it was decided to adjust the network architecture to narrow elongated images by changing the size of the convolution kernels and their stride. The difference lies in the increase in the resolution of feature maps along the vertical axis, as shown in Figure 3.
For SegFormer we have changed kernel size and stride for the first transformer block, which led to an increase in the size of subsequent feature maps. SegFormer architecture with feature map shapes is shown in Figure 4.
We have also tested the effects of different attention mechanisms. We picked the Object Context module [44] and more recent Deformable Attention module [45] and inserted them into the architecture of the DDRNet model, as shown in Figure 3. The choice of the Object Context module is due to its ability to improve the quality of fast image semantic segmentation models, for example, as shown in [35]. The Deformable Attention module is well suited for effectively expanding the receptive field of neural networks [45]. The DAPPM (Deep Aggregation Pyramid Pooling Module) block shown in Figure 3 has an architecture similar to the original model [19] and allows taking into account information from multi-scale feature maps when segmenting the projection of a point cloud.

5) EVALUATION METRICS
The common Intersection over Union (IoU) is used as a basic quality metric to evaluate the results of the proposed approaches. Due to practical usefulness and simplicity when testing on different datasets, the semantic categories in them were previously grouped into four main classes: ''vehicle'' (including cars, trucks, buses, bicycles, motorcycles, and other vehicles), ''human'' (including pedestrians, bicyclists, and motorcyclists), ''surface'' (including road, parking, road markings, sidewalks, terrain, i.e. surfaces where the ground mobile robot can pass), and ''obstacle (static)'' (including buildings, fences, poles, vegetation, etc.).

C. AUGMENTATIONS
In this work, we have developed geometry-preserving augmentations for 3D point clouds and their projections to improve the quality of the projective semantic segmentation model (see Figure 5). The main goal was to introduce a variety of data to the model during the training process to achieve better domain stability. Every augmentation aims to cover a specific problem that causes artifacts during the inference.

1) FoV AUGMENTATION
One of the major problems in utilizing pre-trained models is the difference in the field of view. Projective models are sensitive to this parameter of LiDARs. To overcome this problem, We propose a Field of View augmentation that

2) HEIGHT AUGMENTATION
Another significant challenge is the variation in LiDAR height. This problem is caused by the difference in the LiDAR setups. It drastically changes the point of view on different objects, making it impossible for the model to recognize. To lower this gap height augmentation was developed. In this augmentation, the whole point cloud shifts vertically on the same value. To achieve a realistic scene, the algorithm first finds the ground height and then shifts all points on the random value.

3) FLIP AND ROTATION AUGMENTATION
We decided to apply flip and rotation augmentation to the projection to overcome the bias of LiDAR's rotation angle. This deformation can be interpreted as rotation and flipping of the LiDAR sensor and can help in the cases of non-road environments. It is also natural to use because both flip and rotate preserve the geometry.

4) MASK NOISING AND DROPOUT AUGMENTATION
Mask noising and dropout are other regularization techniques that are borrowed from computer vision. They are applied after the projection process and can be interpreted as an occlusion and sensor noising of the areas. This method helps prevent overfitting on commonly-shaped objects and allows the model to encounter ''zero'' patches that may be created during field-of-view augmentation.

5) HUMAN AUGMENTATION
To address differences in class balance between the two datasets, We developed a human augmentation. During the PointCloud augmentation phase, random samples from the library were added to the scene. The library of the samples was collected from the NuScenes dataset via their API. Specifically, We extracted the bounding boxes of people with corresponding instance ids based on NuScenes labels. To ensure high-quality samples, we filtered out noise by only retaining samples with high density. Similar to the height augmentation approach, we calculated the ground height and placed the humans at the corresponding height to maintain realism in the scene.

6) T-ZONE AUGMENTATION
In real-life scenarios, the configurations of robots used various experiments can differ significantly from the configuration of the robot used for dataset recording. Some points of the point cloud can fall on the body of the robot or be covered by various sensors. To simulate the presence of our robot's body we employ T-zone augmentation. It cuts out from the projection of the LiDAR point cloud those points that in real life would fall on its body, as it is shown in Figure 5.

D. DEVELOPMENT OF DATASETS 1) MOTIVATION
In our research, we explored the possibility of adopting a SalsaNext-like [7] solution for small autonomous robots that utilize a different LiDAR sensor configuration. However, we encountered a challenge due to a data domain shift. The original SalsaNext paper [2] and available open-source research datasets mainly focused on self-driving cars, which did not match our specific LiDAR setup. Our LiDAR sensors have a distinct beam angle, and different number of beams compared to SalsaNext's objective dataset. Consequently, the existing pre-trained models designed for projective segmentation methods were unsuitable for our desired data domain. Addressing this issue was neither straightforward nor costeffective. Hence, we designed a highly efficient alternative approach.
It leverages the SemanticKitti dataset [15] as a foundation. To simulate the LiDAR data captured by a small robot with a different LiDAR configuration, we created a synthetic dataset called DAPS1. This synthetic dataset mirrored SemanticKitti but emulated the data as if it were obtained from a small robot with our desired LiDAR configuration. The creation of DAPS1 involved two distinct steps: • Mesh Map Generation: Firstly, we constructed a detailed mesh map of the scene. This process involved representing the 3D structure of the environment using interconnected triangles. By creating an accurate mesh map, we were able to effectively model the scene and its objects.
• Virtual Lidar Sensor Simulation: Once the mesh map was in place, we simulated a virtual LiDAR sensor that precisely matched the lidar configuration of our small robot. This virtual LiDAR sensor was employed to generate data that closely resembled the LiDAR readings captured by our desired setup.
By creating DAPS1, we successfully prepared a synthetic dataset that emulated the LiDAR data collected by our small robot with a different LiDAR configuration. This synthetic dataset enabled us to pre-train the SalsaNext architecture and reduce the size of the subsequent DAPS2 dataset required for fine-tuning.

2) SemanticKITTI
First of all, we used the popular SemanticKITTI dataset [15]. We use the same training, validation, and test split as described in the original paper for SalsaNext [7]. Specifically, we used sequence as the validation set, and sequences from 00 to 10 were chosen as a training set. The dataset has 22 classes which we mapped into our five cross-domain classes for unified evaluation as it is shown in Table 2. This dataset was selected due to its size and variety of road scenes.

3) EXTENDED RELLIS-3D
The main reason for selecting RELLIS-3D dataset [18] was its size and unique off-road environment. We split the dataset into training and validation sets, taking the first three sequences as our training data.
During the training process on the RELLIS-3D, we encountered that some of the points belonging to the 79348 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.    ''human'' class in the close zone were not labeled ( Figure 6). This problem had a significant impact on the model evaluation because such points map to the empty class that did not affect training, even though the information about the points comes correct. The model, however, has begun to accurately label such points after 80 epochs of training on the SemanticKITTI and RELLIS-3D datasets.
We then used this trained model to refine the labels. We segmented the original point cloud and assigned the ''human'' class to the points that the model attributed to ''human'', and which in the original markup were labeled as an empty class. Our proposed labeling enhancement process is shown in Figure 7.

4) DAPS-1
The natural approach to overcome the domain gap (see Figure 8) is to capture or synthesize realistic data from a similar domain. This idea was implemented in our  semi-synthetic DAPS-1 dataset (see Figure 9), which was generated based on the SemanticKITTI [15]. Its main advantage is that the process can be reproduced for any setup with only a 3D model of the robot and LiDAR parameters. The pipeline consists of 5 steps. First, the data was converted to the 3D semantic mesh using the Kimera-Semantic approach [48] as it is shown in Figure 10. Then the layers of the map are separated in the open-source program Mesh-Lab [49]. It is important, that Kimera-Semantic works only with static  objects, that is why We manually added some human models, that were generated in the MakeHuman [50] program, on the track. The next step is to simulate in Gazebo [51] with the semantic layers and desired LiDAR configuration with the open-source plugin. Finally, we reproduce the original locations of the robot and capture the data to the rosbag archives.

5) DAPS-2
This dataset was recorded during a real field trip of the cleaning robot (see Figure 11) to the territory of the VDNH Park in Moscow in the summer of 2021. The robot model contained a configuration of 3 LiDARrs (central and 2 side ones). DAPS-2 contains several robot scenes in different parts of the park with different pedestrian fillings, with all points from the main central LiDAR.

6) CATEGORY MAPPING
In our work, we combined the main semantic categories for 3D scene elements: ''vehicle'', ''human'', ''surface'', and ''obstacle (static)''. We did not include the ''unlabeled'' category in our models. Table 2 shows category mapping between our five cross-domain classes and labels from different datasets.

A. MODEL TRAINING
We implemented our models using PyTorch and performed training and evaluation on an NVIDIA GeForce RTX 2080 Ti GPU. For the SalsaNet-and SalsaNext-based models, we trained them from scratch with randomly initialized weights. During training, we applied various data augmentations configurations, which are specified in Table 3. We do not crop projected point clouds and set batch size to 8 on all datasets. We trained the models using an SGD optimizer for 40 epochs, the first of which was a warm-up one. The momentum for SGD was set to 0.9, the learning rate to an initial value of 0.05, and the learning rate decay with a factor of 0.99.
For SegFormer we utilized the MiT-B1 encoder pre-trained on the Imagenet-1K dataset and randomly initialize the decoder. We set the batch size to 6 and the learning rate to an initial value of 0.001. All other settings remained the same as the configuration for the SalsaNet-based models.
The DDRNet-based models share the same configuration as the SalsaNet-based models, except for the number of epochs and a learning rate scheduler. We trained the models using the SGD optimizer for 140 epochs, the first of which was a warm-up one. The momentum for SGD was set to 0.9, the learning rate to an initial value of 0.05, and the learning rate decay with a factor of 0.99 for the first 40 epochs and a factor of 0.96 for the rest.

B. INFLUENCE OF DATA AUGMENTATION
The results of the models on the SemanticKITTI+RELLIS-3D dataset are reported in Table 4. The parameters of the augmentations are shown in Table 3. Our proposed Augmentation set 5 enhances the quality of the baseline model on 4.3% in mIoU and 8.5% in the human class, which is one of the most important during inference and underrepresented classes. This indicates that we can effectively regularize our model with augmentations during the training phase to fix the class balance problem without any drops in the quality. From the table, it can be observed that the second augmentation set is the best in the dynamic classes, while the fourth set is the best in the static classes. The reason behind that is that human augmentation brings a lot of new information about dynamic objects, even though the data was noisy, because the dataset, that was used to create a library, had drastically different LiDAR parameters. The results also demonstrate that We can modify and train sequential models within our framework.   The recursive modifications of SalsaNet, trained in the same pipeline, improve the quality of the semantic segmentation. The effectiveness of the recurrent blocks is also evident, as they achieve similar quality to the best augmentation sets when using only the most neutral augmentations. Table 5 shows the segmentation results of four joint classes ''vehicle'', ''human'', ''surface'', and ''obstacle'' on LiDAR data from SemanticKITTI [15]. All models were trained using Augmentation set 5. SalsaNextRecLSTM shows the best overall metrics and demonstrated significant improvement compared to the basic SalsaNext model. Table 6 presents the segmentation results on LiDAR data from SemanticKITTI [15] extended with RELLIS-3D [18]. The expansion was performed by grouping the categories VOLUME 11, 2023 79351 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  from both datasets into four categories. All models are trained with configuration including Augmentation set 5 and T-Zone cutout. The SalsaNextRecLSTM demonstrates the best result on SemanticKITTI extended with RELLIS-3D. The SalsaNet and SalsaNext approaches demonstrate the second-best results after it. The DDRNetDA is inferior in the ''human'' segmentation compared to DDRNetOC and the basic DDRNet. Figure 12 displays the visualized results of SalsaNet-based models on RELLIS-3D. Both recurrent models and SalsaNext outperform SalsaNet in segmenting ''human'' legs. However, DDRNet models have artifacts in the ''human'' category and detect fewer instances of ''surface''.

3) EXPERIMENTS ON DAPS-1
Segmentation results on the DAPS-1 synthetic dataset are shown in Table 7. All models are trained with Augmentation set 5 and T-Zone cutout. Models with recurrent blocks are inferior in terms of segmentation quality to the Seg-Former by 0.5-1%. DDRNet and SegFormer models notably  struggle in the ''human'' category because, due to the low resolution, they cannot distinguish ''human'' from its surroundings. Figure 13 showcases the visualized results of neural network models on the DAPS-1 dataset. All models encounter challenges segmenting the distant ''vehicle'' and the result is noisy, but to the presence of a nearby ''human''. The SalsaNetRec model mistakenly recognized the ''surface'' to the left and the SalsaNet labeled parts of the first ''vehicle'' as ''obstacle''. The proposed SalsaNextRecLSTM model turned out to be the most qualitative method, surpassing the others both in terms of metrics and in visualizing the results.

4) EXPERIMENTS ON DAPS-2
A comparison of the quality metrics of semantic segmentation on the dataset from the DAPS-2 park area is shown in Table 8. All models are trained with Augmentation set 5. Models trained on SemanticKITTI + RELLIS-3D and DAPS-1 additionally use T-Zone augmentation. Models trained on SemanticKITTI show low quality metrics due to different data domains and a small amount of green park zones in the training set. SalsaNextRecLSTM significantly outperformed other models on SemanticKITTI + RELLIS-3D, DAPS-1, and DAPS-2. It should be noted that the use of the LSTM-based recurrent block gives an increase in quality for both the SalsaNet and the SalsaNext baselines.
IoU for the ''surface'' category of SegFormer trained on SemanticKITTI + RELLIS-3D is an outlier, which might be caused by a large number of ''human'' instances. Trained on DAPS-1 models with recurrent blocks outperform basic SalsaNet by 3.4-4.9% mIoU. Figure 14 shows the visualized results of various models on DAPS-2. All listed models are trained on DAPS-1 with Augmentation set 5 and T-Zone cutout. SalsaNextRecLSTM outperforms other models in the ''human'' and ''obstacle'' classes and shows one of the best results for the ''surface'' category. DDRNetOC result is less noisy compared to other models, and it did not label bench as a ''vehicle''. It is worth noting that due to the low resolution, the DDRNet and SegFormer models often mark objects around ''human'' with an inappropriate class.
The importance of domain adaptation is demonstrated in Figure 15. The creation of a DAPS-1 synthetic dataset with a real 3D model of the robot and LiDAR parameters is crucial for real-life applications. All listed models demonstrate significantly better results, especially in the ''human'' category. Domain adaptation boosted SalsaNetRecLSTM quality metric by more than 43% and 23% for DDRNetOC.
All our models have a small number of trainable parameters, except SalsaNetRecLSTM, and provide real-time point cloud segmentation on RTX2080Ti GPU as shown in Table 9.

V. CONCLUSION
In our article, we extensively explored the projective segmentation of three-dimensional LiDAR point clouds and investigated the domain adaptive approach called DAPS3D.
We observed that existing open datasets lack scenes captured from mobile robot LiDARs in parks or on roads with a significant number of people nearby. To address this limitation, we proposed an approach to augment existing datasets by including ''human'' objects in the near-field of a vehicle. We created the semi-synthetic DAPS-1 dataset using this approach. Additionally, we collected the DAPS-2 dataset in a park area using a real robot. These datasets showcased the potential of domain transfer when training different state-ofthe-art deep neural networks on other open datasets.
The paper introduces enhancements to the existing basic neural network methods for projective segmentation, focusing on the fast SalsaNet model with the incorporation of various types of recurrent blocks. The SalsaNextRecLSTM model, which we developed, achieved the best performance indicators when evaluated on the proposed DAPS-1 and DAPS-2 datasets. For it, the integration of recurrent blocks provided a significant increase in the mIoU metric by 1.4% for SemanticKitti dataset, by 4.6% for DAPS-1 and by 6.9-11.6% for DAPS-2 compared to the basic state-of-theart projective model SalsaNext.
We also explored the application of convolutional models for 3D segmentation using the real-time DDRNet architecture with different attention modules and the SegFormer transformer method, which we adapted for LiDAR spherical projections. However, their use did not allow them to surpass SalsaNet-and SalsaNext-based approaches in terms of quality and performance.
We proposed and demonstrated the effectiveness of various augmentation techniques. Among them, the combination of field of view, flip and rotate augmentations, along with the T-Zone cutout approach, yielded the best results in training neural network models on open datasets. These augmentations ensured the stability of projective neural network segmentation methods when encountering variations in the data domain, including the use of different LiDAR devices.
In our experiments evaluating the proposed and investigated approaches, we found that PyTorch implementations of various methods exhibited latencies ranging from 14.4 to 30.2 ms when tested on the widely used RTX2080Ti GPU. This performance highlights the potential for model acceleration or their direct utilization in onboard perception systems of autonomous vehicles.
In our future work, we aim to significantly increase the volume of the DAPS-2 dataset with other park zones. It is promising to study the approaches of spatiotemporal feature aggregation [52] on data sequences along with recurrent blocks as part of models for point cloud projective segmentation. We also plan to explore the adaptation of the proposed neural network models for use on edge devices.
ALEKSANDR KHORIN was born in Tula, Russia, in 1999. He received the B.A.Sc. degree in mathematics and computer science from the Moscow Institute of Physics and Technology, Moscow, Russia, in 2021.
Since 2021, he has been a Research Engineer with the Intelligent Transport Laboratory, Moscow Institute of Physics and Technology. His research interests include computer vision, optical character recognition, and deep learning. Mr. Khorin was the bronze medalist from the All-Russian Student Olympiad ''Yandex'' in ''Mathematical Modeling'' in 2021 and was a recipient of the grant of the President of the Russian Federation for outstanding abilities in educational and scientific activities from 2021 to 2022. Since 2021, he has been a Research Engineer with the Intelligent Transport Laboratory, Moscow Institute of Physics and Technology, Moscow. His research interests include 3D semantic mapping, computer vision, and robotics.