SEE-CSOM: Sharp-Edged and Efficient Continuous Semantic Occupancy Mapping for Mobile Robots

Generating an accurate and continuous semantic occupancy map is a key component of autonomous robotics. Most existing continuous semantic occupancy mapping methods neglect the potential differences between voxels, which reconstruct an overinflated map. What is more, these methods have high computational complexity due to the fixed and large query range. To address the challenges of overinflation and inefficiency, this article proposes a novel sharp-edged and efficient continuous semantic occupancy mapping algorithm (SEE-CSOM). The main contribution of this work is to design the Redundant Voxel Filter Model (RVFM) and the Adaptive Kernel Length Model (AKLM) to improve the performance of the map. RVFM applies context entropy to filter out the redundant voxels with a low degree of confidence, so that the representation of objects will have accurate boundaries with sharp edges. AKLM adaptively adjusts the kernel length with class entropy, which reduces the amount of data used for training. Then, the multientropy kernel inference function is formulated to integrate the two models to generate the continuous semantic occupancy map. The algorithm has been verified on indoor and outdoor public datasets and implemented on a real robot platform, validating the significant improvement in accuracy and efficiency.


I. INTRODUCTION
T HE essence of robot mapping is to employ sparse noisy sensor observations to construct a dense accurate representation, which is regarded as a fundamental problem in robotics [1]. As robots are required to perform more intelligent tasks, incorporating semantic information can further help them distinguish object categories and allow a higher level of environmental representation [2]. Currently, the most widely used mapping technique is the occupancy grid map [3]. Most grid mapping methods are noncontinuous, assuming that voxels are statistically independent, which contradicts the fact that real-world object surfaces are usually smooth. Recent success in Bayesian kernel inference has boosted the development of continuous mapping, such as BGKOctoMap [4], BGKOctoMap-L [5], and S-BKI [6]. They incorporate local spatial correlations into the mapping model, which can infer the continuous surface from sparse sensor data. However, the potential differences between voxels are not fully exploited and all voxels are treated equally, so the voxels next to the object are misclassified to be occupied, resulting in overinflated objects. Such an overinflated map is not suitable for robot navigation tasks, because traversable free space might be falsely blocked. Therefore, the main objective of this article is to design a novel continuous semantic occupancy mapping algorithm that can mitigate overinflation while improving efficiency.
The first challenge is to infer the voxels that are worth filling, so as to mitigate the overinflation phenomenon of existing continuous mapping methods [4], [5], [6]. As shown in Fig. 1, the discrete map only recovers the area hit by the sensor observations, leaving many loopholes. These mapped voxels are defined as observed voxels in this article. Other unknown voxels can be divided into two types: the voxels located in the loopholes due to the lack of observations are called inactive voxels (such as Fig. 1 yellow masks), while those in the free space outside the obstacle surface are called redundant voxels (such as Fig. 1 red masks). Continuous mapping is expected to only fill in inactive voxels, generating a representation similar to the ground truth map. However, SOTA continuous mapping method S-BKI [6] does not distinguish unknown voxels and builds an overinflated map due to falsely filling in redundant voxels (see Fig. 1). To accurately reconstruct the scene, the redundant voxel filter model is proposed to filter out redundant voxels by measuring  [7]. Our map restores smooth surfaces for the objects while generating precise boundaries with sharp edges.
context entropy, which aims to increase the confidence level in the inference process.
The second challenge is to reduce the computational complexity of continuous semantic occupancy mapping, thereby improving efficiency. Current continuous approaches [8], [9], [10] adopt fixed kernel length, which is n times of the voxel size. This operation will increase the computational complexity by n 3 compared to discrete approaches. To reduce the time cost, the Adaptive Kernel Length Model is proposed to adjust the kernel length adaptively by introducing class entropy, which is the measurement of the overall uncertainty of a voxel. A large class entropy usually indicates that the voxel is located at the junction of objects or contains noisy observations, therefore a large query range is needed to improve accuracy. When the class entropy is small, a small kernel length can satisfy the accuracy.
In summary, overinflation and inefficiency are two challenging problems of continuous semantic occupancy mapping. This article proposes a novel sharp-edged and efficient continuous semantic occupancy mapping algorithm (SEE-CSOM) by extending [11]. The overall continuous semantic occupancy mapping problem is mathematically formulated and its probabilistic model is derived. The main contributions of this work are listed as follows.
1) Redundant voxel filter model (RVFM) is proposed to filter out redundant voxels distinguished by context entropy, which moderates the overinflation phenomenon. 2) Adaptive kernel length model (AKLM) is proposed to assign an appropriate kernel length to each voxel by class entropy, which improves the mapping efficiency.
3) The proposed algorithm has been verified on indoor and outdoor public datasets and a real robot. Qualitative and quantitative results show the superiority of the algorithm in improving accuracy and efficiency. The rest of this article organized as follows. Section II reviews the related works. Section III introduces the SEE-CSOM algorithm. Section IV shows the experimental results. Section V concludes this article.

II. RELATED WORKS
In this section, existing algorithms related to continuous semantic occupancy mapping are introduced, including semantic mapping and continuous mapping.

A. Semantic Mapping
With the rapid development of deep learning, semantic mapping has attracted increasing attention. Early semantic mapping methods directly use semantic images for mapping. The authors in [12] use the Bayesian framework to filter probabilistic segmentation from multiple views in a voxel-based 3-D map. In [13], the street-level image label estimates are aggregated to annotate the 3-D volume. These methods are the pioneers of semantic mapping, but lack further optimization. To optimize incorrect voxel labels, CRF has become a research hotspot [14], which can simulate the long-distance relationships in a region, such as grids corresponding to 2-D superpixels [15] or grids within supervoxels [16]. In [17], a novel high-order CRF model is applied to optimize 3-D grid labels. Recently, a hierarchical framework for collaborative probabilistic semantic mapping is proposed in [18]. Besides, relative location among robots can be estimated by matching semantic maps [19], [20].
Various methods mentioned above promoted the development of semantic mapping. However, these methods assume that the voxels are independent and do not reconstruct the continuous surface of the objects.

B. Continuous Mapping
To construct a smoother occupancy map, many methods have attempted to relax the assumption that the voxels are independent, such as GPmap [8], Hilbert map [21], etc. GPmap [8] introduced a dependence relationship between points as the nonparametric Bayesian inference process, which has also been extended to semantic mapping [10]. However, O(n 3 ) computational complexity has limited its application to large-scale online mapping [9]. Hilbert map [21] makes use of fast kernel approximations to enable faster training in O(n) time. In addition, a real-time incremental 3-D Hilbert map has been proven to be feasible [22]. Recently, Bayesian kernel inference with O(log n) computational complexity has begun to gain attention. BGKOctoMap [4] innovatively applies the sparse kernel and Bayesian nonparametric inference data structure to improve efficiency. Similar work is carried out in BGKOctoMap-L [5]. More recently, S-BKI [6] extends [5] to 3-D semantic mapping, which enriches the map information.
In summary, the above methods perform continuous mapping without considering voxel potential differences, which results in overinflated maps. These algorithms also have a large computational cost compared to discrete mapping methods. These have been the reasons for limiting the generalization of continuous mapping.

III. SHARP-EDGED AND EFFICIENT CONTINUOUS SEMANTIC OCCUPANCY MAPPING
This section describes and formulates the SEE-CSOM algorithm, divided into four sections: Algorithm framework and problem definition, redundant voxel filter model, adaptive kernel length model, and multientropy kernel inference.

A. Algorithm Framework and Problem Definition
The framework of the SEE-CSOM algorithm is depicted in Fig. 2, which consists of three main modules. In the redundant voxel filter model (RVFM), redundant voxels are distinguished from inactive and observed voxels by filtering factor. In the adaptive kernel length model (AKLM), class entropy composed of two subentropies is introduced to adjust the kernel length, which determines the range of local spatial associations. Finally, a continuous semantic map is estimated from sensor observations through multientropy kernel inference that combines the information conveyed by RVFM and AKLM.
Considering a robot operating in a completely unknown environment and attempting to reconstruct the surroundings, the problem can be defined as follows: Problem Definition: Given a robot r with camera observations I 1:t , 3-D LiDAR observations L 1:t and robot trajectory O 1:t , the objective is to estimate the continuous semantic occupancy map M t The solution of the problem corresponds to the maximum a posterior (MAP) estimation of (1). For the input, there are I t ∈ R 2 , L t ∈ R 3 and O t ∈ SE (3). For output, the dense semantic to store probabilistic semantic labels, where K is the total number of semantic classes and K k=1 λ k j = 1. At time t, the RGB image I t is fed into the segmentation network [23]. For each pixel, the output is a one-hot encoded measurement tuple c i = (c 1 i , c 2 i , · · · c K i ). Due to differences in the sensor's field of perception, only 3-D LiDAR points within the camera perception area are collected. The semantic labels can be transmitted from pixels to LiDAR points by projection [24], where the parameters are calibrated by [25]. Therefore, (1) can be rewritten as The semantic point cloud consists of a series of semantic points p i referred by coordinates (p x i , p y i , p z i ), which are associated with semantic label c i . Alternatively, the problem can be refined to: Given semantic points and labels (3)

B. Redundant Voxel Filter Model
As stated before, previous continuous mapping methods [5], [6] cannot clearly distinguish voxels, resulting in overfitting of the final continuous map. The redundant voxel filter model is designed to address this problem, distinguishing different types of voxels and filtering out the redundant voxels. Fig. 3 illustrates the significance of RVFM in a 2-D example.
As shown in Fig. 4(a), in order to improve semantic accuracy and inference efficiency, block is introduced as an intermediate A graph model is extracted from the extended block [see Fig. 4(c)], with blocks in the extended block as nodes, and the connection between the current block b J and the surrounding blocks {b x } as edges. This graph is called Dandelion-CRF because of its highly recognizable structure. It allows arbitrarily modifying the style of the expansion block to suit the map application and sensor resolution. Given the observation D, the context entropy E con is described as the conditional probability of the central node b J where b J ∼ 1 indicates that b J should be filled to enhance the continuity of the map. It is important to note that filling does not set the state to be occupied, but instead uses the spatial association to populate current observation. There are two kinds of cliques in Dandelion-CRF: One is a single node {b k } and the other is a pair of adjacent nodes {b k , b l }, where k and l are index variables. By selecting the exponential potential function and introducing the feature function, the conditional probability is defined as where Z(D) is the partial function for normalization, the status feature function ψ(b k ), and the transition feature function ψ(b k , b l ) describe the influence of the observation sequence and adjacent nodes, respectively. In the formulation, ψ(b k ) obtains different values according to whether the block is observed. ψ(b k , b l ) takes the radial basis function (RBF) of the Euclidean distance between blocks. In (7) and (8), ω 1 and ω 2 are hyperparameters to control the amount of information transmitted, and s is the resolution of the block E con reveals potential differences between voxels that can be used as indicators of differentiation. Taking T con as the entropy threshold, the voxels contained in the block with context entropy less than T con are redundant, often located in the gap between two objects or outside the boundaries of objects. RVFM will filter out these redundant voxels during continuous inference, while preserving observed and inactive voxels to estimate a more accurate map. This operation is realized by the filtering factor f J transmitted to the multientropy kernel inference module

C. Adaptive Kernel Length Model
The kernel length is the key to mapping efficiency, because it determines the query range. The adaptive kernel length model is designed to assign appropriate kernel lengths to voxels. Continuing in units of block, voxels in the same block are assigned with the same kernel length.
Class entropy E cla is introduced to measure the overall uncertainty of the voxel. It contains two subentropies: one is the probability entropy E p , and the other is semantic entropy E s . On the one hand, probability entropy E p reflects the proportion of number, which is defined as where n max is the number of semantic points that account for the largest number among all semantic classes, and n all is the total number of semantic points in block b J . On the other hand, semantic entropy E s describes the diversity of semantic labels in block b J . Defining k (k < K) to indicate the number of semantic labels contained in block b J , semantic entropy E s It is worth pointing out that n max /n all has a bound [1/k, 1]. Converting this mathematical relationship to subentropies, the implicit constraint of subentropies can be obtained Class entropy E cla is defined in (13) to combine two subentropies. Probability entropy E p is dominant because it integrates part information of semantic entropy E s . Moreover, the weight of semantic entropy should be inversely proportional to the total number of semantic classes K. Coupled with the constraints of (12), the visualization of class entropy E cla is illustrated in Fig. 5 When there are no observations in the block, it will have the largest class entropy. This also occurs when points with any labels fall into the block evenly. Substituting (10), (11), into (13), class entropy E J cla is written as Larger class entropy means higher uncertainty, requiring a larger query range to ensure map accuracy. Therefore, for voxel v j in block b J , the kernel length with bounds L min and L max is adjusted to

D. Multientropy Kernel Inference
The efficacy of the RVFM and AKLM needs to be exerted through multientropy kernel inference, which essentially converts sensor observations into updated maps. Different from the classical voxel probability update model [3], the multientropy kernel inference model is derived based on the counting sensor model [26]. According to the Bayesian rule, (3) can be decomposed into For incremental Bayesian inference, likelihood probability is modeled as a categorical distribution cat(λ 1 j , λ 2 j , . . . λ K j ).
To break the independence of voxels, an extended likelihood is introduced with a kernel function k that operates on 3-D space X × X → [0, 1]. Then, (16) is rewritten as After simplification, the relationship between the two Dirichlet distribution parameters σ 0 and σ j can be obtained Because σ 0 usually takes a tiny value due to the lack of prior knowledge, σ j is the weighted count of the semantic points, called the semantic count tuple of voxel v j . It will be stored in the voxel and updated when a new semantic point cloud is inserted.
Equation (18) indicates that the semantic count tuple σ j counts not only the semantic points that fall into the current voxel v j but also the adjacent semantic points with the kernel function as the weight. In this way, the choice of kernel function has a pivotal influence on the quality and efficiency of semantic mapping. In order to reduce the computational complexity, the sparse kernel function k 0 (v j , p i ) [27] is chosen as a template: (19) where I represents the indicator function, d = v j − p i , L is the kernel length, and ε 0 is the scale factor.
Incorporating the proposed RVFM (9) and AKLM (15) into the (19), the multientropy kernel function k e (v j , p i ) is derived as (20). k e (v j , p i ) improves the defects of indistinguishable redundant voxels and fixed kernel length in k 0 (v j , p i ) by the integration of the two models, thus helping to greatly improve the mapping performance Inserting the obtained semantic point clouds L s 1:t , the semantic count tuple σ j of each voxel v j in map M t is denoted as where σ k j represents the count value of the 3-D point with semantic label k in voxel v j . Therefore, the probabilistic semantic label of the voxel v j is the closed-form expected value of the posterior Dirichlet: Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. In summary, the mapping problem defined in (3) has been transformed into a probabilistic solution as (22). The map is updated by incrementally calculating (21) and (22) as new sensor observations are obtained.

IV. EXPERIMENTAL RESULTS
In this section, the performance of the proposed SEE-CSOM algorithm is validated through experiments performed on multiple public datasets and a real robot platform.
Implementation Details: All experiments are conducted on AMD R7-5800H CPU @3.20 GHz, 16 GB RAM. Our code written in C++ has been made public, which is based on Robot Operating System, Point Cloud Library, and Semantic Bayesian Kernel Inference Library.
Comparison Baseline: S-CSM, S-BKI [6], OctoMap [3], GPOctoMap [9], BGKOctoMap-L [5], and AKIMap [28] are selected as baselines for occupancy accuracy and efficiency comparison, while S-CSM and S-BKI [6] are set as the baselines for semantic accuracy comparison. For these algorithms, the common hyperparameters follow the settings of S-BKI [6], while the unique ones are kept original or empirically adjusted to be appropriate.
Evaluation Metric: Accuracy is measured by occupancy AUC, voxel-IoU, w-average, and accurateness. Voxel-IoU extends the pixel-IoU from 2-D to 3-D, which is defined as TP/(TP+FP+FN). W-average is the weighted average of voxel-IoU. Accurateness is defined as the proportion of correctly classified voxels. Efficiency is measured in seconds.

A. Occupancy Evaluation
The structured toy dataset is collected from a closed space of 10.0 m × 7.0 m × 2.0 m in Gazebo, which is also used in many previous works [4], [5], [6]. Fig. 6 includes the reconstruction results for ours and all of the comparison baselines. Our map has a compact underlying representation even though the sensor data are sufficiently sparse, which is the closest to the environment model, while other methods suffer from underfitting or overfitting to a certain degree.

B. Stanford Indoor Dataset
Stanford 2-D-3-D Semantics Dataset [7] is a large indoor spatial dataset. It provides indoor data at multiple modalities, including annotated 3-D point clouds. A conference room, a lounge, an office, and a WC are selected as evaluation scenes, covering various indoor environments with different structures. The map resolution is set to 0.05 m for all algorithms.
Taking the lounge as an example, the mapping results are shown in Fig. 8. As can be seen, S-CSM generates a discrete semantic map by only predicting observed voxels. Objects in the S-BKI map are very thick, and a zoom-in view shows that the entire chair has been distorted and glued to the floor due to the blind filling of redundant voxels. In contrast, our proposed SEE-CSOM successfully filters out redundant voxels while filling in the inactive voxels, which builds a semantic map that visually has the most similar features to the ground truth.
The quantitative evaluation results of the mapping accuracy are summarized in Table I. SEE-CSOM has achieved significant advantages, consistent with the visual results. S-BKI has a higher IoU than S-CSM, but has the lowest accuracy due to the overfilling of a large number of redundant voxels.

C. SemanticKITTI Outdoor Dataset
SemanticKITTI dataset [29] is a large outdoor semantic point cloud dataset, collected from the real world. The semantic labels of the point clouds inserted into the map are obtained by RangeNet++ [30]. Sequences 02, 04, 06, and 08, four different  Taking Sequence 04 as an example, the comparison of the mapping results is shown in Fig. 9. The enlarged pictures present part of the ground. Due to inaccurate network segmentation, all the generated maps have some random noises. There are many loopholes and messy semantic labels in the S-CSM map. The reason is that S-CSM does not consider the spatial correlation. S-BKI can remove some noises and fill in the loopholes by smoothing, but it is still not comparable with ours. SEE-CSOM almost removes all noises by applying RVFM and AKLM, which is more in line with ground truth.
The quantitative results are summarized in Table III. It is obvious that SEE-CSOM has the best performance. Moreover, the accuracy of S-BKI has been greatly improved, and even surpasses S-CSM. The reason is that continuous mapping is more suitable for cluttered outdoor scenes with many unknown or ambiguous objects. Fig. 10 shows the confusion matrices. The diagonal (TP) of our confusion matrix has the darkest color, with the largest prediction and recall for each class. The highest F1 score also confirms the best spatial semantic classification effect of the proposed algorithm.

D. Efficiency Evaluation
To evaluate efficiency, the average runtime of both semantic and geometric mapping methods is reported in Table IV. For a fair comparison, all experiments use the same environment configuration on both hardware and software. In general, semantic mapping methods cost more time than geometric mapping methods. This is because the semantic map includes multiple classes of object labels, thus increasing the complexity of estimation and update. By adaptively assigning the kernel length, SEE-CSOM has the highest efficiency in semantic mapping

E. Impact of Parameters
The sensitivity of SEE-CSOM to two important parameters is studied: block depth D b and context entropy threshold T con . The experiments are conducted on the conference room dataset.
In Fig. 11, w-average IoU and accurateness are utilized as evaluation indicators. The best mapping effect is achieved when the block depth D b is set to 2. When it is smaller, the surrounding observation data are considered insufficient, and when it is larger, the details of the map will be ignored, both of which are not conducive to the accurate reconstruction of the scene. 0.1 is the best candidate for context entropy threshold T con , which filters out voxels that have no interior observations and have exterior observations in few directions.

F. Validation in the Real World
To verify the practicality in real applications, SEE-CSOM is tested by deploying a mobile robot. The robot is equipped with a 3-D Velodyne LiDAR and a visual camera, where the sensors have been accurately calibrated with [25]. The Cityscapes dataset [31] is used to train the semantic segmentation model.
The robot is teleoperated to traverse the campus to record raw sensor data, from which semantic point clouds and robot trajectories are generated. To verify the performance of the algorithms on sparse data, each scan of the point cloud is downsampled to a resolution of 0.2 m, the maximum perception range is 15 m, and the map resolution is set to 0.1 m. Using the exact same input and real-time playback, the qualitative results of the three semantic mapping algorithms are shown in Fig. 12. A top view of the SEE-CSOM map and an image of the robot are shown at the top. At the bottom is a comparison of the maps constructed by the three algorithms. By visually checking the generated map, it is found that the SEE-CSOM map strikes a balance between the sparse S-CSM map and the overinflated S-BKI map, filling the mapping loopholes without overfitting.
In addition, the numerical quantitative results are also reported in Table V. Since the ground-truth semantic labels are not available, the ground-truth geometric map is constructed offline using dense denoised point clouds. As indicated in the table, SEE-CSOM outperforms other algorithms on all metrics. S-CSM has the smallest occupied IoU due to its conservative estimation of occupancy, while S-BKI, on the opposite, is overinflated. In terms of runtime, SEE-SCOM achieves the best computational efficiency. Based on the real-world validation,  SEE-CSOM demonstrates its accuracy and efficiency in practical applications.

V. CONCLUSION
This article established a sharp-edged and efficient continuous semantic occupancy mapping algorithm. More specifically, the proposed redundant voxel filter model filtered out redundant voxels, therefore the representation of objects had accurate boundaries with sharp edges in our map. In addition, the proposed adaptive kernel length model adjusted kernel length adaptively, which greatly reduced the computational complexity. The multientropy kernel function integrated the two models to jointly reconstruct a dense accurate representation from sparse noisy sensor observations. The results demonstrated that the proposed algorithm achieved high accuracy and efficiency. In the future, the plan is to consider semantic consistency when filtering redundant voxels, so that a more accurate and smoother semantic map can be reconstructed. Yi Yang received the Ph.D. degree in automation from the Beijing Institute of Technology, Beijing, China, in 2010.
He is currently a Professor with the School of Automation, Beijing Institute of Technology. His research interests include autonomous vehicles, bioinspired robots, intelligent navigation, semantic mapping, scene understanding, motion planning and control. He is the author or coauthor of more than 50 conference and journal papers in the area of unmanned ground vehicles. He is a Professor with the School of Electrical and Electronic Engineering, Nanyang Technological University (NTU), Singapore. His research interests include robotics, control engineering, and fault diagnosis. He is a Fellow of The Academy of Engineering Singapore. Yufeng Yue (Member, IEEE) received the B.Eng. degree in automation from the Beijing Institute of Technology, Beijing, China, in 2014, the Ph.D. degree in robotics from Nanyang Technological University, Singapore, in 2019.
He is currently a Professor with the School of Automation, Beijing Institute of Technology. He has published a book in Springer, and over 40 journal/conference papers, including TMECH, TIE, TCST, TMM, ICRA and IROS. His research interests include perception, mapping and navigation for collaborative robots.