Dynamic Semantic Occupancy Mapping Using 3D Scene Flow and Closed-Form Bayesian Inference

This paper reports on a dynamic semantic mapping framework that incorporates 3D scene flow measurements into a closed-form Bayesian inference model. Existence of dynamic objects in the environment can cause artifacts and traces in current mapping algorithms, leading to an inconsistent map posterior. We leverage state-of-the-art semantic segmentation and 3D flow estimation using deep learning to provide measurements for map inference. We develop a Bayesian model that propagates the scene with flow and infers a 3D continuous (i.e., can be queried at arbitrary resolution) semantic occupancy map outperforming its static counterpart. Extensive experiments using publicly available data sets show that the proposed framework improves over its predecessors and input measurements from deep neural networks consistently.


I. INTRODUCTION
Mapping, localization and navigation are among the key capabilities of autonomous systems.For robots to navigate safely in complex and evolving environments, mapping can act as a unified framework that addresses multiple perception sub-tasks required for a higher-level scene understanding, such as occupancy/traversability estimation, object detection and tracking.While some research streams employ end-to-end deep neural networks for mapless navigation via imitation [1], [2], reinforcement [3], [4] or self-supervised learning [5], maps are still widely used for explicit reliability, interpretability, and predictability.In this work, we focus on the map inference problem instead of Simultaneous Localization and Mapping (SLAM), and aim at improving the inference performance in dynamic environments.
Map inference can aid robots in reasoning about areas that are currently occluded but previously observed (occlusionawareness), or inferring the geometry and semantics of an unseen landmark near those that were previously observed (smoothing).In complex environments (e.g., driving scenarios), robots can recognize stationary cars and people, while consistently tracking moving vehicles and pedestrians.
Semantic mapping complements geometric modelling of a robot's surroundings with semantic concepts, i.e., an understanding of what the environment means to the robot.With semantic mapping, these semantic concepts manifest as a representation of the environment, thus lending robots more resources for task planning and execution.The emergence of semantic mapping can be attributed to (i) the limitations of purely geometric maps, and (ii) the advancements in deep neural networks that allow semantic interpretation of raw sensory data [6].
In early semantic mapping works, semantics and geometrics are modeled independently, where semantic labels are added on top of an existing geometric representation, such as point cloud model [7], surfel-based map [8], and voxel-based map [9].As this field progresses, semantics and geometry have been modeled jointly and inferred in a unified framework [10], [11].Gan et al. [11] proposed a unified semantic mapping framework for closed-form Bayesian inference of the semantic map posterior.However, its underlying static world assumption limits its applications in real-world dynamic environments.In scenarios with dynamic objects, static mapping may provide less detailed or even inconsistent reconstruction due to obscured views.The novelty of this work is thus to propose a unified closedform Bayesian inference framework which extends semantic mapping to dynamic environments.
To this end, we develop a scalable dynamic semantic mapping framework that combines motion and semantic information through closed-form Bayesian inference in a single pipeline as shown in Fig. 1.Spatio-temporal motion data and semantic labels are aggregated over past frames and neighbouring voxels.The aggregated motion is then used in a proposed Bayesian model to propagate the current scene and its semantic labels.
In particular, this work has the following contributions.
1) We propose a kernel method for scene flow aggregation and an efficient auto-regressive Bayesian model for scene propagation.2) We extend the Bayesian Kernel Inference (BKI) semantic mapping framework [11] to dynamic scenes by incorporating motion information.3) We introduce an evaluation methodology for dynamic semantic mapping using single and multi-view data.4) The open-source software is publicly available at: https://github.com/UMich-CURLY/BKIDynamicSemanticMappingTh remaining sections are organized as follows: A comprehensive literature review is presented in Section II.Section III presents the problem setup and preliminaries.The methodology is discussed in Section IV.Section V presents quantitative evaluation methods for dynamic mapping.Results and discussion are given in Section VI.Finally, Section VII concludes the paper and provides ideas for future work.

II. RELATED WORK
In this section, we review works on semantic and dynamic mapping.While our work is focused specifically on mapping, the improved mapping algorithm can lead to improvements in downstream tasks such as localization when integrated in a SLAM system.Therefore, we provide background and perform comparisons of both mapping and SLAM systems to highlight the differences in how the environment is perceived, and subsequently represented.A taxonomy of the state-of-the-art dynamic mapping works is given in Table 1, based on the presence ( ) or absence (×) of semantics and scene dynamics usage, the type of sensors they operate on, and the type of flow measurements incorporated.

A. SEMANTIC MAPPING
Semantics are important to robot perception for better scene understanding and interaction [6].Whereas many semantic mapping works have explored learning-based local mapping [12]- [17], our work is a mathematically-derived 3D global mapping algorithm.Additionally, our method builds upon existing deep learning research by directly taking the output from neural networks as input, instead of attempting to embed all information within a latent space.With explicit intermediate steps, our model is diagnosable and reliable, as failures at each stage of the pipeline may be identified and observed.For this reason, we consider works that incorporate semantic measurements into maps given poses and estimated semantic labels.
Another line of research concerns continuous semantic mapping with uncertainty [30]- [33], that allows one to query maps at arbitrary resolutions.Kernel methods such as Gaussian Processes (GPs) are well-established for predicting a continuous non-parametric function to represent the semantic map [34]- [36].BKI is an efficient approximation of GPs which yields fast computation and accurate inference for semantic mapping [11].This work extends [11] to dynamic scenes.

B. MAPPING IN DYNAMIC ENVIRONMENTS
Dynamic objects can break the assumption of scene rigidity in most mapping algorithms and cause failure.When combined with localization in a SLAM pipeline, artifacts left by dynamic objects can introduce errors for downstream tasks such as pose estimation and loop closure.Thus, some SLAM systems treat dynamic objects in a scene as spurious data or outliers, excluding them entirely from pose estimation and mapping to achieve better accuracy and robustness [37]- [39].However, discarding dynamic objects ultimately relies upon the ability to reject dynamic objects and decreases the level of scene understanding embedded within the map.
In this section, we provide background on rejection-based approaches as well as mapping algorithms which jointly model the static world and dynamic objects.Discarding dynamic objects may be performed through probabilistic outlier rejection [42], moving consistency check [41], feature-based filtering [37], measurement-map semantic inconsistency check [54], culling out with objectcamera relative poses [43], semantic and geometric information coupling [40], or residual motion likelihood calculation [38], [44].These methods can partially reduce localization error brought by dynamic objects, but still have limitations.For instance, discarding information based on semantic labels completely depends on the prediction accuracy.Moreover, the discarded motion information, if modeled correctly, could be further leveraged to predict the scene dynamics.
There are two primary approaches to integrating motion within maps.In the first method, scene dynamics are incorporated into a single reconstruction volume.This could be done by maintaining an object point cloud with a moving probability [25], calculating a dynamics factor for classes that could be mis-classified as "dynamic" (e.g., parked cars) and incorporating those into pose estimation [45], propagating feature points by sampling from scene flow measurements [53], or fusing semantic features by recurrent observation average pooling in an OctoMap cell [52].Instead of analyzing the motion properties of map cell or feature from a single scan, we combine spatio-temporal motion data over multiple scans and neighbouring voxels.
The second category is characterized by its underlying object-oriented map representation.These approaches track local objects using Iterative Closest Point (ICP) and semantic segmentation-aided fusion [21], [24], [46], or by clustering on the basis of motion estimation [47].Object tracking is also done via sparse scene flow estimation [48], frame-to-model data association [55] or Signed Distance Function (SDF)-based data association [49].In this area, deep learning-based instance segmentation is often the bottleneck for computational efficiency [25], [37], [41], [44].Although we model the scene as a unified reconstruction volume as in the former approaches, we can incorporate flow during environmental perception as in the latter approaches.

III. SEMANTIC BAYESIAN KERNEL INFERENCE
Semantic-Bayesian Kernel Inference (BKI) [11] is a probabilistic method for 3D semantic mapping with quantifiable uncertainty.The Semantic-BKI framework assumes that the j-th map cell (voxel in 3D, indexed by j ∈ Z + ) with semantic probability θ j = (θ 1 j , ..., θ K j ), where θ k j is the probability of the j-th cell belonging to the k-th category, has the Categorical likelihood p(y i | θ j ) = K k=1 (θ k j ) [yi=k] .Here, y i ∈ {1, ..., K} is the semantic measurement at position x i ∈ R 3 in or around the j-th cell, and [y i = k] evaluates to 1 if y i = k, 0 otherwise.The semantic measurement y i is usually the semantic label output by a neural network.Given training data with N measurement points D = {(x i , y i )} N i=1 , semantic mapping seeks the posterior distribution p(θ j | D) for each map cell j.
For a closed-form solution, BKI semantic mapping adopts a conjugate prior over θ j , given by a Dirichlet distribution Dir(K, α 0 ), α 0 = (α 1 0 , ..., α K 0 ), where K ≥ 2 is the number of categories, and α k 0 ∈ R + are concentration parameters for each category.Applying Bayes' rule and Bayesian kernel inference, the posterior is another Dirichlet distribution, given by Dir(K, α j ), α j = (α 1 j , ..., α K j ), and: where K s : R 3 × R 3 → [0, 1] is a spatial kernel function defined on 3D Euclidean space to capture the spatial corre-lation of two 3D positions, and α k j is the k-th concentration parameter of the query voxel j centered at x j ∈ R 3 .
Given α j , the maximum a posteriori (MAP) estimate of θ j can then be computed in closed-form: In BKI semantic mapping, the prior distribution of the map at time stamp t is directly set to be the posterior at time stamp t − 1 (assuming that the environment does not change between two time stamps), i.e., p(θ j,t | D 1:t−1 ) = p(θ j,t−1 | D 1:t−1 ), to allow recursive Bayesian updates using sequential training data: However, this assumption is easily violated by moving objects in the environment or environmental changes.As such, we model the transition from p(θ j,t−1 | D 1:t−1 ) to p(θ j,t | D 1:t−1 ) using spatial and temporal information.

IV. METHOD: DYNAMIC-BKI
In this section, we introduce a method to extend Semantic-BKI to dynamic environments.We first formulate an autoregressive temporal transition model which propagates the map posterior according to the scene dynamics.Next, we show how we aggregate motion information from the training data for incorporation into the map voxels.Finally, we consolidate and summarize the algorithm for dynamic semantic mapping.

A. TEMPORAL TRANSITION MODEL
When dynamic objects move in and out of a voxel j, the samples observed in it across time are not independently and identically distributed (i.i.d.).Samples drawn from the map posterior at different time stamps will come from independent but not identically distributed Dirichlet distributions.
In Figure 2, we illustrate the motion of an object and a corresponding visualization of the Dirichlet probability density function (PDF) over the 2-simplex when there are just three classes -"robot", "free space", and "other".The static world assumption in (3) solely relies on the frequency of observations in a voxel.This property makes the Dirichlet distribution ignore scene dynamics and become overconfident about classes that contribute more observations over all time stamps rather than the current time stamp.Therefore, for correct classification, the hyperparameters of the Dirichlet distribution under a static world assumption have to evolve with the scene dynamics.
We first introduce notations used in our formulation.Let the set of all classes be P, the set of moving classes be Q (q ∈ Q) and the free voxel category be denoted as "free."Additionally, let the set of all classes excluding a class r be P \ r.We define a voxel flow vector for each voxel as , which is the motion information captured per  semantic category within a voxel j.Details on computing voxel flow is presented in Section IV-D.
We propose a time-series model to account for temporal discrepancies in the Dirichlet distribution caused by moving objects.To forecast α j,t of voxel j when a moving object passes through at time stamp t − 1, we apply an auto-regressive (AR) model that leverages the 3D motion information captured from the environment and applies it to the map prior.The class-wise AR model is as follows: where e −(v k j,t−1 ) 2 is the AR model's parameter, α k j,t−1 is prior concentration parameter for class k and v k j,t−1 is the voxel flow at time stamp t − 1 which influences the hyperparameter α k j,t for semantic class k.We depict a graphical model for this temporal transition model in Figure 3.
The concept behind the transition model is to redistribute the probability mass of the concentration parameters when there is motion observed in the environment.Therefore, we warp the concentration parameters according to the effect that the motion of a dynamic object has on 1) its corresponding class and 2) other classes.Keeping these two factors in mind, we introduce two modules to predict the concentration parameters α j,t for voxel j at time stamp t.

B. BACKWARD OR EXIT CORRECTION (BACC)
When a moving object of category q is detected in voxel j at time stamp t − 1 is in motion and could exit in time stamp t, we want to decay its influence on the concentration parameters in j for category q in the upcoming t.As a result, we reduce the influence of prior parameter α q j,t−1 of q on concentration parameter of the next time stamp α q j,t .To do so, we only need to calculate the voxel flow associated with that moving category q, i.e., v q j,t−1 .The map prior (observations) of other classes α k∈(P\q) j,t−1 is not required as each semantic category is updated independently.

C. FORWARD OR ENTRY CORRECTION (FORC)
When a voxel j that was "free" in time stamp t − 1 has a moving object of category q ∈ Q entering it in time stamp t, we want to make sure that the future presence of the object can be represented in j.As a result, we need to reduce the effect of α free j,t−1 on α free j,t and for this, we would need to calculate voxel flow associated with the category "free", i.e., v free j,t−1 .Intuitively, we do this with the motion information of all moving objects in the vicinity of voxel j that could enter voxel j.Additionally, we could use the same voxel flow v free j,t−1 to decay concentration parameters of all the static classes P \ Q in order to make α q j,t the highest after entry.

D. VOXEL FLOW CALCULATION FROM POINT CLOUDS
Given a voxel j centered at x j , we wish to get a low-level understanding of how an object is moving in or out of it to model v k j,t−1 in (4).Scene flow provides us with the underlying 3D motion field of the points in the scene.Given two incoming point clouds X t−1 and X t , recorded at time stamp t − 1 and t, respectively, we require a translational motion vector u i ∈ R 3 that conveys how much a point x i ∈ X t−1 has displaced to a new location x i ∈ X t .In practice, this translational motion vector can be obtained from the "scene flow" associated with each point in the point cloud [56]- [58].
To capture the voxel flow v j = (v 1 j , . . ., v K j ) pertaining to any semantic category for voxel j, we aggregate the flow from training points around voxel centroid x j .Thus, given training points is used to weight the influence of each point x i on v q j so that the closer a dynamic object of class q ∈ Q is to the voxel center, the more influence it has.Mathematically, this becomes a kernel density estimation problem and the per-class voxel flow is calculated as: for q ∈ Q do for all dynamic classes 7: v q j,t ← v q j,t + wv ui [y i =q] apply (5) 8: end for 10: end for 11: for q ∈ Q do 12: Apply a custom filter f (., .)with past flow 13: end for 14: 15: for k ∈ P \ Q do copy over free velocity to static classes 16: end for 18: return vj,t 19: end procedure where • takes the Euclidean norm.We take the vector norm consistent with v q j 's usage in the exponential ARmodel in (4).Therefore, our objective is to get a quantitative estimate of how much motion there is around the voxel, rather than capture the direction of motion.
In Section IV-C, we introduced FORC for free and other static classes.As explained previously, we calculate their per-class voxel flow together by considering the dynamic objects of all categories moving in voxel j: where Q is the set of dynamic classes and K free v is a special case of K v that weights the influence of dynamic training point x i on v free j .Specific details about K v and K free v will be discussed in Section VI-A In Algorithm 1, we summarize how the per-class voxel flow for a query voxel j is estimated using the positional X t , semantic Y t , and egomotion-compensated U t information of each point x i in a point cloud.For BACC, we aggregate the flows of the training points encountered around x j in line 1.4.For FORC, we aggregate the flows of the training points while weighing the ones in neighbouring voxels more (than in BACC) in line 1.5.Whereas both equations have a similar form, BACC and FORC have separate kernels.Additionally, while FORC considers all neighboring dynamic points when updating v free j,t , BACC only computes v q j,t from dynamic points with matching semantic label q.After calculating, v k j,t for any class k, in lines 1.12 and 1.14, we post-process v k j,t with a filter f : R × R → R to aggregate information from the voxel flow in the previous time step v q j,t−1 in the final calculation of v q j,t .
Algorithm 2 Dynamic Semantic Mapping end for 9: end for 10: return αj,t 11: end procedure

E. MAP POSTERIOR UPDATE FOR SCENE PROPAGATION
Section IV-A describes how we account for the change in concentration parameters of the Dirichlet distribution caused by the motion of objects.Using this model and following a Bayesian approach, α k j,t in (4) can be substituted as the prior in (1), i.e., Algorithm 2 consists of prediction and update steps as in a recursive Bayes filtering.For the prediction step in line 2.4, we apply the temporal transition model with the query point's flow estimate v j,t−1 .The prediction step with BACC enables removal of traces left by moving objects "exiting" voxels.For example, if a car was in motion at time stamp t − 1 and moving out from a voxel j, calculating v car j,t−1 ensures that the map maintains confidence about a static class such as "road" α road j,t and decreases confidence about the car class α car j,t .Additionally, the FORC in our algorithm facilitates the "entry" of dynamic objects into previously encountered areas in the map by reducing overconfidence in "static" and "free" classes.With the update step in line 2.7, incoming spatial and semantic training data (X t , Y t ) is incorporated.

V. QUANTITATIVE EVALUATION FOR DYNAMIC MAPPING
Typically, dynamic and semantic mapping methods that operate on stereo images re-project the map onto the image plane, and evaluation is done based on pixel-wise semantic segmentation of the image [48], [53].Suppose one uses the same quantitative metric to evaluate the entire scene's geometric-semantic reconstruction accuracy, this would fail to capture the "complete" scene -i.e., how well the map can represent portions of the environment that are not reflected on the evaluation image, e.g., free cells in the map.This problem is significant for evaluating dynamic maps as free space in the environment might be mis-classfied as occupied due to artifacts.Therefore, we propose a querying framework for dynamic semantic occupancy mapping that considers both the "complete" scene and scene dynamics for the map evaluation.
Let M be the map we are building to represent an environment.Let the corresponding ground truth model of the environment be denoted by G.The ground truth model could be sensor data post-processed with correct labels or as a map representation -e.g. a semantically-labelled point cloud, a set of RGB-D images, a heightmap, etc.Let us assume that the rays from our current scan intersect voxels that are "observed" by the robot (marked as "Visible" in Figure 4).We model this "Visible" portion of G with M v i.e., voxels in the map currently being "observed" by the robot.The portion of the environment that is not seen in the current scan could then either be previously explored or still unexplored.If some portion of the environment is "Unexplored" as in Figure 4, there would be no voxel created in M for that portion.Otherwise, if the voxel was previously explored and is not visible to the robot in the current scan, we consider it "Occluded" as shown in Figure 4. We denote the voxels in M that represent the occluded portion of the environment as M o .Our map M is thus comprised of voxels that are visible in the current scan (M v ) and voxels that are not (M o ).
To perform a quantitative comparison on the semantic scene representation between our model M and any ground truth G, we build upon two different map query frameworks introduced in [59]: accuracy and completeness.In both metrics, we assess the intersection between M and G.However, in map accuracy, we evaluate each element in M v against G, and in map completeness each element of G against M v M o .

A. MAP ACCURACY
The accuracy of the map, as the name suggests, quantifies how correct the visible metric-semantic representation M is when compared with the true value G of the environment.As this work pertains to semantic occupancy mapping, "correctness" is specifically the semantic classification accuracy.
For each element (in our case, voxel) m ∈ M v , we generate the corresponding ground truth g m ∈ G that is the "closest semantic neighbor" to m.In practice, g m will be the nearest element in G to m in metric space and also, most representative of the semantic category m could belong to.For example, if G is a point cloud there may be many points residing within m.The most representative semantic category is then the majority semantic label of points within voxel m.
The semantic prediction for voxel m can then be evaluated against that of g m .Details about how g m can be generated from sensor data for single-and multi-view data sets are discussed in Section VI-C1.If we are comparing the map accuracies of mapping methods with different map representations (e.g.uniform resolution voxel map versus point cloud map with a non-uniform point distribution), however, the query elements m ∈ M v could cover metric space differently.Therefore, map accuracy is more suited for comparing maps under the same representation (e.g. both are uniform resolution voxel maps).Consequently, we compare Semantic-BKI (S-BKI) and Dynamic-BKI (D-BKI) using the same map representation, and compute the precision, recall, and Jaccard scores across all classes.

B. MAP COMPLETENESS
Completeness of the map pertains to how much of the environment, including both visible and occluded regions M v M o , is reconstructed correctly.For instance, if we want to evaluate a portion of the environment that was previously observed, but is not currently visible, "completeness" can be used to ascertain whether the map is able to represent the environment correctly.This is because we sample the environment to query the map, rather than the other way around.If we use multiple views of the environment to obtain ground truth G for portions of the environment that are currently occluded in the map M o , it is possible to include M o into completeness.Another advantage of using map completeness as a metric is that we can compare the map inference performance across methods with different map representations.
We pick an element g ∈ G and find its "closest semantic neighbor" m g ∈ M. Again, m g would be the nearest element to g in metric space and most representative semantic category that g could belong to.In practice, we seek the voxel in which the element g falls.If the metric distance between g and m g is greater than a certain margin, we consider that g is currently unexplored by the robot , and exclude these pairs in the evaluation.These kinds of space are shown in the rightmost column in Figure 4.
In addition, to keep the evaluation relevant to dynamic semantic mapping, we treat static and dynamic objects differently when calculating completeness.
1) If g is static.As g is static, g could not move irrespective of whether the robot has observed it.As a result, we evaluate semantic accuracy for any nearest neighbor m g ∈ M.
2) If g is dynamic.As only dynamic objects in M v are currently seen by the robot, we evaluate semantic accuracy for all m g ∈ M v .We do not evaluate on M o as these voxels in the map are occluded and could have a different state from the last time observed by the robot.

C. AUXILIARY TASK: SEMANTIC SEGMENTATION
Semantic segmentation of a point cloud is another task that can be performed with an existing map model M, and we can consider the map's performance on this auxiliary task for additional evaluation.With semantic mapping, we can inherently fuse multi-frame measurements and improve semantic classification accuracy as in S-BKI [11].The querying method for a single scan is simple -we pick each point in D t already inserted into the map M at time stamp t and check which voxel it falls inside.D t contains Y t , which are the semantic label predictions corresponding to X t .As discussed in Section III, this is typically obtained from a neural network.The semantic category of the voxel becomes the prediction from our model M.Both of these can then be compared with the ground-truth, semantically-annotated point cloud -that is typically provided in data sets such as SemanticKITTI [60].This comparison can show whether semantic segmentation predictions can be improved through smoothing [11].

VI. RESULTS AND DISCUSSION
In this section, we first describe our experimental setup and the data sets used for evaluation.Then, we demonstrate the performance of the proposed mapping system D-BKI with qualitative results on synthetic and real data sets.Finally, quantitative results on semantic scene understanding subtasks using single-and multi-view data sets are presented.

A. EXPERIMENTAL SETUP
We first describe our (i) system design choices, then elaborate on (ii) the data sets used, and lastly discuss (iii) flow estimation for each point cloud data set.

1) System Design Choices
In the proposed mapping framework, every query voxel has 6 neighbours (one on each facet).For computational efficiency, only the training points within the 6 neighbouring voxels are used in the calculation of both ( 5) and ( 6).We choose a sparse kernel [61] as where d = x − x , l 1 > 0 is the length scale and σ 1 is the kernel scale parameter.Typically, the kernel length scale in K v and K free v is chosen with respect to the resolution of the map being built as it controls how much influence a point in a neighbouring voxel has.In our experiments, l 1 is greater than the map resolution and σ 1 can be set once in the beginning according to the size of the point set and free-space sampling rate.Lastly, in Algorithm 1, we implement f (., .)as a moving average filter in line 1.12.

2) Data Sets and Benchmarks
We use point cloud-based data sets that contain only positional information.However, our method is amenable to any point cloud data with intensity, colour or other fields.Additionally, it can also be applied to depth camera data, of which larger data sets for training and evaluation exist.

Gazebo Simulation Environment
To create an indoor synthetic data set, a Gazebo simulation environment was set up with multiple Turtlebots exploring a house.We mounted one robot (the ego-robot) with an omni-directional block laser scanner for data collection in the form of point clouds with positional information only.To simulate dynamic objects in the environment, we have other three Turtlebots exploring the same house.Using a reactive planner, the robots avoid each other and obstacles in the environment.The collected data is processed using Point Cloud Library (PCL) [62] and annotated based on height into three semantic classes -floor, robot and miscellaneous objects including walls and cabinets.The scene flow for each scan is computed with FlowNet3D [56].

SemanticKITTI Data Set
The SemanticKITTI data set [60] is a large-scale real driving data set based on the KITTI Vision Benchmark [63] where semantically-annotated LiDAR scans and camera poses are provided for all sequences.Camera poses are estimated with SuMa [64], and semantic annotations for each LiDAR scan are generated by RangeNet++ [65].Additional labels are provided to distinguish static objects from dynamic objects, such as person and moving-person.There are 22 sequences, out of which 11 sequences are provided with ground truth labels for training (00-07), validation (08) and testing (09-10).Sequence 11-21 do not come with ground truth semantic labels, but can be evaluated in a public leaderboard over the mean Intersection-over-Union (mIoU) metric.Since the most reliable ground truth model in this single-view data set is the semantically-labeled point cloud, we use them to generate a ground truth model G.

CARLA Data Set
To evaluate the map completeness of the proposed dynamic semantic mapping, a reliable ground truth model of the environment including free space is needed.As real data sets collected using a single-view sensor (such as SemanticKITTI) usually do not have sufficient measurement coverage to fully recover the underlying environment model, we leverage a simulation environment CARLA [66].
We generate a synthetic multi-view scene completion data set sequence from the CARLA [66] simulator.The methodology for its creation is available publicly in [17].We generate ground-truth environment models by uniformly distributing multiple LiDAR sensors around the ego vehicle, effectively obtaining a 3D Monte Carlo sampling of the world which is i.i.d. with respect to time.The simulation environment also provides ground-truth scene flow (velocity) and semantic labels for each point.Free space observations are obtained by linearly interpolating along all points at a fixed interval of 1.5 meters.Ground truth point clouds with semantic labels are then fused into a semantically annotated ground truth voxel model G with 0.3 meter resolution.The voxel centers in the ground truth model act as query points in the completeness experiments.This approach is similar to the SemanticKITTI [60] scene completion data set; however, it has no traces from dynamic objects and fewer occlusions due to sampling from multiple sensors at the same time.

3) Flow Estimation
To obtain the corresponding flows U t of a point cloud X t , we choose a state-of-the-art deep learning architecture based on PointNet++ [67] -FlowNet3D [56]: a supervised method based on PointNet++ which estimates scene flow between two successive point clouds X t−1 and X t .
Typically, implementations for scene flow estimation train on the XYZRGB fields, i.e., the point cloud includes both the position and colour information.We trained an adapted version of FlowNet3D on the KITTI 2015 Scene Flow data set [68] and the FlyingThings driving data set [69] by only including position information for training.After obtaining U t from the networks, we perform egomotion-compensation by subtracting the mean flow of static classes from U t .
As we only need flow for moving objects, our mapping method can be applied when any flow information for {x q t ∈ X t | ∀q ∈ Q} is available.To demonstrate the performance of the mapping framework independent of flow estimation error, we obtain the ground truth velocity of dynamic objects from the CARLA simulator.

B. QUALITATIVE RESULTS
The goal of this section is to (i) demonstrate improvements of the temporal transition model qualitatively through ablation studies, and (ii) compare the real-time map construction by semantic and dynamic BKI with the data sets described in Section VI-A2.

1) Ablation Studies
We perform two ablation studies to demonstrate the function and efficacy of each component of the method.These studies are aimed to qualitatively show how the global map inference performs without backward (BACC) or forward (FORC) correction.Note that our global map is colored with a gray floor, mustard walls and red Turtlebots.We annotate our ego-robot building the map with a white box around it.Holes on the floor are typically spaces where the sensor has not scanned yet.We only tune parameters (in Table 2) for K v .l s and σ s are the spatial kernel length scale and scale parameters respectively from [11].
Without BACC: To conduct this study, we remove scene flow aggregation for all dynamic classes by setting v q j,t−1 = 0 | ∀q ∈ Q and observe the map as it is being built.Results are shown in the top row of Fig. 5.In the simulation snapshot, we highlight the 3 Turtlebots in the environment that are in motion.Without BACC, trails are visible behind each Turtlebot due to their motion not being considered during map-building.With BACC, no trails are showcases a static ego-car (in dark-black) parked on the road while pedestrians walk around within its sensing range.In the map images, the red blobs are pedestrians walking on the street and the black rectangles are other cars (whether parked or moving).One can see that D-BKI "fills" in the shape of the car more than Kochanov et al. (as in the car on the top left) and does not leave trails of the walking pedestrians as with S-BKI.For Kochanov et al., there seems to be a more sparse representation for pedestrians due to their unique shape than D-BKI.left behind and each robot has consistently the same size due to the incorporation of the temporal transition model.
Without FORC: Results for this experiment are shown in the bottom row of Fig. 5.The simulation snapshot shows 2 moving Turtlebots in the environment.Without FORC, the motion of these Turtlebots around free voxels are not considered to compute v free j,t−1 in (6).As a result, the prediction step for α free j,t in 2.4 becomes obsolete.If α free j,t > α robot j,t , then voxel j will be (incorrectly) classified as a free cell.Note that the other two robots do not get incorporated into the map as α free j,t > α robot j,t for the voxels.With FORC, the map successfully represents the two Turtlebots.

2) SemanticKITTI Data Set
We include images from sequence 1 and 4 of the Se-manticKITTI data set to highlight the differences between static (S-BKI) and dynamic (D-BKI) mapping, as this is not easily captured in the semantic segmentation competition.These results are shown in Fig. 6, where S-BKI either discards dynamic objects over time completely or leaves them in the map depending on parameter choice.In contrast, D-BKI is able to accurately represent the moving objects without leaving long trails.Some of the parameters used to run the experiments are specified in Table 3.

3) CARLA Data Set
We show qualitative results on five different scenarios described in Fig. 7 and the appendix section comparing our approach with various baselines in Fig. 8 and 9.

C. QUANTITATIVE EVALUATION 1) Semantic Mapping
For this sub-task, we compare the estimated map with respect to the ground truth world models G of the Se-manticKITTI and CARLA simulator data set (described in Sec.V).We conduct our experiments with two querying and evaluation metrics -map completeness and map accuracy.Map accuracy is measured at the intersection of the visible estimated map M v with the G, and evaluated at each voxel in the estimated map.Map completeness includes both visible (M v ) and occluded voxels (M o ), and is evaluated at each ground truth element.Note that there are some considerations about the ground truth world model generation, which we will discuss in detail next.
We compare our algorithm, D-BKI, against the static semantic mapping baseline S-BKI [11], and a scenepropagation-based dynamic semantic mapping algorithm by Kochanov et al. [53].Although [53] presents results on building voxel maps with stereo images, the approach is general and performs semantic segmentation and scene flow estimation to incorporate into the mapping pipeline later.We re-implemented their approach and tuned it to generate results on LiDAR point clouds.As their approach updates semantic and occupancy probability separately, we performed free space sampling to provide this extra information.
Single-view data set: For a single-view data set, we select the well-known SemanticKITTI data set [60].Semantic labels Y t for training are obtained from the Cylinder3Dmultiscan model [70] and the ground truth labels in [60] are used to generate a ground truth world model G.As evaluation data for multi-scan dynamic semantic mapping with free space labels is not available, we generate it ourselves by keeping the semantically-labeled point cloud intact; but adding free space samples onto it.For evaluation of both map accuracy and map completeness, we create a point set D free t containing only free space labels, by sampling free space every 1.5m from the sensor origin to each point in the point cloud.D free t is then downsampled by a voxelgrid filter and added to the ground truth semantically labeled point cloud D t to generate G.
For map accuracy, to compute the "closest semantic neighbor" g m discussed in Section V, we consider the semantics of all points in G that fall within each visible voxel m ∈ M v .The semantic category that has the most points in m is chosen to be the ground truth semantic category of g m .Occupied space samples (D t ) are given priority over free space samples, i.e., g m is considered to be "free" only if m exclusively contains free space samples.
We show results of this approach on the entire Se-manticKITTI data set in Table 4 and demonstrate how we improve map accuracy over S-BKI with D-BKI.The average IoU over each scan is computed for each sequence and aggregated per class.The IoUs of a particular class are highlighted if there is a >0.01 difference between the methods.One can see that S-BKI and D-BKI perform similarly for 20 of the static classes.However, for all the 6 dynamic classes, D-BKI consistently outperforms S-BKI by a significant margin.This result also shows the importance of free space consideration for evaluating dynamic maps as the artifacts introduced by S-BKI seen in Fig. 6a and Fig. 6b would not be evaluated if we restrict our evaluation methodology only to occupied space.
For map completeness, the "closest semantic neighbor" m g for each g in the ground truth G is the voxel in M that g falls within.To compare with other dynamic semantic mapping methods, we picked four sequences that are representative of the challenges faced while driving in dynamic environments -highways (sequence 01) at high speed, countrysides (sequence 03) and cities (sequence 06) at normal speed, and residential areas in city (sequence 10) at slow speed.On this subset, we showcase in Table 4 how D-BKI and Kochanov et al. [53] perform in the map completeness metric.We average the Jaccard scores (i.e., mean IoU) across each scan in these four sequences for both methods at the same resolution of 0.3 m.D-BKI performs better or similarly to Kochanov et al. [53] in dynamic classes.As the map resolution is low, D-BKI's mIoU drops slightly for pedestrians (smaller objects) but remains high for larger dynamic objects.The mIoU is observed to be significantly higher in static classes than [53].
Multi-view data set: For the CARLA data set, we picked five scenarios often encountered in a dynamic urban environment -(i) a static car in the presence of moving pedestrians, (ii) a car having to stop in the presence of a jaywalking individual, (iii) a car driving at fast and (iv) slow speeds in dense traffic conditions and lastly, (v) a car driving in light traffic conditions.This data is acquired over an 1800 scan sequence and each of these scenarios is 100 scans long, but presents different challenges.The semantic segmentation labels input into the map are obtained from the simulator.
We compare the map accuracy between S-BKI and D-BKI in Table 5.Note that in accuracy, each visible voxel (m ∈ M v ) is queried against the ground truth model G. Since S-BKI and D-BKI share the same map representation, the maps share the same origin and have overlapping voxels.
Precision is an indicator of how many predictions made by the map match the ground truth, and is calculated per semantic class k as the number of voxels m correctly labeled k, divided by the total number of voxels m labeled k.Therefore, precision will be lower for dynamic classes if residual traces are not removed during map propagation.For example, the trails seen in Fig. 6 for S-BKI lead to a low precision for dynamic classes.This pattern may be seen in Table 5, where D-BKI has improved precision on the vehicle and pedestrian classes.
Recall is another useful metric for evaluating maps in dynamic environments.In contrast to precision, recall is calculated as the proportion of ground truth measurements g m with semantic label k that were correctly identified.The difference in recall between static and dynamic mapping is most evident in the free class.If traces from dynamic objects are not removed, free space voxels will be incorrectly labeled occupied, and thus recall for the free category will be lower.This is also evident in Table 5. TABLE 4: Quantitative results for Dynamic-BKI using two map evaluation methods on SemanticKITTI data set [60] for 26 semantic classes.Comparisons are made with Semantic-BKI [11] for Map Accuracy and Kochanov et al. [53] for Map Completeness.The performance metric mean IoU (mIoU) is used to calculate a quantitative measure with the data collected.To evaluate how much of the ground truth G is modeled correctly by the maps, we report map completeness using both the visible and occluded portions of the environment (shown in Fig. 4) that correspond to M v and M o in the map.Table 6 showcases our map performance in comparison to the scene-propagation-based dynamic semantic mapping by Kochanov et al. [53].Experiments were conducted with a map resolution of 0.1 m against a higher resolution ground truth voxel map.As our mapping method inputs the Velodyne point cloud without free space sampling, our precision in Table 6 is slightly lower due to smoothing at the boundaries of occupied space.Our results are still comparable despite using less information.This is especially evident for the class "traffic sign," as it is a smaller object.Recall for D-BKI is higher for all occupied semantic classes in the CARLA data set.We also evaluate the Jaccard score of the occluded portions of the map and find that D-BKI performs better than [53] in all semantic categories.This experiment shows D-BKI can retain the occluded parts of the map much better over a period of time.

2) Auxiliary Task: Semantic Segmentation
For the sub-task of semantic segmentation, we evaluate our results quantitatively on the SemanticKITTI benchmark.The ground truth semantically-annotated point clouds are available in the data set.
Semantic observations Y t for training are obtained from the Cylinder3D-multiscan model [70] and the data set is divided into training (sequences 00-10) and testing (sequences 11-21).For each point in a point cloud (X t ), we compute the per-class mean IoU using the Jaccard index against the ground truth labels provided in the SemanticKITTI data set.Table 3 details the parameters used to run the experiments at a map resolution of 0.1.Our training results in Table 7 show that D-BKI mapping improves upon Cylinder3D in nearly every category.The results also showcase that spatiotemporal smoothing is beneficial for segmentation accuracy, a valuable insight for future research directions in this area.

D. DISCUSSION
We showed that a simple auto-regressive transition model enables dynamic scene propagation and rectifies the pitfalls of the static world assumption in the Semantic-BKI mapping algorithm -either by reducing traces in the map or by preventing overconfidence in free space.Metrics proposed to evaluate dynamic mapping quantitatively can aid in providing more perspective on the appearance of the global metric-semantic map rather than the local view.This can be helpful in checking whether unoccupied space is erroneously classified as occupied or vice versa.The work can be applied to any sensor data that can be represented in the form of an XYZ point cloud.Given acquiring scene flow data for a full point cloud is more challenging than acquiring it for camera data, we anticipate that the performance is transferable to other 3D sensors.
Setting the map to build at finer resolutions achieves significantly better performance, but at the expense of higher computational burden and memory usage.A garbage collection process when a sequence runs too long for dynamicmapping application may be useful.Having adaptive kernel lengths according to object size (e.g. for vehicular and human classes) may improve results.Future work includes investigating methods to compress and streamline data acquisition, demonstrating results on data sets in unstructured environments, and investigating memory-based alternatives to the autoregressive model.

VII. CONCLUSION
We developed a dynamic mapping algorithm based on Bayesian kernel inference that models the motion of dynamic objects using scene flow.Our map may be built from

FIGURE 1 :
FIGURE 1: Dynamic Semantic Mapping Pipeline.Raw point clouds are inputs to scene flow and semantic segmentation neural networks, which compute the input to the mapping algorithm, Dt = {Xt, Yt, Vt}.The dynamic map updates voxels parameterized by θ using scene flow aggregation and Bayesian inference.The dynamic map is capable of updating cells with dynamic objects, without leaving any residual traces.

FIGURE 2 :
FIGURE 2: We illustrate the observation of a moving object through the middle voxel in this map and display how that voxel's semantics are different at every time step.For every time step, we plot the posterior Dirichlet probability density function (PDF) of the voxel on a 2-simplex.The shift in the rainbow gradient demonstrates what the belief about a semantic category (robot, free or other) can be over t=0 to t=2 to classify the voxel correctly at that time.This shift can be influenced by changing the concentration parameters (hyperparameters) of the Dirichlet posterior as: α j,0 → α j,1 → α j,2 .

FIGURE 3 :
FIGURE 3: A graphical model for hyperparameter propagation.For each voxel j updated at time t and for each class k, the hyperparameter αt is a deterministic function of flow in the previous observed time stamp v t−1 and the prior α t−1 .

FIGURE 4 :
FIGURE 4: An illustration of the ground truth model as viewed from the robot's perspective.Rays pass through the areas that are observed by the robot.Areas that are unseen by the robot in the current scan could have been previously observed but are currently occluded.Alternatively, there could also be areas unexplored by the robot but present in the ground-truth model created by the multi-view dataset.

FIGURE 5 :
FIGURE 5: Ablation Studies with Gazebo Simulation.The images in the top row demonstrate the functionality of BACC, while the bottom row that of FORC.Gazebo Simulation Snapshot: In this column, we show the top view of the gazebo simulation.The ego robot is demarcated within a white square, while other moving Turtlebots are highlighted and marked with their orientation.Without: The images in the middle column show the global map made without specific modules.Traces are left in the map where Turtlebots were present in previous time steps without BACC.Without FORC, the map fails to represent the other two Turtlebots in the map.With: The right-most column shows the the global maps constructed with our approach.Minimal traces are left in the map, and the Turtlebots are in their correct locations.

FIGURE 6 :
FIGURE 6: Qualitative results on Sequences 01 (top) and 04 (bottom) of SemanticKITTI.Four images are shown for both frames, which include (Top Left:) Right stereo image corresponding to one of the scans.Note that the image is included for validation, however we strictly perform map update from LiDAR when generating results.(Top Right:) S-BKI mapping without any free space sampling.Trails are left where cars passed over.(Bottom Right:) S-BKI with free space sampling.After a few scans, the map becomes overconfident about the presence of free cells and fails to incorporate dynamic objects in the map.(Bottom Left:) D-BKI with free space sampling.The cars are tracked with minimal traces.

FIGURE 7 :
FIGURE 7: Qualitative results on S-BKI (top left), Kochanov et al. (top-right) and D-BKI (bottom-right) of the CARLA data set.This scene (bottom-left)showcases a static ego-car (in dark-black) parked on the road while pedestrians walk around within its sensing range.In the map images, the red blobs are pedestrians walking on the street and the black rectangles are other cars (whether parked or moving).One can see that D-BKI "fills" in the shape of the car more than Kochanov et al. (as in the car on the top left) and does not leave trails of the walking pedestrians as with S-BKI.For Kochanov et al., there seems to be a more sparse representation for pedestrians due to their unique shape than D-BKI.

TABLE 1 :
Comparison of properties of DynamicSemanticBKI with respect to other dynamic SLAM and dynamic mapping systems.Although we compare D-BKI only to other mapping baselines, we elaborate on properties of both SLAM and mapping systems to highlight the dynamic mapping taxonomy.In the table, C = (mono, stereo, RGB-D) and 3DC = (stereo, RGB-D).

TABLE 2 :
Parameters for Ablation Studies.

TABLE 3 :
Parameters for SemanticKITTI results.Map resolution and downsampling resolution varies with Map Accuracy and Map Completeness, but the same as each other.

TABLE 5 :
Quantitative results for Dynamic-BKI and Semantic-BKI using the map evaluation method Map Accuracy on the CARLA data set.The data collected is evaluated with two performance metrics -Precision and Recall..5395.50 98.23 98.89 98.43 98.69 96.33

TABLE 6 :
[53]titative results for Dynamic-BKI and Kochanov et.al[53]using the map evaluation method Map Completeness on the CARLA data set.The data collected for the visible map Mv is evaluated with two performance metrics -Precision and Recall.The data collected for the occluded map Mo is evaluated with mean IoU (mIoU).