Towards Customizable Robotic Disinfection With Structure-Aware Semantic Mapping

In the current COVID-19 pandemic, people expect to use robots to replace humans to complete the disinfection work in public places. Since different regions in the environment have different risks, in addition to the conventional SLAM capability, the robot also needs to be able to recognize and distinguish different objects in the scene to complete customizable disinfection tasks. In this paper, we propose a LiDAR-based semantic mapping system that can be used for robotic disinfection tasks. By using the prior information about the scene structure, our system can extract different levels of semantic information from the raw point cloud, so as to not only construct an occupancy grid map for navigation, but also construct a hierarchical semantic map that meets the needs of customizable disinfection tasks, including setting the navigation waypoints, disinfection distances and disinfection time. The effectiveness of our proposed system is proved in the real-world metro disinfection applications.


I. INTRODUCTION
With the current global pandemic of COVID-19, the disinfection of public places has gradually attracted people's attention. Since manual disinfection operations are very timeconsuming and labor-intensive, researchers are trying to use autonomous robots to complete the disinfection tasks to reduce the burden on related staffs and avoid potential infection risks [1], [2].
Many scenarios that need to be disinfected are relatively standard structured scenes, such as metros and buses. Taking the metro car in Fig. 1 (top) as an example, the disinfection robot needs to autonomously navigate from the start to the end of the metro and perform disinfection task (such as spraying disinfectant or emitting ultraviolet rays) at the same time.
Simultaneous localization and mapping (SLAM) is the core functionality for the robot to complete the disinfection task. Furthermore, since different structures in the scene usually have different risks, disinfection tasks often need to be customizable, that is, the disinfection time and disinfection distance of different structures are various [3], [4].
The associate editor coordinating the review of this manuscript and approving it for publication was Heng Wang . For example, the seats and armrests in metro cars are the regions most frequently touched by people, so compared to other regions, longer disinfection time and shorter VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ disinfection distance are required to ensure a higher disinfection result.
To be able to fulfill the customizable disinfection requirements for different structures in the environment, unlike normal navigation tasks which only require the robot to know where there are obstacles, disinfection tasks need the robot to understand the structure around it, i.e., to have a semantic level understanding. Therefore, in this paper, we present a LiDAR-based semantic mapping system for robotic disinfection tasks. Our system can use the scene structure as the prior knowledge to segment the point cloud and obtain the semantic information in it, so as to understand the structure around the robot. The semantics as depicted in Fig. 1 (bottom) can be extracted and recorded in the map, and finally used to support the robot to complete customizable disinfection tasks.

II. RELATED WORK A. LiDAR-BASED SEMANTIC MAPPING
LiDAR-based SLAM system such as GMapping [5], Hector SLAM [6] and Cartographer [7] are widely applied in practical applications where the capability of autonomous navigation for a robot is required. Although these frameworks have made great success, they are usually designed for general navigation tasks which only consider the obstacles and build occupancy grid maps of the environment. Therefore, it is not sufficient to directly apply these frameworks to meet the needs of customizable disinfection tasks.
Compared with traditional SLAM systems, semantic SLAM systems attempt to extract more deeper and higherlevel information from the raw data and eventually build a semantic map. Two different levels of semantics have been widely studied and applied: the feature-level semantics and the object-level semantics.

1) FEATURE-LEVEL
The feature-level semantics consists of various geometric primitives, like points, lines and planes detected and extracted from the input point cloud. In [8], Chen et al. divided the point cloud into different categories and used such information to improve the data association. Line features [9] and planar surfaces [10] were also extracted from the point cloud for fast and precise localization in autonomous driving tasks.

2) OBJECT-LEVEL
The object-level semantics includes the general objects and application-oriented objects. General objects can be the dense/sparse point cloud models or solid figures. Salas-Moreno et al. matched the objects in the environment to their dense 3D models and perform the SLAM task at the level of objects [11], while Bowman et al. proposed an algorithm to resolve probabilistic data association of 3D objects in semantic SLAM problems [12].
For those application-oriented object models, such as the door signs in the building [13], poles [14], [15] and traffic signs [16] in the driving scenes, crops in the precision agriculture [17], or trees in the forest inventory [18], the performance of SLAM system usually can be further enhanced by integrating such semantic information, and some taskrelated requirements can only be fulfilled along with the optimization.
In our proposed semantic mapping system, the raw LiDAR point clouds are segmented into different classes in order to provide semantic-level information for customizable disinfection tasks.

B. STRUCTURE AS PRIOR KNOWLEDGE
To boost the performance of general SLAM systems in specific applications, various prior knowledges of the environment and the task are often exploited.

1) BUILDING STRUCTURE
Karg et al. introduced the similarity of multi-floor layout as global constraints to the graph-based SLAM, which was able to generate consistent maps for multistory buildings [19]. Boniardi et al. took the architecture floor plans as prior knowledges to enhance the robustness of robot localization [20], [21].

2) AERIAL STRUCTURE
Aerial images captured by satellites or unmanned aerial vehicles are another source of information that are commonly used to assist in the SLAM task. With the extracted line segments [22] or edges [23] from the aerial images, the mapping results could be constrained and closer to the groundtruth. Special application in precision farming by Pretto et al. demonstrated the effectiveness of fusing the aerial and ground information for much robust localization and mapping [24].

3) GEOMETRIC CONSTRAINTS
The orthogonality and parallelism in structured environments have also been extensively investigated and considered as the constraints for both data processing and optimization in SLAM problems. Daoust et al. leveraged the parallelism of tunnel walls to remove unimportant LiDAR measurements for localizing a train in challenging underground environments [25]. The orthogonal and parallel planes in both indoor [26] and underground [27] environments were utilized for better mapping performance.
In our framework, considering that the raw 2D LiDAR point cloud is very sparse and less representative, we also use the prior knowledge of the scene structure for point cloud segmentation.

C. APPLICATIONS OF DISINFECTION ROBOTS
Different type of disinfection robots based on the UV-C light [28] and hydrogen peroxide vapor [29] have been applied to disinfect public utilities including hospitals, schools as well as metros investigated in this paper. As discussed in [30], the disinfection performance may be affected by various factors. Due to the shadows caused by occlusions, the UV-C light or FIGURE 2. The proposed semantic mapping framework for customizable disinfection. The scene structure information is provided to the system as the prior knowledge. The point cloud segmentation subsystem (Sec. IV) first builds correspondences between LiDAR point cloud and the scene structure with a structure-aware method, followed by an optimization-based refinement module and an object detection module to further extract feature-level and object-level semantic information in the scene, respectively. The semantic mapping subsystem (Sec. V) builds and updates a hierarchical semantic map with the poses estimated by a LiDAR odometry module. The constructed semantic maps are further used to support the task planning in customizable disinfection applications. chemicals may not always touch the surface of objects in the environment. The effectiveness also reduces as the distance increases. In [31], the authors proposed to identify potential contamination areas for adaptive robotic disinfection in built environment. In [32], Conte et al. designed a disinfection map to evaluate the disinfection performance w.r.t the distance between the robot and targeting structures.
Because some regions of the scene to be disinfected are high-touch and high-risk, the robot needs to take longer disinfection time at a shorter distance to them in order to ensure the complete disinfection. In our system, we mainly rely on the semantic-level understandings of the scenario to fulfill such high-level and customizable requirements.

III. FRAMEWORK
In this section, the pipeline and structure of the proposed semantic mapping system will be described. As shown in Fig. 2, the system can be divided into three parts: • The point cloud segmentation subsystem is designed to segment the point cloud by associating them with the structure of the metro car, followed by an optimization based refinement procedure. To further leverage the object-level information, the stanchions are also detected from the point cloud with the prior knowledge from the scene structure.
• The semantic mapping subsystem consists of a LiDAR odometry module and a mapping module. The LiDAR odometry module provides the poses of the robot in the metro car, which can be further used to fuse the extracted semantic points and stanchions and eventually build a hierarchical semantic map.
• The customizable disinfection subsystem finally utilizes the rich semantic information in the scene to plan the robot's behavior, including placing the navigation waypoints, setting different disinfection distances and disinfection time for different regions. In general, the semantic mapping framework we designed has three characteristics: • Semantic: Semantic information (e.g., different categories of points including the door, wall, seat, joint, stanchion in the metro car) is extracted from the raw LiDAR point cloud, given the structure of the scene as prior knowledge. Then such semantic information will be merged to construct a hierarchical semantic map of the scene, which is necessary for customizable disinfection tasks, especially for the disinfection distance and time control.
• Adaptive: Our framework can be adapted to various scenarios with different structures. On the one hand, to segment the point cloud, the prior knowledge from scene structure can be configured according to different scenarios. On the other hand, the extracted semantic information used by the task planning process can also be selected according to customizable disinfection tasks.
• Online: Since the prior knowledge of the scene structure can be integrated, our framework is very lightweight and fast enough to run on the robot in real-time. The construction of the scene semantic map can be completed along with the disinfection process. In the following two sections, we will first describe the details of structure-aware point cloud segmentation, followed by the semantic mapping procedure. For the customizable disinfection, we will demonstrate its implementation and application in the experiment section.

IV. STRUCTURE-AWARE POINT CLOUD SEGMENTATION
Since the 2D LiDAR point cloud is not as dense as the 3D LiDAR point cloud or camera images, it is usually difficult to extract the semantic information with a direct segmentation VOLUME 9, 2021 method. As reviewed in Sec. II-B, the information of scene structure can be considered as constraints in the SLAM system. Similarly, we propose to leverage the prior knowledge of scene structure in the semantic perception process. By associating the point cloud with the scene structure which can be manually annotated, we can obtain the corresponding label for each part of the point cloud.

A. REPRESENTATION OF THE SCENE STRUCTURE
The most accurate information of the scene structure is the floor plan, e.g., the layout of a metro car or bus. However, since there are many types of floor plans for different scenes and may not be easy to obtain, in order to enhance the versatility of the method, we suggest to simplify the description of the scene structure as much as possible.
Taking the metro car in Fig. 1 as an example, we can find that the distance from different part of the car to the center axis of the care is stable throughout the scene, although it can be various among different type of cars. Thus, we can model the such scenarios with a group of parallel lines, i.e., structure reference lines, as shown in Fig. 2, which can be easily generalized to other similar scenes like the buses and trains.

B. STRUCTURE-AWARE DATA ASSOCIATION
To associate the point cloud with the scene structure, we consider it as a registration or pose estimation process, that is, estimating the robot's pose in the scene. Then after the successful registration, we can extract the corresponding label for each laser point by the nearest neighbor search (NNS).
In the scene like the metro car, a two-stage grid-based searching method is designed to establish the initial correspondences between the point cloud and the scene structure: • Rotation Estimation. We first attempt to rotate the laser point cloud with a sampled angle in a predefined angle range. Then the laser points are transformed to the rotated frame of the robot, projected along the rotated heading of the robot. The histogram of the projected laser points is computed along the lateral direction which is perpendicular to the rotated heading of the robot. Since most of the points fall on the corresponding reference lines, it can be imagined that, the distribution of the projected points will cluster better when the selected rotation angle is closer to the ground-truth orientation.
Here we use the entropy of the distribution to find the best rotation angle: where ψ ∈ [−ψ 0 , ψ 0 ] is the sampled rotation angle, and p i (ψ) ∈ [0, 1] is the corresponding ratio of the projected points located in the i-th lateral bin.
• Position Estimation. Based on the rotated laser point cloud, we can further perform the lateral grid searching, since the reference lines can not provide longitudinal constraints. For each given structure reference line and lateral movement step, we can compute the number of points that fall in a specific distance range of each reference line. The lateral position corresponding to the maximum number of matching points will be considered as the best estimation.

C. OPTIMIZATION-BASED REFINEMENT
After the initial data association, the point-line correspondences can be computed by a distance threshold. Then we can perform a further refinement by accurately estimating the robot's pose in the reference frame T ref robot , which can be obtained by solving the following non-linear least squares problem: where p robot ij is the j-th laser point corresponding to the i-th , p) is the point-to-line distance, w i is the weight for the i-th reference line.
Then we can accurately align the point cloud to the scene structure and refine the initial correspondences. Together with the estimated local pose of the robot in the scene, we can further detect objects in the point cloud to obtain the objectlevel semantic information.

D. OBJECT DETECTION
As required by the customizable disinfection task, the robot needs to distinguish different objects in the scene, for example, the seats and stanchions in the metro car, so that the disinfection time and distance can be configured accordingly.
For some objects like the seats, because of their relatively large size, they can be easily and accurately identified with the structure-aware segmented results. However, it is not trivial to detect small objects like the stanchions in the metro car due to their small (the diameter is ∼3cm) and reflective surfaces. Thus, on the one hand, valid points for these objects may be very few when it is too far away; on the other hand, the shadow effect may influence the detection when the robot is close to them. As depicted in Fig. 3, in the metro car scenario, we design a two-stage method to extract the stanchion points and remove those noisy ones: line fitting & euclidean clustering.
To coarsely select the candidate stanchion points, we first transform the laser points to current reference frame so that the points are aligned to the reference lines. Since the stanchions in a car precisely lie on a straight line, the parameters of this line can be estimated by a random sample consensus (RANSAC) based line fitting method. Most of the noisy points caused by shadow effect will be filtered through a distance threshold to the fitted line.
To further distinguish between different stanchion instances, the stanchion points are examined from near to far along the fitted line. If the point is far enough to the previous point, a new stanchion will be added to the buffer. If the consecutive points are near enough to the previous points, the point will be added to current stanchion. Otherwise it will be considered as an outlier.
After all stanchion points are detected and clustered, the position of each stanchion is computed by averaging the position of its points. Finally, the detected stanchions are transformed to the map with the robot's odometry and fused with previously recorded ones or inserted as new ones.

V. SEMANTIC MAPPING
With the segmented point cloud and detected objects in robot's frame, we can use a LiDAR-based odometry system (e.g., Cartographer) to provide relative motion estimation in the global map. Then with the detected hierarchical semantics (i.e., labeled points, reference lines and objects like stanchions), we can build a hierarchical map of the metro. The semantics contained by this map can be used not only for navigation, but also for the customizable and high-level disinfection tasks. As shown in Fig. 1, our hierarchical semantic map consists of four types of semantics: occupancy grids, semantic points, reference lines and stanchions.

1) OCCUPANCY GRID MAP
The occupancy grid map is the standard mapping result of common LiDAR-based SLAM system. It can be obtained by probabilistically fusing the raw laser point cloud with the robot's pose. Although it is useful for path planning and localization in general navigation tasks, it is not sufficient for customizable disinfection tasks, since only the obstacle and free space information are recorded.

2) SEMANTIC POINT CLOUD
By leveraging the prior knowledge of scene structure, the raw laser point cloud can be classified into different categories to represent different part of the scene. With the estimated robot's poses given by the LiDAR odometry module, the segmented point clouds are accumulated to generate the global semantic point cloud map.

3) REFERENCE LINE MAP
The reference lines form a representation of the structure of the scene. With the current segmented point cloud, on the one hand, the local reference frame in which the reference lines are defined is updated to make the reference lines align to the points in a sliding window by a similar optimization as Eq. (2); on the other hand, previous added points are refined and only the inliers are reserved by a distance threshold for the reference lines, which can effectively remove those incorrectly classified points.

4) STANCHION MAP
The stanchion map is built with the detected stanchions in each frame of laser point cloud. Each stanchion object has the properties illustrated in Table 1 where p robot m is the m-th newly detected point corresponding to the i-th stanchion, and p map n is the n-th existing point. The weight factor w is used to balance the influence of the new data. After the update of the centroid, the existing and newly added points will be refined with the stanchion's radius. Also, the maximum number of points is restricted.
The most essential part of stanchion mapping is the stanchion state management as shown in Fig. 4. The stanchions successfully detected in the first frame are set as Initial. A newly detected stanchion will be set as Candidate if there are not enough valid points. If the number of valid points is over a threshold, the stanchion will be considered as Accepted. The standard deviation of the centroid is monitored to change the stanchion's state from Accepted to Fixed. If the centroid of the stanchion is stable enough in a sliding window, its position will be fixed, thus the mapping process will VOLUME 9, 2021  become more robust to noisy stanchion points caused by the shadow effect. If the stanchion is invisible to the robot and its state is Candidate, it will be considered as an outlier and marked as Removed.

VI. EXPERIMENTS A. DATA COLLECTION & EVALUATION METRICS
The dataset used for the evaluation of our framework was collected on a Metro-Cammell train (M-train) of Hong Kong Mass Transit Railway (MTR), which is the oldest and most widely used trains in Hong Kong. Two sequences of laser scan data were recorded with a SICK TiM561 LiDAR sensor working at 15 Hz of scanning frequency. There are totally four different types of cars in the data sequences. Each of the cars is about 20 meters long. The summarized information of the dataset is shown in Table 2.
The evaluation metrics for proposed semantic mapping system are mainly based on the measurements of the metro car structure, such as the distances between stanchions and the length of cars.

B. DATA ASSOCIATION & REFINEMENT
We first evaluate the performance of the data association and refinement. Since the data association is directly determined by the pose estimation result in local reference frame (i.e., the frame in which the reference lines are defined), we can plot the local localization results along the robot's movement. As the main advantage of point cloud segmentation is to assign different weights for the points with different labels, we have specially compared the localization results without or with stanchion points with different weights in the optimization for refinement, as shown in Fig. 5.
To determine the success of data association, only those refined laser points that fall within 5cm of the reference lines are considered valid. The minimum number of valid points for a successful refinement is 50. In Fig. 5, the short-cuts in gray region mean the failure of refinement, which can  demonstrate the robustness of different configurations of the data association and refinement process.
Obviously, the addition of stanchion points in the optimization can effectively improve the robustness of the data association process. As shown in Fig. 5(a), a larger weight (w = 3) can result in better and more stable performance than a smaller weight (w = 1). This is mainly because that the stanchions are good landmarks in the metro which evenly distribute and maintain a proper distance with each other in the metro cars.
However, from the figure we can also find that the weight for stanchion points should not be set too large (w = 5), otherwise it will decrease the stability of the pose estimation. The cause of this situation may be the possible noisy measurements of the stanchions, since they are too thin to be accurately measured by the LiDAR sensor. Moreover, the situation shown in Fig. 6 can also cause the failure of localization, which can be identified in the data association process.

C. SEMANTIC MAPPING
To compare the performance of different popular LiDAR mapping frameworks in our scene and their influences on our semantic mapping results, we first evaluate the conventional occupancy grid maps generated by different frameworks. Then the semantic maps generated by our proposed framework with different LiDAR odometry modules are compared.

1) EVALUATION OF THE OCCUPANCY GRID MAPS
To evaluate the quality of the occupancy grid maps constructed in the our testing scenarios, the GMapping, 1 Hector Mapping 2 and Cartographer 3 are adopted for comparisons. Since the GMapping additionally needs the odometry of the robot as input, the laser_scan_matcher 4 ROS package was used for it.
In Fig. 7, the mapping results are listed for qualitative comparison. It should be noted that we have chosen the map of Cartographer as the reference, and the other two maps are manually aligned to it with the first stanchion in the map. Vertical reference lines in the figure are added for better comparison. As shown by the red arrows, the map of Hector Mapping has the largest difference, while the map of GMapping remains an almost stable bias. These errors are mainly caused by inaccurate scan matching, for example, the situation in Fig. 6. The map of Cartographer achieves the best performance and no obvious drift or error occurs.

2) EVALUATION OF THE SEMANTIC MAPS
To quantitatively evaluate our semantic maps as Fig. 8, we focus on the positions of stanchions in the map, since the distance between two adjacent stanchions can be conveniently and accurately measured.
The distances between adjacent stanchions estimated with different LiDAR odometry modules are demonstrated in Fig. 9. There are three clusters corresponding to 1.00m, 1.75m and ∼2.25m respectively. The third cluster is related to the stanchion in the joint of cars, so the distances may be slightly different. As shown in the figure, the distances estimated by Cartographer have achieved the best performance compared with the other two methods. It should be noticed that the first two points in the second cluster are related to the first and last stanchion in the data sequence, therefore, their positions are not as accurate as the other stanchions due to lack of measurements.    Table 3 lists the root mean square error (RMSE) of the estimated distances between adjacent stanchions as well as the length of the metro car (which is defined as the distance between the first and the last stanchion in a car). The results consistently demonstrate that Cartographer outperforms the other two frameworks in our scene.

D. RUNTIME ANALYSIS
The runtimes for the main components in our framework are shown in Fig. 10, which were tested on a laptop with an Intel Core i7-8750H CPU @ 2.20 GHz. The average runtimes of the data association and refinement are 0.493ms and 1.102ms, which makes the average rate of the point cloud segmentation subsystem over 600Hz. The stanchion detection process is very efficient by leveraging the prior knowledge of the structure, which only spends less than 0.3ms in average.
It should be noted that the runtime of updating the map will burst when the local point cloud and reference lines are refined, or when the local map is merged with the global map before the robot enters into a new car.

E. CUSTOMIZABLE DISINFECTION TASK
To support the customizable disinfection task, the robot needs to understand the structure of the metro car around it. With the segmented point cloud in our map, the structure label can be decided by accumulating the point cloud in a local moving window around the robot. Fig. 11 shows the extracted structure labels on the left and right side of the robot when it is moving in a the metro car.
Based on the semantic level understanding of the scene, the navigation waypoints, the disinfection distance and the disinfection time can be determined according to different structures of the metro car in the customizable disinfection task, as shown in Fig. 12: • The navigation waypoints can be set according to the positions of stanchions in the map. For example, in Fig. 12(a), the waypoints are placed between the seats and the stanchions and aligned with the position of the stanchions. Additional waypoints are also inserted into the adjacent stanchions with a spacing of 1.75m. Other possible options for setting the waypoints could choose the seats as the reference.
• The effective disinfection distance for different structures such as the seats and stanchions can be configured accordingly. Taking Fig. 12(b) as an example, to ensure the effectiveness of disinfection, the robot must enter a specific range of the object to be disinfected. These ranges will affect the robot's local path planning together with the navigation waypoints.
• The disinfection time can be adjusted by setting the maximum velocity of the robot in different region. As shown in Fig. 12(c), when passing by the stanchions or seats, the robot should slow down or even stop for a while to enhance the disinfection results in these high-risk regions. For other regions, the robot can move faster to reduce the total time consumption of the disinfection task.

VII. CONCLUSION
In this paper, we presented a semantic mapping framework for customizable disinfection tasks which can be used in structured scenarios. The structure information about the scene is used in the point cloud segmentation as the prior knowledges. A hierarchical semantic map, including the segmented point cloud, structure reference lines and objects, can be built to provide a semantic level understanding of the scene for the robot. By leveraging such semantic information, our framework can be effectively deployed to customizable disinfection applications in the metro and efficiently run at real-time. In the future work, we will try to deploy our system to other similar scenarios, such as buses and trains, and further explore the potential usage of visual sensors to capture more semantic information from the 3D structure, which can be an alternative to our LiDAR-based perception system.