Introduction
Robots require a deep understanding of the situation for their autonomous and intelligent operations [1]. Works like [2], [3], [4], [5] generate 3D scene graphs modeling the environment with high-level semantic abstractions (such as chairs, tables, or walls) and their relationships (such as a set of walls forming a room or a corridor). While providing a rich understanding of the scene, they typically rely on separate SLAM methods, such as [6], [7], [8], that previously estimate the robot's pose and its map using metric/semantic representations without exploiting this hierarchical high-level information of the environment. Methods like [5] do optimize the full 3D scene graphs but only after detection of appropriate loop closures. Thus, in general, 3D scene graphs are not tightly and continuously optimized in a factor graph.
Our previous work S-Graphs [9] proposed for the first time a tight coupling of geometric LiDAR SLAM with 3D scene graphs in a single optimizable factor graph, demonstrating state-of-the-art metrics. However, it came with multiple limitations that we overcome in this work with our new S-Graphs+, with updated front-end and back-end relying on 3D LiDAR measurements.
Our new front-end (Section IV) contributes over S-Graphs with (1) a novel room segmentation algorithm using free-space clusters and wall planes, providing higher detection recall and removing most heuristics of the S-Graphs counterpart; (2) an additional floor segmentation algorithm extracting the floor centers using all the currently extracted wall planes.
The new back-end (Section V) consists of an improved real-time optimizable factor graph composed of four layers. A keyframes layer constraining a sub-set of robot poses at specific distance-time intervals. A walls layer constraining of the wall plane parameters and linked to the keyframes using pose-plane constraints. Both the layers are analogous to S-Graphs. A rooms layer modeling detected rooms to their corresponding wall planes constraining them in a single tightly coupled factor, rather than loosely coupled factors in S-Graphs. A floors layer, denoting the current floor level in the graph and constraining the rooms at that level, not present in S-Graphs. See Fig. 1 for an illustrative example of an S-Graph+ of a real building.
S-Graph+ built using a legged robot (circled in black) as it navigates a real construction site consisting of four adjacent houses. (a) 3D view of the four-layered hierarchical optimizable graph. The zoomed-in image shows a partial view of the free-space clusters utilized for room segmentation. (b) Top view of the graph.
Our main contributions are, therefore, summarized as:
A novel real-time factor graph organized in four hierarchical layers and a specifically novel room-to-wall factor.
A real-time extraction of high-level information, specifically novel room and floor segmentation algorithms.
A thorough experimental evaluation in different simulated and real construction/office environments as well as software release for the research community.
Related Works
A. SLAM and Scene Graphs
The literature on LiDAR SLAM is huge, and there are several well-known geometric approaches like LOAM [6] and its variants [7], [10], [11], and also semantic ones like LeGO-LOAM [8], SegMap [12], SUMA++ [13] that provide robust and accurate localization and 3D maps of the environments. While geometric SLAM lacks meaning in the representation of the environments, causing failures in aliased environments and limitations for high-level tasks or human-robot interaction, its semantic SLAM counterparts lack in most occasions geometric accuracy and robustness, due to wrong matches between the semantic elements and the limited relational constraints between them.
Scene graphs, on the other hand, model scenes as structured representations, specifically in the form of a graph comprising objects, their attributes, and the inter-relationships among them. This high-level representation has the potential to boost several relevant challenges in SLAM, such as map compacity or understanding. Focusing on 3D scene graphs for understanding, the pioneering work [2] creates an offline semi-autonomous framework using object detections from RGB images, generating a multi-layered hierarchical representation of the environment and its components, divided mainly into layers of camera, objects, rooms, and building. [14] presents a framework for generating a 3D scene graph using a sequence of images to verify its applicability to visual questioning and answering and to task planning. 3D SSG (Semantic Scene Graph) [15] presents a learning method based on PointNet and Graph Convolutions Networks (GCN) to semi-automatically generate graphs for 3D scenes. SceneGraphFusion [4] on the other hand, generates a real-time incremental 3D scene graph using RGB-D sequences, accurately handling partial and missed semantic data. 3D DSG (Dynamic Scene Graph) [3] extend the 3D scene graph concept to environments with static parts and dynamic agents in an offline manner, while Hydra [5], presents research in the direction of real-time 3D scene graph generation as well as its optimization using loop closure constraints. Though promising in terms of scene representation and higher-level understanding, a major drawback of these models is that they do not tightly couple the estimate of the scene graph with the SLAM state, in order to simultaneously optimize them. They thus in general generate a scene graph and a SLAM graph in an independent manner. Our previous work S-Graphs [9] bridged this gap showcasing the potential of tightly coupling SLAM graphs and scene graphs. However, for several reasons, it was limited to simple structured environments. Our current work S-Graphs+ overcomes these limitations generating a four-layered hierarchical optimizable graph while simultaneously representing the environment as a 3D scene graph, able to provide an excellent performance even in complex environments.
B. Room Segmentation
For a robot to understand structured indoor environments, it is necessary to first understand their basic components, such as walls, and their composition into higher-level structures such as rooms. Hence, room identification and segmentation is one of the critical tasks in S-Graphs+. In the literature, different room segmentation techniques are presented over pre-generated maps using 2D LiDARs [16], [17], [18]. Their performance is, however, degraded in presence of clutter. While [19] presents a room segmentation approach based on pre-generated 2D occupancy maps in cluttered indoor environments, it still lacks real-time capabilities. Methods such as [20], [21], [22] perform segmentation of indoor spaces into meaningful rooms, although they require a pre-generated 3D map of the environment and cannot segment it in real-time. Authors in [5] present a real-time room segmentation approach to classifying different places into rooms but compared to our approach they do not utilize the walls in the environment to efficiently represent the rooms. Given the current state-of-the-art for room segmentation, there was a need to develop a room segmentation algorithm utilizing wall entities while capable of running in real-time as the robot explores its environment, to simultaneously incorporate this high-level information into the optimizable S-Graphs+.
Overview
The architecture of S-Graphs+ is illustrated in Fig. 2. Its pipeline can be divided into six modules, and its estimates are referred to four frames: the LiDAR frame
S-Graphs+ overview. Our inputs are the 3D LiDAR measurements and robot odometry, which are pre-filtered and processed in the front-end to extract wall planes, rooms, floor, and loop closures. Note the four-layered S-Graph+, whose parameters are jointly optimized in the back-end.
We define the global state as:
\begin{align*}
\mathbf {s} &= \left[{}{^{M}}{\mathbf {x}}_{R_{1}}, \ \ldots, \ {}{^{M}}{\mathbf {x}}_{R_{T}}, \ {}{^{M}}{\boldsymbol{\pi }}_{1}, \ \ldots, \ {}{^{M}}{\boldsymbol{\pi }}_{P},\right. \\
&{}{^{M}}{\boldsymbol{\rho }}_{1}, \ \ldots, \ {}{^{M}}{\boldsymbol{\rho }}_{S}, \ {}{^{M}}{\boldsymbol{\kappa }}_{1}, \ \ldots, \ {}{^{M}}{\boldsymbol{\kappa }}_{K}, \ \\
&\left. {}{^{M}}{\boldsymbol{\xi }}_{1}, \ \ldots, \ {}{^{M}}{\boldsymbol{\xi }}_{F}, \ {}{^{M}}{\mathbf {x}}_{O}\right]^\top, \tag{1}
\end{align*}
Front-End
A. Wall Extraction
We use sequential RANSAC to detect and initialize wall planes. In S-Graphs+, we extract the wall planes from the 3D pointcloud snapshot for a newly registered keyframe, as opposed to our previous work [9] which extracted wall planes from a continuous stream of 3D pointcloud measurements. This results in efficient detection and mapping of all the wall planes at each keyframe level. Each wall plane extracted at time
B. Room Segmentation
In this work, we present a novel room segmentation strategy capable of segmenting different room configurations in a structured indoor environment improving the room extraction strategy proposed in [9] which only utilized plane-based heuristics to detect potential room candidates. Proposed room segmentation consists of two steps, Free-Space Clustering and Room Extraction, and the output are the parameters of four-wall and two-wall rooms.
Free-Space Clustering: Our free-space clustering algorithm divides the free-space graph of a scene into several clusters that should correspond to the rooms of that scene. Given a set of robot poses and a Euclidean Signed Distance Field (ESDF) representation [23] for these poses, we generate a sparse connected graph
Given the graph
Free space clustering and rooms segmentation, obtained from the estimated wall planes surrounding each cluster. Pink colored squares represent a four-wall room, while yellow and green colored squares represent two-wall rooms in
Room Extraction: Room extraction uses the free-space clusters
Given each sub-category of the wall planes, our room extraction method first checks the
Algorithm 1: Free-Space Clustering.
Four-Wall Rooms: For a given cluster
\begin{align*}
w_{x} =& \left[ \vert {{}{^{M}}{d}_{x_{a_{1}}} \vert } \cdot {}{^{M}}{\mathbf {n}}_{x_{a_{1}}} - {\vert {}{^{M}}{d}_{x_{b_{1}}} \vert } \cdot {}{^{M}}{\mathbf {n}}_{x_{b_{1}}} \right] \\
w_{y} =& \left[ \vert {{}{^{M}}{d}_{y_{a_{1}}} \vert } \cdot {}{^{M}}{\mathbf {n}}_{y_{a_{1}}} - {\vert {}{^{M}}{d}_{y_{b_{1}}} \vert } \cdot {}{^{M}}{\mathbf {n}}_{y_{b_{1}}} \right] \tag{2}
\end{align*}
\begin{align*}
{}{^{M}}{\mathbf {n}} = {\begin{cases}-1 \cdot {}{^{M}}{\mathbf {n}} & \text{if}{ {}{^{M}}{d} > 0} \\
{}{^{M}}{\mathbf {n}} & \text{otherwise} \end{cases}} \tag{3}
\end{align*}
\begin{align*}
{}{^{M}}{\mathbf {r}_{x_{i}}} =& \frac{1}{2} \left[ \vert {{}{^{M}}{d_{x_{a_{1}}}} \vert } \cdot {}{^{M}}{\mathbf {n}_{x_{a_{1}}}} - {\vert {}{^{M}}{d_{x_{b_{1}}}} \vert } \cdot {}{^{M}}{\mathbf {n}_{x_{b_{1}}}} \right] \\
& + \vert {{}{^{M}}{d_{x_{b_{1}}}} \vert } \cdot {}{^{M}}{\mathbf {n}_{x_{b_{1}}}} \\
{}{^{M}}{\mathbf {r}_{y_{i}}} =& \frac{1}{2} \left[ \vert {{}{^{M}}{d_{y_{a_{1}}}} \vert } \cdot {}{^{M}}{\mathbf {n}_{y_{a_{1}}}} - {\vert {}{^{M}}{d_{y_{b_{1}}}} \vert } \cdot {}{^{M}}{\mathbf {n}_{y_{b_{1}}}} \right]\\
& + \vert {{}{^{M}}{d_{y_{b_{1}}}} \vert } \cdot {}{^{M}}{\mathbf {n}_{y_{b_{1}}}} \\
{}{^{M}}{\boldsymbol{\rho }_{i}} =& {}{^{M}}{\boldsymbol{r}_{x_{i}}} + {}{^{M}}{\boldsymbol{r}_{y_{i}}} \tag{4}
\end{align*}
Data association for the room node follows two steps. First, the
Two-Wall Rooms: The room extraction method is sometimes able to find only two walls that surround a free-space cluster
\begin{align*}
{}{^{M}}{\mathbf {r}_{x_{i}}} =& \frac{1}{2} \left[ \vert {{}{^{M}}{d_{x_{a_{1}}}} \vert } \cdot {}{^{M}}{\mathbf {n}_{x_{a_{1}}}} - {\vert {}{^{M}}{d_{x_{b_{1}}}} \vert } \cdot {}{^{M}}{\mathbf {n}_{x_{b_{1}}}} \right] \\
& + \vert {}{^{M}}{{d_{x_{b_{1}}}} \vert } \cdot {}{^{M}}{\mathbf {n}_{x_{b_{1}}}} \\
{}{^{M}}{\boldsymbol{\kappa }_{x_{i}}} =& {}{^{M}}{\mathbf {r}_{x_{i}}} + \left[ {}{^{M}}{\mathbf {c}_{i}} - [\ {}{^{M}}{\mathbf {c}_{i}} \cdot {}{^{M}}{\hat{\mathbf {r}}_{x_{i}}} ] \ \cdot \hat{{}{^{M}}{\mathbf {r}}_{x_{i}}} \right] \tag{5}
\end{align*}
\begin{align*}
{}{^{M}}{{c}}_{x_{i}} = &\frac{1}{2} \left[ {}{^{M}}{p}_{x_{1}} - {}{^{M}}{p}_{x_{2}} \right] + {}{^{M}}{p}_{x_{2}} \\
{}{^{M}}{{c}_{y_{i}}} =& \frac{1}{2} \left[ {}{^{M}}{p}_{y_{1}} - {}{^{M}}{p}_{y_{2}} \right] + {}{^{M}}{p}_{y_{2}} \\
{}{^{M}}{\mathbf {c}_{i}} =& \left[ {}{^{M}}{c_{x_{i}}}, {}{^{M}}{c_{y_{i}}} \right] \tag{6}
\end{align*}
Data association of two-wall rooms follows a similar concept as four-wall rooms. In the case of a two-wall room in the
Detected four and two-wall rooms are optimized along with their corresponding wall planes in the back-end explained in Section V.
C. Floor Segmentation
The floor segmentation module extracts the widest wall planes within the current explored floor level by the robot which can then be used to calculate the center of the current floor level. Our floor segmentation utilizes the information from all mapped walls to create a sub-category of wall planes as described in the room segmentation (Section IV-B) as,
Back-End
The back-end is responsible for creating and optimizing the four-layered S-Graphs+ summing the individual cost functions of each layer, explained in detail as follows.
Keyframes: This layer creates a factor node
Walls: This layer creates the planar factor nodes for the wall planes extracted by the wall segmentation (Section IV-A). The planar nodes are factored as
Rooms: The rooms layer receives the extracted room candidates and their corresponding wall planes from the room segmentation module (Section IV-B) to create appropriate constraints between them.
Four-Wall Rooms: We propose a novel edge formulation that minimizes in a single cost function the room node (generated from its center) and its four mapped wall planes, as opposed to [9] which comprised of individual cost functions for room and the wall planes. The proposed cost function can be written as:
\begin{align*}
&c_{\boldsymbol{\rho }} \left({}{^{M}}{\boldsymbol{\rho }}, \left[ {}{^{M}}{\boldsymbol{\pi }_{x_{a_{i}}}}, {}{^{M}}{\boldsymbol{\pi }_{x_{b_{i}}}}, {}{^{M}}{\boldsymbol{\pi }_{y_{a_{i}}}}, {}{^{M}}{\boldsymbol{\pi }_{y_{b_{i}}}}\right]\right) \\
& = \sum _{t=1, i=1}^{T, S} \Vert {}{^{M}}{\hat{\boldsymbol{\rho }}_{i}} - {{f({}{^{M}}{\tilde{\boldsymbol{\pi }}_{x_{a_{i}}}}, {}{^{M}}{\tilde{\boldsymbol{\pi }}_{x_{b_{i}}}}, {}{^{M}}{\tilde{\boldsymbol{\pi }}_{y_{a_{i}}}}, {}{^{M}}{\tilde{\boldsymbol{\pi }}_{y_{b_{i}}}})}} \Vert ^{2}_{\boldsymbol{\Lambda }_{\tilde{{\boldsymbol{\rho }}}_{i,t}}}\tag{7}
\end{align*}
Two-Wall Rooms: We propose a similar improved cost function to minimize room nodes and their two corresponding wall planes as follows:
\begin{align*}
&c_{\boldsymbol{\kappa }} \left({}{^{M}}{\boldsymbol{\kappa }_{i}},\left[{}{^{M}}{\boldsymbol{\pi }_{x_{a_{1}}}}, {}{^{M}}{\boldsymbol{\pi }_{x_{b_{1}}}}, {}{^{M}}{\mathbf{c}_{i}}\right]\right) \\
&= \sum _{t=1,i=1}^{T,K} \Vert {}{^{M}}{\hat{\boldsymbol{\kappa }}_{i}} - f\left({}{^{M}}{\tilde{\boldsymbol{\pi }}_{x_{a_{1}}}}, {}{^{M}}{\tilde{\boldsymbol{\pi }}_{x_{b_{1}}}}, {}{^{M}}{\mathbf{c}_{i}}\right) \Vert ^{2}_{\boldsymbol{\Lambda }_{\tilde{{\boldsymbol{\kappa }}}_{i,t}}}\tag{8}
\end{align*}
Floors: The floor node consists of the center of the current floor level calculated from the floor segmentation (Section IV-C). We add a cost function between the floor node and all the mapped four-wall rooms at that floor level as follows:
\begin{align*}
c_{\xi }\left({}{^{M}}{\boldsymbol{\xi}_{\boldsymbol{i}}}, {}{^{M}}{\boldsymbol{\rho }_{i}}\right) = \sum _{t=1,i=1,j=1}^{T,F,S} \Vert {}{^{M}}{\hat{\boldsymbol{\delta }}_{{\xi _{i}},{\rho _{j}}}} - f\left({}{^{M}}{\boldsymbol{\xi}_{\boldsymbol{i}}}, {}{^{M}}{\boldsymbol{\rho }_{j}}\right) \Vert ^{2}_{\boldsymbol{\Lambda }_{\tilde{{\boldsymbol{\xi }}}_{i,t}}} \tag{9}
\end{align*}
Experimental Results
A. Methodology
S-Graphs+ is built on top of its baseline S-Graphs [9] and is validated over several construction sites and office spaces in both simulated and real-world scenarios, comparing it against several state-of-the-art LiDAR SLAM frameworks and its baseline. We utilize VLP-16 LiDAR data in all the datasets. To validate the presented novelty, we ablate S-Graphs+ into S-Graphs+ w. OR comprising older room detection algorithm from S-Graphs but newly proposed room-to-wall plane factors (Section V) and S-Graphs+ w. OF with older factors from S-Graphs but the newly proposed room detection algorithm (Section IV-B). We do not ablate the proposed floor layer, as currently the floor level (Section IV-C) is mostly used to add semantic meaning to the map, without significantly improving the accuracy. Furthermore, we compare the room detection of S-Graphs+ against the heuristics-based one in S-Graphs, reporting the precision and recall of the four-walled and two-walled room detections in the real-world scenarios, for which we have the ground truth number of rooms defined in the architectural plans.
In all the experiments, no fine-tuning of the mentioned thresholds was required and the same prior empirically selected thresholds sufficed for all. The ESDF map (Section IV-B) resolution depends on the LiDAR resolution, which in our case is computed as 0.18 m vertically and 0.03 m horizontally, while the map clearing threshold
Simulated Data: We conduct a total of five simulated experiments. CF1 and CF2, are generated from the 3D meshes of two floors of actual architectural plans, while SE1, SE2, and SE3, are performed in additional simulated environments resembling typical indoor environments with different room configurations. We report the ATE against the provided ground truth. Due to absence of odometry from robot encoders, in all simulated experiments the odometry is estimated only from LiDAR measurements. For a fair validation, S-Graphs+ is run using two different odometry inputs, specifically VGICP [26] and FLOAM [7].
In-House Dataset: In all our in-house data we utilize the robot encoders for estimating the odometry. The first two experiments, C1F1 and C1F2, are performed on two floors of a construction site consisting of a single house. Additionally, C2F0, C2F1, and C2F2 consist of three floors of an ongoing construction site combining four individual houses. C3F1, and C3F2 are two combined houses, while C4F0 is a basement area with different storage rooms. To validate the accuracy of each method in all the real experiments we report the RMSE of the estimated 3D maps against the actual 3D map generated from the architectural plan except for experiment C4F0, for which we provide qualitative results due to the absence of a ground truth plan.
TIERS LiDARs dataset: We also validate S-Graphs+ on the public TIERS dataset [27], recorded by a moving platform in a variety of scenarios. Experiments T6 to T8 are done in a single small room in which the platform does several passes at increasing speeds. Experiments T10 and T11 are performed in a larger indoor hallway with longer trajectories. We report the ATE against the provided ground truth. Due to the absence of encoder readings in this dataset, each baseline method uses its own LiDAR-based odometry. As in the simulated datasets, we validate S-Graphs+ with VGICP and FLOAM odometry.
B. Results and Discussion
Simulated Data: Table I showcases the ATE for the simulated experiments. S-Graphs+ w. OR results in an average improvement in accuracy of 5.44% over S-Graphs, while S-Graphs+ w. OF shows an average decrease in accuracy of 3.03% over S-Graphs. However, the full S-Graphs+ with both the new room detector and newly proposed factors shows an improved average accuracy of 13.37% over the baseline. It can also be seen in Table I S-Graphs+ is run using two different odometry methods, VGICP and FLOAM, and that it improves the respective odometries by 51.88% and 106.5%.
In-House Dataset: Table II presents the point cloud RMSE. As it can be observed in the table, S-Graphs+ outperforms the second-best baseline by a margin of 5.93%. S-Graphs+ w. OR and S-Graphs+ w. OF individually outperform its baseline by 2.74% and 4.03% respectively. For experiment C4F0, Fig. 4 shows a top view of the final maps estimated by S-Graphs+ and three other baselines. Observe the higher degree of accuracy and cleaner map elements in the S-Graphs+ case, the latest indicating a better alignment for different robot passes. Similarly, observe the precise map generated by S-Graphs+ in Fig. 5 for experiment C2F0 when comparing with S-Graphs. Fig. 1, shows the entire four-layered S-Graphs+ for C2F2 along with its map accuracy.
Fig. 6 presents the precision/recall of the room detection in S-Graphs+ and the one in S-Graphs. Note how the precision is slightly higher for S-Graphs+ and, more importantly, the recall is substantially higher for S-Graphs+. In particular, the difference is notable for scenarios with complex layouts such as C2F1 which also improves the final map accuracy (Table II). The latest is one of the main strengths of S-Graphs+: Extracting a higher number of rooms adds a higher number of constraints leading to more accurate estimates and a better representation.
Precision and recall for S-Graphs+ (blue) and S-Graphs (red) on six different scenes of our in-house dataset.
Additionally, Table III provides a comprehensive overview of the computation time required by each module within S-Graphs+. Plane segmentation runtime can vary depending on the area of the wall planes (wall planes with a larger area have a higher number of points, increasing the computation time). The runtime of room segmentation can vary given the number of mapped wall planes in the environment at a given time instant around the robot, while floor segmentation runtime can vary given the current mapped wall planes in the environment (higher number of mapped wall planes increases the computation time). The back-end computation time increases with the length of the experiment, as the graph size typically increases with time, but even with sequence lengths of approximately 17 mins (C2F2), all the modules of S-Graphs+ are able to maintain real-time performance.
TIERS LiDARs dataset: Table IV presents the ATE for all baseline methods and our S-Graphs+ in the indoor sequences of the public TIERS dataset [27]. On an average S-Graphs+ with FLOAM odometry gives the best results in all the experiments, improving by 43.2% over FLOAM. Individually S-Graphs+ w. OR and S-Graphs+ w. OF improve the average accuracy by 13.36% and 8.32% over S-Graphs, while S-Graphs+ shows average improved accuracy by 14.59% over the second-best baseline. Note that all methods perform similarly for small scenes, but differ as scenes become larger. S-Graphs+ presents significant error reductions for large environments. The strength of our hierarchical representation is particularly evident in scenarios like T11, in which S-Graphs+ utilizing FLOAM odometry increases the FLOAM accuracy by 165.8% (Table IV and Fig. 7). Observe also in Fig. 7 the good performance of S-Graphs+ in non-Manhattan worlds.
Conclusion
In this work, we present S-Graphs+, a novel four-layered hierarchical factor graph composed of: A keyframes layer constraining a sub-set of robot poses at specific distance-time intervals. A walls layer constraining the wall plane parameters and linking it to the keyframes. A rooms layer modeling detected rooms to their corresponding wall planes and a floors layer, denoting the current floor level in the graph and constraining the rooms at that level. To extract this high-level information we also propose a novel room segmentation algorithm using free-space clusters and wall planes and a floor segmentation algorithm extracting the floor centers using all the currently extracted wall planes. We demonstrate an average improvement in the accuracy of 10.67% against the second-best method on our simulated and real experiments covering different indoor environments. In future work, we plan to exploit the hierarchical structure of the graph for efficient and faster optimization and validate it over buildings with several floors as well as enhance the reasoning over the graph for improving the detection of different relationship constraints between its semantic elements.
ACKNOWLEDGMENT
For the purpose of Open Access, the author has applied a CC BY public copyright license to any Author Accepted Manuscript version arising from this submission.