TSM: Topological Scene Map for Representation in Indoor Environment Understanding

In the field of robotics, it is crucial to obtain a comprehensive semantic understanding of a scene for many applications. Based on the behavioral topological map and scene graph, we propose to employ a semantic map named Topological Scene Map (TSM) for representation in indoor environment understanding. The behavioral topological map we constructed expresses the spatial connection relations and semantically describes the navigation behavior between adjacent topological nodes. The scene graph promotes the TSM to record the objects that appear in the scene and the relations between objects. The addition of spatial and semantic relations makes the expression of the scene more specific, which improves the robot’s abilities of scene understanding and human-robotic interaction. In this article, we design a method for topological map construction and apply a novel approach to generate a scene graph from RGB-D data. The semantic representation of the environment generated in the experiments verifies that the TSM construction framework models the scene efficiently and the TSM is conducive to the realization of human-robotic interaction.


I. INTRODUCTION
In the field of robotics, modeling environment is fundamental before starting some tasks. The environment information is usually stored in the form of maps, such as metric map, topological map, and semantic map. Maps required by robots with different tasks are various.
For robots, an effective environment model should include the following elements: • Applicability: The robot is able to perform various tasks with the model, not just a specific task.
• Accuracy: The model should be able to accurately describe the environment and provide the robot with correct information.
• Scalability: The model needs to adapt to the size of the environment, and expand the size of the expressed environment step by step.
• Usability: The model should be easy to use and be able to realize human-robotic interaction. For humans, the dynamic changes of the indoor furniture placement, light intensity, and other factors do not seem to change the general understanding of the environment and will not affect the navigation. Previous biological research The associate editor coordinating the review of this manuscript and approving it for publication was Tao Zhou .
showed that the spatial information stored by biological navigation systems are coarse-grained [1]. These coarse-grained representations are the topological descriptions of the environment. The topological map has been widely used in the navigation of robots [2]- [5], and the recent rapid development of deep learning algorithms provides a new perspective for the application of topological map [6]- [8].
Robots need to understand the environment with the human mindset to improve intelligence, so they are capable of executing semantical commands consequently. The semantic map is proposed to describe the environment semantically, which assists robots to understand the environment. The semantic map for the robot contains the space and the entities semantic information. These entities have the attributes of some classes, more knowledge about them is obtained by reasoning with knowledge base [9]. More recent attention has focused on the semantic map to represent the environment intelligently. In [10], the author reviews studies of the semantic map, the research on semantic maps are divided into indoor and outdoor, single-scene and large-scene. The 3D semantic map construction framework proposed in [11] is a masterpiece of semantic map research, which expresses the environment with five levels: Metric-Semantic Mesh, Objects and Agents, Places and Structures, Rooms, Building. Besides, the framework applies the tracking and detection module for dynamic targets, which eliminate the effects of dynamic changes [12], [13]. However, this framework has strict requirements on hardware and the construction process of the map is complicated.
We divide the tasks performed by intelligent robots in an environment into the following stages. (1)Environmental perception. The robot collects information in the environment with sensors. For example, the RGB-D sensor collects images of the environment. (2)Environmental semantic cognition. Semantic information is attached to objects and spaces detected in the environmental perception stage to provide a basis for human-robotic interaction. (3)Semantic task understanding. Semantic tasks are understood through the semantic representation of the environment and decomposed into executable task sequences, such as navigation and obstacle avoidance, etc. (4)Task execution. Tasks in the executable task sequence are realized with the basic control system. This article focuses on environmental semantic cognition. In this article, we propose a lightweight sparse environment representation, Topological Scene Map (TSM), to effectively represent the environment. The methodological approach taken for TSM is a mixed methodology combined with the navigation behavioral topological map construction [6], [14] and the scene graph generation [15]. The navigation behavioral topological map records the connectivity of space and describes the navigation behavior between adjacent nodes, which allows the robot to generate navigation routes with semantic information. The scene graphs [15] is applied to preserve objects and relations appearing in the environment, which is a compact representation that expresses a complicated environment with less memory and integrates data from different domains, such as text and image. One scene graph of an indoor environment is shown in Figure 1. The specific objective of this article is to construct a robotic semantic TSM for the indoor environment with our framework. The TSM aims to model the environment semantically, which contributes to the interpretable reasoning that is the key to semantical tasks, like Q&A(Question and Answer) and navigation routes recommendation. When a robot needs to perform other tasks in the environment, the TSM is loaded as a global static map to construct a new environment representation that will be updated with observations.
In the construction process of TSM, the robot needs to complete the full coverage exploration of the environment firstly. During the exploration, the RGB-D sensor is applied to record the full coverage exploration video, and the laser sensor is used to build the metric map of the environment. Then, the behavioral topological map is constructed based on the metric map, which is employed to divide the exploration video into slice groups used to generate the scene graph for each room. Our contributions are the following: • We design a process of constructing a topological map that represents the connectivity of the environment, in which the features of a node include ID, coordinates, and semantic labels, the edges include distance and navigation behavior information.
• We propose a method to split the environment exploration video according to the topological map. The full coverage exploration video needs to be split into slices according to the topological map, and door nodes are used to get timestamps for splitting.
• We generate the scene graph for every room of the indoor environment from the split exploration video. The scene graph record a complex environment with less memory, which facilitates semantic query and reasoning.
• The TSM construction framework combines a topological map for global information with a scene graph for local information to provides an efficient semantic representation of the environment. With TSM, the robot generates human-friendly navigation command sequences for human-robotic interaction. The rest of this article is organized as follows. In Section II, some related works about the topological map, the scene graph, and the semantic map are presented. In Section III, We introduce the methodology of the TSM construction framework, including the process of navigation behavior VOLUME 8, 2020 topological map construction, the Scene Graph Generation (SGG) model, the method of generating the scene graph from the video, and the method of fusing the scene graph with the topological map. In Section IV, We verify the construction process of TSM with the simulation environment. In Section V, we conclude this article and point out further directions.

II. RELATED WORK A. TOPOLOGICAL MAP
The topological map is also defined as the roadmap, which is a sparse representation that describes the topological characteristics of the environment. The construction of topological maps is combined with metric maps [16], [17] or not [2]. Generalized Voronoi diagram [18] and spectral clustering [19] are the two main methods [20] to construct the topological map. In [21], spectral clustering and extended Voronoi diagrams are used to construct the topological map from the metric map. Spectral clustering is applied to segment the metric map and obtain the center of the cluster. In [22], a lightweight method for combining the metric and topological maps is proposed. With the combined map, the robot is able to autonomously navigate in a large scale environment and avoid the obstacle. Although the fusion of these two maps achieves obstacle avoidance navigation, the cost of storing the map is extremely. A concise process of topological map construction and effective topological representation are proposed in [5] to resolve the problem. While creating a topological map, it is necessary to determine which topological node each region of the environment belongs to. In [23], a grid middle layer is proposed to rasterize the metric map and distinguish different regions. The topological points that fall into the same region are attached with the same region label. In [24], the segmentation problem of 3D scenes is transformed into an integer programming problem to solve. Against the problem of localization and noise in the metric map, a visual topological map is constructed with the use of structured prior knowledge [8], in which each node is represented by a 360-degree panoramic image and the edges represent the transformings of posture.
According to the way humans model the environment, some new methods of topological map construction and application are proposed. In [25], a semi-parametric topology memory framework is constructed inspired by the landmark-based navigation, which consists of two parts, a parameterless topological map for memory and a parameterized deep network used to retrieve nodes from the topological map with observations. With the landmark, humans verify the understanding of the environment. In [26], a topological map based on landmarks is constructed, in which the landmark allows the robot to execute a task with some interference reliably. For humans, prior knowledge about some types of the environment plays an important role, while a new environment needs to be expressed. In [27], a new storage structure, named Bayesian Relational Memory (BRM), is proposed to store the prior knowledge. With BRM, robots construct the unknown environment representation with prior knowledge quickly. Maps in the human mind are usually attached with semantic descriptions. Inspired by this idea, the topological map attached with semantic information will be more practical. In [14], a navigation behavior topological map is constructed, in which the nodes and edges are attached with semantic labels. Inspired by the navigation behavior topological map, we propose a method to construct our topological map.

B. SCENE GRAPH
The scene graph is a sparse representation of semantic information [15], where nodes represent entities in the scene and edges represent spatial or logical relations. The objects appearing in the image are displayed as semantic elements in the scene graph and the relations in the scene graph will contribute to scene understanding and interpretable reasoning. Much of the current literature on the scene graph pays particular attention to generation and application [28].
The data for SGG is not limited to images, but also text and video [29]- [32]. Generally, the scene graph is not generated with a single image but related images, which improve the effectiveness of the scene graph. Besides, the scene graph is regarded as a commonsense knowledge graph generated according to the scene [33]. Therefore, SGG is regarded as a bridge between the scene graph and the commonsense graph. There are currently two main kinds of methods for SGG. The first kind of method is divided into two-stage, objects detection and then recognizes the relation between them [34]. The other applies region proposal to jointly reason the classes of objects and relations [35]. In [34], The SGG model MOTIFNET is proposed, which divides the scene analysis into three stages: delineating the object region, calibrating the region type, and predicting the relation between the regions. Each stage combines the global features of contextual information through bidirectional Long Short-Term Memory (LSTM), and the output of each stage is defined as the input of the next stage. In [35], The SGG model Factorizable Network is proposed, which introduces the subgraph to reduce the cost of SGG. In the TSM, the SGG model Factorizable Network is applied to generate a scene graph from one image.

C. SEMANTIC MAP
Much of the previous research on environment representation is carried for navigation tasks. The robot's navigation tasks are generally divided into global navigation and local navigation. Some studies construct a hierarchical map for global navigation and local navigation. In [36], a hybrid metric-topological-semantic map structure, called MTS-map, is established, which allows robots to implement fine metric-based navigation and coarse query-based localization. Although grid-based representation supports most of the navigation tasks, a large amount of calculation is needed to obtain the optimal path on the grid map of a large scene. To reduce the cost of calculation, a two-layer map is proposed in [37]. In this article, the first layer is a region roadmap for representing the connectivity between different regions in the environment, and the second layer is the local roadmap. Each node in the region roadmap is related to a local roadmap. With this map, there will be less cost for finding the navigation route.
Understanding the environment with the human mindset, robots are able to implement some human-robotic interaction tasks. For planning and exploration in open and uncertain worlds, a semantic map is constructed with the commonsense knowledge [38]. The semantic representation describes the environment information perceived by the robot in detail and deal with uncertainties. In [39], uncertainty is also considered in the process of semantic map construction. Multiple semantic maps are constructed with probability. Conditional Random Fields (CRFs) are used to model the background relations and uncertainties during object recognition. Expressing the environment with 3D information preserved supports the scene graph to record the environment detail, but it requires powerful computation capabilities. In [13], a framework for constructing a 3D scene semantic map is proposed. The map constructed by this framework is composed of four layers, which is more in line with human thinking and perception. In [12], the 3D environment is expressed with a scene graph. Among the scene graph, each node represents the object and attributes, and each edge represents the relation between the objects. Based on [12], [13], MIT SPARK laboratory combines with the previous semantic mapping work, visual-inertial odometry, deep learning, and other methods to construct a scene graph of a dynamic 3D environment [11]. They propose a more comprehensive 3D semantic SGG framework, which adds the detection and tracking modules for dynamic targets, thus some of the impacts of dynamic changes is eliminated. Since the scene graph constructed from a single image is not specific enough, the scene graph generated from multiple images may miss some objects or repeatedly detect some objects. We refer to the method mentioned in [11]- [13] to build the scene graph from RGB-D videos.

III. METHOD
In this section, we describe the method of constructing the TSM. The first stage is the environment exploration. The robot carrying laser and RGB-D sensors is placed in an unknown indoor environment to complete the full coverage exploration. During the exploration, the laser sensor is service for collecting laser data to generate the metric map, the RGB-D sensor is applied to record the environment exploration video. The second stage is the topological map construction. After completing the exploration of the environment, a navigation behavioral topological map based on the metric map is constructed. The third stage is the scene graph generation. When generating the scene graph, the video of environment exploration needs to be split into different slices and then classify these slices according to the room they describe. Finally, the TSM of the environment is obtained by FIGURE 2. The TSM construction framework, the laser data are used to construct the metric map and the topological map, and RGB-D data are used to construct the scene graph. The door and room nodes of the topological map are applied to split the environment exploration video into slices. The video slices are grouped by the room where it was shot, and then the grouped slices are clipped into the image group. The images in the group are filtered by modules of video processing. Take IG_1 as an example, all the images in IG_1 are used to generate local scene graphs. All the local scene graphs from the same image group are applied to merge and update a global scene graph. Finally, the metric map, topological map, and scene graph are fused to construct TSM.
attaching the generated scene graph to the room node and combining it with the topological map. Our TSM makes the metric map as the bottom layer, scene graph as the middle layer, and topological map as the top layer. The process of TSM construction is shown in Figure 2.

A. TOPOLOGICAL MAP CONSTRUCTION
Generally, the topological nodes indicate positions and edges present connectivity and distance. Although this kind of topological map meets the needs of most tasks, it is not enough for human-robotic interaction tasks. When asking the robot for directions, the robot needs to answer in the way that humans understand, rather than providing a simple node sequence. Here, we attach semantic descriptions to the nodes and edges based on the common topological maps.

1) TOPOLOGICAL MAP DESIGN
When constructing the topological map, we refer to the method adopted by [6]. A node in the topological map represents a location, and an edge represents the navigation behavior and distance. The navigation behavior helps to guide the robot from one location to another, such as walking through the corridor, leaving the room, and other navigation behaviors [14].
We define behaviors for the navigation behavior topological map, include enter the room (ER), leave the room (LR), and cross the room (CR). Although there are only three behaviors in the behavior set, the set contributes to achieving semantic navigation in our indoor environments and also allows us to simplify the design of the topological map. For the ''cross the room (CR)'' behavior, we define the behavior as transforming from a door to another door of the same room. For example, the navigation behavior is specifically expressed as ''from door_1, cross the room_1, to door_2''. For the ''leave the room (LR)'' behavior, we define the behavior as transforming from a room to an exit(door) of the room, such as ''from room_1, leave the room_1, to door_1''. For the ''enter the room (ER)'' behavior, we define the behavior as entering from an exit(door) into a room, such as ''from door_1, enter the room_1, to room_1''.
There are two kinds of node in the topological map, the room node and the door node. Each room node is related to a specific space, such as a kitchen and a corridor. Each door node is related to an exit of one room. The door nodes are employed to improve navigation routes and help understand navigation behaviors. Without the door node, the room nodes along the path are visited when navigating from source to target, and there will be more costs. As shown in the left of Figure 3, an order of navigating from room_1 to room_3 is issued. If there are no door nodes, the navigation route will include the room_2 node, and the navigation route is shown as the green line. If there is a door node, the navigation route will include the door nodes of room_2 and will not include the room_2 node. The navigation route is shown as the red line. Comparing the two navigation routes, we know that the employment of door nodes is conducive to generate shorter navigation routes. The cost of each edge is determined by the distance between adjacent nodes calculated by A star algorithm [40].
The process of door_2 node generation will work as an example to illustrate the process of door node generation. The generation process is as follows: (1) Find the region of door_2 in the metric map, as shown in the right of Figure 3. , and record the position and label of these points, such as the second point taken in the direction of B is labeled as door_2B_2. (5) Store all points between A and B in a dictionary with door_2 as the key. In the topological map, only the door_2 node is displayed, and the dictionary related to door_2 will be used to split the video of environment exploration.

2) TOPOLOGICAL MAP CONSTRUCTION PROCESS
The process of the construction is given below: (1) Construct the metric map of the environment with the Gmapping algorithm [41]. The metric map is saved in the form of the occupancy grid. (2) According to previous requirements, some locations in the metric map are selected as nodes in the topological map, and the coordinate of the location is stored as the node features. There are two types of nodes, namely door nodes and room nodes. (3) Attach semantic labels to each node, such as kitchen-1, corridor-1, etc. (4) Define the navigation behavior between nodes in the topological map and attach the navigation behavior to all edges. (5) Calculate the distance between nodes with the A star algorithm and preserve the distance as the edge features. (6) Storage the topological map.

B. SCENE GRAPH GENERATION
The scene graph is defined as a directed acyclic graph: it includes label c i ∈ C of class and attributes A i ⊆ A of object. We apply the Factorizable Network [35] to generate the scene graph from each image captured in the environment exploration video.
Factorizable Network represents the scene graph as a connection graph based on subgraphs during the inference process to improve the effectiveness of SGG. The subgraphs are generated by clustering to represent a set of relations with similar features, which simplifies the calculation of SGG. The Spatial-weighted Message Passing (SMP) structure and Spatial-sensitive Relation Inference (SRI) module of Factorizable Network reserved spatial structure for relational reasoning.
The process of SGG with Factorizable Network is summarized as follows: (1) Region Proposal Network (RPN) is used to generate object region proposals. (2) A fully connected graph is established for all object region proposals, in which any two objects have two edges in different directions that represent relations between them. The features of these edges are extracted by the union box of the two connected objects. (3) All relations are clustered from the bottom up, and relations with the similar features are clustered together. After a relation class is obtained, all the relations in the same class are represented by the same subgraph node. Thus, a subgraph-based representation of the fully connected graph is obtained, and it includes subgraph and object nodes. (4) Feature vectors and 2D feature maps are generated by employing Region Of Interest (ROI) pooling to object feature and subgraph features respectively. (5) Refined object and subgraph features are generated with the use of spatial-weight message passing. (6) Object classes and their relations are recognized by the object features and fusion of objects and subgraph features respectively. For features, object features focus on the detail of an object, while subgraph features focus on the relation between objects. The representation combined with the features of these two types of nodes is beneficial to recognize object and relation classes.
In step (5) of SGG, SMP based on the inner product attention mechanism is applied to combine the object and subgraph features to obtain the representation of refined features. 2D feature map is employed to express subgraph features, so the spatial information is retained. Assume that the object feature vector and subgraph feature map input to SMP is o i and S k respectively. Since the dimensions of object and subgraph features are different, two different methods are needed to exchange information between the object and subgraph nodes.
From subgraph to object nodes The purpose of this process is to convert the 2D subgraph feature map to vector space of the object feature and fuse these two types of features. Firstly, S k is directly converted into s k through 2D average pooling. Then, all s k are aggregated by weights. The weighted sums i of s k is computed as follows: where S i represents a set composed of subgraph nodes connected to object i. p i (S k ) indicates the weight for s k . FC (att_s) convert the feature vector s k to the domain of o i . Finally, refined object featuresô i are generated by combining the weighted sums i with the object feature o i .
where FC (s→o) represents a fully connected network for convertings i to the domain of o i . From object to subgraph nodes The process aims to get the weighted sum of object features and map these features to the domain of the subgraph feature map to combined with the 2-D feature map. When converting the object feature vector to the domain of the subgraph feature map, the position information of the object needs to be considered. The weighted sumÕ k (x, y) at location (x, y) after object feature projection is calculated as follows: where O k represents the set of object nodes connected to subgraph k. P k (o i ) (x, y) represents the weight of the o i at location (x, y) of subgraph k. FC (att_o) convert o i to the domain of S k (x, y). Then, refined subgraph featureŜ k is abtained by combiningÕ k and S k .
where Conv (o→s) is a convolution layer that transforms the object features into the subgraph domain.
In step (6) of SGG, refined object and subgraph features are used to recognize the classes of objects and relations. The object classes are directly predicted by the object features. The classes of relations are predicted by the fusion of subject and object features and the corresponding subgraph feature.
Each object connected with subgraph node is relate to a region in subgraph feature map, so the object feature is applied as convolution kernel to extract visual cue in subgraph feature map. The convolution result S (i) k is calculated as: Then, the relation is predicted with the use of a fully-connected layer on the fusion convolution result S (i) k , S (j) k , and subgraph feature map S k .

C. VIDEO PROCESSING FOR SGG
The images for SGG come from the video taken during the environment exploration. To obtain all the scene information in the space for a certain topological node, the scene graph generated from the images taken from the same room needs to be merged. In this process, it is necessary to select appropriate images to generate the scene graph and eliminate duplicate elements when merging multiple scene graphs. The frames extracted from the video need to be processed by the region proposal network(RPN). The RPN extracts the regions of objects from the image. The SGG module takes the output of the RPN module as input and calculates the probabilities that the object in the proposed region belongs to different classes and the probabilities that the relation between objects belongs to different classes. VOLUME 8, 2020 There are three modules for processing the video for SGG, Adaptive Blurry Image Rejection (ABIR), Keyframe Group Extraction (KGE), and Spurious Detection Rejection (SDR). We employ the local scene graph and the global scene graph to distinguish scene graphs generated from a single image and images describing the same room.

1) ADAPTIVE BLURRY IMAGE REJECTION
For better performance, the object and relation recognition modules need to input clear images. But some blurry images may be collected due to the movement of the camera in the process of images collection. In blurry images, the shape, size, and color of objects may change, which will harm objects and relations recognition. To eliminate the influence of blurry images, the variance of Laplacian is used to measure the intensity variations between pixels in an image: L(x, y) = ∂ 2 I /∂x 2 + ∂ 2 I /∂y 2 (11) where W , H represent the width and height of the image respectively. L(x, y) is the Laplacian operator. However, some low texture images may be filtered as blurry images, as the intensities of low texture images are also not significantly changed.
To overcome the problem of texture, the ABIR algorithm is adopted. Over the Laplacian variances, ABIR evaluates the exponential moving average (EMA): where t represents time step, V t is the variance of Laplacian at t, and α is a constant smoothing factor in the interval [0, 1), which represents the influence size of previous observations on the current S t . The initial EMA does not follow the observed values as the few previous observations, which will produce some deviation in the final result. To correct this deviation, we process the final S t : S t is the bias-corrected average value, which is further processed to obtain the adaptive threshold: where gain and offset correspond g and b, respectively.

2) KEYFRAME GROUP EXTRACTION
The process of the KGE module is divided into three steps: (1) Accepting a series of processed images.
(2) Filtering out unnecessary frames by dividing the input image into three parts.
(3) Forming a keyframe group. Note that the input images are divided into the following three parts: leftmar-gin=*,labelsep=5.5mm • Keyframe: The first anchor frame and the coverage of the keyframe group is determined with keyframe as reference.
• Anchor Frame: Apart from the latest anchor frame, all other anchor frames are inactive. The next anchor frame is determined by the active anchor frame.
• Garbage Frame: Except for the keyframes and anchor frames, all the other frames are garbage frames, which are regarded as redundant frames and discarded. Specifically, the process of the KGE module is as follows: This module defines the first keyframe by the first nonblurry frame. Other frames need to be classified and each incoming frame needs to be compared with the current keyframe and the active anchor frame. When the overlap between the frame and the active anchor frame is lower than t anchor %, the frame is reserved as the next anchor frame. When extracting the first anchor frame, the input frame needs to compare with the keyframe. When a new anchor frame is detected, the current active anchor frame will turn into inactive, and the new anchor frame will become active. If the overlaps value of an incoming frame and the keyframe is lower than t keyframe %, the frame will become the new keyframe, and the previous keyframe and anchor frames will form the keyframe group.
To compute the overlap between two frames, one frame is mapped to the coordinate of the other: where − → 1 A (·) is an indicator function for set A. Projection function p (i, j) project source frame to target frame. It is defined as: where K is intrinsic to the camera, T i,j are the relative poses between i frame and j frame, p is original point, D(p) means point p depth. With the application of keyframes and anchor frames, the KGE module effectively removes the redundant information in continuous image sequences. Besides, even if the camera still for a long time, the module effectively handles redundant frames.

3) SPURIOUS DETECTION REJECTION
During generating scene graphs from images clipping from video, the recognition module cannot perform perfectly as the frames captured by the camera are affected by noise. The SDR module aims to eliminate errors and repeated detections with prior knowledge. The SDR module is applied to multiple modules: region proposal, object recognition, and relation extraction modules. First, the SDR module deletes the redundant target area in the region proposal module with the use of non-maximum suppression (NMS) [42]. For the object recognition module, the SDR module deletes irrelevant objects. Besides, the SDR module deletes predefined irrelevant objects such as roads, sky, buildings, and moving objects.
For the relation extraction module, the SDR module counts the possible relations of all object pairs in all frames from one frame group and keeps the most frequent relations in the scene graph. If multiple relations appear the same number of times for an object pair, they will all be added to the graph. Then, the module SDR employ a relation dictionary, which is extracted from the statistical information of the visual genome dataset, as prior knowledge. The relation dictionary includes the statistical data between object pairs, and the pixel distance d pixel stored in the form of Gaussian distribution.
As to recognizing the relation between object pairs, the process is shown as follows: (1) the SDR module detects the related object pairs. (2) Searching the relation dictionary for the detected object pairs. (3) calculating the probability of the detected relation. To calculate the probability, the pixel distance between objects and prior statistical information is applied to filter out relations with the probability lower than the threshold. The probability of relation Pr r | d pixel is calculated as follows: where Pr dict r | d pixel mean the statistical probability given pixel distance. φ µ,σ 2 (k) is Gaussian function. The normalized probability density function is employed to approximate the probability of distance between points.

4) LOCAL SCENE GRAPH
There will be a temporary ID in the local scene graph and a permanent ID in the global scene graph for the same object. With the recognition module, the semantic label of the objects and probability scores of the label is obtained easily. The top-k labels and the scores are kept for the same node detection. Besides, to eliminate the measurement error caused by representing the object position with the center point, the object position is represented in the form of the Gaussian distribution. After dividing the object region into 5 × 5 sub-regions, the center rectangle is cut from the object bounding box given by the region proposal module. Then, the 3D position of each point in the center rectangle relative to the first keyframe is calculated by where i and o mean the indices of the current frame and first keyframe. We then evaluate the mean and variance of the Gaussian distribution N µ, σ 2 of the 3D position. Each dimension x, y, and z are assumed to be independent and identically distributed, and the points number is reserved for evaluation.
The color histogram of the object is got by where (H , S, V ) represent the three axes of the color space, and N represents the number of pixels. Each axis of the color space is divide into c-bins, that is, the size of the histogram is c 3 . In the end, a thumbnail of the object for the region in the bounding box is obtained. The attributes extracted in this step will be updated and modified by subsequent modules, which collect object information from multiple frames and then make the final decision.

5) GLOBAL SCENE GRAPH
The updating and merging of the scene graph will integrate the local scene graph generated from a single image into the global scene graph. Richer scene information will be collected with the changes of camera position and perspective, and the recognition module extracts different features from the same object. To integrate different features and eliminate the repeated extraction, we propose a module for same node detection.
When adding nodes to the global scene graph, the same node detection needs to be performed. In the same node detection, the following features are employed, object label, 3D position, and color histogram. The similarities of these features between the newly added node and the previous node are calculated, respectively. The similarity scores of each feature are calculated as follows: For label similarity, the label similarity is defined as s label , which is calculated as where o and c respectively represent the original node in the global scene graph and the candidate node in the current frame. C o and C c contain top-k object prediction category labels. l o and l c represent the label with the highest score. The number of common elements in C o and C c is multiplied by the score. If there are no common elements, the scores are related to the distance between the word vectors of l o and l c . The score is calculated as follows: where the input is a label in candidate set C i , and the score function f s i returns the score of the label. For position similarity, we define position similarity as s position , which is calculated as where µ c represents the mean of an object position in the candidate set. The similarity of the position information in the x, y, and z directions is calculated by where Z is the z-score of a normal distribution, φ(·) output the area of standard normal distribution. If the difference between the position of the candidate object and the position of the object in the global scene graph is less than σ o , the position similarity is considered to be 1. Otherwise, the position similarity is inversely proportional to the distance. For color similarity, the color similarity is defined as s color accoding to another color similarity: which is calculated with the intersection of histograms. h i and h j are the two histograms for comparison. X , Y , and Z is the axis of 3-D space. | · | return the magnitude of a histogram. When measuring the distance between histograms, the intersection of the histograms ensures the efficient calculation and effective comparison of the color histograms. The final color similarity is: Finally, the above similarities are combined to get the total similarity: where F = label, position, color . When the similarity between the candidate node and a node in the global scene graph is greater than the threshold, these nodes are considered to be the same. The process of merging and updating for global scene graph generation is as follows. The global scene graph is initialized by the first keyframe. With the first frame inputted, the local scene graph is generated and then merge with the global scene graph. During the merging process, the nodes generated in the local scene graph are compared with the nodes in the global scene graph, and the same nodes will be deleted. Only the nodes that do not appear in the global scene graph are added to the global scene graph. During the updating process, the label with the highest score in the top-k label set C o ∪ C c used for the same node detection is selected as the latest label for one node. The 3D position of the object will also be updated according to the latest observation information. The color histogram is combined with the incoming color histograms. The number of points that remain in the node becomes the sum of the original and new number of points. If the label with the highest score comes from the incoming scene graph, the thumbnail will be replaced by the incoming thumbnail.

D. FUSION OF TOPOLOGICAL MAP AND SCENE GRAPH
To obtain the scene graph of each room, the video obtained by the robot during the full coverage exploration needs to be split. During the environment exploration, the robot records the video of the environment and record the position of the robot and the timestamp in the video every n seconds. These position information and video timestamps are reserved for splitting the exploration video.
The topological map is constructed based on the metric map, in which the door nodes describe the connectivity between two rooms. The exploration video will be split with the positions and timestamp recorded during environmental exploration and the door nodes of the topological map. While splitting the video, the door dictionary containing point group created during door node generation is applied to judge whether the robot passes the door. After splitting the video, we get the image group for each room easily and then construct the scene graph of the room with the image group. Every room node has its scene graph, thus the fusion of the topological map and the scene graph is obtained.
The process of splitting the video is as follows: (1) Choose a door node and take its door dictionary with position information of the point group.  (4). (6) According to the connectivity between the door node and the room node in the topological map, the split video is divided into slices of the rooms.

IV. EXPERIMENTS
In this section, we constructed a topological map in a simulation environment, verified the effectiveness of generating the scene graph from the video, and given a TSM instance derived from the simulated indoor environment and scannet data set. The compositions of TSM are vividly illustrated in Figure 4.

A. EXPERIMENTAL SETUP
The turtlebot3 [43] with the Robot Operating System (ROS) is applied [44] in the simulation environment (Gazebo and Rviz) to construct a topological map of TSM. Figure 5 presents the indoor environment model in Gazebo and Rviz. The Gmapping [41] package is employed to build the metric map of the environment from data collected by a laser sensor. The Factorizable Network is trained on the Visual Genome [45] dataset to obtain an effective SGG model. The Visual Genome dataset connects images with semantic concepts, which include 108,077 images with semantic annotations, like objects, relations, attributes, etc. When training the model, what we need mainly includes the attributes of objects in images, the types of objects, and the relations between the objects.
To verify the effectiveness of our TSM construction framework in a simulation environment, some videos from the indoor scene video dataset ScanNet [46] are selected as the videos of environment exploration. This dataset consists of 1513 sequences, which are collected by RGB-D cameras. Among them, the resolution of the frames is 1296 × 968 (color) and 640 × 480 (depth), and the frequency of image collection is 30Hz.

B. TOPOLOGICAL MAP CONSTRUCTION
The construction of the topological map is carried out in a simulation environment. Above all, an indoor environment model is loaded, as shown in the top of Figure 5. The robot is controlled by the keyboard to explore the environment and complete the full coverage exploration of the environment quickly. The metric map generated during the exploration is viewed through Rviz, as shown in the bottom of Figure 5. The metric map is presented at the top of Figure 4. Then, the topological map is constructed based on the metric map. We select the topological nodes on the map and define semantic labels for them. These labels are divided into room nodes and door nodes. As shown in the bottom of Figure 4, 15 topological nodes are selected for this environment, including 7 door nodes and 8 room nodes.  The label of each room we defined is listed in Table 1. Finally, the edges are defined according to the spatial connectivity. To get the cost of each edge, the distance between connected topological node is obtained by the A star algorithm according to the calculation rules of the actual navigation route of the robot. Combined with the navigation behavior generation rules, the navigation behavior for each edge is generated. The navigation behavioral topological map is visualized as Figure 6 When the robot navigates with the use of the behavioral topological map, it is able to generate a sequence of commands for the robot to execute and generate a recommended route for humans. For example, the robot is placed in room_3 and get a command, like go to room_8. The robot obtain a topological node sequence for robot navigating, like ''room_3, door_3, door_5, door_7, room_8''. By inquiring about the semantic information of the topological map, the robot generate a recommended route for humans, like: (leave living_room_1, cross corridor_1, cross living_room_2, enter bedroom_2).

C. SCENE GRAPH GENERATION
The scene graph is generated with video slices classified by rooms. The SGG model Factorizable Network needs to be trained, and then modify the parameters of other modules for generating scene graphs from the video.
The SGG model Factorizable Network is trained on the Visual Genome dataset. The compute with Intel Core i7-9750H CPU@2.60 GHz×12 and GPU RTX 2060 is employed for the experiment. Based on the pre-trained Factorizable Network [35], the final SGG model used in the experiment achieves 29.574% of Recall@50 and 38.476% of Recall@100 on the Visual Genome dataset.
With the trained Factorizable Network, the scene graph is generated for each image from the exploration video. In the process of capturing images from the video, the ABIR module is applied to eliminate the effects of blurred images, and the key parameters α, g, and b are designed as 0.9, 30, and 25. After removing a host of blurred images, the KGE module is employed to extract the keyframe group from the remaining images. To reduce the budget of computing overlaps between frames, the source image is mapped to the target image and 1000 points are sampled for calculating. For eliminating the influence of uncommon objects in the indoor environment during object recognition, the SDR module ignores 68 objects out of 400 objects in the recognition process. The probability threshold of SDR is set to 0.5 to remove the relation with great uncertainty. When building color histograms, each direction is divided into 8 bins, and the size of the histogram is 512. During performing the same node detection, w label ,w color ,w position is set to 0.375, 0.25, and 0.375, respectively, and set the same node detection threshold to 0.5. In Figure 7, there is an instance of generating a scene graph from the keyframes in the captured image group. With the continuous input of frames, the global scene graph becomes more complete. With the representation of the scene graph, objects in the environment and relations between the objects are clearly displayed.
According to the previously topological map, eight sequences are selected from the ScanNet dataset as videos obtained in eight different rooms, as shown in Table 1. We conducted some comparative experiments to verify the effect of the object detection module, the KGE module, and the same node detection module with the use of scene0010_00. Table 2 presents the results obtained from comparative experiments. The first experiment serves as a reference for comparison. The same node detection threshold, anchor frame threshold, and object detection threshold are adjusted, and the anchor frame number, node num, and total time are compared. From Table 2, we know that the larger the threshold of the same node detection, the fewer nodes are judged as the same node, and more nodes are finally obtained. At the same time, the more nodes to be processed, the more time it takes. The number of anchor frames is adjusted by the anchor frame threshold. If the overlap between a frame and the active anchor frame is less than the anchor frame threshold, this frame is judged as a new active anchor frame. It can be seen that the larger the anchor frame threshold, the larger the anchor frames number, nodes number, and total times. The object detection threshold adjust the number of objects detected in the image, that is, the larger the threshold, the fewer objects are detected and the less total time it takes.
The results of SSG from the video indicate that the scene graph describes the environment in the form of JSON with less memory. As the accuracy of the object detector and the SGG module increases, the robot will also establish a more accurate environment model, which will improve the robot's intelligence. The ABIR, SDR, and KGE modules are effective to reduce the redundancy of the scene graph,   which is conducive to selecting clearer images from the image group to generate the scene graph and improves the accuracy of object detection. Since there are many blurred images in the image group, there is no quantitative evaluation of the generated scene graph. Through analysis, the performance of the SGG module is able to be improved from the following aspects. Firstly, training the object detector with appropriate images and objects. In our experiment, the object detector is trained on the Visual Genome dataset. The images of this dataset include both indoor scenes and outdoor scenes, and the 400 objects used in training may not always appear in indoor scenes. Thus, we need to set a reasonable object recognition threshold, and selectively ignore some objects to eliminate the influence of uncertain objects. Secondly, capturing clearer images. Images for the room is captured from the ScanNet video dataset. Although most of the blurred VOLUME 8, 2020 images are eliminated through the ABIR and KGE module, images intercepted from the video are inevitably not clear enough, which also affects the accuracy of the detector.

D. FUSION OF TOPOLOGICAL MAP AND SCENE GRAPH
In the simulation experiment, we select some sequences from the ScanNet dataset as the video of each room. If the video for each room is needed to be split from the exploration video with the topological map, the threshold for judging the video split timestamp is required to be computed. For instance, the speed of turtlebot3 is v = 0.2 m/s. The time interval for recording coordinate information during environment exploration is n =2 s. The distance between two adjacent points in the point group in the door node dictionary is b = 0.4 m. The threshold is computed as follow: With the video slices of each room, the modules for generating a scene graph from a video are employed to get the scene graph of each room. The topological map and scene graphs are finally integrated into one JSON file in the form of a dictionary. The key of the dictionary is the node ID from the topological map and the values include the topological information and the scene graph information. The fusion of the topological map and scene graph in this experiment is shown in Figure 8, in which each room node in the topological map is related to a scene graph.

V. CONCLUSION AND FUTURE WORK
In this article, we propose a scene semantic map construction framework to build TSM. The TSM is a combined representation of the topological map and the scene graph for improving the robot's capability of understanding the environment intelligently. In general, the topological map based on navigation behavior enables the robot to efficiently and quickly generate a global navigation route with a semantic description, while the scene graph preserving objects and relations between objects makes the representation of the scene more specific.
The purpose of the TSM is to record environmental information and assist the robot to realize interpretable reasoning for completing a multitude of human-robotic interaction tasks, such as question and answer.
The simulation experiments verify the effectiveness of the process for topological map construction and the various modules for generating scene maps from videos. These experiments suggest that TSM is capable of modeling the environment with the navigation behavioral topological map and scene graphs for providing semantic navigation routes and describing the details of scenes. However, the framework for constructing TSM is still immature. The TSM cannot be built in real-time, so related applications need to be based on the completed TSM, such as semantic question and answer, semantic search, etc. During the process of constructing TSM, the dynamic objects are not considered, which will disturb the generation of the global scene graph and even limit the use of the TSM construction framework. Since humans and animals are the main dynamic targets in the environment, we avoid detecting these objects by the SDR module to reduce the impact of dynamic targets in global SSG.
Future work is needed to expand the application fields of the TSM construction framework, which includes the detecting and tracking of dynamic target [11], real-time TSM construction, SGG with knowledge graph [33], and accuracy improvement for SSG [47], etc. Additionally, future work will be carried out, such as semantic navigation, semantic question and answer, and other human-robotic interaction tasks.
YU ZHANG received the B.Eng., M.S., and Ph.D. degrees in automatic control from the National University of Defense Technology (NUDT), Changsha, China, in 2004China, in , 2007China, in , and 2012 In 2012, he joined the College of Mechatronics and Automation, NUDT. He is currently an Associate Professor with the College of Intelligence Science and Technology, NUDT. He has directed five research projects. He has coauthored two books and has published over 30 papers in refereed international journals and academic conferences proceedings. His research interests include intelligence decisions, mission planning, automation and control engineering, and complex systems. His current research interests include goal recognition-based location and routing planning, multi-agent learning for cross-domain heterogeneous swarms, and graph representation learning-based combinatorial optimization for network analysis.
WEILIN YUAN received the B.Eng. and M.S. degrees in automatic control from the National University of Defense Technology (NUDT), Changsha, China, in 2016 and 2019, respectively, where he is currently pursuing the Ph.D. degree in control science and engineering.
His current research interests include intelligence decision-making and control, adversarial reasoning, and behavior game theory.