Hierarchical Landmark Policy Optimization for Visual Indoor Navigation

In this paper, we study the problem of visual indoor navigation to an object that is defined by its semantic category. Recent works have shown significant achievements in the end-to-end reinforcement learning approach and modular systems. However, both approaches need a big step forward to be robust and practically applicable. To solve the problem of insufficient exploration of the scenes and make exploration more semantically meaningful, we extend standard task formulation and give the agent easily accessible landmarks in the form of the room locations and those types. The availability of landmarks allows the agent to build a hierarchical policy structure and achieve a success rate of 63% on validation scenes in a photo-realistic Habitat simulator. In a hierarchy, a low level consists of separately trained RL skills and a high level deterministic policy, which decides which skill is needed at the moment. Also, in this paper, we show the possibility of transferring a trained policy to a real robot. After a bit of training on the reconstructed real scene, the robot shows up to 79% SPL when solving the task of navigating to an arbitrary object.


I. INTRODUCTION
Autonomous navigation in a semantically extensive environment is one of the significant components in building intelligent robotic systems.
This article focuses our direction on indoor robots and their ability to fulfill the user's requests by moving in the previously unseen environment to target objects. Classically, navigation problems are solved with simultaneous localization and mapping (SLAM) methods [1]. As a result, these methods generate an obstacle map, and the planners [2] [3] build upon it the collision-free path to the goal.
These classical approaches show convincing performance when provided with a multimodal input [4] in an environments where signal is not blocked [5]. On the other side, with the appearance of fast performing simulators [6] [7] [8], large photo-realistic 3D datasets [9] [10] [11] and with the latest progress of learned methods, training virtual robots in simulation has gets a lot of interest in recent years. The learning approaches, like Reinforcement Learning (RL) [12], could adapt to almost any task in a simulator. The downside of RL is that it needs to be specifically trained to each task setup from scratch, and these algorithms are hard to deploy in a real navigation scenario. [13] This paper aims to overcome this issue and show that RL can be practically usable in real-world conditions with sensor noise present in a challenging navigation task of finding objects.
To achieve this, we first train our agent in a high framerate simulator Habitat [8] and then smooth transfer policy on a real robot (Fig. 1). Habitat is an open-source simulator that consume a photo-realistic Matterport [10] dataset (and many others) and build an environment upon it.
Habitat authors also claim that mobile operation in threedimensional environments is a primary topic of study in Artificial Intelligence [14] and start a Habitat Challenge to make progress in this field.
We choose to take ObjectNav task formulation from the Habitat Challenge track as a test platform. ObjectNav is defined as the task of navigating to a semantic type of object in an unseen environment [15]. During the challenge in 2021, the end-to-end Reinforcement Learning (RL) algorithm became the state-of-the-art solution and showed a success rate of 23% on a test-standard phase [16]. This indicates that the ObjectNav task is far from being solved or stagnated. Further increasing metrics is impossible due to unsolved episodes being extensive areas that are hard to explore without prior information of its semantical structure. We assume that such information should be easy to annotate by humans and remains unchanged in the areas with the often moving objects. Therefore, the agent should have a hierarchical structure to use spatial understanding efficiently.
To summarize, our contribution includes: • Task formulation with landmarks. As humans solve indoor exploration tasks to find the object, they strongly rely on a room's understanding of a concrete scene and can predict the type of objects inside. To allow the agent to learn this concept, we gave the agent the landmarks (G) in the form of all rooms center coordinates and their kind (Fig. 2). This information could be easily annotated by a human as opposed to the map. Note that the obstacle map or objects inside any rooms are still unknown to the agent. • Dividing policy into a set of skills. As an agent's policy, we distinguished three basic skills: navigation to the point, exploration of the nearby area, and reaching the seen object. The agent was separately trained in all of these skills using the policy optimization approach. • Hierarchical structure. To efficiently use the given list of landmarks, we have built the skill selector module that at each time navigates the agent to the most promising landmark zone by selecting one of the presented skills. • Smooth policy transfer to new real-world scenes.
To make policy transfer possible and predictable, we first 3D reconstructed our scene into a simulator. After a small stage of neural networks adaptation to it and being convinced that the agent navigates safely in it, we conducted a real test and got an agent behavior similar to the one in the simulator. Since our robot relies only on an RGB image, the depth and semantic were received by neural net methods.

II. RELATED WORK
As the most navigational tasks, the ObjectNav task could be solved by a SLAM and a deterministic planner methods. As a results, the agent builds an occupancy map and a collision-free path to the goal. Despite the fact that the goal object's coordinates are unknown, methods such as a Frontier-based exploration (FBE) [17] are often used. A frontier is defined as the boundary between the explored free space and the unexplored space. Frontier-based exploration essentially samples points on this frontier as goals to explore the space. if the agent during this exploration sees a goal type of object it navigates to it directly. As one of our baselines (ExploreTillSeen), we followed this strategy and builded a two RL-based agents, one that explore the area and the other one that navigates to a goal object when the agent sees it. A big breakthrough of the learning approaches in navigation tasks was the DDPPO method [18], which as its core used Proximal Policy Optimization [19]. Without any mapping or planning modules, DDPPO at the PointNav task was able to perform 2.5 billion steps in the environment and solve the task at human-level performance. Authors achieve that by giving the agent inputs directly from an RGB-D camera and a GPS+Compass sensor. To optimize the training process, the authors investigated the most efficient neural network architecture [20], which we also use in our approach.
Previous works have shown that at ObjectNav task, the pure end-to-end RL algorithms that use vanilla visual and recurrent modules perform poorly due to overfitting and sample inefficiency. The authors of Auxiliary task RL method [16] partialy solved that by adding auxiliary learning tasks and an exploration reward during training phase.
Another promising approach to solve the ObjectNav task is to mix analytic and learned components and operate on explicit spatial maps of the environment. Such a combination of classical and learned methods, SemExp [21] has shown the best result at ObjectNav during Habitat Challenge 2020. Authors use a deterministic map module and divide a policy into a global, that by planning on a map output a shortterm subgoal, and a local policy, that pursues that subgoal. Inspiring this work, we also build a two-level policy, but our low level consists of several independent skills and high-level switches between them depending on what the agent needs to do at a given time.
The idea of using landmarks as a region of interest in navigation tasks is also described in [22] [23]. The first method, HIGL, also divided policy into high-level and lowlevel, where the high level generates a subgoal toward landmarks and the low level reaches it. The sample of landmarks was based on the "coverage" and "novelty" criteria from the previously visited states. In general, we found this idea suitable to us, but instead of sampling from previous trajectories (the agent must navigate in previously unseen scenes), we utilize the semantic structure of the indoor scenes and define landmarks as the coordinates of the room's center. The policy of sampling from landmarks is based on the distance to landmarks and statistics of objects relative to rooms type.

III. TASK SETUP
The indoor object navigation task is defined as the task of navigating to an object (specified by semantic label) in a previously unseen environment [24]. In practice, the agent is initialized at a random pose in an environment and aims to find an instance of an object category I goal ∈ {c 1 , c 2 , ..., c 20} (for example, a couch) by navigating to it. This interaction is formally described by the Markov decision process (MDP), which is defined by sets of states S and actions A (forward, turn left, turn right, and stop), the distribution of the initial states p(s 0 ), the reward function r : S × A → R, the transition probabilities p(s t+1 | s t , a t ), the termination probabilities T (s t , a t ), and the discount factor γ ∈ [0, 1]. The agent receives a semantic mask of a goaltype of the object through the semantic segmentation module (I sem = Φ semantic (I RGB , I goal )).
During the evaluation process, the agent can only use the input from the RGB-D camera (I RGBD ), the GPS+Compass sensor (I GP S ), and a list of landmarks (G) for navigation. A list of landmarks contains all center coordinates of rooms and their type with no information about the map, what objects are inside, or how to navigate to those rooms.
Evaluation occurs when the agent selects the stop action. As a metric, the Success rate weighted by Path Length (SPL) and the Success rate are used. SPL is computed to the object instance closest to the agent start location.
where l i is the length of the shortest path between the goal and the target for an episode; p i is the length of the path taken by the agent in an episode. Thus, if an agent spawns very close to chair1 but stops at a distant chair2, it will achieve 100% success (because it found a "chair") but a fairly low SPL (because the path to chair2 is much longer than that to the chair1). More specifically, an episode is deemed successful if the agent is calling the stop action within 1.0m Euclidean distance from any instance of the target object category, and an oracle can view the object from that stopping position by turning the agent or looking up/down.

IV. METHODS
We propose a landmark-based modular framework (Fig. 3) for navigation to object goal (I goal ) -Hierarchical Landmark Policy Optimization (HLPO). The framework consists of three main modules: skill selector π selector , functions for data preprocessing Φ semantic and Φ depth , and skill policies {π explore , π reacher , π pointnav }.
The skill selector module is a deterministic strategy that sets a subtask at each step and selects the skill required to complete it.
The process of managing skills could be treated as a policy and integrated into the global policy module. Typically, the HRL methods can learn a two-level policy Π 1 . Each level of policy learns π i : S i , G i → A i where G i is the set of possible sub-goals. To learn these policies π i , the set of MDPs U 0 ,U 1 , in which U k = (S, G, A, T, R, γ), is used. However, learning multiple levels of policies in parallel is problematic because learning is inherently unstable. For the object goal task, the sequence of skills can be formulated explicitly.
During the training phase π selector analyzes all Matterport scenes to connect the object types to the type of rooms and makes a statistic out of it (Fig. 4). Thanks to it, we know that, for example, a sofa is located with a probability of 70% in the living room, and 10% each in the bedroom, living room, meeting room. During the execution of the episode, it works as follows. For each given landmark, at each step we calculate a visit importance coefficient equal to the probability of finding the object of the target type in the given room type, divided by the Euclidean distance to the center of the given room. π selector at each step chooses the Landmark with the highest importance factor and determines which skill should be used to complete the task. If the agent is outside the selected landmark, then π selector chooses π pointnav to reach it. Once the agent enters the Landmark zone, π selector activates π explore until the agent. will not leave the Landmark  I skill_type , I room_cord ← π selector (G, I goal , I GP S )

5:
I sem ← Φ semantic (I RGB , I goal ) 6: if I skill_type = P ointN av then if I skill_type = Explore then 11: a ← π explore (I RGBD , a prev ) 12: end if 13: if I skill_type = GoalReacher then 14: a ← π reacher (I RGBD , I sem , a prev ) 15: end if 16: Execute action a in the environment 17: end while zone. If at any time Φ semantic sees an object of the target type, then π selector will activate the π reacher skill to reach it. (Algorithm 1) The data preprocessing module is two neural nets that do semantic segmentation and depth reconstruction.
Our algorithm uses the semantic segmentation model to get the mask of the goal object, which is input for policy. We have taken the SOLOv2 [25] approach for those categories that correspond to the Coco dataset [26] and RedNet [27] for others. RedNet was trained on collected images from the MP3D dataset and took as input an RGBD image. The SOLOv2 model uses weights pretrained by authors and rely only on an RGB image. SOLOv2 output far fewer noises during our experiments, which is crucial for our algorithm because of skill separation. If false-positive noise appears, our approach policy switcher fires at the wrong time.
Though we use a ground truth depth sensor for all train and test phases at the simulator, we do not have that option at the Husky robot, so we reconstructed it from an RGB image [28].
The skill policies module takes as input agent's observation and outputs the action that pursues current needed skill. For our task, we identified three skills: point navigation, exploration, and goal reacher. In our experiments, the PointNav skill executes 50% of the time during the episode, Explore skill 22%, and Goalreacher skill 28%.
The PointNav skill was trained to reach a landmark zone (room of interest). The reward is proportional to the shortened distance to the goal coordinate. As an input it takes an RGBD image and the polar coordinates of the center of landmark zone. It's main goal is to perform 'stop' action when an agent is within 0.1m distance from the gool coordinates. (Fig. 5) The Exploration skill determines the task's success more than others, especially when we have no landmarks. We trained an RL policy that effectively explores the nearby area (Fig. 7), so it fits perfectly to explore the room completely but has lower percentage coverage at big scenes. As a reward, we 4 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and  The GoalReacher skill was trained to reach the goal-type object when the semantic sensor sees it (Fig. 6). It's main goal is to perform 'stop' action when an agent is within 1m distance from the goal-type object. To avoid overfitting, we generate episodes by initializing agents to a random place and then ask them to reach the farthest seen object type. The reward is also proportional to the shortened distance to the seen goal-type object.
The agent's goal is to find the policy π that maximizes the expected return Eπ[R0 | s 0 ]). The expectation is taken over the initial state distribution, the policy, and the environment transitions according to the dynamics specified above. The action-value function (Q-function) of a given policy π is defined as Q π (s t , a t ) = Eπ[Rt | s t , a t ]. The state-value function (V -function) is defined as V π (s t ) = Eπ[Rt | s t ]. The advantage is defined as A π (s t , a t ) = Q π (s t , a t −V π (s t ) and informs if action a t is better than the average action the policy π takes in the state s t .
De facto, the policy and the value functions are represented as two neural networks. The first represents the current policy π, and the value network approximates the current policy's value function V ≈ V π .
A policy neural network is a two-head network, one for the action distribution (actor) and the other for the action value estimation (critic). The actor stream is one FC layer that outputs logits for each out of the four actions. An action to execute is picked as a categorical distribution of that logit. The critic stream is an FC layer that outputs value for the given state.
To compute the return, we use a generalized advantage estimator (GAE) with γ = 0.99 and τ = 0.95.
For the policy loss function, we use the proximal policy optimization (PPO) [19], which still remains the SOTA solution in RL for the vast quantity of tasks. In our method, for training an RL agent, we use a gradient descent over a policy agent. Given a θ-parameterized policy π θ and a set of trajectories collected with it (commonly referred to as a "rollout"), the agent updates π θ as follows. LetÂ t = R t −V t , be the estimate of the advantage, where R t = T i=t γ i−t r i andV t is the expected value of R t , and r t (θ) be the ratio of the probability of the action under the current policy and the policy used to collect the rollout. The parameters are then updated by maximizing As the parallelization method, we utilize the decentralized distributed proximal policy optimization (DD-PPO) way [18]. The synchronous method differs from the asynchronous method in that in the synchronous one, each worker collects experience from the environment and simultaneously updates its weights for the general model with all workers. The workers that take too long time to step at the simulator are interrupted so that the rest do not wait for them. As a general abstraction, this method implements the following: at step k, worker n has a copy of the parameters, θ k n , calculates the gradient, ∂θ k n , and updates θ via where P U is any first-order optimization technique (gradient descent), and AR performs a reduction (mean) over all copies of a variable and returns the result to all workers.

V. EXPERIMENTAL SETUP
Setup. To show the performance of our HLPO method at a Habitat simulator, we have singled out two datasets: test VOLUME 4, 2016 5 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

Explore task
Step 0 20 40 60 80 100 Explored area FIGURE 6. Explored area (m 2 ) vs. steps on a training phase of Exploration skill. Higher is better.

GoalReacher task
Step 0 0.5  and validation. For the test dataset, we manually picked 100 episodes from 11 scenes that are complex and large enough to show that exploring them without landmark information results in low performance. These episodes have a medium geodesic distance to the closest goal, no less than 5m. As for indoor object categories, we took 10 to our experiments: a bed, a cabinet, a chair, a chest of drawers, a counter, a cushion, a sink, a sofa, a table, and a toilet. We ran all experiments three times and got the dispersion of the results no more than 0.05.
To demonstrate that our method outperforms the current techniques, we selected five baselines that are not use landmark information, but at the training phase, they had a goal to learn the semantical structure of the scenes natively.
• DDPPO -We trained an end-to-end DDPPO [18] algorithm with the input as the depth + GPS + GT semantic sensors with the reward proportional to the geodesic distance to the goal object closest to the start. As a backbone, ResNet50 and LSTM layer were used. • SemExp -The SemExp [21] is the SOTA algorithm with the module structure that incorporates both RL and planning, which showed the best performance at Habitat challenge 2020. We use the author's weights, but because of the differences in goal objects' classes, we give the agent only the goal class and set the others as zero. • Planning -A combination of planning modules that, at every step, update the obstacle map and explore it. When 6 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3182803 the semantic module sees the goal object, it gets spotted on a map, and the agent navigates to it through planning. • Auxiliary RL -The generic learned policy [16] with the auxiliary learning tasks and an exploration reward. It is the SOTA end-to-end algorithm, which showed the best performance at Habitat challenge 2021. • ExploreTillSeen -A combination of two RL skills: one explores the area (Explore skill) until the semantic sensor sees the target and the other RL follows the seen goal object (GoalReacher skill).
As a result, using only the Explore and GoalReacher skills approach showed similar performance to the SOTA methods, and the HLPO almost doubled the performance, which indicates that our choice of landmark information was right, and the agent itself could not learn it.
To prove that sensor and action noises do not spoil the performance, we added them at a training phase, so they are not a problem during validation (HLPO (Noise) row in Table 1), either. The HLPO (Map) showed that additional performance could be squeezed if the agent had access to a full map of the scene. But the gain is not that noticeable as the hardness of collecting the map. At the validation phase, we take 1,126 episodes out of 2,195 from all 11 scenes. We filter episodes that do not contain at least three goal-type objects at the same floor level as the agent start (Table 2).
To prove the need for landmarks, we gave an example of the ExploreTillSeen versus HLPO trajectories (Fig. 12). The ExploreTillSeen example shows that it explores capabilities on a high level, and we can not squeeze more out of it, but its trajectories are far from optimal. If, for example, the real robot every time navigated through every room of the apartment to search the sink, we do not want that behavior. The HLPO trajectories are also not optimal. We could improve them by giving the entire map, but it is hard to collect, and it will need to be recollected after each scene changes. Considering all of it, we think landmarks are the right tradeoff.
There are several ways to train agents to navigate, from training in real-world scenarios to fully simulated environments. The former is too inefficient as it needs a lot of resources and could bring a lot of harm while the policy is not optimal. The latter is comfortable working with, and outstanding results could be squeezed out, but our overall task is to navigate in real-world scenes. In this case, the transfer process from a simulated environment to an actual robot could be even more challenging than the training itself. In the simulator, the agent does not suffer from many realworld sensors' imperfections.
As an intermediate approach, we use a photo-realistic environment where the scene is the one that was reconstructed by lidars and cameras. We train our agent on a maximum number of scenes as we get from the Matterport dataset. But scene differences are too big across all datasets, and 60 of them are not enough to take into account all the properties of any scene. That is what happened in our case in the laboratory area. Long identical corridors turned out to be underrepresented in the dataset, and the agent showed far from optimal behavior inside them (Fig.  9). Since we can't capture many of these kinds of scenes, we chose to overfit our agent on our reconstructed scene before the real tests. For indoor navigation, we found it suitable in cases like ours because such a calibration needs one time for each type of scene, and the user of our system could download the suitable weights for its kind of scene.
To reconstruct the scene, we first looked at how it was done at the mp3d dataset. Dataset creators used the Matterport Pro2 camera (134 megapixels with no lidar) and the Matterport proprietary soft, where one can upload photos, and they are automatically processed to the final .obj file. That could be formatted to .glb and be used by the Habitat simulator with no effort. This works fine unless the scene contains small details. The mp3d dataset itself had a lot of holes and texture inconsistencies. We wanted higher quality, especially more precise depth reconstruction, so the agent can predictively navigate in a narrow room space. Our solution VOLUME 4, 2016 was to use a professional laser scanner, Leica RTC360 3D. An additional upside is that we can manually edit our shots, delete background people, and manually check the quality of assembling the entire scene. To texture the final point cloud, we use a RealityCapture program (Fig. 11).

Method
Success Distance to goal HLPO 0.6 11.8 (Before adaptation) HLPO 0.9 0.15 (After adaptation) Also, our laboratory reconstructed scene allowed us to calibrate the depth net (instead of simulator depth during the training phase, Fig. 8) and set proper clamps. Thus, the depth net is very close to the simulator depth. Then we collected episodes and retrained the agent on it during one million steps (Fig. 10)). Through this, we took into account the peculiarities of the scene.
After these adaptations, the success rate increased from 0.6 to 0.9 (Table 3). And at the real test on our robot Husky, we get 0.79 SPL, which indicates that our HLPO method performs equally well at the simulator and real-world tests.

VI. DISCUSSION AND CONCLUSION
We propose a novel approach to the ObjectGoal navigational task. With the standard formulation of the task, existing methods are limited to the point where, at large scenes, exploration with no information about the scene takes unreasonably much time. To solve this, we propose landmarks as a list of rooms' coordinates and their type.
With our updated task formulation, we have built a novel hierarchical policy that uses skills that could be stacked and reused in various navigational tasks with no changes. The success rate for our method doubles from 20% for the stateof-the-art method to 46% with the learned semantic, and from 51% to 86% for the ground truth semantic from the simulator. To accomplish the ObjectGoal task, we trained the agent three skills: PointNav, Exploration, and GoalReacher.
To make the transfer to reality process possible and predictable, we described how to reconstruct a scene into a simulator with a decent photo-realistic reconstruction quality using a professional Leica RTC360 scanner.
Our future work involves planning to build a global policy that would automatically select skills for a more complex task and move from discrete actions to continuous ones to control wheel-based robots more effectively.