Guided Reinforcement Learning: A Review and Evaluation for Efficient and Effective Real-World Robotics [Survey]

Recent successes aside, reinforcement learning (RL) still faces significant challenges in its application to the real-world robotics domain. Guiding the learning process with additional knowledge offers a potential solution, thus leveraging the strengths of data- and knowledge-driven approaches. However, this field of research encompasses several disciplines and hence would benefit from a structured overview.


INTRODUCTION
RL is a promising approach for solving decision-making problems in a humanlike fashion through trial-and-error interactions with the environment [121]. In recent years, RL has demonstrated remarkable progress on a variety of challenging tasks, from classic strategy and real-time computer games [14] to the robotics domain [5]. It has been applied to continuous control problems [74], including legged locomotion [64], [113], [119], robot navigation [23], [43], [54], and dexterous manipulation [17], [52], [98]. These success stories built on the data-driven trial-and-error nature of the approach to freely explore the search space.
However, learning control policies in such a way naturally requires many interactions with the environment. This emphasizes the importance of both collecting high-quality samples and exploring the search space in a sample-efficient manner. While directly learning on real robots is appealing, it comes along with substantial challenges, such as high sample cost, partial observability, and safety constraints [28]. Hence, simulators are often adopted as scalable training environments, avoiding safety issues found in the real world. Training robots in simulation is faster, cheaper, and safer, but deploying these policies to a physical robot can fail due to a mismatch between the simulated and real worlds, also known as the reality gap [144].
Combining data-and knowledge-driven approaches in a hybrid fashion can be a potential solution to address these challenges. Von Rueden et al. [126] propose an abstract concept for informed machine learning, where prior knowledge is directly integrated into learning systems. They introduce a taxonomy as a classification framework in this field that considers the knowledge source, its representations, and its integration into the machine learning pipeline. Building on this work, hybrid approaches may also be a promising avenue to explore for RL in the real-world robotics application domain.
Related to robotics, several lines of research have emerged toward more efficient exploration of the search space and effective policy deployment for real-world systems. For instance, dedicated algorithms have been developed that lead to improved sample efficiency [4], [10], [116]. Demonstration data have been used to accelerate RL approaches [17], [68], [131]. Carefully selecting task-specific state representations, reward functions, and action spaces can improve both the time to convergence and performance [82], [86], [119]. RL approaches also can be combined with classical control to learn in state spaces of lower complexity [27], [136]. Finally, integrating knowledge about the learning task structure has been found to improve performance and accelerate convergence [96], [139]. The high variety of approaches resulting from different disciplines impedes a wide-ranging understanding of the state of the art in learning control policies for real-world robotics and highlights the necessity of a structured overview.
Recent surveys provide partial overviews of the field. For example, [142] highlights strategies to improve the sample efficiency in RL in a general manner, while [61] focuses its application on the robotics domain. Von Rueden et al. [127] analyze how machine learning and simulation can be combined in a hybrid modeling approach. Another survey [144] puts special emphasis on sim-to-real transfer methods for robotics. Dulac-Arnold et al. [28] outline unique challenges for real-world RL. Finally, a recent case study [47] provides valuable hands-on insights for successful real-world policy deployment. Our work supplements the preceding by providing a systematic overview of integrating knowledge into the RL pipeline to increase both efficiency and effectiveness for real-world robotics.
In this work, we propose a concept of guided RL that provides an intuitive approach to accelerate the training process and improve performance for real-world robotics settings. We introduce a taxonomy that classifies guided RL approaches and shows how different sources of knowledge can be integrated into the learning pipeline in a practical way. Furthermore, we describe available approaches in this field and quantitatively evaluate their specific impact in terms of efficiency, effectiveness, and sim-to-real transfer within the robotics domain.
The article is structured as follows. In the "Concept of Guided RL" section, we introduce our concept of guided RL and provide a connection to related areas. The "Taxonomy" section presents the taxonomy and its central building blocks on a conceptual level. Based on this taxonomy, we classify a large number of recent research papers in the "Description of Methods" section. The "Evaluation of Approaches" section presents a quantitative evaluation of the most common methods used in guided RL. Finally, we discuss challenges and future directions in the "Discussion of Challenges and Directions" section and conclude in the "Conclusion" section.

CONCEPT OF GUIDED RL
In this section, we present our concept of guided RL with its definition, the overall goal for efficient and effective real-world robotics deployment, and a link to adjacent research areas.

DEFINITION
Guided RL describes the integration of additional knowledge into the learning process to accelerate and improve success for real-world robotics deployment. Figure 1 presents the information flow of guided RL. The additional knowledge can be integrated at different stages of the RL pipeline: the problem representation, the learning strategy, task structuring, and sim-to-real transfer methods. For a detailed discussion of the pipeline, see the "RL Pipeline" section.

EFFICIENT AND EFFECTIVE LEARNING
Accelerating the success of real-world robotics deployment involves both learning in an efficient and effective manner and forms the central goal of guided RL, as detailed in Figure 2. Based on the metrics frequently used in the literature [32], [43], [52], [82], [106], [111], [113], we adopt the following definitions: ■ Definition 2.1 (efficiency): A training process is considered more efficient if it requires fewer interactions with the environment or less time to converge than the baseline. ■ Definition 2.2 (effectiveness): A training process is considered more effective if the performance of a policy in terms of the total return or success rate is higher compared to the baseline. ■ Definition 2.3 (sim-to-real): A training process is considered sim-to-real if a simulation is adopted for training policies and evaluating methods before real-world deployment. While efficient and effective policy training as well as real-world robotics deployment form the natural dimensions of guided RL, combining these three dimensions constitutes the key motivation. For an in-depth evaluation of available approaches in this direction (see the "Evaluation of Approaches" section), we finally introduce the following term: ■ Definition 2.4 (guided RL compliance): A training process is considered fully guided RL compliant when improvements are achieved across all three dimensions of efficiency, effectiveness, and sim-to-real.

RELATED AREAS
This study focuses on guided RL, which integrates prior knowledge directly into the learning pipeline to accelerate success for real-world robotics. Hence, this review article is located at the intersections of deep RL, robotics, and simulation. There are several related lines of research, which we do not explicitly consider in the context of this study. Selecting model-based and off-policy algorithms has been found to improve sample efficiency compared to on-policy algorithms [47]. Also, tuned hyperparameters of the learning algorithm tend to improve the overall policy performance [45]. Furthermore, learning several tasks at once, as done in multitask learning, can lead to more efficient training [56]. In the same manner, research efforts in the field of meta learning aim at solving unseen tasks fast and efficiently [34]. However, all the methods we distilled in the field of guided RL are agnostic with respect to the choice of the algorithm (e.g., [32], [41], [74], [89], and [117]) and learning task, such as locomotion, navigation, and manipulation. In particular, we are not about to define strict distinctions among efficient, effective, and guided RL. Instead, our central motivation is to review existing approaches and distill a structured overview of recent research directions that hopefully strengthens the connection between the RL and robotics communities.

TAXONOMY
In this section, we introduce a taxonomy for guided RL (Figure 3). Based on the concept of informed machine learning [126], we structure the taxonomy according to the knowledge source, methodical representation, and integration into the pipeline. Here, we introduce the central building blocks of the taxonomy on a conceptual level, while an extensive categorization of approaches is presented in the "Description of Methods" section.

KNOWLEDGE SOURCE
Three types of prior knowledge form the basis for most guided RL methods. These knowledge sources can be roughly categorized into scientific knowledge, world knowledge, and expert knowledge. As detailed by [126], the sources range from formalized to intuitive knowledge and are briefly described in the following.

SCIENTIFIC KNOWLEDGE
Scientific knowledge is formalized and has its origin, e.g., in physics, biology, or engineering. This type of knowledge can be validated through experiments and empirical analysis. Scientific knowledge can be used to develop realistic simulators and to integrate findings from biology into the learning process, for instance.

WORLD KNOWLEDGE
World knowledge is either formalized or intuitive and considers facts from, e.g., everyday life. Consequently, this knowledge is held by a large group of people. In the context of guided RL, for instance, world knowledge may be used to design intuitive observation and action spaces and to integrate a natural structure of the learning task.

EXPERT KNOWLEDGE
Expert knowledge is available to a special group of experienced professionals with a strong connection to the robotics and RL domains. Such knowledge is rather informal and typically plays a key role in engineering design decisions. For example, expert knowledge is integrated when formalizing an RL problem and may be used to design an overall learning strategy.

GUIDED RL METHODS
This category is the key component of our taxonomy, as it connects directly to the RL pipeline (see Figure 1) and, hence, robotic applications. Here, we provide a first conceptual overview of these methods, while the "Description of Methods" section provides a detailed description of the most frequent approaches.

STATE REPRESENTATION
State representation describes the observable space for the model, where approaches typically aim to transform or extend the state into more instructive representations.

REWARD DESIGN
Reward design includes techniques to induce knowledge by means of designing appropriate dense reward functions and automatic learning approaches.

ABSTRACT LEARNING
Abstract learning describes the selection of a task-specific action space for a robotics problem that potentially can be hybridized with model-based ap proaches.

OFFLINE RL
Offline RL focuses on using offline data and tries to efficiently learn policies through RL from recorded training sets.

PARALLEL LEARNING
Parallel learning deals with the parallelization of the algorithmic components while balancing the scalability and robustness of the learning process.

LEARNING FROM DEMONSTRATION
Learning from demonstration leverages example trajectories, both online and offline, and focuses on distilling them into the trained policy.

CURRICULUM LEARNING
Curriculum learning is based on the idea of structuring a complex task by iteratively solving simpler tasks with increased levels of difficulty.

HIERARCHICAL RL
Hierarchical RL exploits the hierarchical structure underlying the learning task to solve different subtasks and deploy high-and low-level policies.

PERFECT SIMULATOR
The perfect simulator aims at building more realistic simulation environments in terms of accurate robot models, physics computation, and environment representation.

DOMAIN RANDOMIZATION
Domain randomization strives to make policies more robust by highly randomizing the simulation in terms of either visual or dynamics properties.

DOMAIN ADAPTATION
Domain adaptation approaches typically condition an adaptation module to transfer observations between the simulated and real worlds and vice versa.

RL PIPELINE
From our extensive literature review, we find that an applied RL pipeline for real-world robotics can be structured according to four components, namely, the problem representation, learning strategy, task structuring, and sim-toreal methods (see Figure 1). Within each of these iterative pipeline steps, additional knowledge can be integrated by means of guided RL methods.

PROBLEM REPRESENTATION
Representing a real-world robotics problem in the formal description underlying RL typically requires a large amount of knowledge. A key challenge is to appropriately select observations, define a reward function, and specify the action space of an agent for a desired learning task. Moreover, choosing suitable training data requires a careful assessment between real-world and synthetic data.

LEARNING STRATEGY
Integrating expert knowledge into the learning strategy can be done by deploying parallel learning architectures for a given problem, casting the problem as an online or offline problem, and utilizing real or synthetic demonstration samples.

TASK STRUCTURING
Depending on the complexity of the real-world robotics problem, further knowledge can be integrated in the sense of meaningfully structuring the learning task. For instance, a complex task could be learned sequentially with increased levels of difficulty as well as by decomposing it into several subtasks.

SIM-TO-REAL METHODS
Finally, additional knowledge can be used to accelerate the success of real-world robotics deployment by reducing the discrepancies between the simulated and real worlds. For example, scientific knowledge can be used to tune the simulation environment, and world knowledge, by means of domain randomization, could increase the robustness of the policy training process.

DESCRIPTION OF METHODS
In this section, we provide a detailed overview of the guided RL approaches we identified in our literature review (see Table 2). We structure our description according to the methods of the introduced taxonomy (see the "Taxonomy" section) since they form the natural connection between knowledge sources and practical applications.

STATE REPRESENTATION
Choosing the right state representation is an important aspect of solving learning tasks since it defines the observable space of an agent. Designing the observation space in a task-specific manner with measurable sensor data can significantly enhance the efficiency of the training process [see Hand with tactile sensor information, and the additional sensor information leads to improved RL agent performance. Some work includes other sensory modalities, such as Church et al. [24] presenting tactile-based RL agents, which learn from tactile information represented as depth images. A zero-shot policy transfer is achieved through a generative   Ning et al. [97] introduce an autonomous robotic ultrasound imaging system, where the observation is a concatenated latent vector of two conventional autoencoders. Chen et al. [22] examine the feasibility of conveying ambient sound information about 3D scene structures. Xu et al. [138] create a dataset of 15,000 transparent objects and present TransparentNet to estimate depth images in the presence of light refraction and absorption. Finally, Ji et al. [55] propose a state estimator for quadrupedal locomotion to extend the state representation.

REFERENCE EFFICIENCY EFFECTIVENESS SIM-TO-REAL GUIDED RL METHODS
Other work combines present sensor modalities. For instance, Miki et al. [87] introduce a state representation combining proprioception and exteroception for the quadrupedal ANYmal, which enables locomotion on various complex [129] ✓ ✓ Hierarchical RL (state representation and learning from demonstration) [

REWARD DESIGN
Reward design mainly addresses adjusting reward functions in their terms and parameters as well as automatically learning them from given data. Reward design, on the one hand, is an effective method to incorporate expert knowledge into RL. For complex tasks, where off-the-shelf RL algorithms typically fail to converge, designing appropriate dense reward functions can lead to increased sample efficiency and performance [see Figure 4(b)]. Research that incorporates reward shaping includes, e.g., Jestel et al. [54], whose work presents a robust policy for multirobot navigation, which learns emergent behavior in multirobot scenarios, such as swapping, intersections, and constrictions, and possesses the ability to recover from dead ends. Siekmann et al. [119] design a parametric reward function for all common bipedal gaits, such as walking and running, that proves to allow a successful transfer of the policy to the real robot Cassie. Fu et al. [38] propose a bioinspired reward function for locomotion that is based on reducing energy consumption while walking and generates different natural gaits, depending on the command velocity. Eteke et al. [33] present a skill learning framework that learns rewards from very few demonstrations. The rewards are learned by a hidden Markov model from deep perceptual features, which leads to better performance than a sparse reward signal. Chiang et al. [23] use AutoRL [100] to automatically apply a reward shaping technique, navigating a mobile robot in long indoor environments.
Reward learning, on the other hand, is an approach often found in human-robot interaction, where users rate given agent trajectories to learn a reward function, as done by Myers et al. [93], who introduce a multimodal reward learning approach in which users need only to rank a set of given trajectories. Wilde et al. [135] propose a new feedback mode, where users rate trajectories based on a slider bar to get scaled feedback. Cabi et al. [18] propose reward sketching to efficiently gather dense human feedback that can be used to train reward models. Escontrela et al. [31] use an adversarial RL approach [105], adjusting the reward to distill the walking style of a real dog into a robot.

ABSTRACT LEARNING
Apart from carefully selecting the observation space, a key role in representing the learning problem is the choice of the action space [see Figure 4(c)]. Most approaches that utilize an abstract action space demonstrate improvements in sample efficiency, while some ideas also deploy hybrid learning and model-based approaches. As an introductory read, we refer to Varin et al. [125], who give a comparison of classically used action spaces in various manipulation tasks. In manipulation, some work improves the classically used end-effector space. For instance, Martin-Martin et al. [82] introduce variable impedance control in the end-effector space to simplify exploration and improve robustness to disturbances. Wong et al. [136] introduce Operational Space Control for Adaption and Robustness, a data-driven version of operational space control [59] that is adaptive to changes in the dynamics of a manipulation setting. Bogdanovic et al. [16] propose a policy learning the impedance and desired position in the joint space and compare this approach to torque control and a fixed gain proportional-derivative controller. Duan et al. [27] propose a task space for bipedal locomotion. The policy learns to select set points for the feet, and an inverse dynamics controller transfers these set points to the joint-level control.
Other approaches use action spaces to alleviate the learning problem, such as that of Pertsch et al. [106], [107], who leverage offline datasets to learn latent space representations of sequences of actions (skills) along with prior distributions over these skills. On new downstream tasks, they show that the priors can be used to guide policy learning, enabling agents to sample-efficiently solve long-horizon tasks, such as robotic manipulation tasks. Whitney et al. [134] and Allshire et al. [3] propose latent representations of actions for manipulation that robustly handle the dynamics of manipulation settings. They show improvements in sample efficiency and performance in pixel-based continuous control environments.

OFFLINE RL
Offline RL, also called batch RL, can potentially improve the sample efficiency of other RL approaches, as it is a data-driven paradigm that trains policies from offline data. Due to the novelty of the field, researchers are mainly focused on developing algorithms to produce high-performance policies. Offline datasets are often collected from previous training runs of online RL training methods but could also be preprocessed recordings of real-world sensor data [see Figure 5 For further information about the field of offline RL, we refer to [69]. Offline RL suffers from the so-called extrapolation problem, where the policy produces out-of-distribution actions, wrongly overestimated by the value function [39]. There are several algorithms that are robust to this problem, such as batch-constrained deep Q-leaning [39], Random  Ensemble Mixture [1], critic-regularized regression [133], implicit Q-learning [63], and MuZero Unplugged [116]. Another approach to offline RL is to leverage techniques developed in machine learning, such as the work by Chen et al. [20], who utilize the transformer architecture conditioned on the desired reward, past states, and action to produce future actions that achieve a desired return. Other work addresses offline data themselves. Yarats et al. [140], e.g., propose using reward-free unsupervised data first and then annotating the reward to learn an RL policy. There are also some efforts to produce large offline datasets. Dasari et al. [26] introduce an open database to learn models for vision-based robotic manipulation that consists of 15 million video frames for seven different robot manipulators.

PARALLEL LEARNING
Parallel learning deals with utilizing one and possibly more heterogeneous hardware resources in the most efficient way by means of parallelization [see Figure 5(b)]. Furthermore, it addresses how to implement scalability and the necessary robustness to handle different sizes of a learning process. There are several formulations of parallel learning architectures that were developed throughout the years, such as A3C [88], IMPALA [32], Ape-X [44], D4PG [11], and R2D2 [58]. All the mentioned architectures are robust to parallel deployment and show large improvements in the sample efficiency and performance of the final policy over the baselines. There are also some adjacent optimization paradigms, such as evolutionary strategies [114], that can be scaled to large proportions and are capable of producing competitive policies in comparison to RL-based ones. Other work by Mania et al. [80] proposes using a variant of random search, augmented random search, that is also very scalable and derives nearoptimal policies.
Other work makes use of hardware accelerators to permit massive parallelization and thus facilitate the training process [73]. In this vein, Makoviychuk et al. [79] present Isaac Gym, a fully GPU-based simulator for RL that is capable of simulating a high number of environments by using a single GPU. Building on this, Rudin et al. [113] use Isaac Gym to train a quadruped to walk in minutes over increasingly complex terrain. Finally, the combination of other optimization techniques with RL seems to be a promising approach. For example, Jaderberg et al. [48] present a two-level optimization evolutionary process targeting a population of RL agents. This framework enables agents, conditioned on pixels only, to play a complex 3D game, matching human performance.

LEARNING FROM DEMONSTRATION
Although learning from demonstration [115] is a research field on its own, many RL approaches make use of such techniques. As an introduction to the field, we refer to Osa et al. [99] and Billard et al. [15]. Aside from standard techniques, such as behavior cloning [9] (learning offline data in a supervised way) and Dataset Aggregation (training the policy to mimic an expert in an online fashion) [112], novel approaches are presented by Florence et al. [35], proposing to use implicit behavior cloning and to let the policy be presented by an energy-based model [67]. Laskey et al. [65] present the Disturbances for Augmenting Robot Trajectories algorithm that collects demonstrations with injected noise while adjusting the noise level according to the trained policy.
Other approaches first train a teacher policy on unchanging task setups via, e.g., RL and distill a policy capable of interpolating among different task setups, such as [7], which chooses neural dynamical policies [8] to present the teacher and students. Others learn teacher policies on true state information to then derive a student policy conditioned on a reduced or substituted input space, e.g., Chen et al. [21], whose final visionbased policies are able to reorient objects in the shadow hand domain, and Lee et al. [68], who also distill a vision-based policy and test it in their red-green-blue-stacking benchmark. Other work makes use of classical optimization methods to represent the expert and distill their demonstrations into trainable policies. Wang et al. [131] use a hybrid learning process of RL and imitation learning, with the Optimization-based Motion and Grasp planner [130] as the expert.
Other than using a teacher, some work utilizes human demonstrations. Akbulut et al. [2] introduce a new framework called Adaptive Conditional Neural Movement Primitives, combining supervised learning and RL to conserve old skills learned from robot demonstrations while being adaptive to new environments. James and Davison [51] present a coarseto-fine discrete RL algorithm to solve sparse reward manipulation tasks by using only a small amount of demonstration and exploration data (work extended by [49] and [50]). Celemin et al. [19] include human corrective advice in the action domain through a learning-from-demonstration approach, while an RL algorithm guides the learning process by filtering out human feedback that does not maximize the reward.

CURRICULUM LEARNING
In the context of RL, curriculum learning [13] provides a framework for increasing sample efficiency through task structuring, where the policy for a complex task is learned by solving simpler tasks with gradually increasing levels of difficulty [see Figure 6(a)]. This can reduce the convergence time, on the one hand, and it may help solve problems that are too difficult to learn from scratch [96], [120]. Most approaches rely on either expert knowledge to gradually increase the difficulty of a target task or data-driven strategies for automatic curricula generation.
Matiisen et al. [83] introduce a framework for automatic curriculum learning called teacher-student curriculum learning, where a teacher automatically chooses appropriate subtasks based on the students' progress of learning a complex task. Klink et al. [60] introduce self-paced contextual reinforcement that gives the agent the freedom to control the intermediate task distribution. Florensa et al. [36] present an approach for reverse curriculum generation, where the robot gradually learns to reach more distant goals, starting from goals near the start states. In the same vein, Sharma et al. [118] generate a curriculum of initial states, where the agent learns to reset to generated subgoals based on its performance. Rodriguez and Behnke [111] introduce an approach to learn omnidirectional locomotion for humanoid robots through curriculum learning; their method gradually increases the task difficulty, with scheduled target velocities.
Finally, some approaches use curriculum learning to address complex tasks involving multiple goals and multiple robots. For instance, Luo et al. [78] exploit a curriculum that gradually adjusts the precision requirements for multigoal reach experiments and show that it improves performance in a faster way. Eoh et al. [29], on the other hand, employ a curriculum learning approach for challenging multirobot object transportation tasks that gradually increases both the transportation distance and number of robots involved. Leyendecker et al. [70] propose a combination of reward curriculum and domain randomization to develop a robust sim-to-real transferable policy to execute a manipulation task in an industrial setup.

HIERARCHICAL RL
Another way to improve training efficiency and effectiveness via task structuring is hierarchical RL [12], [101], which is based on the idea of decomposing complex tasks into a hierarchy of subtasks [see Figure 6(b)]. Typically, these subtasks are addressed by dedicated lowlevel policies orchestrated by a more general high-level policy, and thus, they potentially can be reused in a sampleefficient manner. One standard formulation of hierarchies is introduced by [122] with an option framework, where high-level policies choose options instead of actions, which are presented by closed-loop low-level policies that output actions for a certain amount of time, enabling temporal abstraction.
Bacon et al. [6] extend this work with an option formulation of the critic. Some work formulates hierarchical algorithms, such as that by Yang et al. [139], who propose a hierarchical deep deterministic policy gradient for continuous robotic control tasks, where compound and basic skills are learned simultaneously by two levels of hierarchy. Nachum et al. [95] present a general and data-efficient hierarchical RL algorithm, called HIerarchical Reinforcement learning with Off-policy correction to learn complex robotic behaviors. Their approach consists of low-level controllers that are supervised with goals generated automatically by high-level controllers.
Others deploy hierarchies in their policies, e.g., Peng et al. [103], who introduce a two-level hierarchical control framework for learning a variety of locomotion skills for a physically simulated bipedal robot. Le et al. [66] introduce a hierarchical guidance framework that also effectively leverages expert feedback. Instead of merely giving a subtask decomposition, a high-level expert is deployed to focus the low-level learner on relevant parts of the state space. Margolis et al. [81] address the problem of dynamic locomotion over discontinuous terrain by using a high-level controller to produce a trajectory based on visual inputs that is then tracked by a low-level controller. Nachum et al. [94] employ a hierarchy to learn low-level goal reaching skills coordinated by a high-level controller for coordinated multiagent object manipulation. Wang et al. [129] apply a hierarchical policy in a cluttered-scene grasping setting that learns an embedding space on expert plans and chooses sampled plans via a critic as well as appropriate options [122] via an option classifier. Finally, Li et al. [71] adopt a hierarchical structure for interactive navigation tasks, where a highlevel policy generates subgoals and selects low-level policies returning task phase-specific robot actions.

PERFECT SIMULATOR
One intuitive path toward effective real-world deployment is to build a realistic simulator that minimizes the reality gap [see Figure 7(a)]. Simulators that accurately capture real-world physics are appealing since they potentially allow directly transferring trained models in a zero-shot fashion into the real world [144]. Increasing the realism of the simulated environment includes better robot models, physics computation, and environment representations, respectively. System identification [76] is about building a precise mathematical model of a physical system. In the context of robotics simulation, carefully tuning physical parameters, such as friction, weight, and elasticity, can significantly increase the realism of the simulator. Moreover, machine learning approaches can be applied either offline [57] or, as presented by Yu et al., in an online fashion by predicting the dynamics model parameters in real time [141]. Accurately simulating the complex dynamics of modern robots also imposes high demands on the choice of the physics engine. Erez et al. [30] analyze quantitative measures of simulation performance and speed related to solving the numerical challenges of multibody dynamics present in robotics. Besides choosing an appropriate physics engine, the physics simulator has to support the need of the robotics use case. As the authors of [25] conclude, for each robotics subdomain, different simulators are preferred depending on the relevance of, e.g., sensors, dynamic contacts, and friction modeling. Muratore et al. [90] apply dynamics randomization and use a newly developed algorithm to switch parameters of the domain randomization, stopping overfitting to simulator dynamics. Lowrey et al. [77] leverage real-world robot data to carefully identified robot parameters [62], enabling an RL-trained policy to transfer directly from simulation to reality. Heiden et al. [42] propose a hybrid simulator with learned neural networks that switches between analytical and learned computation of physical effects. Xia et al. [137] present the Gibson environment, capable of realistic visual perception for active agents based on real-world data.
Finally, an accurate representation of the environment can significantly reduce the reality gap. Ramos et al. [109] present BayesSim, a framework that offers adaptive Bayesian estimates for simulation parameters via simulation-based inference, while Golemo et al. [40] introduce neural-augmented simulation, a method for augmenting robotic simulators with real robot trajectories. Hwangbo et al. [46] present a neuronal net trained on real data for ANYmal to convert policy action into a torque value for the simulation model.

DOMAIN RANDOMIZATION
The idea behind domain randomization is to highly randomize the simulation along a wide range of parameter distributions [see Figure 7(b)]. Instead of carefully modeling the real-world parameters in simulation, the real world simply appears as just another variation of these distributions [144]. Depending on the parameters to be randomized, common approaches deploy the randomization of either visual or dynamics components. For a more thorough survey on randomization simulations, the reader is referred to [92].
Tobin et al. [123] first introduced the idea of randomizing rendering in the simulator to transfer neural networks to reality for the purpose of robotic control. Mehta et al. [84] propose active domain randomization, which learns a parameter sampling strategy to leverage the randomization ranges that are the most informative. OpenAI et al. [98] present automatic domain randomization, which adjusts the domain randomization environment parameters, depending on the policy success, for solving a Rubik's Cube with a real robot hand. Prakash et al. [108] present structured domain randomization, which creates context-aware synthetic data by taking into account the structure of a scene. Instead of randomizing visual components of the simulator, Peng et al. [102] introduce dynamics randomization that includes parameters such as link masses, joint damping, and proportional-derivative gains, respectively.
Muratore et al. [91] introduce neural posterior domain randomization, which adapts the simulator's parameters by using only a few real-world rollouts to match the observed dynamics. Tsai et al. [124] leverage a single human demonstration to identify the simulator's distribution over dynamics parameters and adapt the domain randomization to reduce the sim-to-real gap. Ideas similar to visual and dynamics randomization have been adopted in other works, where perturbances are introduced to obtain more robust agents. For example, Wang et al. [128] consider noisy rewards, while other recent works apply noisy sensor signals [111] and random external forces [113] for effective policy deployment in the real world.

DOMAIN ADAPTATION
Domain adaptation techniques aim to minimize the reality gap by training adaptation modules, often represented by autoencoders, capable of projecting one domain into another, e.g., realworld camera images to simulation look-alikes [see Figure 7 Domain adaptation based on vision was done by several researchers. Bousmalis et al. [17] implement the former idea via GraspGAN, where an adaptation module is trained to convert synthetic images taken from simulation to more photorealistic observations. James et al. [52] present randomized-to-canonical adaptation networks, which learn to project synthetic images derived from randomized simulations into the style of the canonical simulation. Rao et al. [110] present RL-CycleGAN, which converts synthetic images into more realistic images. Liu et al. [75] introduce an approach called real-sim-real that adapts the real-world state into a simplified one by a segmentation model. Zhang et al. [143] propose adaptation modules, which are trained independently of the deep RL agent and can be deployed for different scenarios, e.g., indoor and outdoor navigation. Hoeller et al. [43] introduce a navigation policy for ANYmal that can navigate in cluttered environments with static and dynamic obstacles.
Other work investigates using an adaption module to handle environmental factors. Peng et al. [104] introduce a framework for training quadrupedal robots to imitate agile locomotion skills from animals, where the learned policies can then be transferred from simulation to the real world through a sampleefficient domain adaptation process. Kumar et al. [64] present a rapid motor adaptation algorithm that adapts in real time to unseen real-world scenarios.

EVALUATION OF APPROACHES
In this section, we present a systematic evaluation of guided RL approaches. As we show here, combining multiple methods leads to improvements in all three guided RL dimensions, namely, efficiency, effectiveness, and sim-to-real transfer, especially when specific combinations are used. In the following, we first describe the methodical approach and then present key insights for both individual methods and combinations of those. Table 2 provides an overview of the guided RL approaches discussed in the "Description of Methods" section. According to the three dimensions of guided RL (see the "Concept of Guided RL" section), we identify, for each of the discussed dimensions, whether 1) the overall training time has been reduced (efficiency), 2) improved policy performance has been achieved (effectiveness), and 3) the trained policy has been deployed to the real world (simto-real). Specifically, we adopt the achievements claimed by the authors themselves along the three dimensions, verified by means of figures, tables, and specific text passages, respectively. Furthermore, for more in-depth analysis, the last column of the table shows which specific guided RL methods were used in each of the approaches.

METHODICAL APPROACH
Consequently, Table 2 provides a structured overview of the approaches both in terms of achievements along the three dimensions and used guided RL methods.
Based on those classified references, Figure 8 displays the normalized contribution of the respective methods in terms of efficiency, effectiveness, and sim-to-real. For instance, among the covered papers adopting hierarchical RL, many approaches have shown improvements in terms of policy performance, and hence, this method seems to contribute significantly toward increasing the effectiveness of the learning approach. To increase the statistical significance of the evaluation, not only papers of the corresponding method are considered but also all papers adopting that method (see the "Guided RL Methods" column in Table 2).

KEY INSIGHTS ON INDIVIDUAL METHODS
As the quantitative evaluation of references has shown (Figure 8), particular methods tend to lead to improvement in terms of efficiency, effectiveness, and sim-to-real. The following key insights can support selecting individual methods to increase the probability of an approach being more efficient and effective and reaching real-world deployment.

IMPROVING THE EFFICIENCY
As our findings show, parallel learning architectures, abstract learning, and learning from demonstration data, in particular, often lead to accelerating the RL training process. First, an efficient parallelization of the algorithmic components allows scaling the learning problem to different sizes [32], [44], [88]. Second, simplifying the learning task by means of task-specific action spaces and hybrid model-based and model-free approaches can improve the overall efficiency. Finally, training based on expert demonstrations tends to be rich in information and hence can accelerate policy training [7], [124], [131]. Furthermore, the efficiency can likely be improved by employing more instructive state representations, applying a curriculum to gradually tackle difficult learning tasks, and utilizing accurate simulation environments.   Table 2), quantitative results show the relative contribution of the respective methods for accelerating the training process (efficiency), improving the overall policy performance (effectiveness), and successful real-world deployment (sim-to-real).

IMPROVING THE EFFECTIVENESS
In terms of effectiveness, in particular, offline RL, hierarchical RL, and curriculum learning seem to have a significant impact on the overall policy performance. On the one hand, training policies based on recorded datasets can be a valuable path to effectiveness since the full range of potential information can be extracted from the samples [18], [39], [133]. On the other hand, utilizing curricula and hierarchical learning schemes turns out to be viable for improving the policy performance and tackling even more complex robotics tasks [83], [111], [113]. Besides this, multiple other methods can contribute to enhance the overall task performance, such as meaningfully formulating the overall learning problem, incorporating demonstration data, and deploying parallel learning structures.

IMPROVING SIM-TO-REAL TRANSFER
As our evaluation results show, domain randomization, domain adaptation, and the perfect simulator are methods often deployed for transferring policies trained in simulation to the real world. First, domain randomization turns out to be a popular method often adopted for successful sim-to-real transfer [84], [102], [113], which is likely due to its simple implementation that can be easily adopted for most robotics problems. Second, domain adaptation in terms of adaptation modules is often employed to successfully transfer between simulated and real worlds and vice versa [17], [43], [52]. Finally, many approaches strive to reduce the reality gap by improving the realism of the simulators in terms of better robot models, better physics computation, and better environment representation to close the sim-to-real gap [30], [64], [141].

KEY INSIGHTS ON GUIDED RL COMPLIANCE
While efficient and effective policy training as well as realworld robotics deployment form the natural dimensions of guided RL, combining these three dimensions constitutes the overall goal (see Figure 2). For this purpose, we analyze correlations among exactly those papers that were able to achieve improvements in all three dimensions, described in the following as guided RL compliant (and marked with an asterisk in Table 2). Overall, we identify three common patterns among the guided RL-compliant papers, which simultaneously accelerate the training process (efficiency), improve the policy performance (effectiveness), and transfer the policies to the real-world (sim-to-real).

USING MULTIPLE GUIDED RL METHODS
First, we find that the guided RL-compliant papers tend to use a variety of guided RL methods. For instance, [7], [27], and [51] utilize at least three guided RL approaches, while [43], [70], and [82] deploy five or more approaches to obtain improvements in all three guided RL dimensions.

COMBINING PARTICULAR GUIDED RL METHODS
Second, we note not only that the number of used methods tends to be an important factor but also that combining particular guided RL methods can improve the probability of simultaneously improving efficiency, effectiveness, and simto-real transfer. By analyzing the Pearson correlation coefficients, we observe high positive correlations between domain adaptation and domain randomization for sim-to-real transfer and parallel learning and domain randomization for datadriven scalability. Moreover, our analysis shows that abstract learning and learning from demonstration are often combined by means of reduced action spaces as well as reward design and the perfect simulator to incorporate further knowledge for both simulation and reward design.

EXPLOITING MULTIPLE LEVELS OF THE GUIDED RL PIPELINE
Finally, we observe that the guided RL-compliant papers also tend to specifically incorporate multiple levels of the guided RL pipeline (see Figure 1). For instance, [33], [60], and [65] employ guided RL methods for two or three pipeline stages. Moreover, several of the guided RL-compliant papers [43], [70] even integrate knowledge into all four levels of the guided RL pipeline to simultaneously accelerate the training process (efficiency), improve the policy performance (effectiveness), and achieve sim-to-real transfer.

DISCUSSION OF CHALLENGES AND DIRECTIONS
In this section, we outline potential challenges and future directions in the field of guided RL. We start the discussion by looking at specific approaches first (see Table 3) and then distill common challenges and directions.

METHODICAL CHALLENGES AND DIRECTIONS
We summarize our findings of the main approaches to guided RL on a high-level in Table 3. For each approach, the table provides the taxonomy, its main motivation, the central idea, potential challenges, and our perspective on current and future research directions. Details of the methods themselves and corresponding papers can be found in the "Description of Methods" section, while the challenges and directions of these approaches are discussed in more detail in the following along the guided RL pipeline levels.

PROBLEM FORMULATION
A key challenge with state representation turns out to be balancing the richness of the observable space according to the computational effort [22], [75], [138]. A potential way to mitigate this challenge is to deliberately combine multimodal sensor information, such as additional tactile sensors for touch information [24], [86]. With reward design, a potential challenge can be selecting reward terms and reward parameters that represent the target task in an accurate way [54], [93]. Potential directions include bioinspired reward shaping [119], parameter optimization [100], and inverse RL [37] for automatically finding appropriate reward functions. Finally, selecting an abstracted action space instead of a complex one can largely simplify the training task (e.g., [106] and [136]). We see a potential challenge in choosing appropriate levels and methods of abstractions, such as the joint space and task space [82]. Two potential directions include latent action spaces [106] and hybrid strategies [27], where model-free RL and model-based approaches are interleaved meaningfully.

LEARNING STRATEGY
A key challenge with offline RL seems to be effectively processing the collected training data to extract the necessary information [1], [18], [39]. Current and future directions in this domain focus on introducing offline RL datasets [1], [26] and proposing novel algorithms to improve data usage [1], [39], [133]. With parallel learning, a main challenge is to design the information flow of the components to be parallel-ized [32], [44], [88]. Directions include improving the robustness of the parallel learning scheme [32], [11] while scaling to larger architectures [11], [44], [88]. In cases of learning from demonstration, potential challenges include accurate behavior modeling and extrapolating the behavior to new situations [7], [35], [124]. Current directions include generalizing behavior beyond specific demonstrations and creating sophisticated benchmarks for evaluating trained policies [2], [21], [68].

TASK STRUCTURING
Potential challenges with curriculum learning include selecting the right sequence of subtasks to be trained and suitable Based on the introduced taxonomy (see the "Taxonomy" section), for each method, we summarize the key motivation (see the "Evaluation of Approaches" section), fundamental idea (see the "Description of Methods" section), and potential challenges and directions (see the "Discussion of Challenges and Directions" section). levels of difficulty [36], [113]. One current direction is the effective progression of the learning tasks, for example, with increased locomotion velocities [111] and challenging terrains [113]. Another direction is to automate the process of task generation, e.g., via multiple competing or cooperating policies [83]. With hierarchical RL, on the other hand, a potential challenge includes designing an appropriate hierarchical structure with task-specific responsibilities of the individual policies [66], [95], [139]. Current directions include deploying a hierarchical learning structure to solve more complex and long-horizon tasks [72], [103].

SIM-TO-REAL
Since the realty gap effects all parts of the simulator, potential challenges with perfect simulators include both the realism of robot models and environments as well as the physics computation accuracy [64], [85], [141]. Directions include successful zero-shot transfer without retraining on the real system [64]. Moreover, we see high potential in further integrating real data into the real-to-sim loop [102]. With domain randomization, a challenge often found is determining the most expedient randomization parameters and ranges. A current direction is to leverage randomization ranges that are the most informative [84] and provide context for the randomization [108]. Another direction is to improve the robustness of agents by introducing perturbances of, e.g., noisy rewards [128] and random external forces [113]. Finally, with domain adaptation, potential challenges lie in appropriately selecting the source and target domains and designing the adaptation module, respectively [17], [43], [52], [143]. Current research directions include the identification of useful source and target domains as well as alternative generative models to correctly represent the target domains.

COMMON CHALLENGES AND DIRECTIONS
Based on our evaluation study (see the "Evaluation of Approaches" section) and methodological analysis (see the "Methodical Challenges and Directions" section), we identify three common challenges in the field of guided RL and outline potential directions in the following.

SAMPLE-EFFICIENT POLICY TRAINING
Learning control policies in a data-driven fashion naturally requires many interactions with the environment and hence constitutes one of the key challenges to accelerate the training process. Specific limitations include the amount and quality of available training data, on the one hand, and the ability to efficiently process such data, on the other hand. In particular, parallel learning approaches, task-specific action spaces, and leveraging expert demonstrations represent potential ways to improve training efficiency.

COMPLEX AND LONG-HORIZON TASKS
Another major challenge is to effectively train policies for high performance on complex and long-horizon robotics tasks, e.g., complex object stacking, combined locomotion and manipulation, and multiagent scenarios. In particular, such tasks turn out to be challenging since interacting deliberately with objects in the environment requires advanced reasoning capabilities. Potential ways to circumvent this challenge are to train on recorded offline datasets and to deliberately apply task structuring approaches (e.g., curriculum learning).

REAL-WORLD ROBOTICS DEPLOYMENT
When simulation environments are adopted for training policies grounded on synthetic training data, bridging the reality gap is a key challenge for successful real-world robotics deployment. As our evaluation has shown, improving the realism of the simulator, randomizing simulation parameters, and training adaptation modules can lead to the zero-shot transfer of policies trained solely in simulation. Alternative promising directions include learning directly on the real system and leveraging real-world data in an offline RL fashion to circumvent the reality gap.

CONCLUSION
In this article, we presented a taxonomy for integrating different types of knowledge into RL to enhance the efficiency and effectiveness of real-world robotics, which we describe using the term guided RL. Based on a systematic and comprehensive literature review, we presented a description of available approaches in this field. Moreover, we quantitatively evaluated the relations among these and found that 1) using multiple methods, 2) combining particular methods, and 3) exploiting various methods can significantly improve the training process. Finally, we hope this conceptual clarification and review of guided RL helps other RL and robotics researchers to accelerate the training process and improve performance for real-world robotics tasks.