Hindsight Goal Ranking on Replay Buffer for Sparse Reward Environment

This paper proposes a method for prioritizing the replay experience referred to as Hindsight Goal Ranking (HGR) in overcoming the limitation of Hindsight Experience Replay (HER) that generates hindsight goals based on uniform sampling. HGR samples with higher probability on the states visited in an episode with larger temporal difference (TD) error, which is considered as a proxy measure of the amount which the RL agent can learn from an experience. The actual sampling for large TD error is performed in two steps: first, an episode is sampled from the relay buffer according to the average TD error of its experiences, and then, for the sampled episode, the hindsight goal leading to larger TD error is sampled with higher probability from future visited states. The proposed method combined with Deep Deterministic Policy Gradient (DDPG), an off-policy model-free actor-critic algorithm, accelerates learning significantly faster than that without any prioritization on four challenging simulated robotic manipulation tasks. The empirical results show that HGR uses samples more efficiently than previous methods across all tasks.

Despite the many accomplishments, considerable challenges lie ahead in transferring these successes to the complex real-world tasks. An important challenge that must be addressed is to design a more sample efficient reinforcement learning algorithm, especially in sparse reward environments. To address this issue, Lillicrap et al. propose the Deep Deterministic Policy Gradient (DDPG) [20], which considers an agent that is capable of learning continuous control such as robot manipulation, navigation, and locomotion. Schaul et al. [31] develop the Universal Value Function Approximators (UVFAs), which allows the value function to generalize over both the states and goals (multi-goal). Moreover, to make the agent learn faster in sparse reward environment, Andrychowicz et al. [2] introduce Hindsight Experience Replay (HER) that enables the agent to learn even from undesired outcomes. HER combined with DDPG lets the agent learn to accomplish more complex robotic tasks in sparse reward environment.
In HER, the failed episode is uniformly sampled from replay buffer; subsequently, their goals are also sampled uniformly from any one of the visited states such that the failed episode is transmuted into a successful episode. As a consequence, HER does not consider which visited states and episodes might be most valuable for learning, which is a probable cause for sample inefficiency. It would be more sample efficient if the episodes and goals are prioritized according to their importance. The challenge now is to determine a criterion for measuring the importance of goals for replaying. A recent approach referred to as Energy-Based Prioritization (EBP) [41] proposes an energy-based criterion to measure the significance of an episode. Yet, the goals are sampled uniformly from future visited states within an episode. Even in [41], the extension of Prioritization Experience Replay (PER) [32] samples goals uniformly within an episode. Zhao et al. [40] introduces a prioritization method such that the policy is encouraged to visit diverse goals. However, this method still treats all goals equally. In this work, the significance of a goal is judged by the Temporal Difference (TD) error, which is an implicit way to measure learning progress [32,1]. Within an episode, a future visited state with high TD error will be labeled as hindsight goal more frequently. The episode's significance is measured by the average TD error of the experience with its hindsight goal set as one of the visited states in the episode.
To summarize, our paper makes the following contributions: We present HGR, a method that prioritizes the experience for choosing the hindsight goal. The proposed method is applicable to any robotic manipulation task that an off-policy multi-goal RL algorithm can be applied. We demonstrated the effectiveness of proposed method on four challenging robotic manipulation tasks. We also compare the sample efficiency of our method with baselines including Vanilla HER [2], Energy-Based Prioritization (EBP) [41], Maximum Entropy-Regularized Prioritization (MEP) [40], and one-step prioritization experience [41]. The empirical result shows that the proposed method converges significantly faster than all baselines. Specifically, two-step ranking uses 2.9 factor less number of samples than vanilla HER, 1.9 times less than one-step prioritization experience, 1.3 times less than EBP, and 1.8 times less than MEP. We also conduct an ablation study experiment to investigate the effect of each step.
The remainder of this paper is organized as follows. Section 2 describes the background to solve the problem and related RL algorithms for verification. Section 3 summarizes the related works. Section 4 presents our proposed method for ranking hindsight goals in the replay buffer. Section 5 shows the results of our method and comparison with modern methods. Finally, Section 6 concludes this work and discuss the advantages and disadvantages.

Background
In this section, the classical framework of RL and RL's extension to multi-goal setting are introduced, and the two most relevant algorithms to the proposed -Deep Deterministic Policy Gradient and Hindsight Experience Replay -are reviewed.

Reinforcement Learning
Reinforcement learning algorithm attempts to find the optimal policy for interacting with the unknown environment to maximize the cumulative discounted reward that the agent receives per action performed. It should be noted that the environment will be assumed fully observable. This problem is typically modeled as a Markov Decision Process (MDP). The MDP is composed by a tuple < S, A, R, T , p 0 , γ, H >, where S is a set of states, A is a set of actions, R is a reward function, T is a set of transition probabilities (usually unknown) mapping from current states and actions to future states: S × A → S, γ ∈ [0, 1) is a discount factor, p 0 is the distribution of initial state, and H is the horizon of an episode. Here, the horizon is assumed finite. A policy maps the state to the action, π : S → A, where the policy can be stochastic or deterministic.
Every episode starts with an initial state s 0 which is sampled from distribution p 0 . Let the agent be in state s t at time t. Assuming it takes action a t ∼ π(s t ) and immediately receives a reward r t = R(s t , a t ) from the environment. The environment responses to agent's action and presents new state s t+1 to the agent. Here, the new state s t+1 is sampled from T (.|s t , a t ). The return from a state is defined as a discounted sum of future rewards R t = H i=t γ i−t r i (s i , a i ). Here, the return depends on the chosen actions, and therefore on the policy π, and may be stochastic. The goal of reinforcement learning is to learn a policy which maximizes the expected return from start distribution E s0 [R 0 |s 0 ]. The state-action value function is used in many reinforcement learning algorithms to indicate the expected return when an agent is in state s t = s and performs action a t = a and thereafter follows Many approaches in reinforcement learning make use of the well-known recursive relationship Bellman equation: If the policy is deterministic we can denote it as a function µ : S → A and the inner expectation is avoided: Let π * denote an optimal policy, its state-action value satisfies the condition Q π * (s, a) ≥ Q π (s, a) for every s ∈ S, a ∈ A, and any policy π. All optimal policies-there could be multiple-have the same Q value referred to as the optimal Q-value function, denoted by Q * . This Q * satisfies the optimal Bellman equation,

Deep Deterministic Policy Gradient
Deep Deterministic Policy Gradient (DDPG) interleaves learning an approximation to Q * (s, a) with learning an approximation to µ * (s), it does so in a way that is specifically adapted for environments with continuous action spaces. Two deep architectures representing critic and actor networks are shown in Figure 1. Here, Q * (s, a) is presumed to be differentiable with respect to the action argument. DDPG serves as the backbone of the proposed algorithm, but it need not be the case: we could have used Twin Delayed DDPG (TD3) [12] or Soft Actor-Critic (SAC) [13].
DDPG uses two deep architectures (for stability, four architectures are used, which is discussed in detail below) in performing actor-critic policy gradient with replay buffer to store real-world experiences to train the actor and critic networks. Figure 1 shows two networks: (1) the actor network that takes the observed state as input and predicts the action that maximizes the action-value function and (2) critic network that takes state and action input pair in predicting the value of the action-value function. The state and action are respectively represented as 5-and 3-dimensional vectors in the figure.
the replay buffer and its corresponding predicted action of the actor network, are fed into the critic network in estimating the action-value function for the next state and its predicted action. The critic network is trained such that the action-value function of the current state and its action from the replay buffer matches the one-step look-ahead of the action-value function, which is defined as the sum of the experienced reward stored in the replay buffer and the discounted action-value of the next state and estimated action.
For obtaining stability during training the critic network, a critic network for training and a critic network for outputting the target value are separately incorporated. Similarly, two separated networks for training and computing next action are also incorporated for learning actor network. The actorcritic networks are repeatedly trained in the manner discussed above. The parameters of the target networks are slowly updated to match the training networks based on moving average. Exploration can be performed by adding noise to the actor network's output or the parameters of the actor network.
Mathematically, DDPG maintains a deterministic policy µ θ (s) parameterized by θ and a critic Q φ (s, a) parameterized by φ. DDPG alternates between running a policy to collect experience and updating the parameters. The episodes are collected by using a behavior policy, which is the version of deterministic policy mixed with noise, i.e., π b (s) = µ θ (s) + N , where N is noise such as mean-zero Gaussian noise or noise generated by Ornstein-Uhlenbeck process. The critic is trained by minimizing the following loss function to encourage the approximated Q-value function to satisfy the Bellman equation: where, y = r + γQ φ − (s , µ θ − (s )). Here Q φ − , µ θ − , and D are respectively the target Q-value function, the target policy, and experience replay buffer containing a set of experience {(s, a, r, s )}.
For stably training, the target Q φ − is typically maintained using a separated network, whose weights are periodically updated or averaged over the current weights of main network [22,39,38]. Subsequently, the actor is learned by maximizing the following objective function with respect to θ: The derivative of this objective is computed using the deterministic policy gradient theorem [35], In this step, the parameter φ of Q-value function is kept fixed. To make training more stable, the target policy is similarly maintained by using a separated network, and its weights are updated periodically by moving average.
In practice, the expectation in Eq. (5) and Eq. (7) is approximated by the mini-batch samples in the experience replay [21]. Specifically, given a batch of experiences B = {(s i , a i , r i , s i )}, the mean square error of Bellman equation approximated by, and the deterministic policy gradient approximated by,

Universal Value Function Approximators
Universal Value Function Approximators (UVFAs) [31] is an extension of DQN to the multi-goal setting. In this setup, there is more than one goal the agent may try to achieve. This setup is also known as multi-task RL or goal-conditioned RL. Let G denote the space of all possible goals. This makes a modification to the reward function R such that it depends on a state, an action, and a goal g ∈ G, i.e. r t = R(s t , a t , g). For simplicity, we assume the goal space G is a subset or equal to the state space S, i.e. G ⊆ S, which is satisfied in our environment. Every episode starts with a initial state and a goal sampled from distribution p(s 0 , g). After sampling, the goal is fixed within the episode. At every timestep, the policy takes as input not only the current state but also the goal of the current episode, µ : S × G → A. The Q-value function now also depends on the goal Q µ (s t , a t , g) = E st∼T ,at∼µ,g∼G [R t |s t , a t , g]. In [31], authors show that it is possible to learn an approximator to Q-value function using direct bootstrapping from the Bellman equation similar to DQN. The extension UVFAs for Deep Deterministic Policy Gradient is straightforward. It results in a modification for updating the critic, where, y = r + γQ φ − (s , µ θ − (s , g), g). Subsequently, the update for the actor, In [31], a value function approximator that generalizes over both states and goals is considered. Two architectures for approximating the value function are depicted in Figure 2: one architecture directly concatenates state and goal and outputs the value of value function for the concatenated input, and the other architecture processes the states and goals separately before taking their respective outputs in predicting the value of the value function. The proposed algorithm is based on the architecture that directly concatenates the state and goal: the architecture is simple and can be easily integrated into various RL algorithms.

Hindsight Experience Replay
Despite a wide range of advances in the application of Deep Reinforcement Learning, learning the agent in an environment with sparse rewards remains a major challenge-especially in robotic tasks where the desired goal is challenging, and the reward is sparse. A reward that is commonly used in a sparse reward environment is as follows: where, s t , g, and ρ are respectively the current state, desired goal, and tunnable hyperparameter. Intuitively, this function rewards an agent if the current state of the agent is close with the desired goal within threshold δ.
In this environment, the agent will not receive any positive reward "0" for a long period and will have difficulty in learning. As a result, the agent will be faced with the sample inefficiency problem, which is one of the main concerns in RL. Andrychowicz et al. [2] propose a simple yet effective method referred to Hindsight Experience Replay (HER), that relabels the goal of the existing experiences in the replay buffer to overcome the sample inefficiency problem in the sparse reward environment.
HER duplicates the episodes in replay buffer but with the set of goals such that the episode is either a successful episode or in the future steps of the episode going to be successful. In [2], authors propose four strategies in selecting a visited state for hindsight goal including final, future, episode, and random. In the final strategy, the hindsight goal is the terminal state in the sampled episode.
In the future strategy, the hindsight goal is selected randomly from states at future time steps with respected the chosen experience. In the episode strategy, the hindsight goal is an arbitrary state which comes from the same episode of the sampled experience, no matter it is observed after or before a chosen experience. In the random strategy, the hindsight goal is sampled randomly from visited states encountered so far in the whole replay buffer. All strategies are effective, but the future strategy outperforms the other three strategies. In this paper, only the future strategy will be considered, and all experiments will be conducted based on this strategy. HER can integrate with any off-policy RL algorithm assuming that we can find a corresponding goal at any state in state space. This assumption is also satisfied in our environment.

Related works
In sparse reward environments, to fully explore the continuous and high dimensional action space in learning an appropriate policy, a naive exploration by adding noise to the action or incorporating -greedy policy is bounded to fail. For a long horizon, the task becomes exponentially more difficult. Furthermore, in the real world, the number of samples that can be collected in real-world tasks is generally limited, and the sample efficiency becomes critical. Thus, exploring diverse outcomes and learning policies in a sparse reward environment is challenging.
Nair et al. (2018) [24] approaches this problem using demonstrations combined with HER. Here demonstration is used as a guide for exploration. The proposed algorithm attempts to teach the agent to learn gradually from easy targets to challenging targets, which is a form of curriculum learning. Florensa et al. (2017) [10] propose Reverse Curriculum generation. In this method, the agent starts off from an initial state that is right next to the goal state, then two steps away from the goal, and so on.   [5] attempt to incorporate HER with the well-known Imitation Learning (IL) algorithms such Behavior Cloning (BC) [28] or Generative Adversarial Imitation Learning (GAIL) [14].
In HER, the key assumption is that the goals are sampled from a set of states that need to be visited. Nevertheless, the real-world problems like energy-efficient transport, or robotic trajectory tracking, rewards are often complex combinations of desirable rather than sparse objectives. Eysenbach et al.
(2020) [7] propose inverse RL to generalize the goal-relabeling techniques to arbitrary classes of reward functions ranging from multi-task settings, discrete set of rewards, and linear reward functions. In this method, inverse RL is used as relabeling strategy to infer the most likely goal given a trajectory. Similar work is proposed by Li et al. (2020) [19], which uses the approximated inverse RL to sample a suitable goal for a given trajectory. Pong et al. [25,29] generalize state in multi-goal RL into raw pixel and proposes to sample goals from a VAE prior. The group of those methods can be considered as a different strategy to generate hindsight goals.
In this paper, the efficiency of experience replay using replay buffer is studied, and an algorithm is proposed to improve the sample efficiency further. HER is an effective algorithm for reducing sample complexity in a sparse reward environment. However, the uniform sampling from future visited states for hindsight goals that HER is based on could be improved for obtaining better sample efficiency.
HER replaces the actual goal of an experience with a randomly sampled state visited in the future without considering the significance of the sampled [27]. The recent method referred to as Energy-Based Prioritization (EBP) [41] proposes to prioritize episodes in the replay buffer. Specifically, based on the work-energy principle in physics, authors introduce an episode energy function that measures the importance of an episode. Subsequently, the higher energy episode is replayed more frequently. Episode energy function can be considered an alternative to the TD error in measuring the significance of an episode. However, EBP samples goals uniformly within a sampled episode, which does not consider the importance of future visited states. Moreover, the authors also compare their methods with the extension of Prioritized Experience Replay (PER) [32], which is prioritizing the experiences in replay buffer rather than the episode. However, after the experience is sampled, the goal is still sampled randomly at future time steps with respect to the chosen experience, which is sample inefficiency. Another approach referred to as Maximum Entropy-regularized Prioritization (MEP) [40] is proposed for encouraging the policy to visit diverse goals. Specifically, the trajectories are prioritized during training the policy such that the distribution of experienced goals in the replay buffer as uniform as possible. However, the goals in a considered episode are still uniformly sampled. In this work, to address the sample inefficiency in HER, we propose a two-step ranking using TD error to judge which goals and episodes are valuable for learning.

Hindsight Goal Ranking
To learn in the sparse reward environment, Hindsight Experience Replay (HER) relabels the goal of a failed episode by any one of the future visited states such that the episode becomes successful. However, the hindsight goal is sampled uniformly from the future visited states attained in the episode, which may not be the most beneficial way of improving sample efficiency. In this section, we answer the question: "which hindsight goals should be generated?" and "how to select the hindsight goals within an episode?". An extension of the HER algorithm is developed based on Deep Deterministic Policy Gradient (DDPG).

Two-step prioritized hindsight goal generation
In HER, the episodes are initially sampled randomly from the experience buffer for replaying. Subsequently, the hindsight goal of a sampled episode is selected by uniformly sampling a future visited state. The replay process does not try to determine which goals and episodes are more valuable for learning [27]. In [41], Energy-Based Prioritization (EBP) attempts to prioritize an episode in the replay buffer by its energy. However, the future visited states in an episode are sampled uniformly to generate a hindsight goal. Also, in [41], the extension of Prioritized Experience Replay (PER) tries to prioritize experiences instead. However, within the episode containing chosen experience, the hindsight goal is still selected randomly from a bunch of future visited states. Instead of uniformly sampling future visited states for relabeling to hindsight goals, this paper improves sample efficiency by prioritizing the future visited states within an episode according to the magnitude of the TD error δ, computed by Eq. (13).
This criterion has been considered as a proxy measure of the amount which the RL agent can learn from an experience: concretely, the TD error measures how far the value is from its next-step bootstrap estimate [1,32]. Using TD error for prioritizing is particularly applicable to DDPG [20], which needs to compute the TD error to update parameters of Q-value function. Specifically, in the replay buffer, the importance of an experience with a future visited state, which is probably become a hindsight goal, is ranked based on the magnitude of its TD error. Note that since we follow the future strategy [2], the future visited state should be ranked relative to the others in a same episode. Thus in the considered episode, the future visited state with larger magnitude TD error will be labeled as hindsight goal more frequently such that for each sampled episode, the agent is attempting to maximize what it can learn from an experience. Assuming a sampled episode has (H −1) experiences. For the j th experience of the episode, the i th future visited state where (j + 1) ≤ i ≤ H is sampled with the following probability: where the normalization function Z = H−1 j=1 H i=j+1 |δ ji | α , and δ ji is the TD error of the j th experience and i th visited state. The exponent α is a hyper-parameter which is preset before sampling.  (15) and (14), respectively. The DDPG and AGENT contains learnable modules.
When α = 0, uniform sampling is performed, and the proposed algorithm becomes the vanilla HER.
To guarantee that the probability of sampling any of the visited states is non-zero, a small positive constant is added to δ ji .
The agent can achieve better sample efficiency by maximizing what it can learn from the whole replay buffer, which requires prioritizing the episodes as well as the goals. Episodes can be ranked in a similar manner as the goal, where the importance of an episode is measured by the average TD error of all experiences within the episode, i.e., all possible combinations of a chosen experience with a future visited state. Let K be the number of combinations of experience and goals in an episode. Then, the priority of the n th episode in the replay buffer is defined as the average TD error by K such that δ (n) = 1 k is the TD error of the k th experience-goal combination from a total of K combinations. Finally, the n th episode is sampled with the probability as where the normalization function Z = n |δ (n) | α . Here α determines how much prioritization should be incorporated: α = 0 is equivalent to the uniform case (no prioritization), i.e., the episodes have the same probability of being sampled. The small positive number is also added to priority to prevent zero probability.
The schematic view of HGR is shown in Figure 3. There are five components including (1) DDPG, which concurrently learns the action-value function Q and a deterministic policy µ, (2) AGENT's behavior policy which is a mixture of the deterministic policy µ θ and Gaussian noise. The behavior policy takes as input the current state s t and the pursuing goal g, then produces action a t to interact with the (3) ENVIRONMENT. Subsequently, the environment presents new state s t+1 to the agent. Immediately, the action a t and new state s t+1 are stored into (4) REPLAY BUFFER D with the size of N . When learning Q and µ, (5) HGR samples an episode and an experience with its hindsight goal from the distribution (15) and (14), respectively. This sampling process is performed M times for obtaining a batch of data for performing DDPG.

Implementation:
There are a total of H(H + 1)/2 experiences in our experiment. The goals are sampled based on heuristics. The sampling complexity is O(n), and the complexity for updating a priority is O(1). The size of the replay buffer is very large (millions), and to sample from the distribution (15), the sum-tree data structure similar to that used in [32], where every parent node is the sum of its children node and the leaf nodes represent the priorities. Its complexity for both sampling and updating is O(log n). Sample an action a t using the behavioral policy from A: 8: a t ← µ(s t g) + N denotes concatenation 9: Execute the action a t and observe a new state s t+1 10: Store the transition (s t , g, a t , s t+1 ) in D with maximal priorityδ = max if episode mod U ≡ 0 then 13: for k = 0 to M-1 do

14:
Sample n th episode for replaying based on P (n) episode prioritization 15: Sample (s j , g j , a j , s j+1 ) and g i based on P (j, i) goal prioritization 16: r j ← r(s j , a j , g i ) Recalculate reward (HER) 17: Compute importance sampling weight by Eq. (18) 18:

Prioritization and Bias trade-off
The Q-value function is estimated based on stochastic updates of samples drawn from the replay buffer. When samples are drawn uniformly, the estimation is unbiased; however, when the samples are prioritized in the replay buffer, the estimation will be biased as prioritization will not allow samples to be drawn from the distribution that defines expectation the Q-value function. Consequently, the prioritization induces a bias in the estimation of the Q-value function [32] and therefore changes the solution that the estimations will converge to. To overcome this problem, we can correct the bias by using importance-sampling (IS) weights. Specifically, when updating the parameters of approximated Q-value function by using the experiences in sampled episodes, the corresponding gradient is scaled by multiplying with the IS weight of the episode, which is defined as Eq. (16). This IS weight helps to scale down the sample's gradient when it is updated frequently and remains the same when it is rarely updated-line 21 in Algorithm 1. (16) where N e is the number of collected episodes in the replay buffer, β is hyper-parameter to control how much bias is corrected. If β = 1, the weight fully compensates for non-uniform probability P (n). In practice, we linearly schedule β, which starts from β 0 and ends up with one during training. Here β 0 is a tunable parameter.
Annealing the bias by multiplying the IS weight of the episode is effective method for reducing the bias. However, the bias induced by the prioritization within an episode still remains. To compensate this bias, a small modification is introduced. As mentioned in the previous section, there are H(H + 1)/2 combinations of experience-goal pair. Therefore, the IS weight for the i th goal according to the j th experience in an episode is where, H is horizon, β plays a role similar with β to control bias correction. The final IS weight to correct bias for an experience-goal is computed by For a new experience with an unknown TD error, the maximal priority, (P t = max i<t P i ), is assigned to guarantee that all experiences are replayed at least once. For more stable convergence, the IS weights are normalized by their maximum 1/ max w (n) ji ; thus the gradient is always scaled downwards. The detailed algorithm is presented in Algorithm 1.
HER is based on the assumption the reward function is defined as R(s g, a) which allows the agent to evaluate the reward from any state and goals, in Algorithm 1. This assumption is not very restrictive and can be satisfied in our environment. For example, in the reaching task, the reward function is defined as the Euclidean distance between the current end-effector position and the target position. Therefore, when collecting experience, we need not observe and store reward in the replay buffer-line 9, 10 of the Algorithm 1. We use the epsilon-greedy to alter the uniform random policy and the behavior policy for better exploration. Furthermore, the behavior policy is a mixture of the deterministic policy and Gaussian noise-line 8 of Algorithm. For details regarding noise and -greedy, see in Section 5.2.

Experiments
This section will describe the robot simulation benchmark used for evaluating the proposed method. Then, we will investigate the following questions: • Does the hindsight goal ranking benefit HER?
• Does the hindsight goal ranking improve the sample efficiently in the robot benchmark environment?
• What is the benefit of prioritizing the goals versus episodes?

Environmental benchmarking
HGR is evaluated on the 7-DOF Fetch Robotic-arm 1 simulations environment provided by OpenAI Gym 2 [27,3], using MuJoCo physics engine [37]. The robotic arm environment is based on currently available hardware 3 and is designed as a standard benchmark for Multi-goal RL [27]. In this environment, the agent is required to complete several manipulation tasks with different objectives in each scenario. A 7-DOF Robotics arm with parallel grippers is used to manipulate an object placed on a table in front of the robot, as shown in Figure 4. There are four different tasks of the robotic arm with the difficulty level increasing: 1. Reaching (FetchReach-v1): The robot arm try to move its gripper from initial position to a desired target position. This target is located arbitrarily in 3D space. 3. Pushing (FetchPush-v1): A box is placed randomly on a table in front of the robot, and the task is to move it to a target location on the table. The robot fingers are locked to prevent grasping. The learned behavior is usually a mixture of pushing and rolling.
4. Sliding (FetchSlide-v1): A puck is placed on a long slippery table, and the target position is outside of the robot's reach so that it has to hit the puck with such a force that it slides and then stops at the target location due to the friction.
The state is a vector consisting of the position, orientation, linear velocity, and angular velocity of all robot joints and objects. A goal represents the desired position and orientation of an object. There is an acceptable range around the desired position and direction. We use a fully sparse reward for all tasks like Eq. (12) with a tolerance range of ρ in our environment is 5 cm. If the object (or end-effector) is not in range of the goal, the agent receives a negative reward, "-1"; otherwise, the positive reward "0" is received.
Performance: The proposed method is compared against the following baselines: Vanilla HER [2], HER with Energy-Based Prioritization (EBP) [41], HER with Maximum Entropy-Regularized Prioritization (MEP) [40], HER with one-step prioritization (PER) [32,41], and HER with HGR (Ours). We run the experiments in all four challenging object manipulation robotic environments with different random seeds. To evaluate sample efficiency, we compare how many samples the agent needs to achieve certain mean success rates at 50%, 75%, and 95%. For FetchSlide, since it is hard to reach high performance without demonstration, we evaluate at 25% and 45% success rate. We also compare the final success rate at the end of training across methods. We evaluate ten times every epoch for each experiment, and the success rate is the average of ten times. Each experiment is averaged of five different random seeds, and the shaded area in the plot represents the standard deviation. In our experiment, we use 1 CPU instead of 19 CPUs as in previous work [2,41]. 4 .

Training details
We used identical architectures for actor and critic network-a three-layer network where each layer consists of 256 units and ReLU as the activation function. To train the HGR, the Adam [17] optimizer was used with a learning rate of 10 −3 for both the actor and critic network. We updated the target Q function and target policy by using the moving average with the factor of 0.95. In the experiment, we used a discount factor γ of 0.98. We used L2 regularization for action to prevent the predicted action from becoming too large. We used tanh as the final layer's activation function such that the output value is scaled from -1 to 1. For exploration, we added a mean-zero Gaussian noise with 0.2 standard deviations to the deterministic policy. We also used the epsilon-greedy algorithm to alternate between uniform policy and behavior policy for better exploration with the factor = 0.3.

Compare with recent methods
In the considered tasks, the goal can be changed to an arbitrary location when beginning a new episode and kept fixed within the episode. Hence, to solve these tasks, the RL algorithms must have the ability to adapt to multiple goals. Figure 5 shows evolution of success rate in four benchmark environments. For comparison with baseline vanilla HER and HER with PER, the result shows that HER combined with HGR converges significantly faster in all four tasks. For comparison with Energy-Based Prioritization (EBP) 5 , we achieve better in FetchSlide, FetchPush, and slightly better in FetchPickAndPlace. For HER with Maximum-Regularized Entropy Prioritization (MEP), our method surpasses across all environments.
The final performance of the trained agent is shown in Table 3. In Reach, both three methods can achieve absolutely performance. For the other three environments, HER+HGR outperforms other 5 We used author's code at this link: https://github.com/ruizhaogit/ EnergyBasedPrioritization  Figure 5: The evolution of mean test success rate across five seeds with standard deviation for all four tasks simulated on Fetch robotic. Overall, DDPG combined with HGR shows the use of replay buffer more sample efficient, with 2.9× faster than Vanilla HER, 1.3× and 1.8× faster than EBP [41] and MEP [40], respectively. Overall, our two-step ranking method benefits HER and shows the use of samples more effective than baseline methods in the robotic environments. However, ranking the relay buffer requires overhead. Overall, the training time is roughly three times slower compared to Energy-Based Prioritization and four times compared to Vanilla HER.

Ablation studies
To investigate the benefit between goal prioritization and episode prioritization, ablation studies on all four benchmark environments FetchReach, FetchPush, FetchPickAndPlace, and FetchSlide are conducted. To disable episode prioritization, we set α = 0 and β = 0, and to disable goal prioritization, we set α = 0 and β = 0. The evolution of success rate using different prioritization types is shown in Figure 6. In the FetchReach and FetchSlide, the benefit of Vanilla Ours (w/ prioritized episode) Ours (w/ prioritized goal) Ours (full) Figure 6: Ablation study for three environments: "full" means using two-step ranking, "w/ prioritized goal" means only ranking goals, uniformly sampling episodes, "w/ prioritized goal" means only ranking episode, uniformly sampling goals.
each ranking is unclear as the former is an easy task to learn, and the latter is more difficult without supporting of demonstrations. In FetchReach, each type of prioritization led to better performance than vanilla HER. In FetchSlide, the ranking with the prioritized goal is even worst than no prioritization, but the combination of the two steps achieves better results. In FetchPush, at the same 100 th episode corresponding one million time step, the success rate of "full", "prioritized goal", "prioritized episode", and without any prioritization are 99.05, 89.74, 71.70 and 31.59 percent, respectively. This indicates that ranking the goal is more important than ranking the episode in the FetchPush environment. In contrast, the episode's prioritization in FetchPickAndPlace is more important. Specifically, at 100 th episode, in the FetchPickAndPlace environment, the success rate of "full", "prioritized goal", "prioritized episode", and without any prioritization are 99.40, 43.25, 70.13 and 27.02 percent, respectively. Overall, in comparison with Vanilla HER, the two-step ranking HGR at the episode as well as at the goal level improves sample efficiency and final performance. How much of the benefit would depend on the task.

Conclusion
In this paper, a prioritized replay method for multi-goal setting in the sparse rewards environment is considered. Inspired by the prioritization method in PER, which is proposed for a single goal and discrete action space, prioritization for multi-goal and continuous action space environment is studied. The proposed method divides the prioritized sampling into two steps: first, an episode is sampled according to the average TD error of experience with hindsight goals within the episode, then, for the sampled episode, experience with hindsight goals leading to larger TD error is sampled with higher probability. From the empirical results, HER with HGR significantly improves sample efficiency in the multi-goal RL with the sparse reward environment compared with vanilla HER, and its performance is marginally higher performance than the state of the art algorithms. However, HER with HGR requires O(log n) computation times to search and update the priority.