Policy Return: A New Method for Reducing the Number of Experimental Trials in Deep Reinforcement Learning

Using the same algorithm and hyperparameter configurations, deep reinforcement learning (DRL) will derive drastically different results from multiple experimental trials, and most of these results are unsatisfactory. Because of the instability of the results, researchers have to perform many trials to confirm an algorithm or a set of hyperparameters in DRL. In this article, we present the policy return method, which is a new design for reducing the number of trials when training a DRL model. This method allows the learned policy to return to a previous state when it becomes divergent or stagnant at any stage of training. When returning, a certain percentage of stochastic data is added to the weights of the neural networks to prevent a repeated decline. Extensive experiments on challenging tasks and various target scores demonstrate that the policy return method can bring about a 10% to 40% reduction in the required number of trials compared with that of the corresponding original algorithm, and a 10% to 30% reduction compared with the state-of-the-art algorithms.

Deep reinforcement learning is a combination of deep learning and reinforcement learning, thus enabling it to possess both the perception ability of deep learning and the policy-making ability of reinforcement learning [1], [2]. In recent years, models based on deep reinforcement learning have achieved excellent performances across various domains, for instance, beating the top human players in Go matches [3], [4], controlling the operation of complex machinery [5]- [8], allocating network resources [9], improving wireless communication technologies [10], [11], etc.
However, while deep reinforcement learning has made remarkable achievements, it has long been faced with the issue of having to undergo many trials due to unstable results [12], [13]. With appropriate multilayer perceptron function approximators and tuned hyperparameters, supervised learning can always acquire a satisfactory result. If the training process is performed multiple times, the learning curves The associate editor coordinating the review of this manuscript and approving it for publication was Victor S. Sheng. acquire similar distributions. However, these phenomena do not occur with deep reinforcement learning. Most often, when using the same algorithm and hyperparameters, the results of multiple training processes differ widely in deep reinforcement learning. Even a DRL model that produces an excellent policy could, in all probability, result in a nonconvergent learning.
When training a DRL model, many researchers may encounter the dilemma that when using the same algorithm and hyperparameters, in some trials, the learning curves cannot converge after a prolonged training process or suddenly diverge from a previous convergence, but in other trials, they converge well. As a result, researchers have to perform many trials for a particular configuration and then choose the relatively best policy from the results.
There are many reasons for the instability of DRL. Strong correlations among state sequences easily lead to the divergence, and the data distribution that changes as the algorithm learns new behaviors also brings about troubles for DRL [12]. Experience replay and distributed training are used to reduce the adverse impacts caused by highly correlated samples and VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ changes in the data distribution [14]- [17]. However, according to our experience, a DRL algorithm using these methods may still have a high chance of diverging during the training process, and we highlight this in experiment 1.
For DRL, researchers often use random seeds to generate stochastic states at the beginning of an episode or add randomness to the action selection procedure for further exploration [18]. At the same time, a DRL model is often initialized by random weights within a certain range or obeying a given distribution. Environmental stochasticity and stochasticity in the learning process are crucial factors that lead to various results [12].
In recent years, researchers have made many efforts to make DRL competent for highly challenging tasks by optimizing algorithms [19]- [22], but few of them have paid attention to the large number of trials caused by unstable results. This article seeks to reduce the amount of required trials by introducing a method that allows the learned policy to return to a previous better state to increase the chance of convergence; we call this the policy return method. When returning, a certain percentage of stochastic weight is added to the DRL model to improve its exploration ability and avoid a repetitive decline. We conducted experiments on challenging tasks, and the results indicated that our proposed method can effectively reduce the number of trials when training a DRL model.
The remainder of the article is organized as follows. Section II introduces the technical background of this article. In Section III, we detail the policy return method. In Section IV, we describe detailed experimental settings, the results and discussions. Finally, we draw a conclusion from our research in Section V.

II. TECHNICAL BACKGROUND
In this section, we introduce reinforcement learning, deep reinforcement learning and some academic concepts that enable the comprehension of subsequent research procedures. A few DRL algorithms that are used in our experiments are also introduced.

A. RL AND DRL
Reinforcement learning (RL) is an area of machine learning that learns policy knowledge by exploring the environment [23]. RL follows the discounted Markov decision process (S, A, γ , P, r) [24]. At step t during training, the algorithm chooses an action a t ∈ A according to its policy π θ (a|s t ) based on the current state s t ∈ A and then the environment offers a reward r(s t , a t ) and the next state s t+1 according to the transition probability P(s t=1 |s t , a t ). The discounted cumulative reward is defined as (1) and the action value function Q(s, a) is defined as Traditional RL algorithms often use a Q-table to record the Q-value that denotes the learned policy and is updated obeying the Bellman equation [25]. Q-learning [26], as a classic RL algorithm, is updated by where In 2013, DeepMind utilized a multilayer neural network to take place of the Q-table and they called it Q-network [1]. It is a classical work in the field of DRL. By differentiating the loss function with respect to the weights θ, the gradient of the Q-network is defined as With deep neural networks, reinforcement learning can handle diversified state data and even pixelated image information, and the networks also bring RL the capacity to handle complex continuous tasks.

B. DEEP DETERMINISTIC POLICY GRADIENT
Deep deterministic policy gradient (DDPG) [15] is a DRL algorithm that uses the actor-critic structure [27], and it generates deterministic actions rather than a policy distribution. DDPG algorithm has 4 networks: an actor µ(s|θ µ ), a critic Q(s, a|θ Q ), a target actor µ (s|θ µ ) and a target critic Q (s, a|θ Q ). The weights of these target networks are updated by slowly tracking the learned networks: θ ← τ θ +(1−τ )θ , with τ 1. The critic Q(s, a|θ Q ) is optimized by minimizing the loss: The actor policy µ(s|θ µ ) is updated by the sampled policy gradient: DDPG is an outstanding DRL algorithm. In the experiment 3, we integrate policy return method into DDPG and perform testing on a challenging task to verify whether our proposed method works.

C. PROXIMAL POLICY OPTIMIZATION ALGORITHMS
Proximal policy optimization algorithms (PPO) [28] are policy gradient methods for reinforcement learning that optimize a ''surrogate'' objective function using stochastic gradient ascent. Similar to TRPO [29], PPO can choose an adaptive step size to update its policy under the premise of learning stability, and it can be easier to implement than TRPO. The newest PPO algorithms use a clipping method to replace the Kullback-Leibler divergence that is used to restrict the step size of the update [28], and the loss function for updating the policy in PPO is the following: Here, r t (θ) denotes the probability ratio r t (θ) = π θ (a t ,s t ) π θ old (a t ,s t ) , so r(θ old ) = 1, andÂ t is the advantage function, which is defined aŝ where V (s t ) specifies the value of the action that is provided by the critic network.
PPO is very good at continuous tasks and has become the main DRL algorithm of OPENAI. In experiment 2, we compare PPO on its own with PPO integrated into the policy return method on a continuous control task.

D. DISTRIBUTED PPO
Distributed PPO (DPPO) [17] is a modified PPO algorithm that uses multiple agents and multiple threads to speed up the training and reduce the correlations among samples. As Fig.1 shows, distributed PPO has some worker agents that learn in different environments and transmit learned policy gradients to the master agent. Every environment runs in an individual thread, and the weights of workers are updated in accordance with those of the master at regular intervals. In this study, we perform a few tests on distributed PPO in experiment 1 to observe the various results obtained based on a certain DRL algorithm and hyperparameter configurations.

III. POLICY RETURN METHOD
The policy return method makes copies of the networks in the original DRL algorithms, so it has two sets of networks with one called the optimal network and the other called the exploratory network. The optimal network is used to record the current optimal policy, and it is also the final policy for learning. The exploratory network is used to learn in the environment and transmit a better policy π θ e to the optimal network. If the performance of the exploratory network deteriorates or cannot surpass that of the optimal network after a period of training, we make the exploratory network return to the policy of the optimal network by copying weights θ o to θ e and adding a certain percentage of the stochastic data σ to the exploratory network. Fig.2 schematically shows the policy transmission between the two networks.

A. RECORDING THE OPTIMAL POLICY
In DRL, any tentative action reaps a reward from the environment, and researchers often use the episode-reward (EPR), which is the sum of the rewards for all steps in an episode, to represent the performance of a learning model. The EPR is calculated by where M is the maximum step limit of an episode and s is the index of the learning steps. At time t, let A denote the average of the latest KEPRs of the exploratory network; A is defined as During training, when the KTH episode is completed, the first A is calculated. Let A max denote the current maximum A. Update A every time an episode is completed, and if A max < A, set A max = A; then, copy the weights of the exploratory network to the optimal network (θ o ← θ e ), so that the optimal network always records the current optimal policy.

B. POLICY RETURN
In this research, if the rewards of the exploratory network could not surpass those of the optimal network after C episodes, the exploratory network will be forced to return to the optimal network that has the current optimal policy. The judgment for return is as follows:  The green points are where to record the current optimal policy, and the pink points are the moments to return. C is the maximum period of stagnation allowed. Fig.3 shows the process of policy return. After the decline lasted for C episodes, at the pink point, the policy returned to the latest recoded point and continued to learn. The policy may not ameliorate immediately after the return, and here we only describe the probable effect of policy return simply. The actual experimental curves for policy return are shown in Fig.6 and Fig.8 in Sec.IV.
To avoid repeating a previous decline and improve the exploration ability of the model, we also add a certain percentage of the stochastic data σ to the exploratory network. A combination of stochastic data that has the same mode with the weights θ e is easy to acquire by reinitializing the exploratory network, so σ = θ e init , and the return method is defined as Here, ϕ is the return ratio factor that controls the percentage of stochastic data, where ϕ 1. Algorithm Box 1 shows the core policy return method in pseudocode.

A. EXPERIMENT 1: OBSERVING THE DRL RESULTS FROM TRIALS WITH THE SAME ALGORITHM AND HYPERPARAMETERS
To observe the difference between the results from multiple DRL trials that use a certain algorithm and hyperparameters, we performed a number of trials for the distributed PPO algorithm on the Pendulum-v0 task.
Environment: Pendulum-v0 is a classic continuous control task from OPENAI-GYM. The goal is to make a pole stand up by applying a torque, and Fig.4 shows the motion sequences of a successful operation for Pendulum-v0. More information about Pendulum-v0 can be viewed at http://gym.openai.com.
Experimental Details: The DPPO algorithm that was used in this experiment is introduced in the Sec.II-D. We used 3 worker-agents running in different threads to learn in the Algorithm 1 Policy Return 1: Randomly initialize the optimal network and the exploratory network with weights θ o and θ e . 2: Initialize A (The average EPR of the latest K episodes). 3: Initialize A max 4: Initialize the size of reward buffer L B . 5: for episode = 1 to N do 6: Initialize reward buffer B.

7:
Initialize EPR. 8: for step = 1 to M do 9: Run policy π θ , collecting {s, a, r}. 10: Update the exploratory network using the certain DRL algorithm. 11: Update EPR: EPR = EPR + r. 12: end for 13: 14: Store A in B. 15: if A > A max then 16: Set A max = A. 17: Update the optimal network: θ o ← θ e . 18: end if 19: if A <= A max lasted C episodes, see Eq(13) then 20: Getting weights θ e init by randomly initializing the exploratory network. 21: Return the exploratory network:    Results and Discussion: From Fig.5, we can see clearly that the same configurations produced very different results. Only half of the results (b, c) were convergent. (a) was divergent throughout the training process. (d) suddenly became divergent after achieving convergence, and the divergence was maintained until the end of training. For this experiment, we may choose policy (b) or (c) as the final result. However, if we only conduct one trial and obtain a terrible result such as that of (a) or (d), we will be very confused about whether to try again or abandon the current algorithms. A cruel reality is that researchers have to perform many trials when they are tuning a DRL algorithm. Therefore, it is significant to study how to verify a DRL algorithm or a batch of hyperparameters with fewer trials.

B. EXPERIMENT 2: COMPARISON OF PPO WITH PR-PPO ON PENDULUM-v0
In this study, we performed 50 trials for the original PPO and PPO with the policy return method (PR-PPO) on the Pendulum-v0 task, and each trial had 2000 learning episodes. The PPO algorithms used in this experiment are introduced in Sec.II-C.
Environment: We used the Pendulum-v0 environment to test our algorithm in this experiment and this environment was introduced in experiment 1.
Experimental Details: In this experiment, the trials of PPO and the trials of PR-PPO used the same configurations except for the exclusive hyperparameters (A max , L B , C, ϕ) of PR-PPO; therefore, they could prominently demonstrate the influence of the policy return method on DRL. The detailed configurations for this experiment are shown in Tab.2.
Results and Discussion: Fig.6 shows two learning curves from the PR-PPO trials. In panel (a), we can note that, in most cases, the returns at the position of pink dots can make the learning reach an improved state, and with the returns, the score of the optimal network becomes increasingly higher. In panel (b), in the first half of the learning process, the situation is difficult, but after a few policy returns, the learning VOLUME 8, 2020 FIGURE 6. Learning curves obtained by PPO with the policy return method on the Pendulum-v0 task: panel (a) shows the capacity of policy return method to help the learning to obtain higher scores, and panel (b) mainly shows the process that the returns save the learning from a poor situation. stops diverging and finally gets back on track. Fig.7 shows a comparison between the average learning curve of the 50 PPO trials and the average learning curve of the 50 PR-PPO trials, and we can find that PR-PPO can not only make the DRL more stable but also obtain higher scores than those of PPO in the end.
Tab.3 lists the required number of trials (N r ) for various target scores. N r is estimated by N r = 50/w, where w is the number of trials whose best 100-episode performance surpasses the target score. Tab.3 indicates that the stable learning brought about by the policy return method leads to fewer required trials at all target score levels, and with the target score continuously increasing, this help becomes more and more significant.

C. EXPERIMENT 3: COMPARISON OF DDPG WITH PR-DDPG ON BipedalWalker-v2
To verify the general suitability of the policy return method with regard to DRL algorithms, we applied it to DDPG, which  is a popular off-policy DRL method. Similar to experiment 2, we performed multiple trials for the original DDPG and DDPG with policy return (PR-DDPG) on a highly complicated task-BipedalWalker-v2.
Environment: BipedalWalker-v2 is a bipedal walking control task and is also from OPENAI-GYM. The goal of this  task is to learn to make the robot walk on uneven ground by driving four leg joints.
Experimental Details: For DDPG method, the actor network has 2 hidden layers with 500*200 units, and the critic network has one hidden layer with 700 units. In order to improve the sampling efficiency a prioritized experience replay method [31] was used in choosing transitions. Tab. 4 shows the details about this experiment.
Results and Discussion: From the results demonstrated by Fig.8, the policy return method can also help DDPG avoid divergence during the learning process. Fig.9, which shows a comparison of average learning curves, further indicates the advantage brought about by policy returns. The statistically-derived numbers of trials that are required for various target scores are listed in Tab.5, and these show that the required number of trials can be reduced by 0.06 to 0.58 based on different target scores. From this experiment, we can conclue that the policy return method has wide applicability to deep reinforcement learning algorithms.

D. EXPERIMENT 4: COMPARISON OF THE POLICY RETURN METHOD WITH OTHER STATE-OF-THE-ART ALGORITHMS
Improving the stability of the learning is one of the most effective ways to reduce the number of trials in DRL. VOLUME 8, 2020   Experience replay [2] and distributed learning [16] are two state-of-the-art methods with regard to improving the stability of DRL. In this experiment, we set a few target scores at different levels and applied these two state-of-the-art methods and the policy return method to a DRL with an actor-critic structure to challenge these target scores. For every method, the number of trials needed to successfully achieve each target score was recorded.
Environment: This experiment conducts these comparisons on the Pendulum-v0 and BipedalWalker-v2 tasks.
Experimental Details: Every method was tested 10 times on each target score, and the average required numbers of trials were compared. For the experience replay method, the capacity of the experience memory was 50000 and 200000 for Pendulum-v0 and BipedalWalker-v2, respectively, and a prioritized sampling method [31] was used when choosing transitions. For the distributed learning method, we used 5 agents to learn in the respective environment. For the policy return method, the inspection period C was 200 episodes for Pendulum-v0 and 100 episodes for BipedalWalker-v2. The details of the actor-critic structure are listed in Tab.6.
Results and Discussion: As Fig.10 shows, all methods required more trials when the target score was set higher, and in both environments, the policy return method used the fewest trials to overcome every target score. In the Pendulum-v0 trials, when setting the target score higher than −180, the policy return method brought about a reduction in the number of required trials of more than 1 compared with those of the experience replay and distributed learning methods. In the BipedalWalker-v2 trials, the policy return method obtained a score of 290 with an average of 1.9 attempts, and the experience replay, which achieved the better performance of these two state-of-the-art methods needed approximately 2.6 attempts.

V. CONCLUSION
In this work, we have proposed the policy return method to reduce the number of trials required in deep reinforcement learning. Our method monitors the learning performance in real time and compels the learned policy to return to the current optimal policy when the learning falls into a terrible situation. The experimental results demonstrate that, the policy return method can reduce the number of trials required by existing DRL algorithms by 10 to 40 percent, and it can achieve a fairly good score with nearly one fewer trial than the state-of-the-art algorithms. A drawback of our method that is worthy of discussion is that for the policy return method, a copy of the networks is needed to store the optimal policy; as a result, more GPU memory is required, but this issue will become increasingly negligible as graphics cards are upgraded. Although the policy return method is not impeccable, it can provide practical help, and we hope it can become a useful component for DRL architectures.

VI. CODE AVAILABILITY
The source-code for policy return method is available from https://github.com/Parker1988/Experiments