ACERAC: Efficient Reinforcement Learning in Fine Time Discretization

One of the main goals of reinforcement learning (RL) is to provide a way for physical machines to learn optimal behavior instead of being programmed. However, effective control of the machines usually requires fine time discretization. The most common RL methods apply independent random elements to each action, which is not suitable in that setting. It is not feasible because it causes the controlled system to jerk and does not ensure sufficient exploration since a single action is not long enough to create a significant experience that could be translated into policy improvement. In our view, these are the main obstacles that prevent the application of RL in contemporary control systems. To address these pitfalls, in this article, we introduce an RL framework and adequate analytical tools for actions that may be stochastically dependent in subsequent time instances. We also introduce an RL algorithm that approximately optimizes a policy that produces such actions. It applies experience replay (ER) to adjust the likelihood of sequences of previous actions to optimize expected $n$ -step returns that the policy yields. The efficiency of this algorithm is verified against four other RL methods [continuous deep advantage updating (CDAU), proximal policy optimization (PPO), soft actor–critic (SAC), and actor–critic with ER (ACER)] in four simulated learning control problems (Ant, HalfCheetah, Hopper, and Walker2D) in diverse time discretization. The algorithm introduced here outperforms the competitors in most cases considered.


I. INTRODUCTION
The subject of this paper is reinforcement learning (RL) [1].This field offers methods of learning to make sequential decisions in dynamic environments.One application of such methods is the literal implementation of "machine learning", i.e., enabling machines and software to learn optimal behavior instead of being programmed.
The usual goal of RL methods is to optimize a policy that samples an action based on the current state of a learning agent.The only stochastic dependence between subsequent actions is through state transition: the action moves the agent to another state, which determines the distribution of another action.The main analytical tools in RL are based on this lack of other dependence between actions.For example, for a given policy, its value function expresses the expected sum of discounted rewards the agent may expect starting from a given state.The sum of rewards does not depend on actions taken before the given state has been reached.Hence, only the given state and the policy matter.
Lack of dependence between actions beyond state transition leads to the following difficulties.In the physical implementation of RL, e.g., in robotics, the lack of dependence usually means that white noise is added to control actions.However, this makes control discontinuous and subject to constant rapid changes.In addition, this is often impossible to implement since electric motors to execute these actions can not change their output too quickly.Even if such control is possible, it requires large amounts of energy, makes the controlled system shake, and exposes it to damages.
Control frequency for real-life control systems can be much higher than that of simulated environments for which RL methods are designed.The typical frequency of the control signal for environments commonly used as benchmarks for RL algorithms ranges from 20 to 60 Hz [2], while the control frequency considered for real-life robots is 10 times higher, from 200 to 500 Hz [3] and can be even higher, up to 1000 Hz [4].Therefore, finer time discretization should be considered to make RL more suitable for automation and robotics.
The lack of dependence between actions beyond state transition may also reduce the efficiency of learning as follows.Each action is then an independent random experiment that leads to policy improvement.However, due to the limited accuracy of (action-)value function approximation, the consequences of a single action may be hard to recognize.The finer the time discretization, the more serious this problem becomes.The consequences of a random experiment distributed over several time instants could be more tangible and thus easier to recognize.
Additionally, fine time discretization makes policy evaluation more difficult, as it requires accounting for more distant rewards.Technically, the discount factor needs to be larger, which makes learning more difficult for most RL algorithms [5].The above problems are serious enough to prevent RL from wide applicability of RL in real life control systems.
To avoid the above pitfalls, we introduce in this paper a framework in which an action is both a function of state and a stochastic process whose subsequent values are dependent.In particular, these subsequent values can be autocorrelated, which makes the resulting actions close to one another.A part of action trajectory can create a distributed-in-time random experiment that leads to policy improvement.An RL algorithm is also introduced that optimizes a policy based on the above principles.
The contribution of this paper may be summarized by the following points: tic process.This framework is suited for the application of RL to optimization of control in physical systems, e.g., in robots.
• An ACERAC algorithm, based on Actor-Critic structure and experience replay, is introduced that approximately optimizes a policy in the aforementioned framework.
• An extensive study is described here with four benchmark learning control problems (Ant, Half-Cheetah, Hopper, and Walker2D) at diverse time discretization.The performance of the ACERAC algorithm is compared using these problems with state-of-the-art RL methods.This paper extends [6] in several directions.We introduce here the notion of adjusted noise which is the input to the noise-value function.Also, when manipulating the policy parameter, the value of the noise-value function at the end of the action sequence is taken into account.The experimental study of the resulting algorithm is almost entirely new.
The rest of the paper is organized as follows.The problem considered here is formulated in Sec.II.Another section overviews related literature.Sec.IV introduces a policy that produces autocorrelated actions along with tools for its analysis.Sec.V introduces the ACERAC algorithm that approximately optimizes this policy.Sec.VI presents simulations that compare the algorithm presented with state-of-the-art reinforcement learning methods.The last section concludes the paper.

II. PROBLEM FORMULATION
We consider here the standard Markov Decision Process (MDP) model [1] in which an agent operates in discrete time t = 1, 2, . . . .At time t the agent finds itself in a state, s t ∈ S, takes an action, a t ∈ A, receives a reward, r t ∈ R, and is transited to another state, s t+1 ∼ P s (•|s t , a t ), where P s is a fixed but unknown conditional probability.
The goal of the agent is to learn to designate actions to be able to expect at each t the highest discounted rewards in the future.To ensure exploration, there is usually a random component introduced into the action selection.
We mainly consider the application of the MDP model to control physical devices.Therefore, we assume that both S and A are spaces of vectors of real numbers [7].We also assume fine time discretization typical for such applications, which means that designating actions the agent should account for rewards that are quite distant in terms of discrete-time steps in the future.This translates into a discount parameter close to 1, e.g., γ ∈ (0.995, 1).We require the reasons for the instability of learning with such a large γ [5] to be overcome.
To ensure applicability to control of physical machines, we require that the actions should generally be close for subsequent t, even if they are random.Also, the learning should be efficient in terms of the amount of experience needed to optimize the agent's behavior.

III. RELATED WORK
A general way to make subsequent actions close is the autocorrelation of the randomness on which these actions are based.Efficiency in terms of experience needed can be provided by experience replay.We focus on these concepts in the literature review below.

A. Stochastic dependence between actions
An autocorrelated stationary stochastic process, now referred to as Ornstein-Uhlenbeck (OU) process, was analyzed in [8].This process is the only autocorrelated Gaussian stochastic process that has the Markov property [9].
A policy with autocorrelated actions was analyzed in [10].This policy was optimized by a standard RL algorithm that did not account for the dependence of actions.In [11] a policy was analyzed whose parameters were incremented by the OU stochastic process.Essentially, this resulted in autocorrelated random components of actions.In [12] a policy is analyzed that produced an action that was the sum of the OU noise and a deterministic function of the state.However, no learning algorithm was presented in the paper that accounted for the specific properties of this policy.

B. Reinforcement learning for fine time discretization
In [13] RL in arbitrarily fine time discretization is analyzed.It is proven that RL based on the action-value function can not be effective when time discretization becomes sufficiently fine and note the importance of the dependence of the action noise in the next timesteps.In the aforementioned work RL algorithm called Deep Advantage Updating (DAU) for discrete actions and its variant for continuous actions (CDAU) are introduced.These methods are based on estimating the advantage function and are presented as immune to time discretization.They are based on Deep Q-Network (DQN) [14] and Deep Deterministic Policy Gradient (DDPG) [15] algorithms, respectively, and use the OU process as autocorrelated noise.
Integral Reinforcement Learning (IRL) is an approach to learning control policies for continuous-time environments.IRL is based on the assumption that the control problem can be divided into a hierarchy of control loops [16].This assumption is usually not satisfied in challenging tasks and thus IRL is not applicable to tasks with any state transition dynamics, only those belonging to a certain relatively narrow class [17].

C. Reinforcement learning with experience replay
The Actor-Critic architecture for RL was first introduced in [18].Approximators were applied to this structure for the first time in [19].Basic on-line RL algorithms use consecutive events of the agent-environment interaction to update the policy.To boost the efficiency of these algorithms, experience replay (ER) can be applied, i.e., storing the events in a database, sampling, and using them for policy updates several times per each actual event [20].ER was combined with the Actor-Critic architecture for the first time in [21].
However, the application of experience replay to Actor-Critic encounters the following problem.The learning algorithm needs to estimate the quality of a given policy based on the consequences of actions that were registered when a different policy was in use.Importance sampling estimators are designed to do that, but they can have arbitrarily large variances.In [21] the problem was addressed with truncating density ratios present in those estimators.In [22] specific correction terms were introduced for that purpose.
Another approach to the aforementioned problem is to prevent the algorithm from inducing a policy that differs too much from the one tried.This idea was first applied in Conservative Policy Iteration [23].It was further extended in Trust Region Policy Optimization [24].This algorithm optimizes a policy with the constraint that the Kullback-Leibler divergence between this policy and the one being tried should not exceed a given threshold.The K-L divergence becomes an additive penalty in Proximal Policy Optimization algorithms, namely PPO-Penalty and PPO-Clip [25].
A way to avoid the problem of estimating the quality of a given policy based on the one tried is to approximate the action-value function instead of estimating the value function.Algorithms based on this approach are DQN [14], DDPG [15], and Soft Actor-Critic (SAC) [26].Although OU noise was added to the action in the original version of DDPG, this algorithm was not adapted to this fact in any specific way.SAC uses white noise in actions and it is considered one of the most efficient in this family of algorithms.

IV. POLICY WITH AUTOCORRELATED ACTIONS
In this section, we introduce a framework for reinforcement learning where subsequent actions are stochastically dependent beyond state transition.We also design tools for the analysis of such a policy.
Let an action, a t , be designated as where π is a deterministic transformation, s t is a current state, θ is a vector of trained parameters, and (ξ t ) ∞ t=1 is a stochastic process with values in R d .We require this process to have the following properties: • Stationarity: The marginal distribution of ξ t is the same for each t.• Zero mean: Eξ t = 0 for each t.
• Autocorrelation decreasing with growing lag: Essentially that means that values of the process are close to each other when they are in close time instants.• Markov property: For any t and k, l ≥ 0, the conditional distributions ) are the same.In words, dependence of future values of the process, ξ t+k , k ≥ 0, on its past is entirely carried over by ξ t−1 .Consequently, if only π (1) is continuous for all its arguments, and subsequent states s t are close to each other, then the corresponding actions are close too, even though they are random.Because they are close, they are feasible in physical systems.Because they are random, they create a consistent distributed-in-time experiment that can give a clue to policy improvement.
Below we analyze an example of (ξ t ) that meets the above requirements.
In fact, marginal distributions of the process (ξ t ), as well as its conditional distributions, are normal, and their parameters have compact forms.Let us denote The distribution of ξn t is normal where Ω n 0 (21) is a matrix dependent on n, α, and C. The conditional distribution ( ξn where both B n (24) and Ω n 1 (25) are matrices dependent on n, α, and C. b) Noise-value function: In policy (1) there is a stochastic dependence between actions beyond the dependence resulting from the state transition.Therefore, the traditional understanding of policy as distribution of actions conditioned on state does not hold here.Each action depends on the current state, but also previous states and actions.Analytical usefulness of the traditional value function and action-value function is thus limited.
Our objective now is to define an analytical tool in the form of a function that satisfies the following: R1.A hard requirement: The function designates an expected value of future discounted rewards based on entities that this expected value is conditioned on.R2.An efficiency requirement: A small change of policy corresponds to a small change of this function.While this is not necessary, it facilitates concurrent learning of the policy and this function approximation.In order to meet the above requirements we introduce an adjusted noise, (u t ) ∞ t=1 , as follows.u t and ξ t belong to the same space R d .Let be a bijective function in R d parameterized by θ and state.We have Formally, we can apply f to convert ξ t−1 to u t−1 and back whenever necessary.
As an analytical tool satisfying the aforementioned hard requirement R1, we propose the noise-value function defined as The course of events starting in time t depends on the current state s t and the value u t−1 .Because of the Markov property of (ξ t ) (3) and the direct equivalence between ξ t−1 and u t−1 , the pair u t−1 , s t is a proper condition for the expected value of future rewards.
To satisfy the aforementioned efficiency requirement R2, we design the f function (8) based on π.It should make the distribution of an initial part of the action trajectory (a t , . . . ) similar for given u t−1 , s t , regardless of the policy parameter θ.Therefore, when θ changes due to learning, the arguments of the W π function (10) still define similar circumstances in which the rewards start being collected.This prevents large changes in the shape of W π .An example of an appropriate f function is provided below in (18).
We can consider the pair u t−1 , s t a state of an extended MDP.Therefore, the noise-value function has all the properties of the ordinary value function.In particular, we consider nstep look-ahead equation in the form It says that the noise-value function is the expected sum of several first rewards, and that the rest of them are also designated by the noise-value function itself.
The algorithm introduced below manipulates the policy π (1) to make n-step sequences of registered actions more or less likely in the future.Let us consider being a probability density of the action sequence ān t conditioned on the sequence of visited states sn t , the preceding noise value ξ t−1 , and the policy parameter θ.This density is defined by π, and the conditional probability distribution ξn t |ξ t−1 .The algorithm defined in the next section updates θ to manipulate the above distribution.
c) The neural-AR policy: A simple and practical way to implement π (1) is as follows.A feedforward neural network, has input s and weights θ.An action is designated as for ξ t in the form (4). Let us analyze the distribution π (12).In this order the density of the normal distribution with mean µ and covariance matrix Ω will be denoted by Let us also denote It can be seen that the distribution (ā n t |s n t , ξ t−1 ) is normal, namely N ( Ā(s n t ; θ) + B n ξ t−1 , Ω n 1 ), (see ( 6) and ( 7)).Therefore, . What is of paramount importance is the log-density gradient ∇ θ ln π.For π defined as (14) it may be expressed as ).The f function (8) may have the form Then u t−1 is the expected value of a t given θ, s t , and ξ t−1 .
Consequently, this definition of f delimits differences between noise-value functions of different policies.This is because W π (u, s) means for any policy the expected sum of future rewards received starting from the same point, which is the current state equal to s and the expected action equal to u.Therefore, if W π is accurately approximated for a current policy and this policy is updated, the approximation of W π needs only limited adjustment.
V. ACERAC: ACTOR-CRITIC WITH EXPERIENCE REPLAY AND AUTOCORRELATED ACTIONS The RL algorithm presented in this section has an actorcritic structure.It optimizes a policy of the form (1) and uses the critic, W (u, s; ν), which is an approximator of the noise-value function (10) parametrized by the vector ν.The critic is trained to approximately satisfy (11).
A constant parameter of the algorithm is natural n.It denotes the length of action sequences whose probabilities the algorithm adjusts.For each time instant of the agent-environment interaction, the policy (1) is applied.Also, data is registered that enables recall of the tuple sn t , ān t , πn t , rn t , s t+n , where πn t = π(ā n t |s n t , ξ t−1 ; θ).The general goal of policy training in ACERAC is to maximize W π (u j−1 , s j ) for each state s j registered during the agent-environment interaction.In this order previous time instants are sampled, and sequences of actions that followed these instants are made more or less probable depending on their return.More specifically, j is sampled from {t − M, . . ., t − n}, where M is a memory buffer length, and θ is adjusted along with a policy gradient estimate, which is derived in Appendix B. In other words, the conditional density of the sequence of actions ān j is being increased/decreased depending on the return this sequence of actions yields.

A. Actor & Critic training
At each t-th instant of agent-environment interaction experience replay is repeated several times in the form presented in Algorithm 1 to calculate actor and critic weight updates.
In Line 2, the algorithm selects an experienced event to replay with the starting time index j.In the following lines the vectors of states sn j = [s T j , . . ., s T j+n−1 ] T and actions ān j = [a T j , . . ., a T j+n−1 ] T are considered.In Lines 3-4 X j−1 and X j+n−1 are appointed to be values of the noise with which the current policy would designate the past actions.Then, in Lines 5-6, the corresponding adjusted noise values u j−1 and u j+n−1 values are calculated.
In Line 7 a temporal difference is computed.It determines the relative quality of ān i .In Line 8 a softly truncated density ratio is computed.The density ratio implements two ideas.Firstly, θ is changing due to being optimized, thus the conditional distribution (ā n i |ξ j−1 ) is now different than it was at the time when the actions ān i were executed.The density ratio u j+n−1 ← A(s j+n ; θ) + αX j+n−1 7: return ∆θ, ∆ν for this discrepancy of distributions.Secondly, to limit the variance of the density ratio, the soft-truncating function ψ b is applied.E.g., for a certain b > 1.In the ACER algorithm [21], the hard truncation function, min{•, b} is used for the same purpose which is limiting density ratios necessary in designating updates due to action distribution discrepancies.However, soft-truncating distinguishes the magnitude of density ratio and works slightly better than hard truncation.In Line 9 an improvement direction for actor is computed.The sum of ∇ θ ln π(ā n j |s n j , X j−1 ; θ)d n j (θ, ν)ρ j (θ) and γ n ∇ θ W (u j+n−1 (θ), s j+n ; ν)ρ j is an improvement direction estimate of W π (u j−1 , s j ) derived in Appendix B. It is designed to increase/decrease the likelihood of occurrence of the sequence of actions ān i proportionally to d n i (θ, ν).L(s, θ) is a loss function that penalizes the actor for producing actions that do not satisfy constraints, e.g., they exceed their boundaries.
In Line 10 an improvement direction for critic, ∆ν, is computed.It is designed to make W (•, • ; ν) approximate the noise-value function (10) better.
In Line 7 the improvement directions ∆θ and ∆ν are applied to update θ and ν, respectively, with the use of either ADAM, SGD, or another method of stochastic optimization.
Implementation details of the algorithm using the neural-AR policy (14) are presented in (17) and Appendix A.

VI. EMPIRICAL STUDY
This section presents simulations whose purpose was to compare the algorithm introduced in Sec.V to state-of-theart reinforcement learning methods.We compared the new algorithm to Actor-Critic with Experience Replay (ACER) [21], Proximal Policy Optimization (PPO) [25], Soft Actor-Critic (SAC) [26] and Continuous Deep Advantage Updating (CDAU) [13].We selected these algorithms as different stateof-the-art approaches that also apply trajectory updates (PPO, CDAU) or control exploration (SAC, CDAU).We used the RLlib implementation [27] of SAC and PPO, and implementation of CDAU published by its authors1 .Our experimental software is available online. 2or the comparison of the RL algorithms to be the most informative we chose four challenging tasks inspired by robotics.They were Ant, Hopper, HalfCheetah, and Walker2D (see Fig. 2) from the PyBullet physics simulator [28].A simulator that is more popular in the RL community is MuJoCo [29]. 3yperparameters that assure optimal performance of ACER, SAC, and PPO applied to the environments considered in MuJoCo are well known.However, PyBullet environments introduce several changes to MuJoCo tasks, which make them more realistic and thus more difficult.Additionally, physics in MuJoCo and PyBullets differ slightly [30], hence we needed to tune the hyperparameters.We based the hyperparameters in our experiments on their values used for MuJoCo environments reported in the original papers.We also followed their authors' guidelines when selecting hyperparameters for tuning.
We do not limit the experiments only to the original environments.We also use modified ones with 3 and 10 times finer time discretization.This is to verify how the algorithms work in these circumstances.
We used actor and critic structures as described in [26] for each learning algorithm.That is, both structures had the form of neural networks with two hidden layers of 256 units each.

A. Experimental setting
Each learning run with basic time discretization lasted for 3 million timesteps.Every 30000 timesteps of training a simulation was made with frozen weights and without exploration for 5 test episodes.An average sum of rewards within a test episode was registered.Each run was repeated 5 times.
In experiments with, respectively, 3 and 10 times finer time discretization, the number of timesteps for a run and between tests was increased, respectively, 3 and 10 times.Also, to keep the scale of the sum of discounted rewards, the discount parameter was increased from 0.99 to, respectively, 0.99 1/3 and 0.99 1/10 , and the rewards were decreased, respectively, 3 and 10 times.The number of model updates was kept constant for different discretization.The data buffer was increased 3 and 10 times, respectively.In ACER the λ parameter was increased to λ 1/3 and λ 1/10 , respectively.Also, in ACERAC, the n coefficient was increased 3 and 10 times, respectively and α was increased to α 1/3 and α 1/10 , respectively.
For each environment-algorithm-discretization triple hyperparameters such as step-sizes were optimized to yield the highest ultimate average rewards.The values of these hyperparameters are reported in Appendix C.

B. Ablation results
Figures 3, 4, and 5, respectively, present results for AC-ERAC, for the original, 3 times, and 10 times finer time discretization, with different α and n.The primary goal of these experiments was to verify whether the concepts introduced with ACERAC really contribute to performance.For α = 0, autocorrelation of actions is switched off.It is seen in the graphs that α = 0 yields inferior performance.n defines the length of sequences of actions whose density is manipulated in the course of learning.It is seen that our proposed default values usually yield optimal or close to optimal performance, but in some cases smaller n prove better.

C. Results
Figures 6, 7, 8, respectively, present learning curves for all four environments and all four compared algorithms.The figures are for, respectively, the original, 3 times, and 10 times finer time discretization.Each graph shows how a sum of rewards, in test episodes evolves in the course of learning.Solid lines represent the average sums of rewards and shaded areas represent their standard deviations.
It is seen in Figures 6-8 that in 12 combinations of tasks and time discretizations, ACERAC was the algorithm to yield the best performance in 6 cases, in 2 cases it yielded the best performance ex equo with ACER, and in the rest of the cases it still yielded reasonable performance.
A curious result of our experiments was the extraordinarily high rewards obtained in some experiments with time discretization 10 times finer than the original.Namely, ACER and ACERAC obtained such results for Ant, and ACERAC for Hopper and Walker2D.Apparently, these environments require fast intervention of control and no algorithm is able to learn it at coarser time discretization.
It can also be seen that for most discretizations and problems ACERAC obtained relatively good results in the initial training steps, which is a desirable feature in robotic control [31].

D. Discussion
The performance of the algorithms in our experiments with fine time discretization can be attributed to two features.The first one is autocorrelated actions.ACERAC and CDAU use them, but only ACERAC utilize their properties.Other considered algorithms do not use them.It can be seen in Figures 6-8 that ACERAC achieved the best performance for 8 discretization-environment pairs out of 12. Switching off autocorelation from actions worsen the efficiency.The autocorrelated actions seem to be an efficient way to organize exploration, better than actions without any stochastic dependence beyond state transition.However, a policy with autocorrelated actions requires specialized training, which is provided in ACERAC.
The second factor is whether the algorithms use 1-step returns (SAC and CDAU) or n-step returns (PPO, ACER, and ACERAC).The impact of this parameter on performance is complex.For fine enough time discretization and large enough discount parameter, the 1-step returns are expected to fail due to the limited accuracy of the critic.However, if the critic is accurate enough for the task at hand, small n values work quite well.This is visible in the upper part of Fig. 5, where 1 proves to be the best value of n in ACERAC for Ant and the highest analyzed time discretization.Hence, n > 1 may be a remendy for an inaccurate critic, but an accurate one is a better remedy.
Even though CDAU was designed in [13] to assure efficient RL in fine time discretization, that algorithm yielded poor performance in our experiments.However, it was presented as an extension of a method, DAU, for discrete actions, and no experimental material on CDAU was presented in the original paper.

VII. CONCLUSIONS AND FUTURE WORK
In this paper, a framework has been introduced for the application of reinforcement learning to policies that admit stochastic dependence between subsequent actions beyond state transition.This dependence is a tool that enables reinforcement learning in physical systems and fine time discretization.It can also yield better exploration and therefore faster learning.
An algorithm based on this framework, Actor-Critic with Experience Replay and Autocorrelated aCtions (ACERAC), was introduced.Its efficiency was verified by simulations of four learning control problems, namely, Ant, HalfCheetah, Hopper, and Walker2D, at diverse time discretization.The algorithm was compared with CDAU, PPO, SAC, and ACER.ACERAC exhibited the best performance in 8 out of 12 discretization-environment pairs.
It would be desirable to combine the framework proposed here with adapting the amount of randomness in actions by introducing reward for the entropy of their distribution, as is done in PPO.Also, the framework proposed here has been specially designed for applications in robotics.An obvious next step in our research would be to apply it in this area, which is more demanding than simulations.
The above estimator is not feasible.Firstly, it is based on the noise-value function, which is unknown.Also, it uses the density ratio, which could make its variance excessive.
In the feasible version of the above estimator, we use the approximator of the noise-value function, and the density ratio is softly truncated from above.

C. Algorithms' hyperparameters
This section presents hyperparameters used in the simulations described in Sec.VI.For the original time discretization all algorithms used a discount factor equal to 0.99.Common parameters for the offline algorithms (i.e., ACERAC, ACER, SAC and CDAU) are presented in Tab.I. Hyperparameters specific for different algorithms are depicted in Tabs.II-IX.The hyperparameters were tuned using grid search over values spread by a factor of 3: . . ., 10 −6 , 3 • 10 −5 , 10 −5 , . . ., with the exception of the clip parameter for PPO, whose only considered values were 0.1, 0.2, 0.3, as suggested by the authors of this algorithm [25].

Fig. 6 .
Fig. 6.Learning curves for the original time discretization: Average sums of rewards in test trials.Environments: Ant, HalfCheetah, Hopper and Walker2D.

Fig. 7 .
Fig. 7. Learning curves for time discretization 3 times finer than the original: Average sums of rewards in test trials.Environments: Ant, HalfCheetah, Hopper and Walker2D.

Fig. 8 .
Fig. 8. Learning curves for time discretization 10 times finer than the original: Average sums of rewards in test trials.Environments: Ant, HalfCheetah, Hopper and Walker2D.