A Functional Clipping Approach for Policy Optimization Algorithms

Proximal policy optimization (PPO) has yielded state-of-the-art results in policy search, a subfield of reinforcement learning, with one of its key points being the use of a surrogate objective function to restrict the step size at each policy update. Although such restriction is helpful, the algorithm still suffers from performance instability and optimization inefficiency from the sudden flattening of the curve. To address this issue we present a novel functional clipping policy optimization algorithm, named Proximal Policy Optimization Smoothed Algorithm (PPOS), and its critical improvement is the use of a functional clipping method instead of a flat clipping method. We compare our approach with PPO and PPORB, which adopts a rollback clipping method, and prove that our approach can conduct more accurate updates than other PPO methods. We show that it outperforms the latest PPO variants on both performance and stability in challenging continuous control tasks. Moreover, we provide an instructive guideline for tuning the main hyperparameter in our algorithm.

Reinforcement learning, especially deep model-free reinforcement learning, has achieved great progress in recent years. The application of reinforcement learning ranges from video games [9], [14], board games [21] to robotics [12], [13], and challenging control tasks [5], [19]. Within these approaches, Policy gradient (PG) methods are widely used in model-free policy search algorithms, contributing significantly to the fields of robotics and high-dimensional control. PG methods update the policy with a step estimator of the gradient of the expected return and gradually converge to the optimal policy [16]. One of the most significant problems of PG-based methods is to determine the suitable step size for the policy update while an improper step size will either prolong the policy searching process or even degrade the policy, since the sampling data highly depends on the current policy. As a result, the trade-off between learning speed and robustness is a challenging problem for PG methods.
The associate editor coordinating the review of this manuscript and approving it for publication was Shagufta Henna .
The trust region policy optimization (TRPO) method [18] proposes an objective function (also called the ''surrogate'' objective) and aims to maximize the objective function while subjecting to a constraint on the size of the policy update by KL divergence. However, the subjecting item is computationally inefficient and difficult to scale up for high-dimensional problems when extending to complex network architectures. To address this problem, the proximal policy Optimization (PPO), which adopts a clipping mechanism on the probability of likelihood ratio, was introduced [20]. This clipping mechanism significantly reduces the complexity of the optimizing process. To deal with learning robustness PPO tries to remove the incentive for pushing the policy away from the old one when the ratio is out of a clipping range, being more efficient for both implementation and computation than the trust region-based TRPO.
Despite the success of PPO it also has some flaws in restricting the update step size, which means that PPO cannot truly bind the likelihood ratio within the clipping range [25]. Aiming towards a better step size update, the proximal policy optimization with rollback (PPORB) applied a rollback operation on the clipping function to inhibit the policy from being pushed away during training [26]. PPORB adopts a straight downward-slope function instead of the original flat function when the ratio is out of the clipping range, which hinders the natural incentive from PPO to seek a large policy update. However, this solution introduces new problems, such as when the ratio is extremely big and the clipped ratio will shoot to infinity or negative infinity, and thus contrast with its original aim. Additionally, the inner workings of PPORB's hyperparameter α, the slope of the rollback function, is still abstruse and poorly understood because PPORB only provides the empirical values for specific experiments instead of a tuning method or function.
Inspired by the insights above, we propose a novel PPO clipping method, named Proximal Policy Optimization Smoothed algorithm (PPOS) which combines the strengths of both PPO and PPORB. We apply a functional clipping mechanism to prevent large policy updates while keeping the convergence of the clipped ratio. We compare PPOS with PPO and PPORB in high-dimensional control environments. We also analyze the hyperparameter here introduced in relation to the dimensions of five problems and provide a useful guideline for readers to orient their hyperparametric choices according to the dimension of their problems.

II. PRELIMINARIES
This section discusses the background information needed to understand our policy search approach. We first describe the reinforcement learning (RL) framework, which is based on the Markov decision process (MDP). Then, we give an overview of policy search methods. We continue by introducing trust-region methods and actor-critic methods. Finally, we delve into the Proximal Policy Optimization (PPO) algorithm, which is currently one of the state-of-the-art algorithms in policy search.
A RL procedure consists of an agent and an environment, where the agent can interact with the environment and gain rewards during this interaction. We consider RL as a process running by a Markov decision process (MDP). An MDP is defined by the tuple (S, A, p, ρ 0 , r, γ ), where S ⊆ R d s is a finite set of states of the agent and A ⊆ R d A is a finite set of actions which can be conducted by the agent. p (s t+1 | s t , a t ) denotes the probability of transfer from state s t to s t+1 given the action conducted by the agent at time step t. p (s t+1 | s t , a t ) is usually unknown in high-dimensional control tasks, but we can sample from p (s t+1 | s t , a t ) in simulation or from a physical system in the real world. ρ 0 denotes the initial state distribution of the agent, and r(s t , a t ) denotes the valued reward in state s t when the agent executes action a t . The last γ ∈ (0, 1) is the discount factor. A policy in RL is defined as the map (or function) from the state space to the action space. The goal of RL is to find a stationary policy π that maximizes the long-term expected reward [18], where Q π (s t , a t ) = E s t+1 ,a t+1 ,...
in which ρ π (s) is the state distribution under π, i.e., the probability of visiting state s under π. Q π (s t , a t ) is called state-action value function which evaluate the state-action pair. s t+1 ∼ p (s t+1 | s t , a t ), a t ∼ π (a t | s t ) and T is the terminal time step. We could also compute the value function V π (s t ) = E a t ,s t+1 ,... T t=0 γ t r (s t , a t ) which is the evaluation of the state s t , and advantage function A π (s t , a t ) = Q π (s t , a t ) − V π (s t ) which indicates how reasonable the chosen action a t is given the state s t .
In policy search, the goal is to find a policy π that maximized Eq.1 In recent work, the policy usually adopts the neural network structure (e.g. [14], [15] and [24]) which simulates the network of neuron of human-being to some extent. A policy can be improved by update the policy by the following surrogate performance objective which is called policy gradients methods [22], where r s,a (π) = π(a | s)/π old (a | s) is the likelihood ratio between the new policy π and the old policy π old , A π old (s, a) is the advantage function of the old policy π old .

A. TRUST REGION POLICY OPTIMIZATION
The common way of conducting policy updates is using the samples from the old policy π old to improve the current policy π at each iteration. However, the update may not be valid for the new policy in practice because the estimates for ρ π (s) and Q π (s, a) are based on the old policy which could result in some extremely large updates. To address this problem, Trust-Region Optimization (TRO) methods are used for keeping the new policy not far away from the old policy, which was first introduced in the relative entropy policy search (REPS) algorithm [17]. Many variants of TRO methods were introduced after the REPS algorithm (e.g. [1], [2], [3]). Most of these TRO methods use a bound of the KL-divergence of the policy update which prevents the policy update from being unstable while keeping the update suitable. Trust region policy optimization (TRPO) [18] adopts this bound to optimize the policy. The surrogate objective of TRPO is maximized while being subjected to a constraint on the KL-divergence between the old and current policy, where δ is a constant hyper-parameter and ''.'' means any action a. In general, when δ is small enough, the update is VOLUME 9, 2021 valid since the difference between the new policy and the old policy is not substantial.

B. PROXIMAL POLICY OPTIMIZATION
TRPO has been theoretically instantiated as capable of ensuring monotonic performance improvements of the policy. Nonetheless, due to the KL-divergence constraint TRPO is computationally inefficient and difficult to scale up for high-dimensional problems. In order to simplify the implementation and computational requirements, the PPO algorithm [20] was introduced. PPO restricts the policy update by a clipping function F on the ratio r s,a (π ) directly, In PPO, F is defined as, (7) where 1 − and 1 + are called the lower and upper clipping range, respectively, and ∈ (0, 1) is a hyper parameter. The minimization operation in Eq. (6) is to keep the gradient when the r s,a is out of the clipping range. The illustration of the clipping mechanism as the red dash line can be seem in Fig.1.
Both TRPO and PPO adopt the actor-critic method [10], in which the Value function Q π (s, a) is estimated by a parameterized function approximator Q(s, a; ω), where the vector ω is the parameter of the approximator e.g. a linear model or usually a neural network. This estimator is called the critic which can have a lower variance than traditional Monte Carlo estimators [27]. The critic's estimate is used to optimize the policy π(a | s; θ) which is also called the actor parameterized by θ. The actor is usually represented by a neural network. After that, the policy parameters can be updated by the gradient of surrogate objective Eq. (6) as, where β is the step size and ∇L CLIP (θ t ) denote the gradient of L CLIP at θ t .

C. PPO WITH ROLLBACK
PPO has made progress in different control tasks, especially with high-dimensional observation and action space, but it also has raised concerns of researchers about whether this clipping mechanism can truly restrict the policy ( [8], [25]).
In reality, PPO is incapable of binding the likelihood ratio within the clipping range because its clipping mechanism still pushes the ratio out of its range, rendering the policy out of the bounds.
To address this problem, PPORB [26] proposed a rollback operation on the clipping function attempting to remove the incentive of moving far away, which is defined as where α is a hyperparameter to decide the force of the rollback operation. When the ratio r s,a is out of the clipping range, it will make negative effects on the output instead of keeping the value as a constant. PPORB has made vast progress from the original PPO algorithm, but at the same time, it leaves space for improvement. Firstly, the rollback operation cannot converge when the ratio is extremely large: In high dimensional tasks, it will make the policy search sway around the optimal policy. Secondly, the rollback operation introduces an extra hyperparameter which can be only tuned by experience of the same problem. However, the hyperparametric possibilities result in significantly different results, with the values for this parameter ranging from 0.02 to 0.3.

III. METHOD
In this section, we first describe our reinforcement learning method, and then compare it with other methods of proximal policy optimization algorithms. Finally, we discuss the stability and strength of our method in this paper.
As discussed in the section Preliminaries II-C, while the rollback operation can, to some extent, restrict the likelihood ratio, it also leaves unsolved problems: the convergence property and the choice of hyperparameters. We deploy our method on the structure of PPO with a general form for sample (s, a), L s,a (π) = min r s,a (π)A π old (s, a), F(r s,a (π), ·)A π old (s, a) where F is a clipping function which attempts to restrict the policy update and ''.'' means any hyperparameters of F. In PPO, F is a ratio-based clipping function F PPO as Eq. (7), which will be flat whenever out of the clipping range.
In PPORB, F PPORB has the rollback operation as Eq. (9) instead of keeping it flat. We modify this function to promote the performance and stability of restricting the likelihood ratio by a functional clipping method, which is defined as where α is a hyperparameter to decide the scale of the functional clipping. The decreasing slope will be more steep when α is larger. Combining the original PPO algorithm with the novel function clipping method, we obtain the Proximal Policy Optimization Smoothed algorithm as shown in Fig. 1 which plots L CLIP (π) as functions of the likelihood ratio r s,a (π). We utilize the gradient change of hyperbolic tangent activation function to obtain the smoothed effect. As the figure depicted, when r s,a (π ) is out of the clipping range, the slope of L PPOS (π ) starts with a large absolute value and generally converges to zero. This smoothing mechanism guarantees the convergence of the clipping function. In contrast to PPORB, PPOS will never have an extremely large clipping output which is unacceptable in policy search. When α < 0, the function is more natural and smoother in shape. However, the actual performance will not be better than α > 0 which can be viewed in section IV-C. At the same time, the hyperbolic tangent activation function could be replaced by other activation functions to explore more.

IV. EXPERIMENTS
In this section, we present experiments to assess the performance of our proposed algorithm in terms of performance and stability during policy searches. We compare our method with the aforementioned state-of-the-art PPO variants in different environments. We also take different hyperparametric choices into consideration and their corresponding results are shown, along with a quick guideline with tuning suggestions to quickly deploy our algorithm on different applications.

A. EXPERIMENTAL SETUP
We test the performance of our algorithm with the following benchmark: (a) PPO: the original PPO algorithm with the author recommended clipping range hyperparameter = 0.2, as seen in [20]. (b) PPORB: PPO with the rollback function. The rollback coefficient α is set to be 0.3 for all tasks (except for the Humanoid-v2 task) as suggested by the authors in [26] (c) PPOS: Our proposed method with an α coefficient set according to the Table 2.
The algorithms are evaluated on continuous control benchmark tasks implemented on OpenAI Gym [4], simulated by MuJoCo [23] in a Tianshou platform. 1 The chosen MuJoCo environment is the version-2 (v2) for all experiments. The actor and critic networks are neural networks with one hidden layer and 128 neurons per layer, and the adopted activation function is ReLU function [6]. Each algorithm was run with ten random seeds and 250,000 timesteps, which are obtained by 2500 steps per epoch and 100 epochs (except for the Humanoid-v2 task with 1 million timesteps, due to its higher dimensionality) using ADAM optimizer [11]. The details can be viewed on the Table 3 in the Appendix.

B. COMPARISON WITH OTHER METHODS
We conducted experiments in five Mujoco environments with continuous control tasks of both high and low dimensions, as shown in Fig. 2. In Fig. 3 we plot episode rewards during the training process on the tasks. Our PPOS method with the functional clipping mechanism significantly outperforms both PPO and PPORB on hard tasks characterized by high observation and action dimension (e.g., Humanoid-v2 with |O| = 376 and |A| = 17 and Ant-v2 with |O| = 111 and |A| = 8), both in terms of learning speed and final FIGURE 3. Episode rewards of the policy during the training process averaged over ten random seeds on MuJoCo tasks, comparing our methods with PPO and PPORB. The solid lines are mean values of ten trials while the shaded area depicts the 60% confidence interval. |O| represents the dimension of observation.  rewards. For the HalfCheetah-v2 with medium dimensionality (|O| = 17 and |A| = 6) our method achieves a better final reward in the later stage. The advantage given by the clipping mechanism is less salient in lower dimensionality, and our PPOS is comparable to both PPO and PPORB in such tasks (e.g., Swimmer-v2 with |O| = 8 and |A| = 2 and Reacher-v2 with |O| = 11 and |A| = 2). The stricter clipping mechanism from both PPORB and PPOS have a tendency to slow down these algorithms on their initial stage (especially in low dimensions, while making it faster in higher dimensionalities), with smaller policy updates, but eventually it enables these algorithms to asymptote at higher rewards.
The numerical analysis with mean and standard deviation is presented in Table 1. PPOS reaches a better performance in 80% of the adopted Mujoco tasks, and, as seen on Table 1, it was virtually tied with PPORB on the low dimensional Swimmer-v2 task (and presented a lower standard deviation). On average, PPOS reaches 8% higher rewards than PPO and 7% higher than PPORB, while the standard deviation of PPOS is 3% lower than both PPO and PPORB. A possible explanation for this is that the functional clipping mechanism successfully keeps the policy update within a reasonable range, while PPO and PPORB made the update either too large or too small. Besides, the nature of PPORB's clipping can generate very negative updates, which can be detrimental to the policy search.
We also compared three methods on entropy during all training procedures as shown in Fig. 4. In general, PPOS and PPORB have lower entropy than PPO in most of the tasks which means PPOS and PPORB can make the trained policy more deterministic. In Humanoid-v2, PPOS can even have lower entropy than PPORB which means the policy is more practical for implements on real robots.

C. CHOICE OF THE HYPERPARAMETER
Correct hyperparametric choices are crucial for policy search algorithms. For example, in PPORB the coefficient of the rollback operation α could vary from 0.3 to 0.05 for  different tasks. However, the suggested value was only provided by empirical methods for specific tasks in most of the former algorithms e.g. [18], [20], and [26]. With this in mind, we evaluate our choices of hyperparameters on varied tasks and provide a selection criterion of the hyperparameter value for different problem dimensionalities.
We conduct the experiment on Humanoid-v2 and Ant-v2 tasks, which have a higher dimensionality in their observation space, since in low-dimensional tasks the performance difference between PPO, PPORB and PPOS can be very subtle. As shown in Fig. 5, in the Humanoid-v2 task the coefficient α = 0.05 achieves the best performance among different values, while α = 0.2 is the best choice in the Ant-v2 task. As discussed in section III, the negative α, which enables a smoother clipping function, is incapable of improving the performance of the Ant-v2. A possible solution for this issue is decreasing the clipping range towards a lower value. For the other three tasks with relatively low dimensions, the default α = 0.3 can be regarded as the benchmark, which is presented in Table 2.
Once we empirically found the optimal parameters for all five environments, we use an exponential regression function α = 0.3333e −0.0048 * |O| to fit α value regarding the dimension of observation as shown in Fig. 6. In general, the proposed α is lower when the dimension of observation is higher. Our explanation for this is that a lower α value is effective in preventing unnecessarily large policy updates which are more likely to take place in high-dimensions. For low-dimensional tasks, default α = 0.3 can balance the performance and robustness. The exponential regression function has two key advantages compared with linear and polynomial regressions: Firstly, it converges to 0 as α grows with an elegant shape which meets the intuition that higher dimensions will result in lower α values. Secondly, the function is more efficient to express the same data than higher-order polynomial expressions.

V. CONCLUSION
Although PPO group algorithms have achieved state-ofthe-art performance on policy searches on high-dimensional control tasks, they also present flaws when restricting the step of each policy update. The original clipping method of PPO cannot prevent large policy updates, while the improved PPORB cannot guarantee the convergence of this very same update. Additionally, as hyperparameters were proposed according to specific tasks, users are left without guidance to retune those parameters when redeploying the algorithm to their needs. Based on these observations, in this paper we proposed the PPOS algorithm with a functional clipping mechanism combined to the original PPO. Our work significantly improves the ability to restrict policy updates and maintain stability, and we demonstrated those through extensive results. We also provide a helpful guideline for tuning the only hyperparameter used by our method, and thus resulting in efficient redeployment on novel environments in the future.
While in this paper we focused on the hyperbolic tangent function and the tuning of the coefficient, in the future we will consider the clipping mechanism with different functions, as other functions have different properties which will affect the performance of the policy search. We will also study the relationship between the clipping range and the clipping function. The default = 0.2 is widely used in different variants of PPO, so there is considerable space for exploration of the clipping range. Another possible improvement in the future is the trust-region method, which is another mechanism to restrict the policy updates [18]. Finally, the implementations of deep reinforcement learning algorithms suffer from the repetitive tuning of hyperparameters [7]. Although we proposed a tuning method for one hyperparameter, there are still many hyperparameters, as seen in Table 3, which may need to be explored in those frameworks. APPENDIX See Table 3.
WANGSHU ZHU received the bachelor's degree in computer science from ShanghaiTech University, in 2019, where he is currently pursuing the master's degree in computer science with the School of Information and Science Technology. He joined the Living Machines Laboratory, Shang-haiTech University, in 2018. His current research interests include model-based fault tolerance for manipulators and reinforcement learning method for robotics. He also focuses on the combination of model-based learning methods and robots. He was awarded by the Merit Student of ShanghaiTech University, in 2020.