Fast-PPO: Proximal Policy Optimization with Optimal Baseline Method | IEEE Conference Publication | IEEE Xplore

Fast-PPO: Proximal Policy Optimization with Optimal Baseline Method


Abstract:

Deep deterministic policy gradient (DDPG) is a useful deep reinforcement learning approach, but it tends to suffer from instability of gradient estimates. Recent methods ...Show More

Abstract:

Deep deterministic policy gradient (DDPG) is a useful deep reinforcement learning approach, but it tends to suffer from instability of gradient estimates. Recent methods such as PPO just limit the policy update under lower speed to keep stability by chance. In this paper, we model the problem under an advantage actor-critic(A2C) architecture. We first analyze the operation of the simplified analytic solution in A2C, where the instability of the policy update mainly be attribute to two factors: the variance of action estimate and the variance of accumulative rewards. To solve it, we propose a new method of PPO with the optimal baseline called Fast-PPO. In detail, our hybrid optimal baseline considers both the advantage of action estimate and the estimate of accumulative reward. By doing so, our method guarantees that not only the action estimate can be converged faster in the right direction, but also the accumulative reward can be under lower variance. The experimental results demonstrated that our Fast-PPO has the better performance than others methods.
Date of Conference: 18-20 December 2020
Date Added to IEEE Xplore: 16 February 2021
ISBN Information:

ISSN Information:

Conference Location: Shanghai, China

Contact IEEE to Subscribe

References

References is not available for this document.