I. Introduction
Reinforcement learning (RL) aims to obtain optimal policies that maximize the expected accumulated rewards by interacting with the environment via trial and error [1]. Recently, due to advances of deep learning (DL), deep reinforcement learning has achieved great success and been applied to many challenging problems, such as video gaming [2], Go [3], Watson DeepQA system [4], autonomous driving [5], [6], multiple robot system [7], etc. However, there are usually two main challenges for training a deep reinforcement learning robot directly in the real world [8]. One challenge is that it generally requires millions of samples to learn an optimal policy for a real-world robot, which will take several months to collect since task executions are comparatively expensive and time-consuming in the real world. The second challenge is that a deep reinforcement learning robot may damage itself or living things in the surrounding environment because of explorations via trial and error [9], [10].