I. Introduction
Reinforcement learning (RL) has achieved success across different automated control tasks such as robotics locomotion [1], navigation [2] and transportation management [3], [4]. However, the trial-and-error process of RL hinders its application in more real-world tasks due to the large number of failures brought by unconstrained policies, threatening the safety of users and systems. Therefore, safe RL [5] is proposed to impose constraints on agents and enhance the safety of policies both after convergence and during the training process. The training time safety issue, also called safe exploration (i.e., reducing the number of constraint violations during learning), is thought to be challenging and significant, especially when the dynamics of the environment are unknown. In this paper, we not only focus on the safe RL problem subject to certain constraints after convergence, but also take a step towards safe exploration.