Guided Soft Actor Critic: A Guided Deep Reinforcement Learning Approach for Partially Observable Markov Decision Processes

Most real-world problems are essentially partially observable, and the environmental model is unknown. Therefore, there is a significant need for reinforcement learning approaches to solve them, where the agent perceives the state of the environment partially and noisily. Guided reinforcement learning methods solve this issue by providing additional state knowledge to reinforcement learning algorithms during the learning process, allowing them to solve a partially observable Markov decision process (POMDP) more effectively. However, these guided approaches are relatively rare in the literature, and most existing approaches are model-based, meaning that they require learning an appropriate model of the environment first. In this paper, we propose a novel model-free approach that combines the soft actor-critic method and supervised learning concept to solve real-world problems, formulating them as POMDPs. In experiments performed on OpenAI Gym, an open-source simulation platform, our guided soft actor-critic approach outperformed other baseline algorithms, gaining 7~20% more maximum average return on five partially observable tasks constructed based on continuous control problems and simulated in MuJoCo.


I. INTRODUCTION
Recently, deep learning techniques have been implemented in reinforcement learning (RL) as function approximators, yielding deep RL, to solve high-dimensional, continuous state space problems [1]- [4]. In Markov decision process (MDP)based deep RL approaches, it is generally assumed that the agent can fully observe the true state of the environment and that the environment's next state is solely determined by the current state and the chosen action. This ensures that the agent can act in the best way by simply reacting to the current state of the environment. However, this assumption is not true for many real-world problems. In real-world environments, agents often perceive the state of the environment to be partially and noisily. Another way of saying that the Markov property is seldom found in real-world environments. A reactive policy found by the MDP cannot be said to be optimum if the observations do not hold the Markov property. In such cases, the partially observable Markov decision process (POMDP) approach is used. POMDP extends the MDP by allowing an agent to represent a subsequent decisionmaking process under uncertainty.
The development of RL algorithms for solving POMDP poses many challenges. The agent explicitly observes state transitions in the MDP, and the generative model is easily estimated using empirical estimators. In POMDP, partial and noisy observations must be used for the transition and reward models. POMDPs require a range of observation probabilities outside of direct observation and a probability distribution across all possible situations based on MDP to predict the next state. To compute the optimal policy for a POMDP with known parameters, an augmented MDP based on a continuous domain of belief must be solved. Finally, it is not trivial to integrate prediction and planning with guarantees in a discovery, and no regret strategy is currently known. Guided RL methods solve this issue by providing additional state knowledge to RL algorithms during the learning process, allowing them to solve POMDPs more effectively.
In the literature, most deep RL approaches have been applied to Markov decision processes. The deep RL for POMDPs has not been sufficiently studied. Following the successful performance of the deep Q-network (DQN), several approaches have been developed for POMDPs, mainly focusing on remembering some important observations, implementing recurrent neural networks (RNNs) [5]- [8]. However, especially in real-world scenarios, there are also partially observable environments in which the agent must consider account information gathering and therefore engage in much more exploration and interaction with these environments to learn how to collect and use the information.
In this study, we propose a novel approach that combines the soft actor-critic method [17] and supervised learning concepts to solve real-world problems, formulating them as POMDPs. In experiments performed on OpenAI Gym [18], an open-source simulation platform, our guided SAC approach outperformed other baseline algorithms trained directly on five partially observable tasks constructed based on continuous control problems, simulated in the MuJoCo [19]. The main contribution of this study is as follows: Our proposed guided SAC approach outperforms the comparison methods on partially observable continuous control experiments owing to our proposed novel architecture and interaction loop for the classic soft actor-critic. Unlike the original concept of the SAC, in the training phase, the guided-SAC architecture consists of two actors and a critic. To generate the samples, two actors iteratively interact with the same environment. The final control actor chooses actions based on partial observations, while the guiding actor chooses actions by perceiving the full-state observations. Both the final control and guiding policies are trained to maximize both the expected reward and entropy. Furthermore, a DNN is trained to teach the control policy actor to act as a guiding policy actor. This paper is organized into five sections, including the introduction. The second section provides a summary of the deep reinforcement learning taxonomy, a brief review of the guided policy search methods, and a detailed literature survey of the deep reinforcement learning approaches for partially observable Markov decision processes. The third section provides the background information necessary to comprehend our novel approach. Section gives a detailed explanation of the proposed methodology. The fifth section provides the results obtained and discusses the results. The sixth section presents the conclusion and future work plan of this study.

II. RELATED WORK
In this section, we first summarize the taxonomy of the DRL algorithms. Second, we provide a brief review of guided policy search methods from a historical perspective. In the last part of the section, we provide a detailed literature survey of deep reinforcement learning approaches for a partially observable Markov decision process.

A. TAXONOMY OF DEEP REINFORCEMENT LEARNING ALGORITHMS
Reinforcement learning is a machine learning subfield in which an agent attempts to learn actions by maximizing the reward received through interaction with the environment [20]. Conventional RL policies such as tabular, linear, and radial basis networks have a number of drawbacks, especially in terms of high-level behavior representation and the computational cost of updating parameters. Combining RL and deep learning can extract more information from raw inputs by passing them through multiple neural layers, which can have a large number of parameters and represent highly nonlinear problems [10]. By using deep learning as a function approximator in RL [21], deep RL methods can approximate optimal policies and/or value functions.
All methods in deep reinforcement learning can be divided into the following categories: model-free/model-based, valuebased/policy-based, and on-policy/off-policy. Figure 1 shows the taxonomy of deep reinforcement learning methods. In real-world problems, interaction with the physical world may be restricted for a variety of reasons, such as safety and cost. The amount of interaction required in the real world can be reduced by learning a dynamic model of the environment. Further, exploration can be performed on the learned models. With model-based approaches, the agent tries to learn the transition and reward functions, which are utilized when taking action. Gathering information on the dynamics of the environment is required to form a model approach. Unlike model-based approaches, this type of information is not required in model-free approaches. Model-free approaches directly illustrate the underlying MDP to learn about an unknown model. Model-free policy search methods require more training data and a longer training time than model-based methods [22].
Value-based methods estimate value functions directly and infer policies that should be implemented. These value functions are calculated by computing the immediate reward and discounting the next state value with the previous state. Policy-based methods specifically discover policies that can take action with no value function values [22]. The distinction between value-based and policy-based methods is based on the trade-off between better performance and more complex strategies. Both methods need to propose actions and evaluate the outcome of the actions. However, value-based methods emphasize determining the maximum cumulative return and have a policy that is based on recommendations, policy-based methods try to directly establish the optimal policy that maximizes the cumulative discounted reward. Furthermore, actor-critic approaches are also known as hybrid approaches that combine value and policy-based methods [22]. VOLUME XX, 2017

FIGURE 1. Taxonomy of Deep Reinforcement Learning Algorithms
On-policy algorithms are RL algorithms that train agents based solely on current policy experience. They are sampleinefficient in the context of changing the agent's policy and actions through training. This ensures that prior experience cannot be considered for training purposes. Off-policy algorithms can employ the experience gained from other policies [66].

B. GUIDED POLICY SEARCH
Guided policy search algorithms assist policy learning with additional information and mechanisms. The type of assistance can vary depending on the domain of the problem or the policy search method used.
Levine and Koltun [67] proposed a guided policy search method using trajectory optimization to learn policy directly and prevent poor local optima. Levine and Koltun [67] employed an importance-sampled variation of the likelihood ratio estimator to incorporate the off-policy guiding samples created using DP into the policy search. Guided policy search algorithms have been applied to continuous control domains, such as challenging locomotion tasks, robotic manipulation, and control [68][69][70]. Levine and Abbeel [71] performed a guided policy search under unknown dynamics with a hybrid method and optimized trajectory distributions for largecontinuous problems using iteratively refitted local linear models. Zhang et al. [72] proposed an approach to decompose the policy search problem into two phases: trajectory optimization and supervised learning, in order to acquire policies with effective memory and recall procedures in continuous control tasks. In the context of guided policy search, Zhang et al. [73] presented an approach that integrated model predictive control (MPC) and RL. To supervise the learning of the neural network policy, Zhang et al. [73] employed MPC, which only received the full state of the system during the training phase. Zhang et al. [73] used a model-based technique to create guiding samples for guided policy search by combining MPC with offline trajectory optimization and assuming that a system dynamics approximation model is obtained in the training phase.
To improve their effectiveness in solving POMDPs, guided RL techniques use additional state information throughout the learning process. There are only a few prominent examples, such as the studies by [67] or [73] on guided policy search.
Egorov [5] suggested the extension of the DQN approach and mapped an action-observation history to an optimal action to solve POMDPs. Egorov [5] showed that DQNs can learn good policies but require substantially more computational resources, and while the Q values converge, policies are vulnerable to minor disturbances. Another approach, called the deep recurrent Q-network (DRQN), can be implemented in a model-free representation using additional LSTM layers [6]. The qualified policy network can capture all historical data in the LSTM layer, and the output actions are based on all historical observations. After performing extensive analysis on DRQN, Foerster et al. [7] proposed deep distributed recurrent Q-networks (DDRQN), which used both prior observations and actions as inputs. The action-specific deep recurrent Q-network (ADRQN) uses action-observation pairs to obtain an optimal policy in partially observable domains [8].
Jin et al. [9] proposed a novel algorithm called ARM that does not require access to a Markovian state and is based on the notion of regret minimization. They demonstrated that regret estimating approaches are better at handling partial observability than regular RL. Le et al. [10] proposed a hierarchical deep RL approach for learning in a hierarchical POMDP. Instead of using the deep deterministic policy gradient (DDPG) for continuous regulation, Heess et al. [11] proposed a recurrent deterministic policy gradient (RDPG) for the POMDP, which involves using recurrent neural networks (RNNs) to approximate the actor and critic. The autonomous navigation of an unmanned aerial vehicle (UAV) was modeled in [12] under uncertainty in a large-scale complex environment as a POMDP. Wang et al. [12] proposed a fast-RDPG using an actor-critic-based method with a function approximation. [13] extended RDPG to solve partially observable bipedal locomotion tasks; however, only one was investigated. The fast-RDPG generates policy updates by directly maximizing the expected long-term accumulated discounted reward. Igl et al. [14] proposed deep variational RL (DVRL) using inductive biases in the network to learn a generative model that can imitate belief updates. DVRL is a state-of-the-art deep RL method for POMDPs. DVRL uses the evidence lower bound (ELBO) loss to estimate belief states directly for recurrent RL policies, while probabilistic inference for learning control (PILCO) uses dynamic models learned from beliefs rather than observations. Previous studies have focused on partial observations, but none have discussed stale observations. Pajarinen et al. [15] proposed Compatible Policy Search (COPOS), a variant of trust region policy optimization (TRPO), which has a transformer network using Adam optimization and applies to machine translation tasks. Although the original TRPO and proximal policy optimization (PPO) were planned for an MDP with an infinite horizon, they have recently been expanded for a finite horizon [16].
In this study, we propose a novel architecture and interaction loop for SAC [17], which is one of the state-of-theart methods on MDP, inspired by [73] to solve partially observable continuous control tasks more efficiently in much more exploration and interaction with the environment by using a guided policy search approach to optimize the policy.

III. PRELIMINARIES
This section provides background information necessary to comprehend our guided soft actor-critic approach. We first briefly introduce partially observable Markov decision process (MDP) fundamentals and then present the classic soft actor-critic method.

A. PARTIALLY OBSERVABLE MARKOV DECISION PROCESS
In many real-world problems, the agent cannot fully observe the state of the current environment, which is necessary for MDP-based learning methods. The Markov property for MDPs does not hold in real-world scenarios in which observations are incomplete or noisy, such as autonomous driving, where the agent can only observe parts of the environment and does not have a complete view of the world. Under uncertainty, the results of the actions depend more on the current state signal. POMDPs generalize MDPs to decision-making processes under partial observability.
POMDP captures the dynamics of many real-world environments better because it explicitly acknowledges that the sensations acquired by the agent are simply fragmentary glimpses of the underlying system state. [6] Formally, a POMDP is defined [74] by the tuple ( , , , , , , γ), where S is the state space, is the action space, O is the observation space, T is the transition model, R is the reward function, is the observation model, and γ is the discount factor, consisting of • a set of states ∈ , • a set of actions ∈ , • a set of observations ∈ the agent can experience of its world, , a reward function defined as : × → ℝ, • the initial state distribution " ( ) specifying the probability of state being the initial state of the decision process. • the history ℎ # = ( " , $ , … , #%$ , # ) as a vector containing all past observations and taken actions.
A POMDP is distinguished from a fully observable MDP by the agent's perception of an observation ∈ rather than actual observation of the state of the environment. As a result, POMDPs have two more elements in their description than MDPs. The set of observations represents all possible sensors that the agent can receive. The observation function indicates which particular observation the agent receives depending on the next state ! and maybe also on its action [70].

B. Soft Actor Critic (SAC)
SAC, which is defined for RL tasks involving continuous actions, is an off-policy optimization algorithm for stochastic policies. Entropy regularization is a key element of the SAC. The policy is trained to maximize the trade-off between expected returns and entropy, which is a measure of the randomness of the policy [18]. This is strongly related to the exploration-exploitation trade-off. More exploration occurs when entropy increases. Thus, it has the potential to accelerate learning. It can also avoid the policy from prematurely convergent to a bad local optimum. The objective function of the SAC comprises both a reward term and an entropy term SAC concurrently learns by using three networks: a state value function V parametrized by ψ, a soft Q-function Q parameterized by θ, and a policy function π parametrized by ϕ [17].
The Value network is trained to minimize the squared residual error [17]: where is the distribution of the replay buffer. The gradient of Eq. 2 can be estimated using an unbiased estimator [17]: where the actions are sampled according to the current policy instead of the replay buffer. The Q network is trained by minimizing the soft Bellman residual [17]: Finally, the policy parameters are learned by directly minimizing the expected KL-divergence [17]: (7) the policy is reparametrized by using a neural network transformation a # = 5 ( # ; s # ) (8) where ϵt is an input noise vector. The objective in Equation 7 can be rewritten as follows: The gradient of Equation 9 can be approximated with where at is evaluated at 5 ( # ; s # ) [17].

IV. GUIDED DEEP REINFORCEMENT LEARNING APPROACH FOR PARTIALLY OBSERVABLE MARKOV DECISION PROCESS
Guided policy search (GPS) approaches [67]- [73] provide an idea for combining RL and supervised learning to solve POMDP problems by including previous shortened observation-action pairs into current state-action representations. Model-based approaches need to learn a dynamic model of the environment to reduce the amount of interaction required in the real world. However, this can be challenging for complex tasks and requires extensive computation. On the other hand, model-free approaches learn a policy directly from interactions with the environment and have much more flexibility because they do not need an exact representation of the environment. To achieve effective outcomes using model-free approaches, an adequate number of interactions with the environment must be provided, depending on the complexity of the task. For this reason, we focus on a model-free approach, as opposed to most traditional GPS methods. We present a new approach to solving continuous control POMDP problems by combining a modelfree guided framework with SAC.
Our guided SAC, like that of [73], has two phases: training and testing. In the training phase, the guided-SAC architecture consists of two actors and a critic. To generate samples stored in replay buffer , two actors interact with the same environment iteratively. Figure 2 shows the guided-SAC architecture and the interaction loop of the guided SAC in the training phase. The final control actor chooses actions based on partial observations, while the guiding actor chooses actions by perceiving the full-state observations. The critic's job is to assess the policy and, as a result, update the actionvalue function, which is a measure of the total discounted reward, using data gathered through environmental observations. Targets for the Q functions are computed using y(r, s ! , d) = r + γ(1 − d) -min and Q-functions are updated by one step of gradient descent using Then, the actors are fed the estimated discounted reward. Each policy gradient of the actors is updated by one step of gradient ascent using where m 4 ( ) is a sample from 4 (⋅| ) which is differentiable w.r.t via the reparametrization trick.
In turn, the actor chooses an action based on the current state and interacts directly with the environment using the estimated action-value function. For each update, the guiding policy is trained with the history of observation-action pairs along with the current state from and the control actor updates its policy with only the history of observation-action pairs from . Both are trained to maximize the expected reward while also maximizing entropy like in the original concept of the SAC [17]. Through α known as the entropy temperature, the agent selects random actions to explore the environment through α, known as the entropy temperature. The parameter is automatically adjusted with gradient descent to guarantee adequate exploration at the beginning of learning and, thereafter, a larger emphasis on maximizing the predicted reward. By decreasing the Kullback-Leibler divergence, the policy distribution is updated towards the softmax distribution for the current Q function for the policy improvement step. [75] In our approach, to use samples generated from different policies, the distributions of the two policies should be similar. We use the adjustable entropy temperature α as it improves the performance and stability of the method [75] and guarantees that they converge to the same policy.
Update target networks with ϕ targ ,L ← ρϕ targ ,L + (1 − ρ)ϕ L for i = 1,2 (16) We also trained the DNN using the stored data in to teach the control policy actor to act as a guiding policy actor. Using this iterative interaction approach, the same trajectory is generated by both the guiding and control policies.
Algorithm 1 shows the Guided SAC approach. The numbers, which indicate the order of steps in Figure 2, are referenced in the Algorithm 1.

V. EXPERIMENTS
We investigated the following research question during the experiments: Can our proposed guided deep reinforcement learning approach outperform the original DRL baseline algorithms on partially observable robotic tasks constructed based on continuous control problems? We selected the stateof-the-art DRL baseline algorithms TD3 [40] and SAC [17] to compare the performance of our proposed guided SAC approach.
Deep reinforcement learning requires a large number of trial-and-error episodes in the real world to learn an optimal policy. Simulators are needed to accomplish outcomes in a cost-effective and timely manner because they allow these episodes to occur in the digital world. As an example, consider teaching a robot to walk by observing a real, physical robot attempt and fail 100,000 times until it can reliably and consistently walk. To avoid causing damage to the real world, it is preferable to first train in a simulation platform before transferring it to the real world. We ran experiments in five different MuJoCo [19] environments that provide continuous control tasks of increasing difficulty and action dimensions.

A. ENVIRONMENTS AND TASKS
In this study, we compared our guided SAC approach with the state-of-the-art baseline algorithms (SAC and TD3) on five partially observable tasks (HalfCheetah-v2, Hopper-v2, InvertedDoublePendulum-v2, Walker2d-v2, and Swimmer-v2) that we constructed based on high-dimensional, continuous control problems in MuJoCo in the OpenAI Gym framework. (Figure 4) Typical observations for these tasks include positions, angles, velocities or forces.

FIGURE 4. POMDP MUJOCO TASKS -A) HALFCHEETAH-V2 B) HOPPER-V2 C) INVERTEDDOUBLEPENDULUM-V2 D) WALKER2D-V2 E) SWIMMER-V2
Based on the original tasks, have a fully observable statespace, we built a modification to simulate partial observability in real applications. The modification details of the Mujoco tasks for POMDP are given in Appendix A.

B. EXPERIMENTAL SETTINGS
To compare our results, the original implementations of SAC and TD3, as well as the OpenAI baselines, were used. We trained the algorithms for a total of 1 million time steps on MuJoCo tasks. One core on Intel® Xeon® Gold 6148 Processor and 16 GB RAM has been used for the computation of each job. It took about 12 hours for each task. All trials were carried out with a total of ten random seeds in each. Running the mean policy without action noise for ten trajectories once per 10,000 steps was used to assess performance, and the average return over ten trajectories was reported. The results of our experiments were plotted as learning curves for the average cumulative reward over the last 10 episodes. Feedforward neural networks are used to represent policies and/or value functions in all evaluation techniques. The SAC and TD3 algorithms use two hidden neural network layers of size (256,256) with rectified linear units (ReLU) as the activation function [40], [17]. For the baseline algorithms, default hyperparameters from OpenAI SpinningUp [68] were used. The hyperparameters for our proposed algorithm were empirically set to those for SAC, and the network architectures of the guided SAC were selected to have the same number of parameters as the networks in SAC. (See Appendix A). The observation history has a fixed length of four, which means that the input for the RL algorithm includes the current observation as well as the four previous observations.

C. RESULTS
To compare our guided SAC approach with SAC and TD3, we ran all tasks for full observability and POMDP. Figure 5 shows the learning curves for the sampled tasks, with the first column representing fully observable performance and the second column representing POMDP findings. Tables 2 and 3 summarize the results. Guided SAC performed best on the four POMDP MuJoCo tasks. Guided SAC outperformed the state-of-the-art DRL baseline algorithm by more than 7∼20% on each task. TD3 could perform better than guided SAC on POMDP InvertedDoublePendulum-v2 Task. In general, guided SAC achieves significantly better performance on difficult POMDP tasks than the baselines.

VI. CONCLUSION
To summarize our contribution, we propose a novel approach called guided SAC, which combines a model-free guided framework with SAC for POMDPs. We proposed a new agent environment interaction method for generating training samples in which the guiding and control actors interact with the same environment iteratively. The control actor selects actions based on partial observations, whereas the guiding actor can take action by perceiving full-state observations. We evaluated our approach using modified, partially observable MuJoCo tasks. The proposed guided SAC was compared to the state-of-the-art DRL baseline algorithms (SAC and TD3) on the POMDP versions of the sampled continuous control tasks in MuJoCo. We learned the optimal policy successfully and obtained good empirical results, showing that guided SAC outperforms the baselines on POMDPs. Our guided soft actorcritic approach outperformed the state-of-the-art DRL baseline algorithm, gaining 7∼20% more maximum average return on four out of five partially observable tasks.
In future studies, we aim to apply our proposed algorithm to autonomous driving for partially observable environments, such as pedestrian collision avoidance problems.

A. HYPERPARAMETERS
The implementation of the algorithms was based on OpenAI Spinningup [18].

B. ENVIRONMENTS AND TASKS DETAILS
To create partially observable environments, we modified the MuJoCo tasks listed above by deactivating the parts of the observation. A number of dimensions of the observation vector were selected and overwritten with zeros. We chose dimensions whose deactivation had a noticeable impact on overall performance. (Table 5)