Enhanced Probabilistic Inference Algorithm Using Probabilistic Neural Networks for Learning Control

In model-based methods of reinforcement learning (RL), the probabilistic inference for learning control (PILCO) algorithm, which relies on Gaussian process (GP) for building probabilistic dynamics models, is getting attention for its advantages of requiring only a small amount of data, and being able to learn from scratch in a few trials by explicitly incorporating model uncertainty into long-term planning. However, the time complexity of the GP, which is cubic with respect to the number of trials, makes it challenging to scale the framework to high-dimensional observation spaces. Moreover, the cost function of a task is limited as a locally quadratic function for calculating the policy gradient analytically. To solve these problems, we proposed a probabilistic neural network (PNN) to replace the GP in building probabilistic dynamics models and develop a deterministic control policy by using long-term predictions. In particular, we determine model uncertainty through basic prior knowledge and calculate cumulative cost by sampling from state distributions. This approach can help reduce the influence of model error and time consumption. Compared with the state-of-the-art model-based RL, the proposed approach can reconcile data efficiency and speed of learning even in high-dimensional observation spaces.


I. INTRODUCTION
The use of reinforcement learning (RL) for robot control has been gaining popularity over the last few years [1], [2] in a wide spectrum of applications of robotics [3], [4]. Approaches to RL can be divided into two main classes: model-free (also known as direct) and model-based (also known as indirect) methods. The main difference between model-based and model-free learning lies in whether a model of the interactions between the robot and the environment is employed. In model-free methods, e.g., Q-learning [5] and TD-learning [6], there is no model, and thus the rewards and optimal actions are derived by a trial-and-error approach The associate editor coordinating the review of this manuscript and approving it for publication was Huaping Liu. using a physical system. Although this approach has attracted significant interest of scientists and engineers in the field of robot control, sampling trajectories to derive the optimal policy become a disadvantage when applied to robots [7]. The model-based approach is thus preferable.
Model-based methods feature models of transition dynamics that are used to derive the rewards and optimal actions. This characteristic significantly reduces physical interactions between the robot and its environment, and results in significantly less mechanical wear. It can help discover better policies with fewer trials because the dynamics of the model can generalize knowledge of the system to unobserved states [8]. However, the main disadvantage of model-based learning is that it suffers severely from model errors, especially in longterm predictions. In practice, most recent work still relies on VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ a large amount of data for the unsupervised training of autoencoders [9] to build a control policy. Although available task-specific prior knowledge can reduce model errors and increase data efficiency in RL [10], but a dramatic improvement without prior knowledge can be achieved with probabilistic inference for learning control (PILCO) [11]. PILCO uses the Gaussian Process (GP) to build a probabilistic dynamics model and propagate uncertain state distributions analytically to realize high data efficiency on low-dimensional control tasks. This enables the probabilistic dynamics model to consider the long-term consequences (cumulative cost) of the parameterization of a given control policy [12]. Thus, the key to achieve high data efficiency is building a probabilistic dynamics model [13]- [16] because it can do so by incorporating uncertainty throughout planning and predicting.
Current probabilistic dynamics models rely on GP to propagate state distributions through time. But the time complexity of GP is cubic with respect to the number of trials, makes it challenging to scale the framework to high-dimensional observation spaces. Furthermore, the cost function of a task is limited as a locally quadratic function since the policy gradient needs to be solved analytically. In this paper, we solve these problems by replacing GP with a neural network while maintaining the framework's probabilistic nature and its data efficiency-related benefits. To this end, we need to incorporate model uncertainty into the long-term planning for model prediction and policy evaluation. Therefore, we propose a probabilistic neural network (PNN) that expresses predictive uncertainty by using basic prior knowledge to build a probabilistic dynamics model. Prior knowledge here is simply a measure of uncertainty, not manual input of knowledge to the model or policy. Policy gradients are obtained by sampling from state distributions; thus, the cost function of a task is no longer limited to local quadratic function. We call the proposed approach as APILCO because it is an approximate scheme over the traditional PILCO. It uses deterministic inference for long-term prediction and policy evaluation. The framework is similar to that of PILCO but reduces computation time by using PNN instead of GP, and deals with arbitrary cost function by sampling from state distributions.
The paper is organized as follows: after discussing related work in Section II, we describe the key ideas of the proposed APILCO learning framework in Section III, i.e., the probabilistic dynamics model, the approximate cumulative cost evaluation, and the approximate gradient-based policy improvement. Experimental results are provided in Section IV, where we discuss the key properties, limitations, and future extensions of APILCO. Conclusions are drawn in Section V.

II. RELATED WORK
Ways of treating parameter uncertainty in a control system have long been a key subject of research in robust and adaptive control [17]. A certainty equivalence approach is usually applied, which takes the estimated values of model Algorithm 1 APILCO 1: Init: Sample policy parameters θ and probabilistic 2: neural network parameters. 3: Apply random control signals and record data. 4: Repeat. Repeat.
parameters as true values without uncertainty [18]. PILCO, a model-based RL algorithm, builds policy by taking model uncertainties as noise [19]. It explicitly considers uncertainties to train a control policy to achieve unprecedented data efficiency on several control benchmarks [20]. In this section, we outline PILCO and analyze the key areas where it is effective.
The parameters of a deterministic policy π are initialized with random parameters θ; PILCO then executes the policy from initial states (sampled from an initial distribution p(S 0 )) until time horizon T. During policy execution, the observed transitions are recorded and appended to the whole training data. The dynamics model is then re-trained by the GP model using all transition data. For a certain state s t−1 , PILCO can predict a Gaussian distribution of the next state p (s t |s t−1 , a t−1 ) = N s t |μ t ,˜ t , and this is consistent with transitions of the Markov model. According to this, PILCO can analytically predict subsequent state distributions from an initial state distribution (e.g., p (S 0 ) to p (S T )) and state distributions are correspondent to a joint Gaussian assumption s t ∼ N μ t ,˜ t . These distributions can evaluate the cumulative cost of policy π where both µ t and t are functionally dependent onμ t ,˜ t . Hence, PILCO can analytically compute the gradients of cumulative cost J π with respect to policy parameters θ. The cumulative cost J π is finally optimized to yield an improvement in policy. PILCO then executes the optimized policy in a new trial to determine whether it is locally optimal given the probabilistic dynamics model. The data efficiency of PILCO can be attributed to its probabilistic dynamics model that avoids model bias. Model bias arises from the selection of a single dynamics modelf , which is assumed the correct model without uncertainty [15]. While a single dynamics model can provide accurate short-term state predictions (e.g., p(S 1 )), its long-term predictions (e.g., p(S T )) are inaccurate due to the compounding effects of errors in T-step prediction. Because optimizing the cumulative cost J π must balance the perceived cost per distributions, e.g., p(S 1 ) and p(S T ), the optimization compromises on the cost of p(S 1 ) based on the perceived cost of p(S T ), though the prediction p(S T ) is almost completely random noise, i.e., given sufficient model uncertainty, p(S T ) has a wide distribution that is nearly invariant to policy π. PILCO avoids model error through a probabilistic dynamics model that can express uncertainty, and integrates model uncertainties into long-term planning and decision-making.

III. APPROXIMATE PROBABILITY INFERENCE
In the following, we detail the key components of APILCO: the probabilistic dynamics model, the approximate cumulative cost evaluation, and the approximate gradient-based policy improvement.

A. PROBABILISTIC DYNAMICS MODEL LEARNING
We do not use GP to build the probabilistic dynamics model (PDM), and instead propose PNN to estimate the variance of state distributions on which we condition our inference. With PNN as a parametric model, the parameter set imposes a fixed structure upon the prediction. The number of parameters is determined in advance, and is independent of the number of observations and sample size. PNN makes one-step predictions as a Gauss distribution that is established by using the predict values as means, and the uncertainties as variances.
We assume the tasks are within a finite state space and normalize the states before training. The input to the PDM is . It performs one-step prediction to yield the mean µ t and the appended output sa = [sc t ,ss t ], which is used to calculate the variance t of the predict Gaussian distribution.
The training target of µ t , sa t is s * , where α is a hyper-parameter constraining sa t to [−α, α] and bounded by the use of tanh as the activation function of the last layer of the PDM.
The training target of the PDM is y * t = s * t , sa * t ∈ R 3D . By the trigonometric function, we have a basic priori knowledge The entire training set is D = X : = x 0 , . . . ,x n−1 , Y: = y * 1 , . . . ,y * n , assuming that the training errors of the PDM on set D are In addition, we define the evaluation of uncertainty as Our prediction of variance t is based on a higher-level assumption that there is a correlation between the precision of prediction and the difference between f X and Y. Therefore, there is also a correlation between predictive uncertainties and Eva(sa t ). We express this uncertainty by the variance t of the Gaussian distribution as where e and β are hyper-parameters. Thus, the Gaussian distribution can be obtained as s t ∼ N μ t ,˜ t . Unlike the GP model that becomes ineffective in high-dimensional space, the PNN is good at processing high-dimensional data, and this is a significant advantage for PILCO. We finally tested the effectiveness of our prediction and evaluated the uncertainty of PNN and GP with small sample data. As shown in Fig. 1, PNN achieved almost the same results with GP, though PNN is a parametric model.

B. POLICY EVALUATION
Minimizing and evaluating J π in (1) requires long-term predictions of the evolution of state. The one-step predictions of the PDM are obtained as shown in (2) and (6), analytical solutions of f (x) form the premise to obtain those of the state distributions at each step [p (S 1 ) , . . . , p(S T )].
To predict s t−1 from p (S t−1 ), we require a distribution p (s t−1 , a t−1 ), as the policy π is deterministic, which means a t−1 = π (s t−1 , θ). Although one-step predictions s t are correspondent with the Gaussian distribution, the process of prediction of the PDM is given as a black box. Therefore, we compute the cumulative cost per step through sampling from state distributions. At time t, the mean and variance of the distribution of the output of each sample i areμ i t and˜ i t , respectively, where i ∈ [1, . . . M ], Thus, the mean µ t of the predicted state distribution can be approximated as To compute the covariance matrix t of the state distribution, we need to distinguish diagonal and off-diagonal elements. Using the law of iterated variances, we obtain the whereμ is known from (2) and˜ is known from (6). The off-diagonal terms do not contain additional term E ˜ i t because our approach follows the assumption of conditional independence of GP models: Different target dimensions do not covary for a givenx t−1 . The analytical solution can be expressed as In our implementation, we calculate (10) on a set of M candidate values sampled randomly from a Gaussian distribution p (S t−1 ), which provides suitably diverse samples to approximately solve the equations such that it cannot obtain the solution for p (S t ) ∼ N µ t , t . Although the state distributions are approximated by the samples, they exhibit a few advantages and there are no restrictions on policy π. We can implement a policy of infinite complexity. The cumulative cost is If the integration required for (12) is analytically intractable due to the existence of c (s t ), J π can be approximated as

C. GRADIENTS FOR POLICY IMPROVEMENT
Both u t and t are functionally dependent on the meanũ i t and covariance˜ i t of the prediction of the samples, which is random sample through N (u t−1 , t−1 ). Although the analytical solution of J π cannot be obtained as through PILCO, we can still approximate the gradients of the cumulative cost J π with respect to policy π. Above all, the cumulative cost J π is a recurrent neural network composed of policy π and PDM, because of which methods of optimization of neural networks can be used for policy learning.
The sampling approach approximates the analytical solution and simplifies the calculation of distributed costs in each step.

IV. EXPERIMENT AND DISCUSSION
This section describes the experimental setup. All tasks were in a continuous state and involved continuous action in discrete time to test the validity of the proposed algorithm. To intuitively show how the algorithms work, we present a detailed analysis of the key properties of APILCO.

A. TASKS FOR COMPARISION
In the following, we evaluate the quality of APILCO for longterm predictions in terms of computational requirements and learning speed. Moreover, we reveal the properties of the PNN. For these assessments, we apply APILCO to nonlinear control tasks and compare it with PILCO.

1) TASK DESCRIPTION
We considered two numerical simulation tasks, i.e., doublependulum swing up and cart pole swing up, to evaluate the validity of the APILCO: Learning speed and quality of approximate inference. More details of the tasks will be provided in the following sections.
Double-pendulum swing-up task: The double-pendulum system is a two-link robot arm with two actuators, as shown in Fig. 2(a). The state s is given by the angles θ 1 and θ 2 . The corresponding angular velocitiesθ 1 andθ 2 of the two links are measured with respect to the vertical. Each link is 1 m in length and weighs 0.5 kg. The torques u 1 and u 2 are in the range from −3 to 3 Nm. The control signals are changed with an interval of 50 ms. The objective is to learn a control policy that can swing the double pendulum up from an initial distribution p(s 0 ) around µ 0 = [π, π, 0, 0] T and balance it to the inverted position with s d = [0, 0, * , * ], where * represents arbitrary velocity due to the cost function is independent of velocities. The prediction period is within 1.5 s. The task is challenging because it is required to obtain the interplay of two correlated control signals. To solve the double-pendulum swing-up task, we parametrized the preliminary policy as a multilayer neural network herein.
Cart pole swing-up task: The cart pole system consists of a cart moving along the horizontal and a freely swinging pendulum attached to it, as shown in Fig. 2 (b). The state of the system is expressed with the position x of the cart, the velocityẋ, the angle θ (the pendulum measured hanging downward), and the angular velocityθ. In our simulation, we set the masses of the cart and the pendulum to 0.700 kg and 0.325 kg, respectively, the length of the pendulum to 0.6 m, and the coefficient of friction between cart and ground to 0.1 Ns/m. The prediction period is set to 2.5 s. The control signal is changed with an interval of 100 ms.
A horizontal force u ∈ [−10, 10] N is applied to the cart. The objective is to learn a control policy that can swing the pendulum up from around µ 0 = [x,ẋ, θ,θ] T and balance it to the inverted position while keeping the cart in the middle, i.e., s d = [0, * , π, * ] T . A linear policy cannot be used to solve the task [21]; APILCO learns a nonlinear policy with the multilayer neural network.
To compare it with PILCO, for both tasks, we use a cost function as PILCO requires that where σ c = 0.5. The Euclidean distance between the target position and the tip of the pendulum is penalized.

2) RESULTS AND DISCUSSION
In this section, we compare the results of APILCO and PILCO on the tasks, Fig. 2 shows the results. The architecture of the PNN had 150 units with three hidden layers and sigmoid activation functions to fit the dynamics model. The control policy was a three-layer neural network with 150 units, and the weight decay of the NN dynamics model was set to 2×10 −3 . We used a ''replay buffer'' with finite size (the most recent 10 episodes) and discarded older episodes of data.
We alternated between fitting a dynamics model and optimizing the control policy. We first generated a single random episode when experimenting with each method before iterating over the main loop for 50 episodes. In each iteration, a single episode of data was acquired by executing tasks for the prediction horizon and generating a transition per episode. We evaluated each method in each iteration by calculating the total cost and time comsumption. Both PILCO and APILCO used the same physical simulator and cost function settings. Fig. 3 shows the total costs of PILCO and APILCO. In both tasks, PILCO converged after 7-8 episodes whereas APILCO required 10-20 episodes before reaching a cost slightly lower than that incurred by PILCO. Although APILCO was not as data efficient as PILCO, it was considerably efficient in terms of time consumption as shown in Fig. 4.
We counted the running time of both models for both APILCO and PILCO, where both algorithms leveraged the CPU. The time comsumption of APILCO for 50 episodes was only 16 minutes. Whereas PILCO's simulation by contrast took 250 minutes for 50 episodes on the same computer.
The time complexity of the simulation (with gradient information) was technically constant. Our experiments used a dataset of size N , input dimensionality X , and action dimensionality U . Because the dynamics model was trained only on the 10 most recent episodes (so that the PNN would assign a larger weight to newer information more likely useful for training a control policy than older data), we used P-many samples to represent the distributions.
However, because the ''replay buffer'' was limited to ten recent trials, the time complexity of APILCO was constrained to O(PN (X + U )) while PILCO had a time complexity of O N 2 X 2 (X + U ) 2 [22]. With more data collected, PILCO took more time for each trial. The minimum time for convergence of PILCO was beyond the requirement of APILCO for both tasks. Consequently, PILCO is unsuited to tasks requiring a large number of episodes or those with highdimensional states.    5 shows the total cost and cumulative cost of APILCO per trial, where the cumulative cost in the PDM was slightly higher than the total cost in a dynamics model. It causes the fact that the PDM slightly overestimated the variance of the state distributions. Fig. 6 shows a typical example of the position of the car and the angle of the pendulum. In the early episode of learning, PDM is not ideal for the long-term prediction of state distribution. The distributions of the states were multimodal. APILCO dealt with this inappropriate modeling by learning a control policy that forced the states into a unimodal distribution such that a Gaussian approximation was appropriate, as shown in Figs. 6(c) and (d).
We explain this behavior as follows: Assuming that APILCO found different paths that led to a target, a wide Gaussian distribution was required to capture all possible paths.
However, when computing the expected cost, uncertainty in the predicted state led to higher expected cost, assuming that the mean was close to that of the target. Therefore, APILCO optimized the control policy to push the marginally multimodal trajectory distribution into a single mode.

B. SCALING TO HIGHER DIMENSIONS OR/AND ARBITRARY COST FUNCTION TASKS
To implement the calculation of the analytical gradient, PILCO needed to use the saturating cost function in (15). However, the cost function directly affects policy research, and is usually determined by the tasks rather than the algorithms [23]. APILCO calculates the cumulative cost by sampling from state distributions, as in (14). Therefore, APILCO can deal with tasks with an arbitrary cost function. For these assessments, we applied it to three nonlinear control tasks.

1) TASK DESCRIPTION
We considered three simulated tasks (cart pole swing up, two-DOF manipulator location task, and seven-DOF manipulator location task). The cost functions were determined by tasks rather than algorithms.
The state of the cart pole system was described by position x, velocityẋ, complex representation [cos (θ) , sin (θ)] of angle θ of the pendulum measured when hanging downward, and angular velocityθ. The objective was to learn a control policy to swing the pendulum up from around µ 0 = [x,ẋ, θ,θ] T and balance it in the inverted position in the middle of the track, i.e., around s d = [0, * , π, * ] T . The simulation model is given in Fig. 7 (a). We chose the squared Euclidean distance between the target position (pendulum upright in the middle of the track) and the tip of the pendulum as cost function The two-DOF manipulator consisted of two links and the state s was given by the complex representation of the angles [cos (θ 1 ) , sin (θ 1 ) , cos (θ 2 ) , sin (θ 2 )], as well as the corresponding angular velocitiesθ 1 andθ 2 of the inner and outer links, respectively. The control signals were changed every 100 ms. The objective was to learn a control policy that forced the two links from an initial distribution p(s 0 ) of approximately µ 0 = [θ 0 1 , θ 0 2 , 0, 0] T to approximately s d =  Fig. 7 (b).
We chose cost as a function of Euclidean distance and rotating velocity where d is the Euclidean distance between the target position and the end position of the manipulator, and α = [α 1 , α 2 ] T is a constant weight vector. The seven-DOF manipulator consisted of seven links as shown in Fig. 7 (c) The state s was given by the complex representation of angles [cos (θ) , sin (θ)] , θ = [θ 1 , θ 2 , . . . , θ 7 ], and the corresponding angular velocitiesθ i of the links. The control signals were changed every 20 ms. The objective was to learn a control policy that forced the links from an initial distribution p(x 0 ) and kept the end of the manipulator in the target state s d = [θ d ,θ d ] (where θ d is the target angle andθ d is the target velocity set to as zeros). The prediction horizon was 2 s.
We chose the cost function as where β = [β 1 , β 2 , . . . , β 7 ] T is a weight vector, and a is the torque applied to each link. Note that the angle of the tasks used complex representation [cos (θ i ) , sin (θ i )], which were correlative variables. However, the state variables were independent random variables in the assumption of the PDM. The influence of this is discussed later.

2) EVALUATION OF KEY PROPERTIES
In this part, the key properties of the approximate method of inference in terms of the influence of independence on state distribution and that of initial distribution on APILCO are discussed.
We compared the total cost and cumulative cost per trial in Fig. 8. APILCO maintained excellent performance on tasks VOLUME 7, 2019  with arbitrary cost functions, where these tasks could not be implemented by PILCO due to the limitations of its cost function. Although we could not directly compare the cumulative cost obtained by the analytical solution in (13) with that obtained by sampling approximation in (14), the learning success of the tasks proved that sampling was effective in approaching analytical cumulative cost.
It was assumed that state variables of the PNN were independent in the three tasks mentioned above. Although, the angles were given by the complex representations (sine and  cosine functions) which violated the assumption of the independence of variables, APILCO was still effective for these tasks.
In case the variables became inconsistent with the correlation owing to the complex representation, a difference between the predicted states and real states would be VOLUME 7, 2019 generated by a larger variance in the distributions of the predicted states. Larger variance was contrary to low cumulative cost. However, the target of the improvement in control policy was to reduce the cumulative cost, which forced the variables of the predictive states to be correlative variables.
As shown in Fig. 8, in tasks (a) and (c), the correlatives of the variables had no negative effect. In task (b), although it had no effect on total cost, it rendered the cumulative cost unstable. The difference between task (b) and the others was that the initial state s 0 was randomly chosen as the mean of the initial distribution to improve control policy. We chose the 58th episode in the learning for control policy as an example to illustrate the phenomenon. The cumulative cost of the episode was overestimated as shown in Fig. 8 (b).
The overestimation of cumulative cost can be explained intuitively. In the trial, variables of subsequent states were always the correct correlative variables even if the initial state was unobserved. When the subsequent states fitted the predictions, the control policy could successfully complete the task as shown in Fig. 9. Nevertheless, in long-term predictions, an unobserved state might have led to the variables of subsequent states inconsistent with the correlation; while a wide Gaussian distribution was required to capture the variability of the state distribution in this case, which was almost invariant with regard to policy π.

V. CONCLUSION
To overcome the problems in the current PILCO algorithm, we replaced GP by PNN in building probabilistic dynamics models and developed a deterministic control policy through long-term predictions. We determined model uncertainty through basic prior knowledge and calculated cumulative cost by sampling from state distributions. Simulation experiment demonstrated that the proposed approach could expand the probabilistic inference framework to high-dimensional observation spaces. It had a high learning speed with similar data efficiency to PILCO, and does not require human input, as well as other kinds of informative initializations or prior knowledge. In addition, PNN could avoid the cubic time complexity of GP and overcome the restriction on cost function. It was shown that PNN was useful for learning probabilistic dynamics models and formulating a control policy.