Importance-Weighted Variational Inference Model Estimation for Offline Bayesian Model-Based Reinforcement Learning

This paper proposes a model estimation method in offline Bayesian model-based reinforcement learning (MBRL). Learning a Bayes-adaptive Markov decision process (BAMDP) model using standard variational inference often suffers from poor predictive performance due to covariate shift between offline data and future data distributions. To tackle this problem, this paper applies an importance-weighting technique for covariate shift to variational inference learning of a BAMDP model. Consequently, this paper uses a unified objective function to optimize both model and policy. The unified objective function can be seen as an importance-weighted variational objective function for model training. The unified objective function is also considered as the expected return for policy planning penalized by the model’s error, which is a standard objective function in MBRL. This paper proposes an algorithm optimizing the unified objective function. The proposed algorithm performs better than algorithms using standard variational inference without importance-weighting. Numerical experiments demonstrate the effectiveness of the proposed algorithm.


I. INTRODUCTION
Reinforcement learning (RL) is a promising framework for autonomously learning a policy from interaction data [1].Online model-free RL methods have succeeded in applications where the data can be obtained easily, such as games [2], [3].However, such methods are often impractical for applications where the data collection is expensive, such as robotics or healthcare [4], [5].Data-efficiency is one of the fundamental issues in RL.
There are several approaches for increasing data-efficiency in RL.One is model-based reinforcement learning (MBRL).In MBRL, the agent explicitly learns an environment model and utilizes it to improve a policy [6], [7], [8].Bayesian MBRL is a subfield of MBRL in which the The associate editor coordinating the review of this manuscript and approving it for publication was Tony Thomas.agent explicitly takes uncertainty about an environment model into account [9], [10].Based on Bayes-optimal exploration/exploitation tradeoff in Bayesian MBRL, the data-efficiency can be further improved.Offline RL is also a data-efficient RL approach [11].In offline RL, the agent learns a policy from previously collected data.Meta-RL is another approach for data-efficient RL [12].In meta-RL, the agent learns a policy from data collected from multiple similar environments, assuming that each environment is drawn from some distribution every episode.Combining these data-efficient RL approaches has also been investigated.
Motivated by increasing data-efficiency, this paper discusses a Bayesian MBRL approach for offline meta-RL.A standard model in Bayesian MBRL is a Bayes-adaptive Markov Decision Process (BAMDP) [9], [10].A task distribution to draw a task instance in meta-RL can be represented as a prior distribution over MDPs in a BAMDP.
A BAMDP is also reasonable for offline RL, as its goal is offline optimization of possible trial and error under its environment model and prior distribution.For these reasons, a BAMDP is a promising model for offline meta-RL.
Conventional Bayesian MBRL methods assume that a BAMDP is given in advance, implying that an environment is accurately represented by a likelihood function and a prior distribution specified in a BAMDP.This assumption is valid when using a flexible black-box model to infer from sufficient data from a current environment.However, this assumption is often difficult to hold when using a structured model with low-dimensional latent task representation to infer from few data from a current environment.If using an inaccurate model, Bayesian MBRL may not work for a real environment due to failing at belief update [13].How to address a structured BAMDP is a question.
Recent meta-MBRL research has discussed learning latent variable models based on variational inference framework to obtain latent task representation in meta-RL [14], [15], [16], [17], [18].A typical approach is to optimize an evidence lower bound that implicitly assumes that the data distribution does not change.Such implicit assumption can also be seen in meta-MBRL but also in MBRL, e.g., [8], [19], and [20].However, in MBRL, the distribution of data previously collected to train a model differs from the distribution of data obtained in the future when applying a policy improved using the learned model.Such a situation is called covariate shift or distribution shift [11].
In the case of online MBRL, the effect of ignoring covariate shift is relatively mild.This is because the difference between the constantly updated data-collecting policy and the improved policy gradually becomes small in the online setting in which the policy is gradually improved and converged.Indeed, most of the above-mentioned meta-MBRL methods suppose online learning settings.However, in the case of offline MBRL, the difference between the data-collecting policy no longer updated and the improved policy is significant, and thus the effect of ignoring covariate shift is also significant.Prior work [17] addresses another issue that arises in offline meta-MBRL, whereas the issue of covariate shift is out of the discussion.
This paper discusses learning a BAMDP model considering covariate shift.This paper leverages the idea of learning a MDP model considering covariate shift [21].The main idea of [21] is importance-weighted maximum likelihood estimation weighted by the ratio of the distributions to predict future data more accurately when applying an improved policy.The importance-weighted objective is also an estimate of the expected return in a MDP penalized by model error.The algorithm in [21] optimizes the importance-weighted objective with respect to both model and policy.This paper proposes to extend this idea from MDP model learning to BAMDP model learning.The outline of the discussion is similar to [21].Firstly, this paper presents a unified objective function viewed as an importance-weighted variational objective function for training a model and as the expected return penalized by model's error for planning a policy.Secondly, this paper proposes an algorithm to optimize it with respect to both model and policy.This paper and [21] are one of the decision-aware model learning approaches [22], [23].Prior works [24], [25] are also similar approaches in that they consider importance-weighting with the distribution ratio.The difference is that this paper and [21] consider the data distribution in a simulation MDP model when applying a planned policy, not the data distribution in a real MDP as in other approaches.Using the data distribution in MDP model simulation has two advantages.Firstly, unlike in a real MDP, data when applying a newly planned policy in a simulation MDP model are accessible to the agent, and the importance-weight can be obtained in the standard framework of density ratio estimation [26].Secondly, optimizing the importance-weighted variational objective with respect to policy takes the same form as standard BAMDP planning, and the proposed algorithm can use an existing BAMDP planning algorithm as a policy planning subroutine.
Sect.II describes the notations of MDP and BAMDP.Sect.III explains the problem setting of offline meta-MBRL in this paper and presents an importance-weighted variational objective.Sect.IV proposes an algorithm to optimize the importance-weighted variational objective.Sect.V demonstrates the effectiveness of the proposed algorithm in numerical experiments.Sect.VI concludes this paper.

II. PRELIMINARY A. MDP
This paper considers a discounted infinite horizon MDP [27].Let S be the state space.Let A be the action space.Let ρ(s) be the initial state distribution.Let P(s ′ |s, a) be the transition probability function.Let r(s, a) be the reward function.Let π be a policy.The state and state-action distributions are = s|ρ, π, P) The expected return is A BAMDP is an augmented MDP whose augmented state is (b t , s t ), where b t is the agents' belief over MDPs at timestep t [9], [10].For simplicity, this paper assumes that reward function r is known.In that case, the agent's belief is over transition probability function P. The prior distribution, i.e., the agent's belief at timestep t = 0 is b 0 (P).The likelihood function is l(P; s t , a t , s t+1 ) = P(s t+1 |s t , a t ).The posterior distribution, i.e., the agent's belief at t ≥ 1 is updated using the Bayes rule, b t+1 (P) = Pr(P|b 0 , s 0 , a 0 , • • • , s t , a t , s t+1 ) ∝ b t (P)P(s t+1 |s t , a t ).
= Pr(P|b t , s t , a t , s t+1 ) The transition probability function in a BAMDP is By the assumption, the reward function in a BAMDP is r(b, s, a) = r(s, a).The expected return in a BAMDP is A Bayes-optimal policy is a policy that maximizes (1).Since a BAMDP is an augmented MDP whose augmented state is (b, s), a Bayes-optimal policy is a function of (b, s).In principle, when given a BAMDP, i.e., when given likelihood function l(P; s, a, s ′ ) = P(s ′ |s, a) and prior distribution b 0 (P), a Bayes-optimal policy can be planned offline, as (1) can be computed offline [10].

III. PROBLEM SETTING AND OJECTIVE FUNCTION
This paper assumes a meta-RL setting where a task represented by a MDP is drawn from a distribution.This paper considers optimizing the expected return averaged over MDPs as a reasonable criterion in meta-RL.For simplicity, this paper assumes that state space S, action space A, initial state distribution ρ(s), and reward function r(s, a) are the same between all MDPs.In that case, the expected return averaged over MDPs is E P∼b 0 η π P , which is the same as (1).That is, policy optimization in meta-RL in this setting can be seen as policy optimization in a BAMDP whose likelihood function and prior distribution are specified by P and b 0 (hereinafter called ''the real BAMDP'').As described in Sect.I, this paper considers a setting where the real BAMDP is inaccessible, and only offline data are given.Even in principle, a Bayes-optimal policy cannot be planned offline in this setting, as the real BAMDP is not given.Throughout, this paper discusses a model-based approach to optimize (1) in this setting.
This paper assumes that offline data are collected from M real MDPs sampled from b 0 .Let D To represent P m (s ′ |sa) and P m ∼ b 0 , the agent uses a latent variable model denoted by Pθ,z (s ′ |sa) and z ∼ β 0 φ , where θ is a model parameter vector shared between MDPs, z is a latent variable vector to specify one MDP, and β 0 φ is the prior distribution parameterized with φ.Hereinafter, this paper refers to a BAMDP model whose likelihood function and prior distribution are specified by Pθ,z and β 0 φ as ''the simulation BAMDP.''Let β t φ be the agent's belief at timestep t.By the assumption, the reward function in the simulation BAMDP is r(β φ , sa) = r(sa).In the MDP whose transition probability function is Pθ,z , let ηπ θ,z be the expected return, let dπ θ,z (sa) be the state-action distribution, and let Dπ θ,z be simulated data collected using policy π.
The model-based meta-RL setting in this paper is summarized as • the agent trains simulation BAMDP parameter (θ, φ) using the offline data obtained in the real BAMDP, • the agent uses the trained simulation BAMDP to plan policy π to optimize the expected return in the real BAMDP, ( Below, this paper discusses how to train (θ, φ) and plan π.The first idea is to train (θ, φ) to optimize a standard latent variable model learning criterion and then plan π to optimize a standard MBRL criterion.This paper calls it ''twostage optimization.''The second idea is to iterate between training (θ, φ) and planning π to optimize a unified objective function.This paper calls it ''joint optimization.''The former is a natural extension of existing methods, whereas the latter is what this paper proposes.Sections III-A-III-B describe objective functions for these ideas, respectively.Sections IV-A-IV-B show algorithms for these objective functions, respectively.

A. OBJECTIVE FUNCTION FOR TWO-STAGE OPTIMIZATION 1) FIRST STAGE: TRAINING (θ, φ)
The first stage is to train (θ, φ) based on variational inference for latent variable model learning.As a standard method, this paper uses variational autoencoder (VAE) [28].Given D ofl , the log marginal likelihood function is where p(z) is the prior distribution for VAE learning.Using Jensen's inequality, Equatoin ( 2) is bounded as ln Pr(D ofl |θ ) where q φ (z|D ofl m ) is a variational distribution parameterized with φ.Let (θ * , φ * ) denote parameters that maximize (3).
The initial belief in the simulation BAMDP is ideally the true latent variable distribution obtained after VAE learning.As a reasonable approximation, this paper uses ), which can be seen as a latent distribution learned from data and is called average encoding distribution [29] or aggregated posterior [30].

2) SECOND STAGE: PLANNING π
The second stage is to plan π using the simulation BAMDP represented by Pθ * ,z and β 0 φ * .The most naive idea is to optimize the expected return in the simulation BAMDP, with (φ, θ) = (φ * , θ * ).However, even in the case of MDP, this idea often only works for offline MBRL [20].An improved idea is to optimize a penalized expected return in a MDP whose penalized reward function is r(s, a) − λu(s, a), where u(s, a) is an estimate of model's error, and λ is the user-chosen penalty coefficient [20].
Similarly, this paper considers a penalized version of the expected return in the simulation BAMDP.Writing the initial belief explicitly as with (φ, θ) = (φ * , θ * ) as the second stage objective function, where u m,θ,z (sa) is an estimate of the model's error between P m (•|sa) and Pθ,z (•|sa).

B. OBJECTIVE FUNCTION FOR JOINT OPTIMIZATION
In the joint-optimization, this paper gives the agent's belief at timestep t = 0 in the form of , as in the two-stage optimization.This paper also approximates the expected return in the real BAMDP by The difference between the expected return in the simulation BAMDP and the approximate expected return in the real BAMDP is bounded as where and ν is a constant.For the derivation, see Appendix.
A lower bound of the approximate expected return in the real BAMDP is bounded as The first term is the expected return in the simulation BAMDP, and the second penalizes the policy evaluation error between the real and simulation BAMDPs.
Inspired by increasing the objective function by maximizing the lower bound, this paper defines a penalized objective function by where c ∈ [0, C] is a user-chosen penalty coefficient.The main idea of the joint optimization is to iteratively optimize (θ, φ) and π based on an estimate of ( 7).This paper uses the MM framework [31] to optimize (7).When updating from (θ i , φ i , π i ), the surrogate function is . Below, this paper omits the constant term.Equation ( 8) can be rewritten as , . This paper estimates the above equation by where , and κ is an estimate of κ.How to estimate them is described in Sect.IV-B.
Equation ( 9) can be interpreted as a kind of variational inference because ( 9) is similar to (3) in the following points.Firstly, w π m,θ,z (sa) is importance-weighting to address covariate shift between d ofl m (sa) and dπ θ,z (sa).Secondly, ℓ m,n (θ, z; κ) is a utility function modified from the loglikelihood function.Thirdly, ν scales the KL divergence regularization term in the same manner as β-VAE [32].Based on the interpretation of (9) as a kind of variational inference, this paper uses it to update (θ, φ).This paper calls it ''importance-weighted variational inference for BAMDP.''

2) ESTIMATED OBJECTIVE FUNCTION FOR PLANNING π
Equation ( 8) can also be rewritten as .
resulting estimated objective function is which is a penalized version of the expected return in the simulation BAMDP.

a: COMPARISON TO TWO-STAGE OPTIMIZATION
The objective function for planning π is essentially the same for the joint optimization and the two-stage optimization, comparing (10) and ( 5).In the joint optimization, the objective function for training (θ, φ) is relevant to the one for planing π, as ( 9) and ( 10) are both estimates of (8).However, in the two-stage optimization, the objective function for training (θ, φ) is different from the one for planning π, comparing (3) and (5).In other words, for one objective, the joint optimization optimizes it with respect to both (θ, φ) and π, whereas the two-stage optimization does it with respect to only π.As a result, the joint optimization is better than the two-stage optimization in terms of optimizing one objective., becase another bound similar to (6) can also be derived by replacing dπ θ,z in L(θ, φ; π) with d π m , see Sect.IV of [21].However, in that case, the resulting variant of (10) does not have the same form as the objective function of a BAMDP planning problem.One advantage of using

b: ADVANTAGE OF USING
is that ( 10) is a BAMDP planning objective function and can be optimized using an existing BAMDP planning algorithm.Another advantage is that, since the agent cannot access data sampled from d π m (sa) in the real BAMDP but can generate data sampled from dπ θ,z (sa) in the simulation BAMDP, can be obtained in the standard framework of density ratio estimation [26], which is a simpler setting.

IV. ALGORITHM A. ALGORITHM FOR TWO-STAGE OPTIMIZATION
The main idea of the two-stage optimization is to train BAMDP parameter (θ, φ) and subsequently plan policy π.
2) SECOND STAGE: PLANNING π Line 3 in Algorithm 1 optimizes (5), which is policy planning.Inspired by VariBAD [16], this paper approximately gives an augmented state in the BAMDP by a pair of a state and a variational approximation of the belief.To reduce computational efforts, as the prior for the variational approximation of the belief, this paper uses a variational distribution that minimizes the KL divergence from β 0 (z) = 1 M m q φ * (z|D ofl m ).As the likelihood function for the variational approximation of the belief, this paper uses Pθ,z , the decoder trained as in Line 2. This paper trains u m,θ,z (sa) in (5) using input data { sa n,m , z, µ φ,m , ln σ φ,m } n,m and output data

B. ALGORITHM FOR JOINT OPTIMIZATION
The main idea of the joint optimization is to iterate between training (θ, φ) and planning π.Algorithm 2 shows the outline.

1) TRAINING (θ, φ)
At the first iteration, where π remains an initial value, importance-weighting depending on π is not reasonable.Line 4 in Algorithm 2 optimizes (3), as with the twostage optimization.At the subsequent iterations, Line 6 in Algorithm 2 optimizes (9).Below, this paper discusses how to execute Line 6 concretely.
In principle, v π θ,z (s) may be estimated by a meta-RL extension of LSDG [34].However, in practice, estimating v π θ,z (s) is computationally unrealistic if θ is high-dimensional.Specifically, LSDG in a single MDP RL setting requires estimating the same number of value functions as the dimension of model parameters, and v π θ,z (s) additionally needs its meta-RL version.In the case of MDP, the numerical experiments in [21] observe that importance-weighted model estimation ignoring this term can also perform better than unweighted model estimation.Assuming that this also holds for a BAMDP, this paper ignores v π θ,z (s) in the gradient-based optimization.
is meta-learning of density ratio where the source datasets are D ofl m , the target datasets are Dπ θ j ,z .This paper estimates w π m,θ,z using neural networks that take sa, z, µ φ,m as input, denoted by ŵπ m,θ,z .Since µ φ,m encodes data from the m-th MDP, it contains the information of the source distribution.Since z specifies a simulation MDP, it captures the characteristics of the target distribution.Adding latent representations of both source and target distributions to input is inspired by [35].(θ j , φ j , π j ) ← (θ, φ, π ).

V. NUMERICAL EXPERIMENTS A. POLICY EVALUATION
Firstly, to illustrate the effectiveness of importance-weighted variational inference for BAMDP, this paper discusses the problem of predicting behaviors of a given target policy.This problem can be seen as a policy evaluation problem, as the expected return is computed from the predicted behavior.This paper compares predicting behaviors of standard variational inference and importance-weighted variational inference when training BAMDP models expressed using the same NN model.This paper considers an inverted pendulum task, where state s is a pair of angle and angular velocity, and action a is torque input.The environmental variation in meta-RL is that the viscosity coefficient of the equation of motion behind a real MDP changes every episode.The offline data is collected using a random policy in 100 sampled real MDPs.The target policy is a controller that swings up and stabilizes the pendulum to (0, 0) in the real MDP, whose viscosity coefficient is zero.For more details, see Appendix.
The outline of variational inference is as follows.The agent considers a one-dimensional dimension latent variable z.The agent represents each model by neural networks.For learning each model, the agent uses the data obtained from 80 real MDPs for training and the rest for validation.For regularizing importance-weighting, the agent uses α = 0.2.The number of iterations of Algorithm 3 is five.For more details, see Appendix.
Fig. 1 illustrates the predicted behavior when using standard variational inference, i.e., optimizing (3).The 100 subplots correspond to the 100 sampled real MDPs.In each subplot, the horizontal and vertical axes stand for angle and angular velocity, respectively.The black lines show real future data when applying the target policy from initial state (π, 0) in each real MDP, which is the ground truth behavior the agent wants to predict.Multiple black line patterns show that the target policy planned for zero viscosity coefficient swings more weakly than expected as the viscosity coefficient increases, finally failing to swing up.The colored markers show simulated future data when applying the target policy from the same initial state in each simulation MDP, whose latent variable is the encoding of the offline data collected in the real MDP in the same subplot.That is, this is the prediction that the agent obtains using the trained model.Note that since the state transition model is estimated as a probabilistic model, there are variations in the predicted behavior, which are drawn in different colors.The red markers indicate (0, 0).The top 20 subplots and the bottom 80 subplots are the real MDPs where the offline data for validation and training are collected, respectively.There is a big difference between the black lines and the colored markers, meaning that the simulation BAMDP trained using standard variational inference does not capture the behavior of the target policy.
Fig. 2 illustrates the predicted behavior when using importance-weighted variational inference for BAMDP.Note that the black lines, i.e., real future data, are the same as Fig. 1.The difference between the black lines and the colored markers in Fig. 2 is small compared to Fig. 1.Thus, the simulation BAMDP trained using importance-weighted variational inference for BAMDP captures the behavior of the target policy more accurately compared to standard variational inference.
Fig. 3 shows the offline data colored based on the logarithm of the importance-weights at the fifth iteration of Algorithm 3.This figure also shows the same black lines as Fig. 1 for reference.Roughly speaking, data points close to the black line are colored brightly, assigning large importance-weighting.Such importance-weighting is  effective for more accurately predicting behaviors of the target policy.
Fig. 4 illustrates the relationship between the real MDP parameter and the simulation MDP latent variable when using standard variational inference.The horizontal axis stands for the viscosity coefficient, which is the real MDP parameter and is inaccessible to the agent.The vertical axis indicates the one-dimensional latent variable mean of the approximate belief, which encodes the offline data collected in the same real MDP and is accessible to the agent.The orange and blue markers are the results of the real MDPs where the offline data for validation and training are collected, respectively.This figure also shows that the simulation BAMDP learned using standard variational inference is not very accurate.Fig. 5 illustrates the relationship between the real MDP parameter and the simulation MDP latent variable when using importance-weighted variational inference for BAMDP.The magnitude relation of the one-dimensional latent variable accessible to the agent roughly captures the magnitude relation of the viscosity coefficient inaccessible to the agent.This figure also shows that the simulation BAMDP learned using importance-weighted variational inference for BAMDP is more accurate.Note that, for the few subplots that do not capture the ground truth behaviors, the viscosity coefficient  is close to the critical point where the target policy cannot swing up.

B. POLICY OPTIMIZATION
Next, this paper discusses policy optimization experiments to demonstrate the effectiveness of the proposed algorithm.This paper presents the results of the inverted pendulum task described in Sect.V-A and a cartpole swing-up task.For the cartpole task, the environmental variation in meta-RL is that the pole mass and the pole length of the equation of motion behind a real MDP change every episode.Similar to the inverted pendulum task, the offline data is collected using a random policy in 100 sampled real MDPs.For more details, see Appendix.
The outline of the two-stage optimization and the joint optimization is as follows.The agent considers a one-dimensional latent variable in the inverted pendulum task and a two-dimensional one in the cartpole swing-up task.The agent uses a decoder with 48 hidden units in the inverted pendulum task and one with 64 hidden units in the cartpole swing-up task.The others are the same between the inverted pendulum and cartpole swing-up tasks.For regularizing importance-weighting, the agent uses α = 0.2.The number of iterations of Algorithm 3 is five.The number of iterations of Algorithm 2 is two.The agent uses SAC [37] as a policy planning subroutine to learn an augment-statedependent policy in the simulation BAMDP.For more details, see Appendix.
Table 1 shows the result of the two-stage optimization and the joint optimization.Note that the two-stage optimization is an existing method, and the joint optimization is the proposed algorithm, as described in Sect.III.For each task, Table 1 reports the score averaged over five runs with different random seeds.For each run, this paper estimates the expected return by averaging the return in 100 sampled real MDPs.For both tasks, the joint optimization achieves better performance.Figs.6-7 show the behaviors in the real BAMDP when planned using the two-stage optimization and the joint optimization, respectively.The policy planned using the two-stage optimization cannot stabilize the pendulum around (0, 0) as shown in Fig. 6, leading to the worse performance shown in Table 1.This is because the simulation BAMDP trained by the two-stage optimization cannot accurately represent transitions around (0, 0).The joint optimization trains the simulation BAMDP by assigning larger importance-weighting to data around (0, 0).As a result, the policy planned using the joint optimization can stabilize the pendulum around (0, 0) as shown in Fig. 7, resulting in the better performance shown in Table 1.

VI. CONCLUSION AND FUTURE DIRECTIONS
This paper discusses importance-weighted variational inference to train a BAMDP model in offline Bayesian MBRL.The proposed algorithm optimizes a unified objective function that is an importance-weighted variational objective function for training a model and is a penalized expected return for planning a policy.In theory, since a method using standard variational inference without importance-weighting optimizes an objective function of interest only with respect to a policy, the proposed algorithm is better in terms of optimizing one objective function.In practice, numerical experiments demonstrate that the proposed algorithm can perform better.
Future directions to improve the proposed algorithm will be as follows.Firstly, this paper considers the case where the number of real MDPs collected in offline data, M , is not large.To address a large number of real MDPs, the average encoding distribution, β 0 (z) = 1 M m q φ * (z|D ofl m ), needs to be approximated by a mixture of variational posteriors with pseudo-inputs [38] or a similar technique.Secondly, applying to large-scale tasks is an important challenge.One of the bottlenecks is density ratio estimation in highdimensional settings, as this is itself a research topic [39], [40].It is necessary to incorporate recent developments.Thirdly, improving variational inference of BAMDP as a latent variable model is essential for both unweighted and importance-weighted settings.

APPENDIX A DERIVING POLICY EVALUATION ERROR BOUND
The policy evaluation error between the real and simulation MDPs is bounded as follows (see Sect.IV-A of [21]).
where ξ (θ, φ, z; π) = E sa∼ dπ θ,z ,s Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

APPENDIX B NUMERICAL EXPERIMENT SETTINGS
The inverted pendulum task and the cartpole swing-up tasks are modifications of OpenAI Gym [41].The modified parts are as follows.For the inverted pendulum task, the time discretization width is 0.1, the mass is 0.5, the viscosity coefficient is uniformly sampled from [0, 0.3] as task variation, the initial angle and angular velocity uniformly are sampled from [−0.75π, 0.75π] and [−5, 5], and the cost function is 1 − exp(−0.5 × angle 2 ).For the cartpole swingup task, the goal is changed from balancing to swing-up, the time discretization width is 0.05, the pole mass and length are uniformly sampled from [0.05, 0.3] and [0.4,0.5] as task variation, the initial angle is uniformly sampled from The details of model training are as follows.For the encoder, f φ and [µ φ , log σ φ ] are four-layer neural networks with ReLU activation with 32 hidden units.The decoder, Pθ,z , is two-layer neural networks with ReLU activation with 48 hidden units in the inverted pendulum task and with 64 hidden units in the cartpole swing-up task.The encoder and the decoder are trained using standard variational inference or importance-weighted variational inference for BAMDP.The importance-weight model, ŵπ m,θ,z , is four-layer neural networks with tanh activation with 32 hidden units and learned using a logistic regression loss and α = 0.2.The penalty model, ûm,θ,z , is four-layer neural networks with tanh activation with 16 hidden units and learned using a regression loss.
The discount factor is γ = 0.99.The constant scaling the KL divergence regularization term is ν = 1.The penalty coefficient for importance-weighted variational inference for BAMDP is c = 0.1.The penalty coefficient for standard variational inference is λ = κ, to compare with importance-weighted variational inference for BAMDP under the same condition.
m,n , a m,n , s ′ m,n )} N n=1 be the offline data collected in the mth real MDP, where (s m,n , a m,n , s ′ m,n ) is the n-th transition sample observed in the m-th real MDP.Let D ofl = {D ofl m } M m=1 be the entire offline data.Let P m be the m-th real MDP's transition probability function.Hereinafter, for notational shorthand, this paper uses sa = (s, a), sa m,n = (s m,n , a m,n ), and sas ′ m,n = (s m,n , a m,n , s ′ m,n ).Let d ofl m (sa) be the underlying state-action distribution of sa m,n .

1 )
ESTIMATED OBJECTIVE FUNCTION FOR TRAINING (θ, φ) It is also possible to consider importance-weighting withd π m (sa) d ofl m (sa)
a: ESTIMATING κThis paper estimates κ by κ

FIGURE 1 .
FIGURE 1. Behaviors in real and simulation BAMDPs when using standard variational inference (policy evaluation).

FIGURE 2 .
FIGURE 2. Behaviors in real and simulation BAMDPs when using standard variational inference (policy evaluation).

FIGURE 4 .
FIGURE 4. Real MDP parameter and simulation mpd latent variable when using standard variational inference (policy evaluation).

FIGURE 5 .
FIGURE 5.Real MDP parameter and simulation mpd latent variable when using standard variational inference (policy evaluation).

FIGURE 6 .
FIGURE 6. Behaviors in real and simulation BAMDPs when planned using two-stage optimization (inverted pendulum policy optimization).

FIGURE 7 .
FIGURE 7. Behaviors in real and simulation BAMDPs when planned using joint optimization (inverted pendulum policy optimization).