Safe Reinforcement Learning Using Wasserstein Distributionally Robust MPC and Chance Constraint

In this paper, we address the chance-constrained safe Reinforcement Learning (RL) problem using the function approximators based on Stochastic Model Predictive Control (SMPC) and Distributionally Robust Model Predictive Control (DRMPC). We use Conditional Value at Risk (CVaR) to measure the probability of constraint violation and safety. In order to provide a safe policy by construction, we first propose using parameterized nonlinear DRMPC at each time step. DRMPC optimizes a finite-horizon cost function subject to the worst-case constraint violation in an ambiguity set. We use a statistical ball around the empirical distribution with a radius measured by the Wasserstein metric as the ambiguity set. Unlike the sample average approximation SMPC, DRMPC provides a probabilistic guarantee of the out-of-sample risk and requires lower samples from the disturbance. Then the Q-learning method is used to optimize the parameters in the DRMPC to achieve the best closed-loop performance. Wheeled Mobile Robot (WMR) path planning with obstacle avoidance will be considered to illustrate the efficiency of the proposed method.


I. INTRODUCTION
Enforcing safety in the presence of uncertainty and stochasticity of nonlinear dynamical systems is a challenging task [1]. Chance constraints are a common way of mathematical modeling of safety that requires a user-specified upper bound for the probability of the constraint violation [2]. However, it is challenging to handle a chance constraint from the computational point of view due to its nonconvexity. Conditional Value at Risk (CVaR) [3] is a convex risk measure that has received considerable attention in decision-making problems, such as Markov Decision Processes (MDPs) [4], [5].
The theory of stochastic optimal control typically assumes that the probability distribution of the disturbance is fully known. However, this assumption may not hold in many real-world applications, and one needs to estimate the probability distribution. However, stochastic optimization is challenging to solve, especially for non-convex problems [6].
The associate editor coordinating the review of this manuscript and approving it for publication was Jinquan Xu .
In data-driven stochastic optimization, Sample Average Approximation (SAA) is a fundamental way to estimate the probability distribution of the random variables [7]. SAA typically needs quite an extensive data set to fulfill risk constraints accurately. Distributionally Robust Optimization (DRO) is an alternative that overcomes this problem. DRO tackles stochastic optimization by considering the worstcase distribution in an ambiguity set. There are several ways to construct ambiguity sets, e.g., moment ambiguity [8], Prohorov-based ball [9], Kullback-Leibler divergence-based ball [10] and Wasserstein-based ball [11]. The Wassersteinbased ball is a statistical ball in the space of probability distributions around the empirical distribution such that the radius of this ball is measured using Wasserstein distance. Then the radius of the ball represents the conservatism of the DRO problem. Unlike the SAA method, Wasserstein DRO provides a probabilistic guarantee based on finite samples in a tractable formulation [12].
Model Predictive Control (MPC) is an optimization-based control approach operating with a receding horizon [13]. MPC employs a (possibly inaccurate) model of the real system dynamics to produce an input-state sequence over a given finite horizon. The resulting trajectory optimizes a given cost function while explicitly enforcing the system constraints. The optimization problem is solved at each time instance based on the current system state, and the first input of the optimal solution is applied to the system. Due to the finite-horizon scheme and (possibly) model mismatch, MPC usually delivers a reasonable but suboptimal approximation of the optimal policy. This paper uses the DRO in the chanceconstrained nonlinear MPC. This approach has been known as Distributionally Robust MPC (DRMPC) [14].
Reinforcement Learning (RL) is a technique for solving problems involving MDPs. RL typically requires a function approximator to approximate the optimal policy, value function, or action-value function. For instance, Q-learning has been used in [15] for unmanned vehicle applications. In [16], the comparison of MPC and RL has been studied in the distributed setting. Recently, MPC has been used as a structured function approximator for RL algorithms. In this method, a parameterized MPC scheme is used in order to generate policy and/or value functions of the real system. Then RL algorithms can be used to adjust the MPC parameters to achieve the best closed-loop performance. The combination of MPC and RL has been proposed and justified in [17], where it is shown that an MPC scheme can theoretically generate the optimal policy and value functions for a given system even if the MPC model is inaccurate. Recent research have further developed and demonstrated this approach [18], [19].

A. RELATED WORKS
In [14], the authors have proposed to use DRMPC to utilize its benefits for motion control. A DRMPC has been applied to the multi-area dynamic optimal power flow in [20] to better hedge the uncertainties of distributed generation and loads. For the Gaussian processes, a learning-based DRMPC has been proposed in [21]. A learning-Based DRMPC has been developed for chance-constrained Markovian switching systems with unknown switching probabilities in [22]. The authors have shown that this framework provides meansquare stability of the system without requiring explicit knowledge of the transition probabilities. In [23], a DRO has been proposed for chance-constrained data-enabled predictive control with stochastic linear time-invariant systems. In [24], a DRMPC algorithm has been presented for spacecraft circular orbital rendezvous and docking problems. A soft-constrained DRMPC has been proposed for linear systems in [25].
A robust MPC scheme has been used as a function approximator for safe RL in [26]. Control Barrier Functions (CBF) have been used in the safe RL context in [27]. A safe RL-CBF framework has been developed to guarantee safety and improve exploration in [28]. Probabilistic safety in learning-based control methods has been provided in [29] based on probabilistic model predictive safety certification. In [30], the safe RL problem is formulated as a constrained MDP. Then a Lyapunov approach has been proposed to solve it.

B. CONTRIBUTIONS
There are a limited number of data from uncertainties and disturbances available in many real stochastic systems. Therefore, traditional methods such as SAA cannot accurately estimate the distribution of these random variables. An accurate distribution may be more important for safetycritical systems to design a safe controller for the system. In this paper, we propose to use a parameterized nonlinear DRMPC based on the Wasserstein metric as a function approximator for RL in order to generate a family of policies that are safe by construction. DRMPC is subject to the chance constraint, approximated by the CVaR risk measure. We reformulate Wasserstein DRMPC as a tractable optimization. Then we use the Q-learning technique to optimize the parameters of the DRMPC scheme to achieve the best closed-loop performance among the safe policies.

C. ORGANIZATION
The paper is structured as follows. Section II details safe RL and chance constraints. Section III provides safe policies based on the SMPC scheme, evaluated using the SAA method. Moreover, we formulate CVaR as a convex approximator of chance constraints. Section IV formulates a tractable DRMPC scheme and provides out-of-sample guarantees. Section V details Q-learning as an efficient way to optimize the parameters of the DRMPC scheme. Section VI provides a numerical simulation and section VII delivers a conclusion.
Notation: We denote the set of real numbers, non-negative real numbers, extended real numbers, non-negative integers, and natural numbers by R, R ≥0 ,R := R ∪ {−∞, ∞}, Z and N, respectively, while I i:j refers to the set {i, i + 1, . . . , j}. Vectors in R n are denoted by the bold letters, e.g., a. x, y := x y denotes the usual inner product for given vectors x, y. A function f : R n →R is proper if f (x) < +∞ for at least one x and f (x) > −∞ for every x in R n . The conjugate function of a function f : R n → R is denoted by [f ] (y) := sup x∈R n x, y − f (x). Support function of set W is defined as W (x) := sup y∈W x, y . For scalar a, we define (a) + := max{a, 0}.

II. SAFE REINFORCEMENT LEARNING
In this section, we formulate safe Reinforcement Learning (RL) using chance constraints. Let us consider the following (possibly) nonlinear discrete-time stochastic dynamical system: where k ∈ Z is the time index, s k ∈ X ⊆ R n is the system state, a k ∈ U ⊆ R m is the control input, w k ∈ W ⊂ R d is a random variable representing the stochastic disturbance of the system and f : R n+m+d → R n is a Borel-measurable function. Note that the notation in (1) is standard in the literature of control, while the RL literature typically uses the conditional probability notation P[s k+1 |s k , a k ] for the state transition. We then make the following assumption on W.
Assumption 1: The disturbance set W is convex and closed. VOLUME 10, 2022 We will use this assumption in the rest of the paper to reformulate DRO as finite convex programming. A deterministic policy π : X → U maps the state space to the input space and determines how to choose input a k at each state s k . We aim to find the optimal safe policy π , given by the solution of: where µ 0 is the probability distribution of the initial state s 0 and V π : X → R is the value function associated with the policy π, defined as follows: where L : X × U → R is the stage cost, γ ∈ (0, 1] is the discount factor, S ⊆ X is a safe set and α ∈ (0, 1) is a user-chosen confidence level. The chance-constraint (3c) guarantees probabilistic safety of state trajectories s k+i for a finite-horizon with length I ∈ N given state s k at each time instance k. In fact, we generalize the common chance constraint in the literature not only to be satisfied for one step ahead but also to be satisfied for a finite horizon ahead at every time instance. This paper provides such policies using both an SMPC scheme and a DRMPC scheme with horizon I . The safe set S can be defined as follows: where h j : X → R specifies a state constraint and J is the number of constraints. For the sake of simplicity and in order to avoid the complexity of joint constraints, we consider the following individual constraint: Then one can verify that using (4), (5) implies (3c).

Assumption 2: Each function −h j is proper, convex and lower semi-continuous functions.
In the next section, we will use an SMPC scheme in order to provide a family of safe policies.

III. STOCHASTIC MPC-BASED POLICY
In the RL context, we consider a family of the parameterized policy given by π θ with parameter vector θ ∈ R p and seek the best parameters θ that provide the best closed-loop performance. More specifically, (2) is reformulated as: Instead of solving (3) directly, we use a function approximator based on the MPC scheme to extract policy π θ that satisfies (3c) by construction for all parameters θ. More specifically, consider the following parameterized SMPC at time instant k: where T θ : X → R and l θ : X × U → R is the parameterized terminal cost and stage cost, respectively. This parameterization allows one to provide a family of policies that are safe for all θ ∈ R p . Then by tuning the parameters θ and reshaping the cost function and MPC-scheme, one can achieve the best closed-loop performance. Decision variables a = {a k|k , . . . , a I +k−1|k } and s = {s k|k , . . . , s I +k|k } are the input and state sequence, respectively. Then the parameterized policy π θ at time instance k is extracted as follows: where a k|k is the solution of SMPC (7) corresponding to the first input a k|k . The use of parameterized MPC scheme as a function approximator in order to capture the optimal policy and value function was proposed and justified in [17]. Moreover, the authors showed that RL methods such as Q-learning and policy gradient can be used in order to adjust the MPC scheme parameters and achieve the best closed-loop performance.
We ought to stress here that MPC scheme (7) provides a family of safe policy for all parameters θ based on the best state-input sequence that minimizes a finite-horizon parameterized cost function of an MPC scheme. Obviously, a richer parameterization in the stage cost and terminal cost provides a more extensive set of policies. Then tuning the parameters θ leads us to get the optimal policy among the provided policy families. We will detail Q-learning as a practical way of adjusting the parameters in Section V.
In order to tackle the chance constraint (7c), a natural measure of risk is value-at-risk VaR. For a random variable r and confidence level α, VaR α is defined as follows: In fact, VaR represents the worst-case loss with probability α.
Then one can show that: Unfortunately, VaR is, in general, non-convex, and optimizing models involving VaR are numerically intractable for high-dimensional, non-normal distributions. An alternative measure of risk is conditional value-at-risk CVaR, defined as follows: Indeed, CVaR is a coherent risk measure that satisfies conditions such as convexity and monotonicity [3]. Risk management with CVaR functions can be done quite efficiently. CVaR can be formulated with convex and linear programming methods, while VaR is comparably complicated to optimize. Detailed benefits and concepts of CVaR can be found in, e.g., [31].
It can be shown that for α → 1, CVaR can approximate VaR more accurately. i.e.: Note that in engineering applications, we often are interested in a very low probability of failure (α → 1). Then using CVaR, with the numerical and mathematical benefits, imposes a very low conservative on the problem. Using CVaR, MPC (7) can be approximated as follows: At each time k we first consider N s , independent and identically distributed (i.i.d.) samples of the disturbance w i and we denote these samples by w m i , i ∈ I 1:I m ∈ I 1:N s . Then N s scenarios are described as follows: where s m k+i|k and a m k+i|k are the predicted state and input for m th scenario at time k + i given time k. We then define auxiliary variables x m i for i ∈ I 1:I , m ∈ I 1:N s in order to approximate CVaR, in (13c), in the following tractable Linear Programming (LP), ∀i ∈ I 1:I : In [32], it has been shown that for N s → ∞ the approximation in (15) will converge to its exact value with probability one.
Substitution of (15) into (13) From a theoretical point of view, SMPC (16) requires N s → ∞ in order to provide an accurate approximation of the original MPC (13). In the next section, we will introduce DRMPC scheme to overcome this problem.

IV. DISTRIBUTIONALLY ROBUST MPC-BASED POLICY
In order to tackle the limited distributional information issue with finite-many sampling, we use Distributionally Robust Optimization (DRO) in the chance constraint of the MPC scheme. In this section, we suppress the subscript i, denoting the horizon index, to simplify the notations.
The core idea of the theoretical developments in this section was proposed in [12] for general optimization problems. For the sake of clarity, in the context of learningbased MPC, we detail these developments in this section.
We use the Wasserstein metric to define an ambiguity set as a ball around the empirical distributionP. Then the optimization will be solved with respect to the worst-case distribution in the ambiguity set. Empirical distributionP, where δ w is the Dirac measure concentrated at w. Then we define the Wasserstein ball D around the empirical distributionP as the ambiguity set as follows: where P(W) denotes the set of Borel probability measures on the support W, ≥ 0 is the radius of the ball and d W : P(W) × P(W) → R ≥0 is the Wasserstein metric, defined as follows: for all distributions P 1 , P 2 ∈ P(W) where l κ denotes the lth marginal of the transportation plan κ for l = 1, 2 [33]. Indeed, the Wasserstein distance of P 1 and P 2 can be interpreted as the minimum transportation cost for moving the probability mass from P 1 to P 2 . Then distributionally robust optimization minimizes the worst-case cost over all the distributions in the ambiguity set. Distributionally robust constraint (13c) can be written as follows: For the sake of simplicity, we define a new variable c := max j h j (s). We then recall the definition of CVaR: We then use the minimax inequality for (20): on the other hand: then using the Lagrangian function for the constrained optimization (23): where λ ∈ R is the Lagrange multiplier. Using Theorem 1 in [34]: In fact, the first equality in (25) follows from the strong duality that has been shown in [34], and the second equality holds because P(W) contains all the Dirac distributions supported on W.
Introducing a new auxiliary variable y m , we can rewrite (25) as follows: where y = {y m } N s m=1 . From the definition of dual norm, we decompose the expression inside (·) + in constraint (26b) as follows [12]: where · := sup ξ ≤1 ·, ξ is the dual norm. Since {ξ | ξ ≤ λ} is a convex set, we use the minimax inequality, and (27) reads: Then optimization (26) can be written as follows: where [−c] is the conjugate of −c that is calculated at ξ m 2 − v m . Note that under assumptions 1 and 2, (30) is a finite convex program [12]. Restoring the index i, the DRMPC scheme based on the Wasserstein metric reads as follows (31), shown at the bottom of the next page, where ξ = {ξ i,1 , ξ i,2 } I i=0 . Then the parameterized safe policy π DRMPC θ based on DRMPC scheme at time instance k is extracted as follows: where a k|k is solution of DRMPC (31) corresponding to the first input a k|k . Note that all the optimal solutions of a m k|k s are identical since the random samples are generated based on the first given state s m k|k = s k and the uncertainty cannot be anticipated [35]. Then we select one of the optimal solutions of a m k|k s as a k|k .

A. OUT-OF-SAMPLE GUARANTEE
Unlike the SAA method, Wasserstein DRMPC provides a probabilistic guarantee on the out-of-sample performance with finitely many samples. More specifically, let us consider the following inequality: where s k+i|k is the optimal solution of (31) and P is an unknown arbitrary distribution. Then it is worth fulfilling inequality (33) with high probability, i.e.: where β is a user-specified confidence level. It has been shown in [12], if the Wasserstein radius i is chosen as follows: then (34) holds, where c 1 , c 2 are positive constants. In fact, we have assumed that the measure concentration inequality holds [36], i.e., B = E P [exp w a ] ≤ ∞ for a > 1 (Light-tailed distribution), then c 1 , c 2 depend on a, B and the disturbance dimension.
We must emphasize here that in practice, analysis and (probabilistic) finite sampling guarantees are essential in the context of RL and stochastic optimization because, in practice, there is typically limited access to real system data. This analysis can include various criteria in the context of RL, such as convergence rate [37], regret analysis [38], and performance [39].
The next Proposition summarizes the theoretical development of this section.
Proposition 1: Under assumptions 1 and 2, DRMPC has a tractable reformulation as (31) and the extracted policy π DRMPC θ , based on finite N s i.i.d. samples, satisfies (34) ∀k ∈ Z, ∀i ∈ I 1:I , ∀θ ∈ R p , for a user-specified β and α and any distributions P, if i is selected from (35).

B. FEASIBILITY PRE-FILTRATION
As discussed, satisfying a state-depend hard constraint with α = 1 is generally impossible. The same problem exists when the required α is higher than the problem nature requirement. This problem arises in solving (31) when no solution is found.
This problem is known as the infeasibility of optimization. A common way to solve the feasibility issue is to soften the constraints using slack variables. The slack variables are positive decision variables that allow inequalities to violate. However, the violation is penalized in the cost function.
A common way to use slack variables is by adding them to the original cost function. However, in this case, finding proper positive coefficients is still challenging. Another way to use slack variables is to build an optimization as a prefiltration to find the feasible slack variables and then apply them to the original optimization problem. More specifically, we consider the following optimization problem :   min  s, a, η, λ, y, ξ , v (31b) with the optimal solutions σ i . We then replace constraint (31c) with the following constraint ∀i ∈ I 1:I : Then the DRMPC scheme always has a feasible solution.
Note that inverting the procedure of obtaining DRMPC (31), we can see DRMPC (31) with the feasible constraint (37), equivalent to the following robust constraint: while (20) may yield an infeasible solution. Note that DRMPC scheme provides a family of safe policies π DRMPC θ min s, a, η, λ, y, ξ , v ≤ λ i , ∀m ∈ I 1:N s , ∀i ∈ I 1:I (31f) a m i+k|k ∈ U, s m k|k = s k , ∀m ∈ I 1:N s , ∀i ∈ I 1:I (31g) VOLUME 10, 2022 for all tuning parameters θ. Therefore, in the next stage, it is necessary to update the parameters to achieve the best performance. The next section details the Q-learning method as a practical way of updating the parameters θ to achieve the best closed-loop performance.

V. Q-LEARNING BASED ON DRMPC SCHEME
Q-learning is a powerful, well-known, and popular method in the field of RL, whose use is practical due to relatively low-cost computational efforts, especially in engineering and economic applications [40]. In fact, Q-learning is a classical model-free RL algorithm that tries to capture the optimal action-value function Q θ ≈ Q via tuning the parameters vector θ where Q θ is the parameterized actionvalue function, and Q is the optimal action-value function [41]. The optimal action-value function Q is defined as follows: The parameterized action-value function Q θ (s k , a k ) based on DRMPC scheme (31) can be formulated as follows: while the approximation of the value function V θ can be extracted from (40) when constraint (40c) is removed. Then one can verify that the MPC-based action-value function and value function satisfy the fundamental Bellman equations [17]. Q-learning solves the following Least Square (LS) problem: In order to solve (41), Temporal-Difference (TD) method uses the following update rule for the parameters θ at state s k [42]: where the scalar ζ > 0 is the learning step-size, δ k is labelled the TD error, and the input a k is selected according to the corresponding parametric policy π DRMPC θ (s k ) with the possible addition of small random exploration such that it preserves the safety. The gradient ∇ θ Q θ required in (42) can be obtained by a sensitivity analysis on the DRMPC scheme (40) as detailed in [17] for generic MPC schemes.
In order to generate a k , we first add a small exploration noise to the policy, i.e.: where ρ k ∈ E is a random variable providing the exploration. One can easily observe that a e k may not deliver a safe input. Therefore a safety filtration based on the DRMPC scheme is needed to provide safe exploration, more specifically consider the following parametric DRMPC scheme with the FIGURE 1. The illustration of the safe exploration for the Q-learning method in a 2-input system. Safe exploration input a k ∈ safe exploration set, while a e k ∈ Exploration set and π DRMPC θ (s k ) ∈ DRMPC Safe set.
additional parameter ρ k : Then a k (θ, s k , ρ k ) = a k|k (θ , s k , ρ k ) delivers a safe input after exploration where a k|k is the optimal solution of (44) for the first input. Fig. 1 illustrates the safe exploration based on the DRMPC scheme. DRMPC safe set is defined as follows: DRMPC safe set := {a k|k | ∃ s, a, η, λ, y, ξ , v : (40b)} In the policy gradient method, the projection technique results in a bias in the gradient of the performance function [43]. Roughly speaking, this is because the safe exploration set may not be a centered ball, as shown in fig. 1. We have proposed a robust MPC scheme in [44] to solve the bias issue. The proposed method can be easily applied to the DRMPC scheme for the policy gradient method, but applying it is out of the focus of the current work. Fig.2 illustrates the proposed safe learning method using the DRMPC scheme.
Remark 1: The proposed method can be applied to the general nonlinear stochastic dynamics with an unknown distribution of stochasticity. Obviously, the computational efforts are increased as the dimension and complexity of the dynamics grow.
Remark 2: In this paper, we do not focus on the convergence of the learning method. It is well-known that under the mild assumptions, the Q-learning technique generates a sequence of the parameters θ that converge to the parameters that best estimate the exact optimal action-value function. Then the extracted policy is an optimal policy among the provided safe policies. The convergence conditions for the Q-learning method can be found in, e.g., [45].
Remark 3: Closed-loop stability of the policy for the nominal model used in the MPC scheme resulting from an MPC scheme is straightforward under some mild assumptions on the stage cost and terminal cost and constraints. However, these conditions are not painless for general stochastic systems and stochastic and robust MPC. This aspect is not the main scope of the current work. However, in the functional space, the closed-loop stability properties are recently addressed in [46] for general stochastic systems (MDP).
The proposed approach has been summarized in Algorithm 1. apply the safe exploration using (43) and (44) to get the input a k , 8 apply the input a k to the dynamics (1) to get s k+1 , 9 update parameters θ k+1 ← θ k using Q-learning technique, e.g., (42) ( i s are among the parameters), 10 Save the last parameters θ 0 = θ k+1

end
The next section provides a numerical case study for the proposed method.

VI. NUMERICAL SIMULATION
In this section, we consider Wheeled Mobile Robot (WMR) path planning while avoiding static obstacles. The stochastic nonlinear dynamics can be considered as follows: where s k = [x k , y k , φ k ] , a k = [v k , ψ k ] and w k ∞ ≤ 0.1 are the system state, input, and disturbance, respectively.
x k and y k are the position of the robot in two dimensions and φ k ∈ [−π, π] is the orientation angle. Sampling time t e is selected 0.2sec for the simulation. The control inputs v k and ψ k are the linear and angular velocities, respectively. The control input is restricted as follows: For simplicity, we consider obstacles of elliptic shape. Hence, the condition for obstacles avoidance can be seen as the following inequality: where (o x,j , o y,j ) and (r x,j , r y,j ) are the center and radii of the j th ellipse (j = 1, . . . , J ), respectively, and J is number of obstacles. First, we simulate SMPC with CVaR constraints based on Sample average approximation and DRMPC, and we compare them with deterministic MPC. As shown in figure 3, there are some constraint violations in the MPC scheme. As the probability level α increases, the distance from the path and obstacle increases in SMPC. As mentioned, this method usually requires a large number of data to capture the chance constraint accurately. Moreover, as shown in figure 3, the planned path using DRMPC is farther from the obstacle. We then consider the following stage cost for the L(s, a) = a + |φ| + |x − 8| + |y| −   RL (48), as shown at the top of the page, where τ and ω are small positive constants. Since h j only depends on x, y, function r also depends on x, y. Note that the logarithmic barrier function has been inspired by the constrained optimization context [47]. Moreover, this function allows us to compute the logarithm for every s, while it has a large value when the constraints violate. Figure 4 illustrates r(x, y). We include the radius of the Wasserstein ball in the DRMPC parameters to tune it using Q-learning. Figure 5 shows the average stage costs during each mission. As can be seen, the average stage costs are decreasing in five missions in both SMPC and DRMPC. However, DRMPC has lower average costs, and Q-learning is more effective in the DRMPC scheme than in the SMPC scheme. The better improvement in the DRMPC scheme is due to the more freedom and parameters in the provided policies, such as the radius of the Wasserstein ball around the empirical distribution, whereas in the standard SMPC scheme, there is no such parameter. Obviously, tuning the radius of the Wasserstein ball and, consequently, adjusting the conservatism of the safe policy positively impacts the improvement of the closed-loop performance.

VII. CONCLUSION
In this paper, we proposed to use a tractable Distributionally Robust MPC (DRMPC) scheme in order to provide safe policy for Reinforcement Learning (RL) by construction. DRMPC optimized the cost function subject to the worst-case distribution in a given statistical ball around the empirical distribution. The radius of this ball was measured using the Wasserstein metric. Moreover, Conditional Value at Risk (CVaR) was used as a convex approximator of chance constraints in the DRMPC scheme. We used Q-learning to update the parameters of the DRMPC scheme. We showed the efficiency of the method in the path planning of a Wheeled Mobile Robot (WMR). Considering model mismatch, joint chance-constrained and Neural Network based cost functions in the DRMPC scheme will be the directions of future works.