Probabilistic Safety Guarantees for Markov Decision Processes

This article aims to incorporate safety specifications into Markov decision processes. Explicitly, we address the minimization problem up to a stopping time with safety constraints. We establish a formalism leaning upon the evolution equation to achieve our goal. We show how to compute the safety function with dynamic programming. In the last part of this article, we develop several algorithms for safe stochastic optimization using linear and dynamic programming.


Probabilistic Safety Guarantees for Markov Decision Processes
Rafal Wisniewski , Senior Member, IEEE, and Manuela L. Bujorianu Abstract-This article aims to incorporate safety specifications into Markov decision processes.Explicitly, we address the minimization problem up to a stopping time with safety constraints.We establish a formalism leaning upon the evolution equation to achieve our goal.We show how to compute the safety function with dynamic programming.In the last part of this article, we develop several algorithms for safe stochastic optimization using linear and dynamic programming.

I. INTRODUCTION
The point of departure is a Markov decision process (MDP) with a finite number of states and actions.The overall objective of this article is twofold: 1) to formulate stochastic safety as dynamic programming (DP) and 2) to incorporate probabilistic safety guarantees into the stochastic optimization of MDPs.To the best of our knowledge, a method of directly unifying MDPs and safety is missing in the literature.Undoubtedly, several complementary approaches tackle safety in optimization.They will be described in the related work paragraph.a) Motivation: Recently, the subject of DP has enjoyed a resurgence [1], [2].The explanation for this increase in popularity is reinforcement learning (RL)-a powerful and prevalent method for learning from data and subsequently generating optimal decisions [3].In a nutshell, DP provides the mathematical structure for RL.Applications of DP can be found in robotics [4], autonomous vehicles [5], drones [6], and water networks [7], to name a few examples.On the other hand, [8] showed that for optimization problems, where the constraints are formulated as cost functions, the principle of optimality does not hold (for multichain MDPs), and the value function depends on the initial distribution.Consequently, the solution of such optimization problems cannot be solved by DP.On the contrary, linear programming (LP) provides the means of solving constrained MDP problems [9] and [10].Specifically, in this work, we strive to combine the results on constrained MDPs with safety [11].Safety assigns the probability of reaching the undesired states-the forbidden set.The intended result is an optimization algorithm that keeps the system on the desired safety level.Specifically, the probability that the process realizations hit the forbidden states before reaching the target set remains below a certain value p.This is the concept of p-safety introduced in [12].This definition of safety is related to the reach-avoidance problem thoroughly studied in formal verification, where the problem of safety can be formulated as a temporal logic specification [13].
b) Novelty: The approach of this article fruitfully combines ideas from constrained MDP with the concept of p-safety.The p-safety represents a rigorous mathematical formalism that encapsulates most of the probabilistic safety formulations in the literature: standard probabilistic reachability problem, reach-avoidance problem, and bound probabilistic reachability.The analytical reasoning in the current work leans upon elements of probabilistic potential theory.Already in [14] and [15], it has been shown that the analytical approach based on potential theory provides straightforward proofs for the barrier certificate's properties.
The list of original contributions of this work includes the following.1) p-safety is reformulated as a DP problem.
2) The evolution equation, relating the initial, the hitting and the occupation measures, is introduced into DP.
3) The safe MDP is formulated as optimization with constraints.The resulting formulation involves two occupation measures for safety and optimality, which are subsequently combined in LP.

c) Related work:
The subject of minimizing an expected cost without safety constraints is not new, and it is well-known that its solution is obtained by solving Bellman's equation [2] or LP [16].The safety verification problem of stochastic systems has also been addressed in the literature [17].This article [11] has extended the approach based on barrier certificates to discrete settings of Markov chains (MCs).Pragmatic methods for safe DP and RL have been addressed in [18].In the abovementioned reference, safety is ensured with a barrier function, which serves as a soft constraint to the system.On the other hand, the work [19] proposes a supervisor that prevents the applied control action from driving the system into unsafe regions.
Several approaches have been grafted on different model predictive control (MPC) techniques in the context of safety learning.The shielding approach [20], [21] welds a backup policy that is proven safe and subsequently uses the backup policy to revoke the learned policy to guarantee safety.Another approach is verifying safety on the fly using MPC safety certification [22].A similar research line is adaptive RL [23], [24], where safety is computed for the next k steps and unsafe actions are blocked.There is an intrinsic tradeoff when choosing the number k of steps.If it is too small, an MDP might end up in a state where all actions are unsafe even though a safe policy exists.If k is too long, the complexity of the shielding algorithm for blocking unsafe actions is too large.

d) Approach in this work:
We take the starting point of an MDP with a (stationary) policy that, for each state, provides the probability of choosing a particular control action.Nonetheless, we face a challenge.To compute the optimal path to the target states, we need a random time when the process reaches the target set before hitting the forbidden set.Our solution to this challenge is to use the evolution equation [25], which relates the occupation measure with the hitting probability.The occupation measure corresponds to the expected number of the states' visits.When examining the hitting probability, we consider two sets: the set of target states and the set of forbidden states.Consequently, a safety function S π is derived from the evolution equation.The safety function provides the probability of hitting the forbidden set.It is shown that the safety function is the solution of Bellman's equation and can be computed by an iterative procedure analogous to the one used for computing the value function in DP.In the second part of this article, we combine stochastic optimization with safety guarantees.Consequently, we formulate the optimization with a constraint: minimizing the value function V π subject to keeping system p-safe, S π ≤ p.Here, the cost and the safety are the expected values of accumulated rewards up to stopping times.To this end, we reformulate LP in [10] to address the time horizon specified by the two stopping times.
In the final section, we relax the concept of safety.Subsequently, we develop a local optimization algorithm, meaning that the control action at each state i is computed only using the information available from its neighbors. 1We introduce a local concept of safety-relative safety.Equipped with this new concept, we define an optimization problem.
e) Organization of this article: We shed light on the preliminary objects of this work: MCs and MDPs in Sections II and III.Specifically, the focus in these sections is on formulating the evolution equation for the occupation measure and hitting measures.The stochastic optimization with stopping time is the matter of Section IV.It is shown in Section V that the safety function can be computed as the accumulated cost of the probability of getting to the unsafe set.Hence, the safety function is the solution to Bellman's equation.The main result-an algorithm for safe MDP is developed in Sections VI.

A. Notation
For a countable set U and a set R, we write R U := R |U | for the Cartesian product of |U | copies of R. We use I U to denote the identity matrix on U , and 1 U to denote the vector of ones on U .For two vectors v, w ∈ R m , we will use the Hadamard product v • w defined by (v • w)(i) = v(i)w(i).The notation v ≥ 0 denotes v(i) ≥ 0 for all i ∈ {1, . .., m}.The Kronecker delta denoted by δ j is δ j (i) = 1 for i = j, otherwise it is 0.

II. MCS AND EVOLUTION EQUATION
Let X be a countable set of states denoted by letters i, j, ... A probability distribution ν on X , (ν(j)) j∈X , is thought of as a row vector ν ∈ R X ≥0 (with j∈X ν j = 1).A function f : X → R is defined as a column vector f = (f (j)) j∈X .
Suppose that (X t ) := (X t ) t∈N is a discrete-time (homogeneous) MC with transition probabilities ( The transition matrix P of (X t ) is , where P k = P P . ..P is the k-fold matrix product.
Let H be an arbitrary subset of X , which will be kept fixed.We will call H the taboo set.Later in this article, the taboo set will be the complement of the union of the sets of all the goal states and the forbidden states.We restrict the transition probabilities of the MC (X t ) to the set H. These are the taboo transition probabilities [26].We collect the taboo transition probabilities into the transition matrix Q = (p ij ) i,j∈H .In this case, the transition matrix Q is substochastic, i.e., the sum of row entries j∈H p ij ≤ 1.

We introduce the occupation (green) operator of H
(2) G is well defined if the states in H are transient, i.e., for all i ∈ H, we have P [X k = i for infinitely many k|X 0 = i] = 0. From (2), it follows that: Recall I H is the identity matrix on H.

A. Evolution of the MC
To study the reach-avoidance problem (reach the target set while avoiding the forbidden set), we examine the process up to the first hitting time of a target set or a forbidden set.We associate a reward to each state and ask two questions: What is the cost of getting to the target set, and what is the probability that the process reaches the forbidden set before the target set?To this end, we will use the evolution equation relating the occupation measure and the hitting probability, which we characterize first.
Suppose that τ is a stopping time, for instance, τ = τ E is the first hitting time of some set E, i.e., τ E := min{t ≥ 0| X t ∈ E}.The remaining part of this article assumes that the stopping time τ is finite almost surely (a.s.).Specifically, if the states in Suppose that D is a subset of X .Let ρ <τ (D) be a random variable that describes the amount of time the MC spends in D before time τ has passed The (state) occupation measure γ <τ for (X n ) is defined as the expectation of ρ <τ (•) in (4), i.e., γ <τ (D) := Eρ <τ (D) As its name suggests, γ <τ is a measure.We define the integral w.r.t.γ <τ of a vector function f as The abovementioned equation will be instrumental for computing the accumulated cost of the process until stopping time τ .The (state) hitting measure λ τ (D) is the expected time that the process lies in a set D ⊂ X at the time τ We define the hitting operator corresponding to the stopping time τ as the integral of a function f with respect to λ τ as Specifically, let E and U be disjoint subsets of X .We think about E as a target set and U as a forbidden set.Suppose that τ = τ U ∪E , the first hitting time of the union of U and E. Then λ τ U ∪E , I U is the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
probability that the process hits the forbidden set before the target set.This relation will be instrumental for the computation of safety.When the initial state is i, we employ the probability P i , and use the notations γ i <τ and λ i τ .Similarly, for the initial probability μ, we use P μ , and the notations γ μ <τ and λ μ τ .The occupation measure and the hitting probability are connected by the adjoint or evolution equation [25] where L is the generator, L = P − I X .The measure triplet (μ, γ μ <τ , λ μ τ ) with γ μ <τ (i) = 0 for i ∈ X \ H and λ μ τ (j) = 0 for j ∈ H uniquely characterizes the given Markov process [27].

III. MDPS AND EVOLUTION EQUATION
We suppose that (U t ) is a process with values in a countable set U of actions, and study the conditional probabilities P[X t+1 = j| X t = i, U t = u] for i, j ∈ X , and u ∈ U. We remark that Markov property holds for the MDPs, and introduce transition probabilities where (i, u, j) ∈ X × U × X .
By a Markov policy, we understand the family of stochastic kernels We think about the policy π as the to-be-designed stochastic control.In this work, we entirely restrict our attention to stationary policies; the Markov policy is stationary if π iu does not depend on the time, i.e., To conclude, the stationary policy π is seen as the (possibly infinitedimensional) matrix π := (π iu ) (i,u)∈X ×U (10) with entries between 0 and 1, corresponding to the probability that at the state i, the control action has the value u.
Let D be the standard simplex in So, for each fixed i, the probabilities (π iu ) u∈U belong to D. We will write π ∈ D X (the Cartesian product of |X | copies of the set D) even though we mean its matrix representation (π iu ) (i,u)∈X ×U with For a stationary policy π, using the law of total probabilities the transition probability of the induced chain are For the policy π, we define the transition probability matrix A straightforward calculation shows that the operator P (π) on vector functions f has the following expression: where • is the Hadamard product, P (u) := (p iuj ) i,j∈X and π u := (π iu ) i∈X is thought of as a row measure associated to each action u ∈ U.This "factorization" will be a key tool for obtaining most expressions in the following sections.

A. State-Action Occupation Measure
For a target set E. Let the taboo set H = X − E. At the outset, we recall the notion of a reward The function ρ induces a process (ρ t ) by We suppose that the process (U t ) is generated by a policy π, which will be characterized in the following.Let τ be a stopping time, for example the first hitting time of a set.The cost for the policy π up to time τ is and the vector The aim of DP is to evaluate the cost function V π , and subsequently to find a minimizing stationary policy.To meet this aim, this article will use the evolution equation defined in the last section.All the objects used below will depend on the policy π; herein, the expectation, the transition matrix, occupation measure, and hitting probability.Therefore, to enhance readability, we occasionally suppress π from the notation.At the outset, notice that in (14), since τ is a random variable, the expectation operator cannot be moved under the summation symbol, as it is customarily done in standard DP and RL (see [2] and [3]).In the realm of altering policies, we enhance the evolution equation to capture the frequencies of visiting the states and actions.We examine the process (X t , U t ) with initial distribution μ of (X 0 , U 0 ).Moreover, because of policy stationarity, the initial distribution of U 0 has no effect on X t nor U t for t > 0. Let μ(•) = u∈U μ(•, u).To this end, we define a state-action occupation measure by and a state-action hitting measure by For a given π = (π iu ) (i,u)∈X ×U , the state-action and state occupation measures are linked as follows: The (17) follows from: where we have used the information that the policy is stationary, Similarly, the state-action hitting measure and the state hitting measure are related by ( 16) and ( 17) with γ <τ and γ <τ substituted by λ τ and λ τ .
We return to the evolution (9).Lemma 1: The evolution equation for the state-action measures is Proof: The state occupation measure satisfies (9).Left-hand side of (18), follows from (16); whereas, the right-hand side is the consequence of the following computation: The evolution ( 18) can be reformulated as Hence, the evolution equation for the state-action measure is an average of the evolution equations over constant actions.

IV. STOCHASTIC OPTIMIZATION WITH STOPPING TIME
We shall call a policy π transient on H if the MDP X t with the policy π is transient, i.e., P π [X t ∈ H for infinitely many t | X 0 ∈ H] = 0.

A. DP With Stopping Time
The following result shows that the cost V π restricted to H can be computed as a potential of "charge" R π .
Proposition 1: For an MDP (X t ) with the state-action space (X , U) and a target set E ⊂ X , let π be a transient policy on H := X \ E. Let τ := τ E be the first hitting time corresponding to E, and let ρ be the reward.
Then, the cost function V π in ( 14) restricted to H is where G(π) is the kernel associated to H and π is given by (2).We remark that the first assumption of π being transient ensures that the optimization problem is feasible for the policy π.If the probability of staying in the taboo set H was not 0 then the system would never reach the target set E.
Proof: Write P := P π , and We claim that The claim follows from: We observe that Inserting this equation in (20) shows the claim, i.e., (19).From ( 6), we conclude that In the second part of the proof, we will use the evolution ( 9), to evaluate f| H for any f such that f (e) = 0 for all e ∈ E, and for any initial measure μ.The claim leads us to the conclusion To prove the claim, without loss of the generality, suppose that the states are numbered such that the first states belong to H and the remaining to E. Then the (possibly infinite-dimensional) transition matrix P := P (π) is decomposed as where Q := P (π)| H .We define a matrix where G is the green operator defined in (2).By the relation (3), we have 9), it follows that: We strive to solve the following optimization problem: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
where H is the taboo set, and D H is the Cartesian product of |H| copies of the simplex D. V * is called the value function.
From (3), we obtain that G(π and the result is the celebrated well-known formula in DP with the boundary condition V π (e) = 0 for e ∈ E. Let Δ(π) := I − Q(π) be the discrete Laplacian operator on the taboo set H, and then we write (23) as Let us recall some standard results of DP [2], which will be instrumental in the following sections.Suppose that there is at least one transient policy π.The optimal cost V * restricted to the set H satisfies Bellman's equation Furthermore, the function V * | H is the (coordinatewise) limit of the sequence (V n ) defined by with an arbitrary initial condition V 0 ≥ 0.

B. Linear Programming
We follow the idea of [9] showing that the value function ( 22) is the largest among the functions V : E → R satisfying the inequality Such functions are called subharmonic vectors in the MDP context.
Lemma 2: The value function satisfies where and sup is to be understood coordinatewise.We use the action-state evolution (18) to formulate LP.At the outset, we define where as before, τ is the first hitting time of E, τ = τ E .Furthermore, we let V μ * = min π∈D H V μ π .Lemma 3: Suppose that X 0 has the initial distribution μ with the support in H, then over the measure α on H × U that satisfies Furthermore, the policy is given by where α μ (i) = u∈U α μ (i, u).Proof: From (15), we have The state action evolution equation uniquely characterizes the process (X t , U t ).Subsequently, noticing that the support of λ μ τ (•, u) is in the complement of H, from (21) in the proof of Proposition 1 By substituting γ μ <τ by α μ , the equality (27) follows.Finally, the policy (28) follows from ( 16) and (17).

V. SAFETY
We formulate safety as a DP problem.In the previous section, we have considered the terminal set E and its complement, the taboo set H. We extend this situation by adding an extra set U , the set of forbidden states.We suppose that U is disjoint from E. Now, the taboo set is The definition of safety is taken from [11].For each state in X , the safety function gives the probability that the realizations hit the forbidden set U before reaching the target set E.
We consider the problem of finding a policy π such that the safety function satisfies the following condition: where τ A is the first hitting time of a set A. We have again suppressed the policy π in the notation, P = P(π).
To compute the safety function S π , we apply the evolution ( 9) with the initial distribution μ concentrated at i, and τ = τ E∪U equal to the first hitting time of E ∪ U λ i τ , f = f (i) + γ i <τ , L(π)f for all f : X → R. We observe that the safety function S π (i) = λ i τ (U ).We unfold the evolution equation Since the function f is arbitrary, for the specific choice of f such that f (j) = 0 for j ∈ E, f (j) = 1 for j ∈ U , and (L(π)f )(j) = 0 for j ∈ H, we have k∈U λ i τ (k) = f (i).In conclusion, the safety function S π is the solution s of the following problem: The problem (30) is known as the Dirichlet problem.Its solution is unique.Since (30) is linear in s, we formulate it in terms of matrices.
To this end, we suppose the state are numbered in the following order: the states in H are first, then in U , and finally in E. We decompose P := P (π) as follows: Lemma 4: Suppose that the MDP (X t ) with a policy π is being transient on H. Let Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
where 1 U is the column vector of 1 s of length |U |.Then, the safety function is given by and it is the solution of the following Poisson equation: Furthermore, the sequence (S n π ) defined by for an arbitrary S 0 π converges pointwise to S π | H . Proof: Applying the transition matrix (31) to the Dirichlet problem (30) we get (33).Then (33) and ( 3) imply (34).
We regard (35) as a discrete-time dynamical systems, and observe that the eigenvalues of Q(π) are in the open unit disk.Consequently, the sequence (S n π ) converges to G(π)K π .The value of K π at a state i ∈ H is the probability of reaching one of the forbidden states in U in a single time-step.The characterization of the safety function in (33) corresponds to the probability of reaching U after staying entirely in the set H.
Safety function can be computed as the expectation of a cumulative reward.Suppose μ is the initial distribution of the process (X t ), X 0 ∼ μ.We consider the safety function Corollary 1: Suppose μ is the initial distribution of the process (X t ) and the support of μ is in H.The safety function S π (μ) is given by where τ = τ U ∪E , and κ(i, u) = j∈U p iuj , for all i ∈ H. Proof: From (32) On the other hand, form Lemma 4, . Corollary 1 allows to formulate the safe optimization as a constrained MDP.Specifically, from Lemma 3, the safety function S μ π is computed from where γ μ <τ is the occupation measure, the solution of the evolution equation The measure γ μ <τ (restricted to H) corresponds to the occupation measure of H, when there is no escape.Subsequently, the function κ is the reward to go into the set U .

VI. SAFETY GUARANTEE IN STOCHASTIC OPTIMIZATION
We combine safety and stochastic optimization.The objective is to reach the goal states in the subset E ⊂ X with minimal expected cumulative reward subject to the safety constraints of reaching E before the set U of unsafe states with a probability below a prior level p.In other words, for 0 ≤ p ≤ 1, and an initial distribution μ of X 0 with μ supported on H, we strive to find the minimum V * of the cost on H subject to S μ π ≤ p, and over the stationary (mixed) policies π ∈ D X \E .First, we reformulate the safety function in the constraint as the cost with τ := τ E∪U , and κ(i, u) := j∈U p iuj .We notice the stopping times in the cost function (39) and in the constraint (40) are not the same; furthermore, τ = τ U ∪E ≤ τ E almost surely.As before, we define the (state-action) occupation measures and hitting measures for both stopping times.We use simplified notation, where we suppress the names of the stopping times and the initial measure for (i, u) ∈ X × U. Consequently, there are two evolution equations: the first one governs the measures with the index A, and the second with the index B u∈U λ A (j, u) = μ(j) + In conclusion, the cost and the constraint are expressed in terms of the occupation measures that are To conclude, safe stochastic optimization is formulated as the following linear program.Proposition 2: Suppose that there is a transient policy π ∈ D X \E then the minimum of V μ π over stationary policies π ∈ D X \E is the u∈U i∈X γ A (i, u)(p iuj − δ j (i)) u∈U λ B (j, u) = u∈U λ A (j, u) + u∈U i∈X γ B (i, u)(p iuj − δ j (i)).(41)Bothλ A and λ B are zero on H, λ B is additionally zero on U .By summing the two evolution equations in (41), we observe that γ := γ A + γ B satisfies the following evolution equation:u∈U λ B (•, u) = μ(•) + u∈U γ(•, u)L(u).