Efficient Strategy Synthesis for MDPs With Resource Constraints

We consider qualitative strategy synthesis for the formalism called consumption Markov decision processes. This formalism can model the dynamics of an agent that operates under resource constraints in a stochastic environment. The presented algorithms work in time polynomial with respect to the representation of the model and they synthesize strategies ensuring that a given set of goal states will be reached (once or infinitely many times) with probability 1 without resource exhaustion. In particular, when the amount of resource becomes too low to safely continue in the mission, the strategy changes course of the agent toward one of a designated set of reload states where the agent replenishes the resource to full capacity; with a sufficient amount of resource, the agent attempts to fulfill the mission again. We also present two heuristics that attempt to reduce the expected time that the agent needs to fulfill the given mission, a parameter important in practical planning. The presented algorithms were implemented, and the numerical examples demonstrate the effectiveness (in terms of computation time) of the planning approach based on consumption Markov decision processes and the positive impact of the two heuristics on planning in a realistic example.


Efficient Strategy Synthesis for MDPs with Resource Constraints
František Blahoudek, Petr Novotný, Melkior Ornik, Pranay Thangeda, and Ufuk Topcu Abstract-We consider qualitative strategy synthesis for the formalism called consumption Markov decision processes. This formalism can model dynamics of an agents that operates under resource constraints in a stochastic environment. The presented algorithms work in time polynomial with respect to the representation of the model and they synthesize strategies ensuring that a given set of goal states will be reached (once or infinitely many times) with probability 1 without resource exhaustion. In particular, when the amount of resource becomes too low to safely continue in the mission, the strategy changes course of the agent towards one of a designated set of reload states where the agent replenishes the resource to full capacity; with sufficient amount of resource, the agent attempts to fulfill the mission again.
We also present two heuristics that attempt to reduce expected time that the agent needs to fulfill the given mission, a parameter important in practical planning. The presented algorithms were implemented and numerical examples demonstrate (i) the effectiveness (in terms of computation time) of the planning approach based on consumption Markov decision processes and (ii) the positive impact of the two heuristics on planning in a realistic example.
Index Terms-consumption Markov decision process, planning, resource constraints, strategy synthesis

I. INTRODUCTION
Autonomous agents like driverless cars, drones, or planetary rovers typically operate under resource constraints. A lack of the critical resource usually leads to a mission failure or even to a crash.
Autonomous agents are often deployed in stochastic environments which exhibit uncertain outcomes of the agents' actions. Markov decision processes (MDPs) are commonly used to model such environments for planning purposes. Intuitively, an MDP is described by a set of states and transitions between these states. In a discrete-time MDP, the evolution happens in discrete steps and a transition has two phases: first, the agent chooses some action to play, and the resulting state is chosen randomly based on a probability distribution defined by the action and the agent's state.
The interaction of an agent with an MDP is formalized using strategies. A strategy is simply a recipe that tells the agent, in every moment, what action to play next. The problem of finding strategies suitable for given objectives is called strategy synthesis for MDPs.
As the main results of this paper, we solve strategy synthesis for two kinds of objectives in resource-constrained MDPs. These two objectives are (i) almost-sure reachability of a given set of states T , and (ii) almost-sure Büchi objective for T . That is, the synthesized strategies ensure that, with probability 1 and without resource exhaustion, some target from T will be reached at least once or T will be visited infinitely often.
We also present two heuristics that improve the practical utility of the presented algorithms for planning in resourceconstrained systems. In particular, the goal-leaning and threshold heuristics attempt, as a secondary objective, to reach T in a short time. Further, we briefly describe our tool implementing these algorithms and we demonstrate that our approach specialized to qualitative analysis of resource-constrained systems can solve this task faster then the state-of-the-art generalpurpose probabilistic model checker STORM [1].

A. Current approaches to resource-constraints.
There is a substantial body of work in the area of verification of resource-constrained systems [2]- [11]. A naive approach is to model such systems as finite-state systems with states augmented by an integer variable representing the current resource level. The resource-constraint requires that the resource level never drops below zero.
The well-known energy model [2], [3] avoids the encoding of the resource level into state space: instead, the model uses a integer counter, transitions are labeled by integers, and taking an -labelled transition results in being added to the counter. Thus, negative numbers stand for resource consumption while positive ones represent charging. Many variants of both MDP and game-based energy models have been studied. In particular, [12] considers strategy synthesis for energy MDPs with qualitative Büchi and parity objectives. The main limitation of the energy models is that in general, they are not known to admit strategy synthesis algorithms that work in time polynomial with respect to the representation of the model. Indeed, already the simplest problem, deciding whether a non-negative energy can be maintained in a twoplayer energy game, is at least as hard as solving mean-payoff graph games [3]; the question whether the latter belongs to P is a well-known open problem [13]. This hardness translates also to MDPs [12], making polynomial-time strategy synthesis for energy MDPs impossible without a theoretical breakthrough.

B. Consumption MDPs
Our work is centered around Consumption MDPs (CMDPs) which is a model motivated by a real-world vehicle energy consumption and inspired by consumption games [14]. In a CMDP, the agent has a finite storage capacity, each action consumes a non-negative amount of resource, and replenishing of the resource happens only in a designated states, called reload states, as an atomic (instant) event. In particular, the resource levels are kept away from the states.
Reloading as atomic events and bounded capacity are the key ingredients for efficient analysis of CMDPs. Our qualitative strategy synthesis algorithms work provably in time that is polynomial with respect to the representation of the model. Moreover, they synthesize strategies with a simple structure and an efficient representation via binary counters.
We first introduced CMDPs and presented the algorithm for the Büchi objective in [15]. In contrast to [15], this paper contains the omitted proofs, it extends the algorithmic core with the reachability objective and it introduces goal-leaning and threshold heuristics that attempt to improve expected reachability time of targets. Moreover, the presentation in this manuscript is based on new notation that simplifies understanding of the merits and proofs, it uses more pictorial examples, and finally, we provide a numerical example that uses STORM as a baseline for comparison.

C. Outline
Section II introduces CMDPs with the necessary notation and it is followed by Section III which discusses strategies with binary counters. Sections IV and V solve two intermediate objectives for CMDPs, namely safety and positive reachability, that serve as stepping stones for the main results. The solution for the Büchi objective is conceptually simpler than the one for almost-sure reachability and thus is presented first in Section VI, followed by Section VII for the latter. Section VIII defines expected reachability time and proposes the two heuristics for its reduction. Finally, Section IX describes briefly our implementation and two numerical examples: one showing the effectiveness of CMDPs for analysis of resource-constrained systems and one showing the impact of the proposed heuristics on expected reachability time. For better readability, two rather technical proofs were moved from Section V to Appendix.

II. PRELIMINARIES
We denote by N the set of all non-negative integers and by N the set N∪{∞}. For a set I and a vector v ∈ N I indexed by We assume familiarity with basic notions of probability theory.
A. Consumption Markov decision processes (CMDPs) Definition 1 (CMDP). A consumption Markov decision process (CMDP) is a tuple C = (S, A, ∆, γ, R, cap) where S is a finite set of states, A is a finite set of actions, ∆ : S × A × S → [0, 1] is a transition function such that for all s ∈ S and a ∈ A we have that t∈S ∆(s, a, t) = 1, γ : S × A → N is a consumption function, R ⊆ S is a set of reload states where the resource can be reloaded, and cap is a resource capacity.
Visual representation. CMDPs are visualized as shown in Fig. 1 for a CMDP ({r, s, t, u, v}, {a, b}, ∆, γ, {r, t}, 20). States are circles, reload states are double circled, and target states (used later for reachability and Büchi objectives) are highlighted with a green background. Capacity is given in the yellow box. The functions ∆ and γ are given by (possibly branching) edges in the graph. Each edge is labeled by the name of the action and by its consumption enclosed in brackets. Probabilities of outcomes are given by gray labels in proximity of the respective successors. For example, the cyan branching edge stands for ∆(s, b, u) = ∆(s, b, v) = 1 2 and γ(s, b) = 5. To avoid clutter, we omit 1 for nonbranching edges and we merge edges that differ only in action names and otherwise are identical. As an example, the edge from r to s means that ∆(r, a, s) = ∆(r, b, s) = 1 and γ(r, a) = γ(r, b) = 1. The colors of edges do not carry any special meaning are used later in text for easy identification of the particular actions. For s ∈ S and a ∈ A, we denote by Succ(s, a) the set {t | ∆(s, a, t) > 0}. A path is a (finite or infinite) state-action sequence α = s 1 a 1 s 2 a 2 s 3 · · · ∈ (S · A) ω ∪ (S · A) * · S such that s i+1 ∈ Succ(s i , a i ) for all i. We define α i = s i and we say that α is s 1 -initiated. We use α ..i for the finite prefix s 1 a 1 . . . s i of α, α i.. for the suffix s i a i . . . , and α i..j for the infix s i a i . . . s j . A finite path is a cycle if it starts and ends in the same state and is simple if none of its proper infixes forms a cycle. The length of a path α is the number len(α) of actions on α, with len(α) = ∞ if α is infinite.
An infinite path is called a run. We typically name runs by variants of the symbol . A finite path is called history. We use last(α) for the last state of a history α. For a history α with last(α) = s 1 and for β = s 1 a 1 s 2 a 2 . . . we define a joint path as α β = αa 1 s 2 a 2 . . ..
A CMDP is decreasing if for every cycle s 1 a 1 s 2 . . . a k−1 s k there exists 1 ≤ i < k such that γ(s i , a i ) > 0. Throughout this paper we consider only decreasing CMDPs. The only place where this assumption is used are the proofs of Theorem 3 and Theorem 7.

B. Resource: consumption and levels
The semantics of the consumption γ, reload states R and capacity cap naturally capture evolution of levels of the resource along paths in C. Intuitively, each computation of C must start with some initial load of the resource, actions consume the resource, and reload states replenish the resource level to cap. The resource is depleted if its level drops below 0, which we indicate by the symbol ⊥ in the following.
Formally, let α = s 1 a 1 s 2 . . . s n (where n might be ∞) be a path in C and let 0 ≤ d ≤ cap be an initial load. We write α d to denote the fact that α started with d units of the resource. We say that α is loaded with d and that α d is a loaded path. The resource levels of α d is the sequence RL C ( α d ) = r 1 , r 2 , . . . , r n where r 1 = d and for 1 ≤ i < n the next resource level r i+1 is defined inductively, using c i = γ(s i , a i ) for the consumption of a i , as otherwise.

C. Strategies
A strategy σ for C is a function assigning an action to each loaded history. An evolution of C under the control of σ starting in some initial state s ∈ S with an initial load d ≤ cap creates a loaded path α d = d s 1 a 1 s 2 . . . as follows. The path starts with s 1 = s and for i ≥ 1 the action a i is selected by the strategy as a i = σ( d s 1 a 1 s 2 . . . s i ), and the next state s i+1 is chosen randomly according to the values of ∆(s i , a i , ·). Repeating this process ad infinitum yields an infinite sample run (loaded by d). Loaded runs created by this process are σcompatible. We denote the set of all σ-compatible s-initiated runs loaded by d by Comp C (σ, s, d).
We denote by P d σ s C (A) the probability that a sample run from Comp C (σ, s, d) belongs to a given measurable set of loaded runs A. For details on the formal construction of measurable sets of runs see [16].

D. Objectives and problems
A resource-aware objective (or simply an objective) is a set of loaded runs. The objective S (safety) contains exactly all loaded runs that are safe. Given a target set T ⊆ S and i ∈ N, the objective R i T (bounded reachability) is the set of all safe loaded runs that reach some state from T within the first i steps, which is R i T = { d ∈ S | j ∈ T for some 1 ≤ j ≤ i + 1}. The union R T = i∈N R i T forms the reachability objective. Finally, the objective B T (Büchi) contains all safe loaded runs that visit T infinitely often.
The safety objective -never depleting the critical resource -is of primary concern for agents in CMDPs. We reflect this fact in the following definitions. Let us now fix a target set T ⊆ S, a state s ∈ S, an initial load d, a strategy σ, and an objective O. We say that σ loaded with d in s We consider the following qualitative problems for CMDPs: Safety, positive reachability almost-sure Büchi, and almostsure reachability which equal to computing ml[S] C , ml[R] >0 C , ml[B] =1 , and ml[R] =1 C , respectively, and the corresponding witness strategies. The solutions of the latter two problems build on top of the first two.

E. Additional notation and conventions
For given R ⊆ S, we denote by C(R ) the CMDP that uses R as the set of reloads and otherwise is defined as C. Throughout the paper, we drop the subscripts C and T in symbols whenever C or T is known from the context. Calligraphic font (e.g. C) is used for names of CMDPs, sans serifs (e.g. S) is used for objectives (set of loaded runs), and vectors are written in bold. Action names are letters from the start of alphabet, while states of CMDPs are usually taken from the later parts of alphabet (starting with r). The symbol α is used for both finite and infinite paths, and is only used for infinite paths (runs). Finally, strategies are always variants of σ or π.

Example 2. The runs and
from Example 1 are sample runs created by two different memoryless strategies: σ a that always picks a in s, and σ b that always picks b in s, respectively. As is the only s-initiated run of σ a , we have that σ a |= 2 s S. However, σ a is not useful if we attempt to eventually reach t and we clearly have P 2 σa s (R {t} ) = 0. On the other hand, is the witness for the fact that σ b does not even satisfy the safety objective for any initial load. As we have no other choice in s, we can conclude that memoryless strategies are not sufficient in our setting. Consider instead a strategy π that picks b in s whenever the current resource level is at least 10 and picks a (and reloads in r) otherwise. Loaded with 2 in s, π satisfies safety and it guarantees reaching t with a positive probability: in s, we need at least 10 units of resource to return to r in the case we are unlucky and b leads us to u; if we are lucky, b leads us directly to t, witnessing that P 2 π s (R {t} ) > 0. Moreover, at every revisit of r there is a 1 2 chance of hitting t during the next attempt, which shows that Remark. While computing the sure satisfaction relation |= on a CMDP follows similar approaches as used for solving a consumption 2-player game [14], the solutions for |= >0 and |= =1 differ substantially. Indeed, imagine that, in the CMDP from Fig. 1, the outcome of the action b from state s is resolved by an adversarial player (who replaces the random resolution). The player can always pick u as the next state and then the strategy π does not produce any run that reaches t. In fact, there would be no strategy that guarantees reaching t against such a player at all.
The strategy π from Example 2 uses finite memory to track the resource level exactly. Under the standard definition, a strategy is a finite memory strategy, if it can be encoded by a memory structure, a type of finite transducer (a finite state machine with outputs). Tracking resource levels using states in transducers is memory-inefficient. Instead, the next section introduces resource-aware strategies that rely on binary counters to track resource levels.

III. STRATEGIES WITH BINARY COUNTERS
In this section, we define a succinct representation of finitememory strategies suitable for CMDPs. Let us fix a CMDP C = (S, A, ∆, γ, R, cap) for the rest of this section. In our setting, strategies need to track resource levels of histories. A non-exhausted resource level is always a number between 0 and cap, which can be represented with a binary-encoded bounded counter. A binary-encoded counter needs log 2 cap bits of memory to represent numbers between 0 and cap (the same as integer variables in computers). Representation of resource levels using states in transducers would require cap states.
We call strategies with such binary-encoded counters finite counter strategies. In addition to the counter, a finite counter strategy needs rules that select actions based on the current resource level, and a rule selector that pick the right rule for each state.
Definition 2 (Rule). A rule ϕ for C is a partial function from the set {0, . . . , cap} to A. An undefined value for some n is indicated by ϕ(n) = ⊥.
We use dom(ϕ) = {n ∈ {0, . . . , cap} | ϕ(n) = ⊥} to denote the domain of ϕ and we call the elements of dom(ϕ) border levels. We use Rules C for the set of all rules for C.
A rule compactly represents a total function using intervals. Intuitively, the selected action is the same for all values of the resource level in the interval between two border levels. Formally, let l be the current resource level and let n 1 < n 2 < · · · < n k be the border levels of ϕ sorted in the ascending order. Then the selection according to rule ϕ for l, written as select(ϕ, l), picks the action ϕ(n i ), where n i is the largest border level such that n i ≤ l. In other words, select(ϕ, l) = ϕ(n i ) if the current resource level l is in [n i , n i+1 ) (putting n k+1 = cap + 1). We set select(ϕ, l) = a for some globally fixed action a ∈ A (for completeness) if l < n 1 . In particular, select(ϕ, ⊥) = a.
A binary-encoded counter that tracks the resource levels of paths together with a rule selector Φ encode a strategy σ Φ .
. . s n be a loaded history. We assume that we can access the value of lastRL( α d ) from the counter and we set A strategy σ is a finite counter strategy if there is a rule selector Φ such that σ = σ Φ . The rule selector can be imagined as a device that implements σ using a table of size O(|S|), where the size of each cell corresponds to the number of border levels times O(log cap) (the latter representing the number of bits required to encode a level). In particular, if the total number of border levels Φ is polynomial in the size of the MDP, so is the number of bits required to represent Φ (and thus, σ Φ ). This contrasts with the traditional representation of finite-memory strategies via transducers [17], since transducers would require at least Θ(cap) states to keep track of the current resource level.

IV. SAFETY
In this section, we present Algorithm 2 that computes ml[S] and the corresponding witness strategy. Such a strategy guarantees that, given a sufficient initial load, the resource will never be depleted regardless the resolution of actions' outcomes. In the remainder of the section we fix an MDP C = (S, A, ∆, γ, R, cap).
A safe run loaded with d has the following two properties: (i) it never consumes more than cap units of the resource between 2 consecutive visits of reload states, and (ii) it consumes at most d units of the resource (energy) before it reaches the first reload state. To ensure (i), we need to identify a maximal subset R ⊆ R of reload states for which there is a strategy σ that, starting in some r ∈ R , can always reach R again using at most cap resource units. To ensure (ii), we need a strategy that suitably navigates towards R while not reloading and while using at most d units of resource.
In summary, for both properties (i) and (ii) we need to find a strategy that can surely reach a set of states (R ) without reloading and withing a certain limit on consumption (cap and d, respectively). We capture the desired behavior of the strategies by a new objective N (non-reloading reachability).

A. Non-reloading reachability
The problem of non-reloading reachability in CMDPs is similar to the problem of minimum cost reachability on regular MDPs with non-negative costs, which was studied before [18]. In this sub-section, we present a new iterative algorithm for this problem which fits better into our framework and is implemented in our tool. The reachability objective R is defined as a subset of S, and thus relies on resource levels. The following definition of N follows similar ideas as we used for R, but (a) ignores what happens after the first visit of the target set, and (b) it uses the cumulative consumption instead of resource levels to ignore the effect of reload states.
Given T ⊆ S and i ∈ N, the objective N i T (bounded nonreloading reachability) is the set of all (not necessary safe) loaded runs s T forms the non-reloading reachability objective. Let us now fix some T ⊆ S. In the next few paragraphs, we discuss the solution of sure non-reloading reachability of T : computing the vector ml[N T ] and the corresponding witness strategy. The solution is based on backward induction (with respect to number of steps needed to reach T ). The key concept here is the value of action a in a state s based on a vector v ∈ N S , denoted as AV (v, s, a) and defined as follows.
Intuitively, AV (v, s, a) is the consumption of a in s plus the worst value of v among the relevant successors. Now imagine that v is equal to ml[N i ]; that is, for each state s it contains the minimal amount of resource needed (without reloading) to reach T in at most i steps. Then, AV for a in s is the minimal amount of resource needed to reach T in i + 1 steps when playing a in s.
The following functional F : N S → N S is a simple generalization of the standard Bellman functional used for computing shortest paths in graphs. We use To complete our induction-based computation, we need to find the right initialization vector x T for F. As the intuition for action value hints, x T should be precisely ml[N 0 ] and thus is defined as x T (s) = 0 for s ∈ T and as x T (s) = ∞ otherwise.
We proceed by induction on i. The base case for i = 0 is trivial. Now assume that the lemma holds for some i ≥ 0.  a 1 ). Therefore, given a state s / ∈ T and the load d = ml[N i+1 ](s), each witness strategy On the other hand, let a m be the action with minimal AV for s based on ml[N i ]. The strategy that plays a m in the first step and then mimics some witness strategy for ml Proof. For the sake of contradiction, suppose that F does not yield a fixed point after n steps. Then there exists a state is the consumption of the first k actions of . Such a run must exist, otherwise some ml[N i ] can be improved.
As n is the length of the longest simple path in C, we can conclude that there are two indices f < l ≤ n + 1 such that Witness strategy for ml [N]. Any memoryless strategy σ that picks for each history ending with a state s some action a s such that AV (ml[N], s, a s ) = ml[N](s) is clearly a witness strategy for ml [N], which is, σ |= N.

B. Safely reaching reloads from reloads
The objective N is sufficient for the property (ii) with T = R . But we cannot use it off-the-shelf to guarantee the property (i) at most cap units of resource are consumed between two consecutive visits of R . For that, we need to solve the problem of reachability within at least 1 steps (starting in T alone does not count as reaching T here). We define N i +T in the same way as N T but we enforce that f > 1 and we set N +T = i∈N N +T . To compute ml[N +T ], we slightly alter F using the following truncation operator.
The new functional G applied to v computes the new value in the same way for all states (including states from T ), but treats v(t) as 0 for t ∈ T .
Let ∞ S ∈ N S denote the vector with all components equal to ∞. Clearly, ∞ S T = x T , and thus it is easy to see that for all s / ∈ T and i ∈ N we have that . A slight modification of arguments used to prove Lemma 1 and Theorem 1 shows that G indeed computes ml[N +T ] and we need at most n + 1 iterations for the desired fixed point. Algorithm 1 iteratively applies G on ∞ S until a fixed point is reached.
Input: Now with Algorithm 1 we can compute ml[N +R ] C and see which reload states should be avoided by safe runs: the reloads that need more than cap units of resource to surely reach R again. We call such reload states unusable in C. w C. Detecting useful reloads and solving the safety problem Using Algorithm 1, we can identify reload states that are unusable in C. However, it does not automatically mean that the rest of the reload states form the desired set R for property (i). Consider the CMDP D in Fig. 2. The only reload state that is unusable is w (ml[N +R ](w) = ∞). But clearly, all runs that avoid w must avoid v and x as well. This intuition is backed up by the fact that ml Fig. 3. The property (i) indeed translates to ml[N +R ](r) ≤ cap for all r ∈ R ; naturally, we want to identify the maximal R ⊆ R for which this holds. Algorithm 2 finds the desired R by iteratively removing  [2] [1] unusable reloads from the current candidate set R until there is no unusable reload in R (lines 3-7). With the right set R in hand, we can move on to the property (ii) of safe runs: navigating safely towards reloads in R , which equals to the objective N R from Section IV-A. We can reuse Algorithm 1 for it as ml[N] = ml[N + ] regardless the target set. Based on properties (i) and (ii), we claim that ml[S] = ml[N +R ] R . Indeed, we need at most cap units of resource to move between reloads of R , and we need at most ml[N +R ] R (s) units of resource to reach R from s.
Whenever ml[S](s) > cap for some state s, the exact value is not important for us; the meaning is still that there is no strategy σ and no initial load d ≤ cap such that σ |= d s S. Hence, we can set ml[S](s) = ∞ in all such cases. To achieve this, we extend the operator · T into · cap T as follows. Correctness. We first prove that upon termination n cap R (s) ≤ ml[S] C (s) for each s ∈ S whenever the latter value is finite. This is implied by the fact that ml[N +R ] ≤ ml[S] is the invariant of the algorithm. To see that, it suffices to show that at every point of execution, ml[S](t) = ∞ for each t ∈ R R : if this holds, each strategy that satisfies S must avoid states in R R (due to property (i) of safe runs) and thus the first reload on runs compatible with such a strategy must be from R .
Let R i denote the contents of R after the i-th iteration. We prove, by induction on i, that ml[S](t) = ∞ for all t ∈ R R . For i = 0 we have R = R 0 , so the statement holds. For i > 0 , let t ∈ R R i , then it must exist some j < i such that n(t) = ml[N +R j ](t) > cap, hence no strategy can safely reach R j from t and by induction hypothesis, the reload states from R R j must be avoided by strategies that satisfy S. Together, as C is decreasing, there is no strategy σ such that σ |=   Proof. Using Lemma 2, the existence of a memoryless strategy follows from the fact that a strategy that fixes one minsafe action in each state is safe. The complexity follows from Theorem 3.
Example 4. Figure 4 shows again the CMDP from Fig. 1 and includes also values computed by Algorithms 1 and 2. Algorithm 2 stores the values of ml[N] into n and, because no value is ∞, returns just n cap R . The strategy σ a from Example 2 is a witness strategy for ml[S]. As ml[S](s) = 2, no strategy would be safe from s with initial load 1.

V. POSITIVE REACHABILITY
In this section, we present the solution of the problem called positive reachability. We focus on strategies that, given a set T ⊆ S of target states, safely satisfy R T ⊆ S with positive probability. The main contribution of this section is Algorithm 3 that computes ml[R] >0 and the corresponding witness strategy. As before, for the rest of this section we fix a CMDP C = (S, A, ∆, γ, R, cap) and also a set T ⊆ S.
Let s ∈ S T be a state, let d be an initial load and let σ be a strategy such that σ |=  (s, a). It must then continue in a similar fashion from s until either T is reached (and σ produces the desired run from R T ) or until there is no such action.
To formalize the intuition, we define two auxiliary functions. Let us fix a state s, and action a, a successor s ∈ Succ(s, a), and a vector x ∈ N S . We define the hope value of s for a in s based on x, denoted by HV (x, s, a, s ), and the safe value of a in s based on x, denoted by SV (x, s, a), as follows. HV (x, s, a, s ) The hope value of s for a in s represents the lowest level of resource that the agent needs to have after playing a in order to (i) have at least x(s ) units of resource if the outcome of a is s , and (ii) to survive otherwise.
We again use functionals to iteratively compute ml[R i ] >0 , with a fixed point equal to ml[R] >0 . The main operator B just applies · cap R on the result of the auxiliary functional A. The application of · cap R ensures that whenever the result is higher than cap, it set to ∞, and that in reload states the value is either 0 or ∞, which is in line what is expected from The following two lemmata relate B to ml[R] >0 and show that B applied iteratively on y T reaches fixed point in a number of iterations that is polynomial with respect to the representation of C. Their proofs are quite straightforward but technical, hence we moved them to Appendix.   for this example (as we can safely reach t from all reloads). In the iteration, where p(s) = 2 for the first time, the selector is updated to play a in s with resource level between 2 and 10 (excluded). Note that the computed Φ exactly matches the one mentioned in Example 3. Proof. The complexity part follows from Lemma 4 and the fact that each iteration takes only linear number of steps. The correctness part is an immediate corollary of Lemma 4 and the fact that Algorithm 3 iterates B on y T until a fixed point.
Theorem 6. Upon termination of Algorithm 3, the computed rule selector Φ encodes a strategy σ Φ such that σ Φ |= v >0 R for v = ml[R] >0 . As a consequence, a polynomial-size finite counter strategy for the positive reachability problem can be computed in time polynomial with respect to representation of C.
Proof. The complexity follows from Theorem 5. Indeed, since the algorithm has a polynomial complexity, also the size of Φ is polynomial. The correctness proof is based on the following invariant of the main repeat-loop. The vector p and the finite counter strategy π = σ Φ have these properties: (a) It holds that p ≥ ml[S]. (b) Strategy π is safe. (c) For each s ∈ S with d such that p(s) ≤ d ≤ cap, there is a finite π-compatible path α d = s d 1 a 1 s 2 . . . s n with s 1 = s and s n ∈ T such that RL( α d ) = r 1 , r 2 , . . . , r n never drops below p, which is r i ≥ p(s i ) for all 1 ≤ i ≤ n. The theorem then follows from (b) and (c) of this invariant and from Theorem 5.
Clearly, all parts of the invariants hold after the initialization on Lines 2 to 5. The first item of the invariant follows from the definition of SV and HV . In particular, if p old ≥ ml[S], then SV (p old , s, a) ≥ AV (ml[S], s, a) ≥ ml[S](s) for all s and a. The part (b) follows from (a), as the action assigned to Φ on Line 14 is safe for s with p(s) units of resource (again, due to p(s) = SV (p old , s, a) ≥ AV (ml[S], s, a)); hence, only actions that are safe for the corresponding state and resource level are assigned to Φ. By Lemma 2, π is safe.
The proof of (c) is more involved. Assume that an iteration of the main repeat loop was performed. Denote by π old the strategy encoded by p and Φ from the previous iteration. Let s be any state such that p(s) ≤ cap. If p(s) = p old (s), then (c) follows directly from the induction hypothesis: for each state q, Φ(q) was only redefined for values smaller then p old (q) and thus the history witnessing (c) for π old is also π-compatible.
The case where p(s) < p old (s) is treated similarly. We denote by a the action a(s) selected on Line 9 and assigned to Φ(s) for p(s) on line 14. By definition of SV , there must be t ∈ Succ(s, a) such that HV (p old , s, a, t) + γ(s, a) = SV (p old , s, a) (which is equal to p(s) before the truncation on Line 11). In particular, it holds that l = lastRL( sat p(s) ) ≥ p old (t) (even after the truncation). Then, by the induction hypothesis, there is a t-initiated finite path β witnessing (c) for π old . Then, the loaded history α p(s) with α = sat β is (i) compatible with π and, moreover, (ii) we have that RL( α p old (s) ) never drops below p. Indeed, (i) Φ(s)(p(s)) = a (see Line 14), and (ii) Φ was only redefined for values lower than p old and thus π mimics π old from t onward. For the initial load p(s) < d ≤ cap the same arguments apply. This finishes the proof of the invariant and thus also the proof of Theorem 6.

VI. BÜCHI: VISITING TARGETS REPEATEDLY
This section solves the almost-sure Büchi problem. As before, for the rest of this section we fix a CMDP C = (S, A, ∆, γ, R, cap) and a set T ⊆ S.
The solution builds on the positive reachability problem similarly to how the safety problem builds on the nonreloading reachability problem. In particular, we identify a largest set R ⊆ R such that from each r ∈ R we can safely reach R again (in at least one step) while restricting ourselves only to safe strategies that (i) avoid R \ R and (ii) guarantee positive reachability of T in C(R ) from all r ∈ R .
Intuitively, at each visit of R , such a strategy can attempt to reach T . With an infinite number of attempts, we reach T infinitely often with probability 1 (almost surely). Formally, we show that for a suitable R we have that ml C(R ) (where C(R ) denotes the CMDP defined as C with the exception that R is the set of reload states).
Algorithm 4 identifies the suitable set R using Algorithm 3 in a similar fashion as Algorithm 2 handled safety using Algorithm 1. In each iteration, we declare as non-reload states all states of R from which positive reachability of T within C(R ) cannot be guaranteed. This is repeated until we reach a fixed point. The number of iterations is clearly bounded by |R|.
Theorem 7. Upon termination of Algorithm 4 we have that for the strategy σ Φ encoded by Φ it holds C . As a consequence, a polynomial-size finite counter strategy for the almost-sure Büchi problem can be computed in time polynomial with respect to the representation of C.
Proof. The complexity part follows from the fact that the number of iterations of the repeat-loop is bounded by |R| and from theorems 5 and 6.
For the correctness part, we first prove that Then we argue that the same holds also for C. Finally, we C ; the converse inequality follows from C(R ) R, and upon termination, b(r) = 0 for all r ∈ R . Therefore, there is θ > 0 such that upon every visit of some state r ∈ R we have that P 0 σΦ r C(R ) (R) ≥ θ. As C(R ) is decreasing, every safe infinite run created by σ Φ in C(R ) must visit R infinitely many times. Hence, with probability 1 we reach T at least once. The argument can then be repeated from the first point of visit of T to show that with probability 1 we visit T at least twice, three times, etc. ad infinitum. By the monotonicity of probability, we get P d σΦ C . Assume for the sake of contradiction that there is a state s ∈ S and a strategy σ Then there must be at least one α d created by σ such that α d visits r ∈ R \ R before reaching T (otherwise d ≥ b(s)). Then either (a) ml[R] >0 C (r) = ∞, in which case any σ-compatible extension of α d avoids T ; or (b) since ml[R] >0 C(R ) (r) > cap, there must be an extension of α that visits, between the visit of r and T , another r ∈ R \ R such that r = r. We can then repeat the argument, eventually reaching the case (a) or running out of the resource, a contradiction with σ |= d s C S.

VII. ALMOST-SURE REACHABILITY
In this section, we solve the almost-sure reachability problem, which is computing the vector ml[R T ] =1 and the corresponding witness strategy for a given set of target states T ⊆ S.

A. Reduction to Büchi
In the absence of the resource constraints, reachability can be viewed as a special case of Büchi: we can simply modify the MDP so that playing any action in some target state t ∈ T results into looping in t, thus replacing reachability with an equivalent Büchi condition. In consumption MDPs, the transformation is slightly more involved, due to the need to "survive" after reaching T . Hence, for every CMDP C and a target set T we define a new CMDP B(C, T ) so that solving B(C, T ) w.r.t. the Büchi objective entails solving C w.r.t. the reachability objective. Formally, for C = (S, A, ∆, γ, R, cap) we have B(C, T ) = (S , A, ∆ , γ , R , cap), where the differing components are defined as follows: i.e. ∆ (sink , a, sink ) = 1 for each a ∈ A; • we have R = R ∪ {sink }; • for each t ∈ T and a ∈ A we have ∆ (t, a, sink ) = 1 and ∆ (t, a, s) = 0 for all s ∈ S; • for each t ∈ T and a ∈ A we have γ (t, a) = ml[S] C (t); • we have γ (sink , a) = 1 for each a ∈ A; and • we have ∆ (s, a, t) = ∆(s, a, t) and γ (s, a) = γ(s, a) for every s ∈ S \ T , every a ∈ A, and every t ∈ S. We can easily prove the following: . Consider a strategy π in C which, starting in some state s, mimics σ until some t ∈ T is reached (this is possible since B(C, T ) only differs from C on T ∪ {sink }), and upon reaching T switches to mimicking an arbitrary safe strategy. Since σ reaches sink and thus also T almost-surely, so does π. Moreover, since σ is safe, upon reaching a t ∈ T the current resource level is at least ml[S] C (t), since consuming this amount is enforced in the next step. This is sufficient for π to prevent resource exhaustion after switching to a safe strategy.
It follows that ml The converse inequality can be proved similarly, by defining a straightforward conversion of a witness strategy for Hence, we can solve almost-sure reachability for C by constructing B(C, T ) and solving the latter for almost-sure Büchi via Algorithm 4. The construction of B(C, T ) can be clearly performed in time polynomial in the representation of C (using Algorithm 2 to compute ml[S] C ), hence also almostsure reachability can be solved in polynomial time.

B. Almost-sure reachability without model modification
In practice, building B(C, T ) and translating the synthesized strategy back to C is inconvenient. Hence, we also present an algorithm to solve almost-sure reachability directly on C. The algorithm consists of a minor modification of the already presented algorithms.
To argue the correctness of the algorithm, we need a slight generalization of the MDP modification. We call a vector v ∈ N S a sink vector for C if and only if 0 ≤ v(s) ≤ ml[S] =1 C (s) or v(s) = ∞ for all s ∈ S. By F (v) we denote the set {s ∈ S | v(s) < ∞} of states with finite value of v and we call each member of this set a sink entry. We say that v is a sink vector for T if F (v) = T . Given a CMDP C, target set T , and sink vector v for T , we define a new CMDP B(C, T, v) in exactly the same way as B(C, T ), except for the fourth point: for every t ∈ T we put γ (t, a) = v(t) for all a ∈ A. Note that B(C, T ) = B(C, T, ml[S] C ). Then, run the modified algorithm on C. The correctness can be argued similarly as for Algorithm 5: let R i be the contents of R in the i-th iteration of Algorithm 2 on B(C, T, v); and letR i be the contents of R in the i-th iteration of the modified algorithm executed on C. Clearly, sink ∈R i for all i. An induction on i shows that for all i we have R =R \ {sink }, so both algorithms terminate in the same iteration. The correctness follows from the correctness of Algorithm 2. Now we can proceed to solve almost-sure reachability. Algorithm 6 combines (slightly modified) algorithms 3 and 4 Algorithm 6: Computing ml[R T ] =1 and a corresponding witness rule selector.
Input: CMDP C = (S, A, ∆, γ, R, cap) and T ⊆ S Output: The vector ml[  T ). But this follows immediately from the fact that in the latter computation, sink always stays in R .

VIII. IMPROVING EXPECTED REACHABILITY TIME
The number of steps that a strategy needs on average to reach the target set T (expected reachability time (ERT)) is a property of practical importance. For example, we expect that a patrolling unmanned vehicle visits all the checkpoints in a reasonable amount of time. The presented approach is purely qualitative (ensures reachability with probability 1) and thus does not consider the number of steps at all. In this section, we introduce two heuristics that can improve ERT: the goal-leaning heuristic and the threshold heuristic. These slight modifications of the algorithms produce strategies that can often hit T sooner than the strategies produced by the unmodified algorithms.

A. Expected reachability time
To formally define ERT, we introduce a new objective: T for all 0 ≤ j < i} (the set of all safe loaded runs d such that the minimum j such that j ∈ T is equal to i). Finally, the expected reachability time for a a strategy σ, an initial state s ∈ S, an initial load d ≤ cap, and a target set T ⊆ S is defined as follows.
The running example for this section is the CMDP in

B. Goal-leaning heuristic
Actions to play in certain states with a particular amount of resource are selected on Line 9 of Algorithm 3 (and on Line 15 of Algorithm 6) based on the actions' save values SV . This value in s based on ml[R] =1 is equal to 2 for both actions a and b. Thus, Algorithm 3 (and also Algorithm 6) returns σ a or σ b randomly based on the resolution of the arg min operator on Line 9 (Line 15 in Algorithm 6) for s. The goal-leaning heuristic fixes the resolution of the arg min operator to always pick a in this example.
The ordinary arg min operator selects randomly an action from the pool of actions with the minimal value v min of the function SV for s (and the current values of p old ). Loosely speaking, the goal-leaning arg min operator chooses, instead, the action whose chance to reach the desired successor used to obtain v min is maximal among actions in this pool.
The value SV is computed using successors' hope values (HV ), see Section V. The goal-leaning arg min operator records, when computing the HV values, also the transition probabilities of the desired successors. Let s be a state, let a be an action, and let t ∈ Succ(s, a) be the successor of a in s that minimizes HV (p old , s, a, t) and maximizes ∆(s, a, t) (in this order). We denote by p s,a the value ∆(s, a, t). The goal-leaning arg min operator chooses the action a in s that minimizes SV (p old , s, a) and maximizes p s,a .
In the example from Fig. 5 we have that p s,a = 1 and p s,b = 1 10 (as v is the desired successor) in the second iteration of the repeat-loop on Lines 6 to 15. In the last iteration, p s,a remains 1 and p s,b changes to 9 10 as the desired successor changes to r. In both cases, a is chosen by the goal-leaning arg min operator as p s,a > p s,b .
Correctness. We have only changed the behavior of the arg min operator when multiple candidates could be used. The correctness of our algorithms does not depend on this choice and thus the proofs apply also to the variant with the goalleaning operator.
While the goal-leaning heuristic is simple, it has a great effect in practical benchmarks; see Section IX. However, there are scenarios where it still fails. Consider now the CMDP in Fig. 6 with capacity at least 3. Note that now γ(s, b) equals to 1 instead of 2. In this case, even the goal-leaning heuristic prefers b to a in s whenever the current resource level is at least 1. The reason for this choice is that SV (p old , s, b) = 1 < 2 = SV (p old , s, a) from the second iteration of the repeatloop onward. Note also that the strategy σ a that always plays a in s is not a witness strategy for ml[R] =1 as σ a needs at least 2 units of resource in s. The desired strategy π should behave in s as follows: play a if the current resource is at least 2 and otherwise play b. We have that ERT (π, s, 2, {t}) = 2 and ERT (π, s, 1, {t}) = 3.8. In the next section, we extend the goal-leaning heuristic to produce π for the (updated) running example.

C. Threshold heuristic
The threshold heuristic is parametrized by a probability threshold 0 ≤ θ ≤ 1. Intuitively, when we compute the value of SV for b in s, we ignore the hope values of successors t ∈ Succ(s, b) such that ∆(s, b, t) < θ. With θ = 0.2, v in our example is no longer considered as a valid outcome for b in s in the second iteration and SV (p old , s, b) = ∞. Therefore a is picked with SV (p old , s, a) = 2. It happens only in the fourth iteration that action b is considered from s. In this iteration, p old (r) is 0 (it is a reload state) and with ∆(s, b, r) = 0.9, r passes the threshold and we finally have that SV (p old , s, b) = 1. The resulting finite counter strategy is exactly the desired strategy π from above.
Formally, we parametrize the function SV by θ as follows where we assume min of the empty set is equal to ∞ (changes to definition of SV are highlighted in red). The new function SV θ is a generalization of SV = SV 0 . To implement this heuristic, we need, in addition to the goal-leaning arg min operator, to use SV θ instead of SV in Algorithms 3 and 6.
There is, however, still one caveat introduced by the threshold. By ignoring some outcomes, the threshold heuristic might compute only over-approximations of ml[R] >0 . As a consequence, the strategy σ computed by the heuristic might be incomplete; it might be undefined for a resource level from which the objective is still satisfiable.
In order to make σ complete and to compute ml[R] >0 precisely, we continue with the iterations, but now using SV 0 instead of SV θ . To be more precise, we include Lines 6 to 15 in Algorithm 3 twice (and analogously for Algorithm 6), once with SV θ and once with SV 0 (in this order).
This extra fixed-point iteration can complete σ and improve p to match ml[R] >0 using the rare outcomes ignored by the threshold. As a result, σ behaves according to the threshold heuristic for sufficiently high resource levels and, at the same time, it achieves the objective from every state-level pair where this is possible.
Correctness. The function SV θ clearly over-approximates SV as we restrict the domain of the min operator only. The invariant of the repeat-loop from the proof of Theorem 6 still holds even when using SV θ instead of SV (it also obviously holds in the second loop with SV 0 ). The extra repeat-loop with SV 0 converges to the correct fixed point due to the monotonicity of p over iterations. Thus, Theorems 5 and 6 hold even when using the threshold heuristics.

D. Limitations
The suggested heuristics naturally do not always produce strategies with the least ERT possible for given CMDP, state, and initial load. Consider the CMDP in Fig. 7 with capacity at least 2. Both heuristics prefer (regardless θ) b in s since ∆(s, b, v) > ∆(s, a, v) = ∆(s, a, u). Such strategy yields ERT from s equal to 2 2 3 , while the strategy that plays a in s comes with ERT equal to 2. This non-optimality must be expected as the presented algorithm is purely qualitative and does not convey any quantitative analysis that is required to compute the precise ERT of strategies. However, there is no known polynomial (with respect to the CMDP representation) algorithm for quantitative analysis of CMDPs that we could use here instead of our approach.
While other, perhaps more involved heuristics might be invented to solve some particular cases, qualitative algorithms which do not track precise values of ERT, naturally cannot guarantee optimality with respect to ERT. The presented heuristics are designed to be simple (both in principle and computation overhead) and to work well on systems with rare undesired events.
The threshold heuristic relies on a well-chosen threshold θ. This threshold needs to be supplied by the user. Moreover, different thresholds work well for different models. Typically, θ should be chosen to be higher than the probability of the most common rare events in the model, to work well. As the presented algorithms rely on the fact that the whole model is known, a suitable threshold might be automatically inferred from the model.
Despite these limitations, we show the utility of the presented heuristics on a case study in the next section.

IX. IMPLEMENTATION AND EVALUATION
We have implemented Algorithms 1 to 6, including the proposed heuristics in a tool called FIMDP (Fuel in MDP). The rest of this section presents two numerical examples that demonstrate utility of FIMDP on realistic environments. In particular, we first compare the speed of strategy synthesis via CMDPs performed by FIMDP to the speed of strategy synthesis via regular MDPs with energy constraints encoded in states, performed by STORM. The second example shows the impact of heuristics from Section VIII on expected reachability time. Jupyter notebooks at https://github.com/FiMDP/ FiMDP-evaluation/tree/tac contain (not only) scripts and instructions needed to reproduce the presented results.

A. Tools, examples, and evaluation setting
FIMDP is an open-source library for CMDPs. It is written in Python and is well integrated with interactive Jupyter notebooks [19] for visualization of CMDPs and algorithms.  [20] or JANI [21] languages.
STORM is an open-source, state-of-the-art probabilistic model checker designed to be efficient in terms of time and memory. STORM is written in C++ and STORMPY is its Python interface. The examples are based on models generated by FIMDPENV -a library of simulation environments for real-world resource-constrained problems that can be solved via CMDPs. Table I lists the homepages of these tools and versions used to create the presented results.
We demonstrate the utility of CMDPs and FIMDP on high-level planning tasks for unmanned underwater vehicles (UUVs) operating in ocean with stochastic currents. FIMD-PENV models this scenario based on [22]. The model discretizes the area of interest into a 2D grid-world. A gridworld of size n consists of n × n cells, see Fig. 8 (left). Each cell in the grid-world forms one state in the corresponding CMDP, some of them are reload states, and some of them form the set of targets T . The set of actions consists of two classes of actions: weak actions consume less energy but have stochastic outcomes whereas strong actions have deterministic outcomes with the downside of significantly higher resource consumption. For each class, the environment offers up to 8 directions (east, north-east, north, north-west, west, southwest, south, and south-east), see Fig. 8 (right). n n [1] strong south east [2] weak north All experiments were performed on a PC with Intel Core i7-8700 3.20GHz 12 core processor and with 16 GB RAM running Ubuntu 18.04 LTS. Table I lists tools and versions used to obtain the presented results.

B. Strategy synthesis for CMDPs in FIMDP and STORM
We use the UUV environment from FIMDPENV to generate 15 strategy synthesis tasks with a Büchi objective. The complexity of a task is determined by grid size (the number of cells on each side) and capacity. We use grid sizes 10, 20, and 50. For each grid size n, we create five tasks with capacities equal to 1, 2, 3, 5, and 10 times n.We solve each task modeled as a CMDP using FIMDP and modeled as a regular MDP with resource constraints encoded in states and actions using STORM. We express the qualitative Büchi property in PCTL [23] for STORM. Figure 9 presents the running times (averaged over 10 independent runs) needed for each task by FIMDP (•) and by STORM (×). For each grid size we have one plot and a dot (x, y) indicates that the corresponding tool needed y seconds on average for the task with capacity x. We can observe that FIMDP outperforms STORM in terms of computation time in all test cases with the exception of small problems. For the small tasks, STORM benefits from its efficient implementation in C++. The advantage of FIMDP lies in the fact that the state space of CMDPs (and also the time needed for their analysis) does not grow with rising capacity.

C. Comparing heuristics for improving ERT
This section investigates the novel heuristics from a practical, optimal decision-making perspective. The test scenario is based in the UUV environment with grid size 20, a single reload state and one target state. The objective of agents is to reach the target almost surely. We consider four strategies generated for almost-sure reachability by the presented algorithms: the standard strategy (using randomized arg min operator), and strategies generated using the goal-leaning arg min operator and thresholds θ equal to 0, 0.3, and 0.5. For each strategy, we run 10000 independent runs and measure the number of steps needed to reach the target. By averaging the collected data, we approximate the expected reachability time (ERT) for the strategies.  Table II shows the average number of steps needed to reach the target by each of the strategies. The strategy built using the standard arg min operator does not reach the target within the first 200 steps in any of the 10000 trials. The goal-leaning arg min operator itself helps a lot to navigate the agent towards the goal. However, it still relies on rare events at some places. Setting θ = 0.3 helps to avoid these situations as the unlikely outcomes are not considered any more, and finally θ = 0.5 forces the agent to use strong actions almost exclusively. While using thresholds led to a better ERT in this particular environment, the result might not hold in general. The best choice of threshold solely depends on the environment model and the exact probabilities of outcomes.

X. CONCLUSION & FUTURE WORK
We presented consumption Markov decision processesmodels for stochastic environments with resource constraints -and we showed that strategy synthesis for qualitative objectives is efficient. In particular, our algorithms that solve synthesis for almost-sure reachability and almost-sure Büchi objective in CMDPs, work in time polynomial with respect to the representation of the input CMDP. In addition, we presented two heuristics that can significantly improve the expected time needed to reach a target in realistic examples. These heuristics improve the utility of the presented algorithms for planning in stochastic environments under resource constraints. The experimental evaluation of the suggested methods confirmed that direct analysis of CMDPs in out tool is faster than analysis of an equivalent MDP even when performed by the state-of-the-art tool STORM(with the exception of very small models).
Possible directions for the future work include extensions to quantitative analysis (e.g. minimizing the expected resource consumption or reachability time), stochastic games, or partially observable setting.
František Blahoudek is a postdoctoral researcher at the Faculty of Information Technology, Brno University of Technology, Czech Republic. He was a postdoctoral researcher in the group of Ufuk Topcu at the University of Texas at Austin. He received his Ph.D. degree from the Masaryk University, Brno in 2018. His research focuses on automata in formal methods and on planning under resource constraints.
Petr Novotný is an assistant professor at the Faculty of Informatics, Masaryk University, Czech Republic. He received his Ph.D. degree from Masaryk University in 2015. His research focuses on automated analysis of probabilistic program, application of formal methods in the domains of planning and reinforcement learning, and on the theoretical foundations of probabilistic verification.
Melkior Ornik is an assistant professor in the Department of Aerospace Engineering and the Coordinated Science Laboratory at the University of Illinois at Urbana-Champaign. He received his Ph.D. degree from the University of Toronto in 2017. His research focuses on developing theory and algorithms for learning and planning of autonomous systems operating in uncertain., complex and changing environments, as well as in scenarios where only limited knowledge of the system is available.
Pranay Thangeda is a graduate student in the Department of Aerospace Engineering and the Coordinated Science Laboratory at the University of Illinois at Urbana-Champaign. He received his M.S. degree in Aerospace Engineering from the University of Illinois at Urbana-Champaign in 2020. His research focuses on developing algorithms that exploit side information for efficient planning and learning in unknown environments.
Ufuk Topcu is an associate professor in the Department of Aerospace Engineering and Engineering Mechanics and the Oden Institute at The University of Texas at Austin. He received his Ph.D. degree from the University of California at Berkeley in 2008. His research focuses on the theoretical, algorithmic, and computational aspects of design and verification of autonomous systems through novel connections between formal methods, learning theory and controls.