Model Predictive Control in Partially Observable Multi-Modal Discrete Environments

Autonomous systems operate in environments that can be observed only through noisy measurements. Thus, controllers should compute actions based on their beliefs about the surroundings. In these settings, we design a Model Predictive Controller (MPC) based on a continuous-state Linear Time-Invariant (LTI) system model operating in a discrete-state environment described by a Hidden Markov Model (HMM). Environment constraints are modeled as chance constraints and environment observations can be asynchronous with system state measurements and controller updates. We show how to approximate the solution of the MPC problem defined over the space of feedback policies by optimizing over a trajectory tree, where each branch is associated with an environment measurement. The proposed approach guarantees chance constraint satisfaction and recursive feasibility. Finally, we test the proposed strategy on navigation examples in partially observable environments, where the proposed MPC guarantees chance constraint satisfaction.

associated with the state estimate. These bounds should then be leveraged in a robust control design to guarantee constraint satisfaction [3].
When uncertainties are multi-modal, the optimal control policy minimizing the expected cost can be computed by modeling the problem using a Partially Observable Markov Decision Process (POMDP), which is a decision-making formalism to jointly model estimation and control problems. Unfortunately, solving POMDPs is computationally intractable, even for systems with discrete state and action spaces [4]. For this reason, several strategies have been proposed in the literature to approximate the solution to POMDPs [5], [6], [7], [8], [9].
We focus on Mixed-Observable Markov Decision Processes (MOMDPs) where perfect state feedback is not available only for a subset of the state space [10]. In particular, we focus on control problems where perfect state feedback is available for the system state, whereas only partial noisy discrete measurements are available to estimate the environment mode. These settings are common in several practical applications such as autonomous driving and robot navigation, where it is often possible to compute a reasonably accurate estimate for the vehicle state, but it is hard to estimate the state of the surroundings which is multi-modal. For instance, in autonomous driving the environment state could encode the intentions of other vehicles or pedestrians, e.g., the intent of drivers to perform lane change or lane-keeping maneuvers [11], [12], [13].
Several strategies have been proposed for the control design of autonomous systems operating in partially observable environments [11], [12], [13], [14], [15], [16], [17], [18], [19]. These approaches leverage a Model Predictive Controller (MPC) which solves an optimization problem over a trajectory tree. Each branch of the tree is associated with either a sensor measurement, a disturbance realization, or an environment mode; thus such a trajectory tree encodes a policy. 1 The resulting MPC policy computes control actions to influence the environment or to gather sensor measurements that can be used for inference [13], [14], [15], [16].
In this letter, we model the environment evolution and the sensor accuracy using a Hidden Markov Model (HMM). Then, we design an MPC policy that optimizes a trajectory tree constructed based on the environment's HMM and the append(e sort [j]); 12 j = j + 1; 13 end 14 end 15 end 16 Return: C i k|t current belief. Our contribution is twofold. First, we show how to construct a trajectory tree that guarantees chance constraint satisfaction. Compared to previous works, we update the constraint enforced at each branch of the tree based on the environment belief and the imposed chance constraints. In particular, we design Algorithm 1 to compute a set of constraints that guarantee chance constraint satisfaction. As shown in the results section, the proposed strategy guarantees chance constraint satisfaction, while standard scenario MPC approaches fail. Second, we show that our MPC design guarantees recursive feasibility. To guarantee recursive feasibility in the case of asynchronous observations and chance constraints, we design an MPC problem where the optimization is defined for a trajectory tree, where each branch is associated with an observation sequence and a different set of constraints that are time-varying. Finally, we test our strategy on a navigation example, where the environment state is unknown to the controller. We show that our MPC guarantees chance constraint satisfaction and recursive feasibility, even when only noisy environment measurements are available.
Notation: We denote the ith element of a vector v ∈ R n as v [i]. For a function Z : R n → R and a vector v ∈ R n , we indicate Z(v) as the value of the function Z at the point v. Furthermore, for a vector v, we define the function Sort(v) sorting the elements of v in descending order and the function ArgSort(v) returning the indices of the vector v that would sort the vector, i.e., v[ArgSort(v)] = Sort(v). Given a set S ⊂ R n , we define its complement as S c = R n \ S and its cardinality as |S|. The set of positive integers is denoted as Z 0+ = {0, 1, 2, . . .}, and the set of positive reals as R 0+ = [0, ∞). Finally, we use the symbol ∅ to denote the empty set.

II. PROBLEM FORMULATION A. System and Environment Models
We consider the following linear time-invariant system: where x t ∈ R n and u t ∈ R m denote the state and the control input at time t. The above system operates in an environment represented by partially observable discrete states. We model As the environment state e t is partially observable, it is common practice to introduce the following belief vector [21]: where each element b t [e] represents the posterior probability that the state of the environment e t equals e ∈ E, given the observation vector o t ∈ O k collecting k observations stored up to time t, the system trajectory x t ∈ R n×(t+1) , and the belief vector b 0 at time t = 0, i.e., b t [e] = P(e|o t , x t , b 0 ). We recall that the belief vector is a sufficient statistic and it can be recursively updated by using the Bayes rule [21]. System (1) is subject to the input and state constraints: Notice that at each time t the constraint set X (e t ) is a function of the partially observable environment state e t that is not known at execution. For this reason, the above chance constraint is conditioned on the environment belief b t .

B. Control Objectives
Our goal is to design a control policy π mapping the state x t ∈ R n and the environment belief b t ∈ B to a control action u t ∈ R m , i.e., The above policy (4) in closed-loop with system (1) should guarantee that input and state constraints (3) are satisfied. Throughout this letter we make the following assumptions.
Assumption 1: The input and state constraint set U and X (e) are compact sets containing the origin for all e ∈ E.
Assumption 2: During the control task, we collect K environment observations. Furthermore, we know the time steps t 1 , . . . , t K at which these K observations are collected. Thus, we introduce the following set collecting these time instances: Our problem is motivated by the navigation task shown in Figure 1, where a drone has to fly from an origin to a destination while avoiding high windy areas. In this example, the system state x t represents the position of the drone, while the environment state e t represents the location of the windy area. Such a location is unknown, but we know that it may be either in region #1 (blue ellipse) or region #2 (red ellipse). In this example, a robust plan (blue trajectory) would simply avoid the possible windy areas. On the other hand, a policy based on observations about the wind location would first fly the drone toward the windy regions and then adjust its trajectory based on measurements (green tree of trajectories).  1. Navigation task where a drone has to plan a route without knowing the exact location of the windy region, which will be inferred during navigation via noisy measurements.

III. CONTROL DESIGN
This section describes the proposed control strategy. First, we introduce an MPC policy that meets the design requirements from Section II-B, but it requires solving an infinite-dimensional chance-constrained optimization problem. Afterward, we propose a finite-dimensional approximation. The properties of this approximation are discussed in Section IV.

A. Belief Update
In this section, we present the belief update equation. The belief vector from (2) is a sufficient statistic for an HMM and it can be recursively computed based on observations [21]. As discussed in Assumption 2, the time instances at which observations are collected are known and stored in the set T obs . Given such a set of time instances, we write the belief update as follows [21]: and

B. The MPC Optimization Problem
Given the system state x t and the environment belief b t at time t, we introduce the following MPC optimization problem: where π t = {π t|t , . . . , π t+N−1|t }. In the above problem, h : R n × R m → R and V : R n → R represent the stage cost and the terminal cost, respectively. Furthermore, the terminal constraint X F satisfies the following assumption: Assumption 3: The terminal constraint set X F ⊂ X (e) for all e ∈ E is a control invariant set, i.e., for all x ∈ X F there exists a u ∈ U such that Ax + Bu ∈ X F . In problem (9), the variable x k|t indicates the predicted state at time k for a prediction computed at time t. The same notation is used for the control action u k|t , the belief vector b k|t , the observation o k|t , and the environment state e k|t . Note that if k / ∈ T obs , we have that o k|t = ∅. Thus at the predicted time k, the policy π k|t maps the predicted state x k|t and belief b k|t to the control action u k|t .
Solving problem (9) is challenging for two reasons: (i) the optimization is defined over the space of feedback policies {π t|t , . . . , π t+N−1|t } that are continuous functions with uncountable degrees of freedom which render the optimization problem infinite-dimensional, and (ii) the system predicted states are subject to chance constraints. To overcome these challenges, in the next section we first rewrite the above problem as an optimization over a tree of trajectories. Then, we leverage this reformulation to approximate the chance constraints. The proposed reformulation builds upon our previous work [16] where we did not consider constraint sets that change as a function of the environment state.

C. Finite-Dimensional Reformulation
In this section, we reformulate the chance-constrained optimization problem (9) as a finite-dimensional problem. First, we introduce the observation vector o t:t+N collecting the observations from time t to time t + N, i.e., where t k and t j are the time steps at which the kth and jth observations are collected. Without loss of generality, we assume that t ≤ t k < . . . < t j ≤ t + N. Let M t:t+N be the number of observations collected from time t to t + N, we have that there are |O| M t:t+N possible sequences of observations that we denote as: Leveraging the S t:k = |O| M t:k sequence of observations (10), we define the finite-dimensional optimization problem:  (12), v i k|t is the unnormalized belief vector and constraint (12c) represents the unnormalized belief vector update equation: where and are defined as in (7)- (8). The unnormalized belief vector is initialized using the belief b t and it allows us to rewrite the expectation as a summation [16, Proposition 1].
Proposition 1: For all x ∈ R n and b ∈ B, we have that  (9) and (12) respectively, we have that π * t|t (x t , b t ) = u 1, * t|t . Proof: As the predicted belief b k|t is defined by the belief b t and the M t:k observations collected from time t to time k, we have that the policy π k|t is evaluated at most |O| M t:k times. Thus, optimizing over the set of policies from (9) is equivalent to optimizing over the set of control actions from (12) (9) is equivalent to the cost function in (12), we conclude that J ∞ (x, b) = J f (x, b) and π * t|t (x t , b t ) = u 1, * t|t for all x ∈ R n and b ∈ B. The key assumption leveraged by the proposed reformulation is that the HMM is defined for a set of discrete states. This allows us to reformulate the expectation as a summation and the policy as a finite set of control actions. Thus, no assumption on the cost function is required.

D. Chance Constraint Reformulation
We present the chance constraint approximation strategy. For each predicted time k and observation sequence i, we use Algorithm 1 to compute the set of environment states C i k|t such that P(e i k|t ∈ C i k|t |b i k|t ) ≥ 1 − . Then, we leverage such a set to reformulate the chance constraint from problem (12).
In Algorithm 1, we first compute the predicted belief b i k|t . Then, we sort the belief vector (line 6) and compute the vector e sort collecting the environment states sorted in descending order by their belief (lines 7), i.e., , ∀j ∈ E. See Section I for further details on notation. In line 8, we initialize the scalar p env to keep track of the probability that e i k|t ∈ C i k|t , i.e., p env = P(e i k|t ∈ C i k|t ). Finally, we append e sort [j] to the set C i k|t , until the probability that e i k|t belongs to C i k|t is greater than 1 − . Given the sets C i k|t computed with Algorithm 1, we introduce the following finite time optimal control problem: Given the optimal solution from the above problem {u 1, * t|t , . . . , u S t:t+N , * t+N−1|t }, we define the MPC policy as: Next, we show that the above policy is recursively feasible and it guarantees that state and input constraints are satisfied.

IV. PROPERTIES
First, we show that the policy (14)  Proof: Assume that the belief and control sequences are the optimal solution from problem (13) whereū i satisfies Ax i, * t+N|t + Bū i ∈ X F . Note that the existence of such a control action is guaranteed from Assumption 3. Furthermore, from definition (16)  k|t for all k ∈ T , which in turn implies that C j k|t+1 = C i k|t . Thus, the tree of trajectories associated with the predicted candidate input from (17) satisfies state and input constraints. Finally, from Assumption 3, we have that X F ⊂ X (e) for all e ∈ E, which implies that for all e ∈ E, x ∈ X F and b ∈ B, we have that P(x ∈ X (e)|b) = 1. This fact, together with the feasibility of the optimal solution (15), implies that (17) is a feasible solution for problem (13) at time t + 1. Thus, π MPC (x t ) = u * ,1 t|t ∈ U at all times.
In Proposition 1, we showed that the infinite-dimensional problem (9) is equivalent to the finite-dimensional chance constraint problem (12), which is still challenging to solve. Next, we show that the optimal solution from problem (13) is feasible for problem (12), i.e., the chance constraint problem (12) can be approximated by solving (13).
Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. Proposition 2: An optimal solution to problem (13) is a feasible solution for problem (12).
Proof: Let (15) be an optimal solution to problem (12). By definition, we have that (15) satisfies constraints (12a)-(12f) of problem (12). Furthermore, by construction we have that x i, * k|t ∈ X (e), ∀e ∈ C i k|t implies that P(x i, * k|t ∈ X (e)|b i, * k|t ) ≥ 1 − , which leads to the desired result.
Note that the result from Proposition 2 and the recursive feasibility from Theorem 1 imply that the chance constraint from (3) is satisfied at all times.

V. NAVIGATION EXAMPLE
We consider the following LTI system: where at each time t the state x t collects the system position where a x t and a y t represent the acceleration subject to saturation constraints, i.e., u t ∈ U = {u ∈ R m : ||u|| ∞ ≤ 20} for all time t ∈ {0, 1, . . .}.
For the initial condition x 0 = [−4, 0, 0, 0] and b 0 = [0.5, 0.5], we tested the proposed strategy on a navigation task where a drone has to reach a goal state x goal = [14, 0, 0, 0], while avoiding with high probability a windy region X wind . The MPC problem with horizon N = 22 is solved with CasADi [22] and the cost function h(x, u) = 0.1||x−x goal || 2 2 + ||u|| 2 2 and V(x) = 10 3 ||x − x goal || 2 2 . 2 The exact location of the wind region is partially known and it has to be inferred by partial noisy observations. In particular, we know that the center of the windy region may be either at loc 0 = [7, −0.2] or loc 1 = [6, 0.2]. We design a controller that avoids the windy region X wind with high probability by enforcing the following chance constraint: for all t ∈ {0, 1, . . .}. In the above chance constraint, each element of the two-dimensional belief vector b t represents the center of the windy region being in locations loc 0 or loc 1 . The belief is computed based on the noisy observations collected at time t 1 = 4 and t 2 = 8. At time t 1 the sensor returns an observation that is exact with probability 0.6, and at t 2 the probability of the observations being correct is 0.75. Notice that as time passes the accuracy of the sensor increases as it would be in a real scenario, since we get closer to the area of interest. Furthermore, we assume to know the region where the measurements can be taken. We performed the control task 1000 times by randomly sampling the wind location and the noisy observations collected by the controller. Out of these 1000 trials, the controller flew the drone over the windy region only 106 times. Thus, we verify that the chance constraint is empirically satisfied for the closed-loop system. We emphasize that the controller does not know the exact location of the windy area and the control action is computed based on noisy observations and the known sensor accuracy. Figure 2 shows the closed-loop trajectories  I  COMPARISON WITH A SCENARIO MPC APPROACH   TABLE II  COMPUTATIONAL TIME for all possible wind locations and noisy observations. Notice that in the scenarios from Figures 2(b), 2(c), 2(f), and 2(g), the controller receives contradicting observations about the wind location, thus it decides to avoid both regions where the wind may be located. Indeed when the observations collected at time t 1 and t 2 are not in agreement, the controller does not have a strong belief about the wind location and to satisfy the chance constraint (19) it is forced to avoid both regions. On the other hand, when the two observations are in agreement the controller decides to fly over one of the possible windy areas, as shown in Figures 2(a), 2(d), 2(e), and 2(h). It is important to underline that, as the sensor accuracy is 0.6 at time t 1 and 0.75 at time t 2 , there is a low probability that both measurements are incorrect and that the controller flies over the wind region. Figure 3 shows the planned tree of trajectories at time t ∈ {1, 5, 9} for an experiment where o t 1 = loc 0 and o t 2 = loc 0 . Note that observations about the wind location are collected at time t ∈ T obs = {t 1 = 4, t 2 = 8}. Thus, for time t < t 1 the controller plans a trajectory tree that branches twice, as the controller will behave differently as a function of the collected observation ( Fig. 3(a)). For t 1 < t < t 2 , the controller plans a trajectory that branches once, as only one observation will be collected in the future (Fig. 3(b)). Finally, for t > t 2 the controller plans a single trajectory (Fig. 3(c)). This example shows that the tree of trajectories encodes a policy where each branch represents how the closed-loop system would evolve depending on the collected observations. Most importantly, we notice that each branch satisfies different constraints, i.e., the planned trajectory avoids either the wind location #1 (red ellipse), the wind location #2 (green ellipse), or both regions. These constraints are computed via Algorithm 1 and they allow us to guarantee chance constraints satisfaction.
We compare the proposed approach with a scenario MPC, where the optimization problem is carried out over a trajectory tree and in each branch only the constraint associated with one environment mode is considered, as in [11], [13]. Table I shows the percentage of constraint violations over 1000 random simulations. Notice that only the proposed approach empirically satisfies the chance constraint (19). Finally, in Table II we report the computational time.

VI. CONCLUSION
We presented an MPC design for autonomous systems operating in partially observable discrete environments. First, we reformulated the MPC problem as a finite-dimensional optimization problem over a trajectory tree. Then, we presented an algorithm to compute the constraints enforced at each branch of the tree. We demonstrated that our approach guarantees recursive feasibility and chance constraint satisfaction.