Abstraction-Based Planning for Uncertainty-Aware Legged Navigation

This article addresses the problem of temporal-logic-based planning for bipedal robots in uncertain environments. We first propose an Interval Markov Decision Process abstraction of bipedal locomotion (IMDP-BL). Motion perturbations from multiple sources of uncertainty are incorporated into our model using stacked Gaussian process learning in order to achieve formal guarantees on the behavior of the system. We consider tasks which can be specified using Linear Temporal Logic (LTL). Through a product IMDP construction combining the IMDP-BL of the bipedal robot and a Deterministic Rabin Automaton (DRA) of the specifications, we synthesize control policies which allow the robot to safely traverse the environment, iteratively learning the unknown dynamics until the specifications can be satisfied with satisfactory probability. We demonstrate our methods with simulation case studies.

Legged locomotion and navigation have been extensively studied in recent years due to the accessibility of reliable, highly-agile quadrupedal [1], [2] and bipedal [3], [4] robotic platforms. As compared to more widely received mobile robot and drone platforms, legged robots offer the promise of high mobility in difficult terrain with additional manipulation capabilities for material transportation.
Numerous planning works for legged robots have focused on optimization methods for locally stable control [5], [6], [7]. These methods are primarily concerned with formulating tracking controllers which allow legged robots to track desired motion trajectories formulated by higher-level task planners, especially in the presence of motion perturbations.
In the field of formal methods, there has been much work on using Linear Temporal Logic (LTL) specifications for task and motion planning (TAMP) [8], [9], [10], [11], [12], [13], [14]. Formal methods approaches to TAMP allow for formal guarantees on task satisfaction and offer expressive semantic languages to represent broad classes of TAMP objectives.
Many studies have focused on mobile robot navigation in partially observable domains through exploration [15], [16], re-synthesis when encountering unexpected obstacles [17], [18], and receding-horizon planning [19]. However, these existing approaches are better suited for guaranteeing successful navigation and collision avoidance in environments with simple robot dynamics such as point-mass mobile vehicles. Legged navigation in complex environments while incorporating safety-critical locomotion kinematics and dynamics is largely an open research question.
In this work, we apply temporal-logic-based planning techniques onto legged navigation in order to synthesize control policies for legged robots which enforce the satisfaction of complex tasks in uncertain environments. From a formal methods perspective, our work demonstrates the applicability of abstraction-based verification and synthesis techniques onto legged robotic systems with complex dynamics. With respect to the locomotion community, our work advances the use of formal methods as a framework to enable robots to execute a broad class of navigation tasks with formal guarantees on safety and satisfiability.

A. RELATED WORK
High-level task planning for legged robots has not been widely explored, with many works focusing on developing a reduced-order-model-based motion planner which designs optimal locomotion trajectories [20], [21], [22], [23], [24], [25]. One major difficulty in applying high-level task planning approaches to legged locomotion problems is the need for reduced-order dynamical models which enable computationally tractable planning while capturing the full-body kinodynamics of the robot [26], [27]. One high-level approach which has seen success is the use of Markov Decision Process (MDP) planners for control policy synthesis [28], [29]. These MDP-based approaches abstract the motion trajectories of the robot by planning using a subset of the robot's states which approximate the low-level states.
In the formal methods community, Interval Markov Decision Process (IMDP) [30] abstraction-based modeling has shown potential for enabling planning for systems with complex dynamics [31], [32], [33], [34], [35]. As opposed to standard MDPs, IMDPs have transition probability intervals, allowing for the modeling of stochastic or uncertain dynamics [36], [37], [38]. Multiple approaches have been proposed in the literature for the problem of control policy synthesis for IMDP-abstracted systems with respect to ω-regular specifications, including partitioning and refinement strategies [39], invariant set computation [40], and two-player games [41]. These approaches address scenarios where the transition probability intervals of the graph remain static. In contrast, in this work we develop graph-based techniques which synthesize control policies to improve transition probability intervals online, thus increasing the probability of satisfying the desired specifications. In particular, we shrink these intervals via integration of Gaussian process (GP) learning [42] of uncertainties into the IMDP as explored in some recent works [43]. GPs have proven to be a popular approach for learning uncertainties in robotic planning problems [44], [45], [46], [47], [48]. IMDPs render a coherent structure to map learned GP high-confidence bounds on the uncertainty into transition probability intervals so that verification and synthesis tasks explicitly consider uncertainty. This methodology combines the formal guarantees associated with abstraction-based planners with the online adaptation capabilities of learning.

B. CONTRIBUTIONS
This study proposes a novel integrated planning framework of legged locomotion which fuses high-level formal methodsbased control synthesis techniques, learning methods, and low-level dynamics-aware motion plans in order to enable temporal-logic-based planning capabilities for bipedal robots. We first formulate an IMDP model for bipedal locomotion which incorporates the complex dynamics associated with legged locomotion. We then integrate a multi-layer GP learning structure into the IMDP abstraction to allow for learning of correlated robot model and environmental (i.e., terrain) uncertainties. Using this IMDP model, we develop and prove the validity of a planning methodology which allows a robot to iteratively traverse its environment to perform online learning while maintaining safety with respect to complex LTL specifications. Finally, we map our high-level planner onto a low-level full-body kinematics-based bipedal robot model and demonstrate simulation results.
Within our framework of temporal-logic-based planning for legged locomotion, we make the following contributions: r We develop a novel IMDP abstraction of legged locomotion systems, which possess more complex dynamics than the systems used for the IMDP methodology in our previous work [49]. For high-level planning, we abstract the system using reduced-order Prismatic Inverted Pendulum Model (PIPM) dynamics. We also explicitly consider the effects of low-level full-body kinematic constraints in our IMDP planner, allowing for computationally tractable synthesis of high-level control policies which can be executed on the full-order legged dynamical model.
r We generalize our previous work on IMDP planning in [49] to broaden the language of allowable specifications from the syntactically co-safe fragment of LTL to the entire language of LTL. Using graph-theoretic techniques, we develop theory and algorithms for synthesis of control policies for data sampling which are nonviolating with respect to infinite-horizon specifications. We demonstrate the computational tractability of our approach on case studies with state space sizes which are orders of magnitude higher than those in [49].
r We propose a stacked GP learning model for structured learning of the motion uncertainties from modeling and environmental sources inherent to legged locomotion. This allows our planner to compensate for a priori unknown motion perturbations and improve online the probability of satisfying high-level tasks.

A. REDUCED-ORDER LOCOMOTION DYNAMICS
We model the rigid body dynamics of a bipedal robot using the prismatic inverted pendulum model (PIPM) [50]. In this model, the mass of the robot is modeled as a single point on the hip of the robot. This point is called the 3D Center of Mass (CoM) and has position p com = (x, y, z) , which are the sagittal, lateral, and vertical coordinates, respectively. The robot also has angular motion with orientation angles (φ, θ, ψ ) and a foot contact position p foot = (x foot , y foot , z foot ) . Following the exposition in [51], we take the position and velocity of the CoM to be the system state space ξ = (p com ,ṗ com ) = (x, y, z,ẋ,ẏ,ż) ∈ ⊆ R 6 , where is the set of admissible CoM positions and velocities. Fig. 1 illustrates these parameters. The second-order CoM dynamics of the legged robot at the q th step arë with m the robot mass, g the gravitational acceleration, ω q the phase-space asymptotic slope parameter, a q and b q the slope coefficients, and c q the constant bias term. The hybrid control inputs to this system are the discrete foot contact position and the PIPM asymptotic slope ω q . The derivation of (1) is found in [50]. For hybrid locomotion planning, we introduce apex and keyframe states: Definition 1 (Apex State): The apex state occurs when the CoM sagittal position is equal to the location of the foot contact in the sagittal axis (i.e. the point in the walking step when the robot's body is exactly over its foot contact). The coordinates of the CoM at the apex state are (x apex , y apex , z apex ) .
Definition 2 (Locomotion Keyframe State): A keyframe state of the PIPM is defined as k = (d, θ, z foot , v apex , z apex ) ∈ K, where r d := x apex,n − x apex,c is the walking step length; r θ := θ apex,n − θ apex,c is the yaw angle change; r z foot := z foot,n − z foot,c is the height change; r v apex is the CoM sagittal apex velocity; r z apex is the global CoM height at apex.

Summary of the PIPM planning approach:
We use the apex states of the robot as discrete-time states for high-level planning. At each apex state, the objective of the discrete-time planner is to target an apex state to reach in the next step. The parameters of the keyframe state correspond to the actions commanded at each apex state. The high-level action a HL = (d, θ, z foot ) ∈ A HL determines the (x, y, z) coordinates of the next apex state. In order to calculate the corresponding action in the PIPM dynamics, we first need to compute the additional parameters a LL = (v apex , z apex ) for the action using a low-level motion planner. Together, the keyframe state parameters determine the foot placement of the next walking step. The keyframe state also maps to a corresponding asymptotic slope ω q required for tracking the continuous trajectory in between apex states. bounds, respectively, on the transition probability from state q ∈ Q to state q ∈ Q under action α ∈ A, r Q 0 ⊆ Q is a set of initial states, r O is a finite set of atomic propositions or observations, r L : Q − → O is a labeling function.

B. TEMPORAL-LOGIC-BASED TASK PLANNING
A(q) denotes the set of actions at q. Moreover, for all q ∈ Q and all α ∈ A(q),Ť andT satisfy In this work, we consider specifications which can be written using the semantics of Linear Temporal Logic (LTL): where o ∈ O is an observation and φ, φ 1 , and φ 2 are LTL formulas. We define the next operator as satisfying φ in the next state transition, the until operator U satisfying φ 1 until φ 2 is satisfied, the eventually operator ♦ as U φ, and the always operator as ¬♦¬ . It is well known that the satisfaction of LTL formulas can be checked using deterministic Rabin automata [ . . , n is the acceptance condition. The semantics of a deterministic Rabin automaton are defined over infinite input words in O ω (the set of infinite sequences of observations). A run of R over an infi- A run w S admits a set inf(w S ) = {w S (i) : ∀m ∈ N ∃k > m s.t. w S (k) = w S (i)}, defined as the set of observations in w S which appear infinitely often. Then, a run w S is accepted If w S is accepted by R, we say that w s |= R.

C. GAUSSIAN PROCESS LEARNING
In order to learn the uncertainties present in our system, we will utilize Gaussian process (GP) regression: Definition 6 (Gaussian Process Regression): Gaussian Process (GP) regression models a function g i : R n → R as a distribution with covariance κ : ..,m} , where z j ∈ R n is the input and y j i is an observation of g i (z j ) under Gaussian noise with variance σ 2 ν i . Let K ∈ R m×m be a matrix defined elementwise by K j = κ (z j , z ) and for z ∈ R n , let Then, the predictive distribution of g i at a test point z is the conditional distribution of g i given D, which is Gaussian with mean μ g i ,D and variance σ 2 g i ,D given by where I m is the identity and Y = y 1 In practice, we use a sparse Gaussian process regression approximation [53] to reduce computational complexity.

III. PROBLEM SETUP
We now define our planning problem for bipedal robot locomotion. We consider a 3D environment for the bipedal robot in which there exists a priori a rectangular partition R on the x − y plane with partition regions {R n } n∈N : where a x,n and a y,n are lower bounds on the x and y coordinates, respectively, and b x,n and b y,n are upper bounds.
We assume that the bipedal robot can be modeled using the PIPM dynamics as introduced in Section IV-A.
Assumption 1: There exists a model error between the PIPM used for planning and the low-level controller of the bipedal robot which causes deviation from the desired trajectory of motion. Additionally, the terrain elevation of the environment is unknown, which results in a separate motion perturbation. Both sources of uncertainty can be characterized and learned.
Assumption 2: The bipedal robot experiences a stochastic yaw perturbation ν at each walking step which can be modeled as a zero mean random variable with stationary, symmetric, and unimodal distribution ρ ν . We further assume a bounded support on the distribution so that the perturbation cannot drive the system into arbitrary regions of the space. In particular, this makes the distribution sub-Gaussian.
In this work, we specifically consider the effect of uncertainties on yaw angle error. Using proprioceptive sensing only for legged robots, the results in [54], [55], [56], [57] proved that the absolute yaw angle is unobservable so that estimates of these quantities will drift over time. To address this issue, we explicitly model yaw uncertainties using GPs. Our objective is to develop a high-level planning algorithm which considers the robot's full-body kinematics to satisfy a LTL objective in the presence of robot system and environmental uncertainties: Problem 1: Design a temporal-logic-based planning algorithm for a bipedal robot to safely traverse its environment and learn system and environmental uncertainties in order to satisfy given LTL specifications with satisfactory probability under worst-case resolutions of the transition probability intervals of the discretized model of the robot. Subproblem 1.1: Design an IMDP model of bipedal locomotion based on Prismatic Inverted Pendulum Model dynamics. Integrate Gaussian process learning of system and environmental uncertainties into the bipedal IMDP in order to model and account for motion perturbations. Subproblem 1.2: Synthesize a control policy for an IMDPabstracted system which allows the robot to traverse its environment while maintaining nonzero probability of satisfying given LTL specifications under best-case resolutions of transition probability intervals.

IV. BIPEDAL IMDP FORMULATION
In this section, we propose a high-confidence IMDP model of bipedal robot locomotion which incorporates Gaussian process learning of system uncertainties.

A. IMDP STATE DEFINITION
We first define the IMDP abstraction states: Definition 7 (IMDP Abstraction States): An IMDP abstraction state {X q } q∈Q is defined as where R q is a rectangular partition region, a θ,q and b θ,q are lower and upper bounds, respectively, on the yaw angle, and ψ q ∈ {left, right} is the left or right foot stance index.
Given the full-body kinematic constraints of the robot, we assume that forward and backward steps are feasible, with a limited turning motion possible in the forward direction. Overall IMDP-BL framework. Planning is done using a product IMDP model which considers dynamical constraints from a PIPM bipedal robot framework. Additionally, the product IMDP relies on GP learning of uncertainties for safe planning with respect to a LTL specification. Control actions from the IMDP planner are input to the PIPM dynamics, where they are mapped to low-level control actions for the physical robot. Trajectory data from the robot executing the control policy is fed back into the stacked GP model, updating the transition probabilities in the product IMDP. Black arrows/text illustrate the flow of information through the system and green text shows the corresponding parameters.

Formally, we have the action set
where N corresponds to a straight forward step, NE is a forward step with a right turn, NW is a forward step with a left turn, and S is a straight backwards step. We use a high-level waypoint planner δ to map IMDP actions to target IMDP states. We first perform an intermediate mapping from IMDP actions to high-level keyframe state actions a HL = (d α , θ α , δz foot ) as in Definition 2. Then, we map high-level keyframe state actions to their corresponding target IMDP states as follows. Given an initial IMDP state X q with a geometric center x c,q , y c,q , a mean yaw angle θ c,q , and an action α ∈ A, we target the center of the IMDP state X * q closest to the desired endpoint, which is calculated as where are the x and y components of the desired step, respectively. Once each state-action pair has been associated with a target state during planning time, we implement a controller to execute the appropriate trajectory at runtime.

B. GAUSSIAN PROCESS FORMULATION
Inspired by stacked Gaussian process learning [58], a hierarchical framework in which output parameters of lower-level GPs are used as inputs to higher-level GPs, we propose a multi-layer GP learning framework to model sources of bipedal locomotion uncertainties. We first use a Gaussian processẑ(x, y) to model the elevation z of the terrain, which takes as input the x-and y-coordinates of the position. We assume that at each step the robot can measure the terrain elevation at its current CoM position. Therefore, at a given step i we can record the x − y position (x i , y i ) and the observation z i . Next, we model uncertainty in the yaw angle due to model error at each step using Gaussian processes. We first introduce a Gaussian process θ (a HL ), which takes as input the commanded high-level action a HL = (d, θ, z foot ) and outputs a mean expected deviation μ θ and a variance σ 2 θ from the commanded yaw change. This Gaussian process models the error induced by the low-level dynamics which are not accounted for in the reduced-order PIPM model. Then, we model the overall expected variance of the yaw angle deviation using another Gaussian processσ 2 θ , which takes as inputs the variance of the terrain uncertainty of the step σ 2 z and the variance of the predicted yaw deviation σ 2 θ . Intuitively, this Gaussian process captures the range of potential yaw angle deviations from a step, which should increase when there exists more uncertainty in the Gaussian processes used to predict the terrain elevation and yaw angle deviation. Fig. 2 shows the structure of the GPs.
Samples for each of the GPs are collected as follows. At each step, the robot collects samples of the terrain eleva- is the x − y foot stance after the step and (z + foot ) is the corresponding observation of the terrain elevation. This data point is then added to the sample set of the GPẑ(x, y).
The robot has a target yaw angle θ c,q but due to estimation errors and motion perturbation actually reaches a yaw angle of θ + after each step. We define a yaw angle error and use this error as observations for our remaining GPs as follows. First, we generate samples (a HL , θ e ) for the model error Gaussian process θ , where a HL is the high-level control action used for the step, calculated using the controller in Section IV-C. Next, we generate samples (σ 2 z , σ 2 θ , |θ e |) for the yaw error Gaussian processσ 2 θ where the inputs are the variances of the terrain and model error GPs used for a step and the value to be predicted is the absolute value of the yaw deviation. Our motivation for using the absolute value of the yaw error is that this captures the total error induced by the model error, terrain uncertainty, and noise, whereas the yaw error itself will not capture the cumulative effect of the the zero-mean stochastic noise.

C. CONTROLLER DESIGN
Define a family of controllers K q : Z × X → A HL which take as input a current pose ζ = (x com , y com , z foot , θ, ψ ) ∈ Z. Additionally, we have a target IMDP state X q with center (x c,q , y c,q , θ c,q ) as a second input. The controller then outputs a high-level waypoint-targeting control action a HL = (d, θ, z foot ) as shown in Fig. 3.
First, the controller calculates a desired step length d The foot elevation change z foot is calculated as where (x + , y + ) is the desired foot placement which is analytically calculated online at each step, and μẑ is the mean of the GP prediction of the foot placement elevation.
The yaw angle change θ is calculated using the GP prediction θ (a HL ) of the expected yaw angle deviation where μ θ is the GP estimate of the yaw angle deviation due to model error.

D. IMDP TRANSITION PROBABILITY CALCULATION
We next formulate transition probability bounds for the IMDP abstraction states. Given the controller developed in Section IV-C, we separate the IMDP state transition into three components. First, we assume that the rectangular partition region transition is deterministic; that is, given current IMDP state X q with rectangular partition region R q and an action which targets IMDP state X q * with rectangular partition region R q * , we assume that the actual next IMDP state X q has rectangular partition region R q = R q * . Additionally, we assume that the foot stance of the robot alternates between IMDP states, i.e. if the current IMDP state X q has stance ψ q = left, the next IMDP state X q always has stance ψ q = right and vice versa. Then, the uncertainty in IMDP state transitions arises solely from yaw uncertainty. We assume that this yaw uncertainty has two components. First, uncertainty in the GP estimations of the terrain elevation and reduced-order model error induce learnable controller error. We also assume a nonlearnable disturbance modeled as stochastic noise. Given current IMDP state X q and an action which targets IMDP state X q * , we model the yaw uncertainty induced by GP estimation error using the GPσ 2 θ . We first calculate a maximum terrain GP variance σ 2 z,max and a maximum model GP variance σ 2 θ ,max for the state-action pair as follows. We can conservatively estimate the maximum terrain variance σ 2 z,max under the action as where σ 2 z (x, y) is the variance of the GP prediction of the terrain at a given x − y location.
We now define a series of parameters in order to calculate the maximum model error GP variance. First, since the controller always seeks to move to the center x c,q of the target IMDP region X q , lower and upper bounds on the step length input to the model error GP are which are the distances of the closest and furthest points, respectively, in X q to the center of X q . Next, we define lower and upper bounds on the yaw angle change inputs to the model error GP as where a θ,q and b θ,q are lower and upper bounds on the yaw angles in state X q , and θ c,q is the central yaw for state X q . Finally, we define lower and upper bounds on the foot elevation change input to the model error GP as whereẑ(x, y) is the mean of the GP prediction of the terrain elevation at point (x, y).
We now calculate the maximum model error GP variance as Given maximum terrain and model error GP uncertainties, we then take the corresponding mean prediction of the yaw uncertainty GP γ = μσ2 θ (σ 2 z,max , σ 2 θ ,max ) under these parameters to be a measure of uncertainty in the yaw angle for each state-action pair. We can then bound the yaw angle θ + after the action using γ : where θ c,q is the desired yaw angle after the action is taken and β is a high-confidence interval factor calculated using methods in [43]. We note that although the yaw angle measurements we use to train the GPs are perturbed by the stochastic noise as in Assumption 2, it is established in [57] that we can place high-confidence bounds on the underlying learnable function as in Equation (20). The effect of the noise is to increase the magnitude of the high-confidence interval factor β, thus making the bounds on the underlying function more conservative. However, our approach assumes a priori conservative bounds to initialize the GPs, so the calculated bounds always remain conservative and thus the validity of the overall approach is not affected by noise. Practically, the noise perturbation results in the robot having to collect more data samples to generate sufficiently tight bounds. We can then calculate lower and upper transition probability bounds for the IMDP state-action pair by treating the high-confidence transition region as nondeterministic and centering the stochastic noise at probability minimizing and maximizing points in this region: Lemma 1 (Yaw-Based Transition Probability Bounds): Consider X q , X q ; q, q ∈ Q and action α q * ∈ A q . Lower bounď T and upper boundT transition probabilities from X q to X q under α q * are given by (22) where we recall ρ ν i is the probability density function of the stochastic noise ν i and a θ,q and b θ,q are the lower and upper boundary points for region X q . We define θ min and θ max as Proof: The proof for the validity of these transition probability bounds follows from the proof outlined in Theorem 1, [49], as a special one-dimensional case with yaw angles rather than Cartesian coordinates considered. Fig. 1 illustrates the nondeterministic transition regions and the stochastic noise centering used in Lemma 1.

E. IMDP FOR BIPEDAL LOCOMOTION DEFINITION
With the individual components of the IMDP defined along with the integration of a stacked GP framework, we can now formally define an IMDP abstraction of bipedal locomotion: The IMDP-BL abstracts the dynamics of bipedal robots and also incorporates multi-component GP learning of uncertainties, solving Subproblem 1.1. We note that the reduced-order bipedal PIPM dynamics in (1) are outside of the class of dynamics considered in the previous work [49] for the IMDPbased methodology. Additionally, the IMDP-BL explicitly accounts for low-level full-body kinematics unique to bipedal locomotion such that the proposed IMDP planning method can be realistically implemented on bipedal robots. Fig. 2 shows the structure of the proposed IMDP-BL framework with interconnections between the high-level planning, low-level dynamics, and GP learning components.

V. CONTROL POLICY SYNTHESIS
We now turn our attention to the verification and synthesis of control policies for the IMDP-BL against LTL specifications. The structure of our methodology broadly follows our previous work in [49], in which we constructed similar control synthesis techniques for the syntactically cosafe fragment of LTL (scLTL). In this section, we introduce generalized algorithms which are able to accommodate complex specifications from the general class of LTL.

A. PRODUCT IMDP
To check LTL specifications against observations from an IMDP abstraction of a bipedal robot, we utilize a product IMDP construction synthesized from the IMDP-BL and a DRA of the LTL specifications, defined as follows: and 0 otherwise, r (q 0 , δ(s 0 , L(q 0 ))) ∈ (Q × S) is a set of initial states of I ⊗ A, and The PIMDP construction encodes both the abstracted dynamics of the robot as well as specification satisfaction. For LTL specifications, [59,Def. 10.128, Thm. 10.129] show that specification checking on a PIMDP structure reduces to checking reachability of accepting maximal end components, defined as follows: Definition 10 (End Component [60]): An end component of a finite PIMDP P is a pair (T , Act ) with T ⊆ (Q × S) and Act : T → A such that r ∅ = Act (q, s) ⊆ A(q) for all states (q, s) ∈ T , r (q, s) ∈ T and α ∈ Act (q, s) implies {(q , s ) ∈ T |T (q, α, q )) > 0, s ∈ δ(s, L(q))} ⊆ T , r The digraph induced by (T , Act ) is strongly connected.
We note that upper bound transition probabilities are used to calculate end components in order to enable all possible edges in the PIMDP. This ensures that the end components are robust to all realizations of the transition probabilities.
Maximal end components are pairwise disjoint, so the number of MECs is bounded above by the number of PIMDP states, thus reducing the computational complexity of specification checking. In order to use graph-theoretic techniques to calculate reachability probabilities, it is necessary to resolve the transition probability bounds in the PIMDP to reduce it to a product MDP. We define adversaries for this purpose: Definition 12 (PIMDP Adversary): Given a PIMDP state (q, s) and action α, an adversary ξ ∈ is an assignment of transition probabilities T ξ to all states (q , s ) such thať T ((q, s), α, (q , s )) ≤ T ξ ((q, s), α, (q , s )) ≤T ((q, s), α, (q , s )).
In particular, we use a minimizing adversary, which realizes transition probabilities such that the probability of satisfying the specification is minimal, and a maximizing adversary, which maximizes the probability of satisfaction.
Definition 13 (Control Policy): A control policy π ∈ of a PIMDP is a mapping (Q × S) + − → A, where (Q × S) + is the set of finite sequences of states of the PIMDP.
Then, reachability probabilities in the PIMDP are uniquely determined by the control policy and adversary selected.

B. GRAPH PRUNING
In order to enable online learning of the uncertainties in the system, we require paths through the state-space which the legged robot can take in order to collect samples of the dynamics without violating the specifications of interest. The key idea is that verification of LTL specifications reduces to checking reachability of accepting MECs. Then, states which have zero probability of reaching any accepting end component under a maximizing control policy and maximizing adversary are unsafe, i.e. the robot is guaranteed to violate the specification at these states. On the other hand, states which have zero probability of reaching an accepting MEC under a maximizing control policy and minimizing adversary but have nonzero probability under a maximizing adversary are considered safe to sample, as these are states which could benefit most from uncertainty reduction.
Algorithm 1 details our methodology for graph pruning in order to generate a nonviolating sub-graph to sample. First, we assume that MECs corresponding to accepting conditions in the product IMDP have been calculated. The MECs themselves can be calculated using standard algorithms [59,Algorithm 47]. Then, accepting MECs are simply those which contain states in an accepting condition G i but no states from the corresponding rejecting set B i . In line 1, we initialize the accepting states of the PIMDP to be those in the union of all of the accepting MECs. In lines 4-14, we prune any state-action pairs which have zero probability of reaching the accepting states under a maximizing adversary. If this pruning leaves any state with no available actions, we add this state to a set of hazard states to be excluded from the nonviolating sub-graph. In lines 15-28, we prune state-action pairs which have nonzero probability under a maximizing adversary to reach any hazard state. Finally, in line 31 we return the remaining state-action pairs as a nonviolating sub-graph.
Once a nonviolating sub-graph has been generated, a control policy to safely sample the state-space can be synthesized using the method detailed in Algorithm 2. For states which are in the nonviolating sub-graph (lines 2-4), we randomly select from the available actions for each state. For states which are outside of the nonviolating subgraph (lines 5-7), we select the action which produces the highest probability under Algorithm 1: Nonviolating PIMDP Generation.
Proof: By construction, Algorithm 1 produces as output a subgraph P in which every state has nonzero probability of reaching an accepting maximal end component of the specifications. Therefore, executing a control policy generated by Algorithm 2, which performs control policy synthesis using the nonviolating sub-graph from Algorithm 1, ensures that the robot will always traverse states which have nonzero probability of satisfying the LTL specifications. Thus, the robot can traverse the state space using this control policy indefinitely while remaining in nonviolating regions, proving the safety of the control policy generated from Algorithms 1 and 2.
Therefore, executing Algorithms 1 and 2 on a bipedal PIMDP produces a control policy which allows the robot to traverse its state space indefinitely while guaranteeing safety with respect to LTL specifications, solving Subproblem 1.2.

A. TASK PLANNING ALGORITHM
Given the IMDP-BL model developed in Section IV and the LTL control policy synthesis framework proposed in Section V, we can now formulate a complete algorithm for bipedal task planning for safe online learning with respect to complex LTL specifications, detailed in Algorithm 3.
Given a biped with PIPM dynamics and LTL specification φ with desired probability of satisfaction P sat , we first construct an IMDP-BL of the bipedal dynamics (initializing the GP models of the system and environmental uncertainties with conservative bounds) and a DRA of the LTL specification φ. We then construct a PIMDP and calculate minimum and maximum probabilities of satisfaction from the initial state (lines 1-3). If the initial minimum probability of satisfaction is too small (line 4), we next check the maximal probability of satisfaction. If this probability is less than P sat , we know that the LTL specification cannot be satisfied with sufficient probability regardless of how well we learn the uncertainties, so the algorithm is terminated (lines 5-7). Otherwise, we use Algorithms 1 and 2 to calculate and traverse a nonviolating PIMDP, collecting a batch of data on the system dynamics for GP retraining (lines [8][9][10][11][12][13][14]. Finally, we recalculate the PIMDP transition probabilities based on the new GPs (lines [16][17][18] and recheck against the desired probability of satisfaction. If we have reached the desired threshold, we output a final control policy which maximizes the probability of satisfying the specification under a minimizing adversary from all states (lines [19][20]; else, we repeat the sampling process. The algorithm terminates once a satisfying control policy is found, the nonexistence of a satisfying control policy has been proved, or a maximum number of batches has been reached. Algorithm 3 provides a complete framework for the bipedal task planning problem with online learning of uncertainties, solving Problem 1.

B. LOCOMOTION-SPECIFIC MODELING CONSIDERATIONS
In order to apply our planning method to bipedal robots, additional leg kinematic considerations (e.g., full-body kinematics) must be taken. Thus, we must solve the problem Algorithm 3: Synthesis Algorithm.
Input: Bipedal PIPM (1), LTL specification φ, P sat Output: Satisfying control policy π † 1: Construct IMDP-BL I from System (1), DRA R from φ, PIMDP P from I and R, initial GPĝ(x); 2: CalculateP max ((q 0 , δ(q 0 , s 0 )) |= φ) for initial state; 3: CalculateP max ((q 0 , δ(q 0 , s 0 )) |= φ) for initial state; 4: while (P max < P sat ) and (Count< MaxBatches) do 5: ifP max < P sat then 6: return No satisfying control policy exists; 7: end 8: Find nonviolating PIMDP P using Algorithm 1; 9: Find MEC to cycle in with corresponding control policy π * using Algorithm 2; 10: for NumSteps do 11: Take action π * (q) at current state q; 12: Sample terrain z foot , yaw error θ e ; 13: end 14: Reconstruct GPs using collected samples; 15: Recalculate transition probability intervals for P; 16: RecalculateP max ((q 0 , δ(q 0 , s 0 )) |= φ); 17: RecalculateP max ((q 0 , δ(q 0 , s 0 )) |= φ); 18: end 19: ifP max ≥ P sat then 20: return Control policy 21: end of converting high-level waypoints produced by the PIMDP task planner into motion plans via the PIPM. Using an optimization-based approach as in [51, Algorithm 1 ], we first solve for the PIPM keyframe state v apex , z apex in the motion planner (see Definition 2). This gives us middle-level waypoints from which we can interpolate a trajectory of robot's CoM motion. These middle-level waypoints can then be translated into low-level joint motion plans. We next map the low-level kinematic constraints associated with legged locomotion to the high-level PIMDP task planner. First, we consider the problem of lateral CoM motion deviation. The nonperiodic gaits we design result in sinusoidal center-of-mass trajectories, i.e., the CoM motion wiggles between the desired CoM waypoint and the actual foot contact position. To ensure accurate waypoint tracking, we follow the results in [51, Prop. V.1] and assume a bound y 1,max on the maximum lateral deviation between the target waypoint and the CoM apex position. Similarly, we assume a bound y 2,max on the maximum lateral deviation between the CoM and foot apex positions. Fig. 3 illustrates these parameters. These global bounds can be enforced with appropriate selection of v apex ([51,Algorithm 1]) and a restriction of the possible set of motions (which we do by restricting the set of IMDP-BL actions to be finite).
Second, we need to enforce CoM and foot position constraints in order to ensure the validity of our high-level PIMDP planner with respect to low-level full-body kinematic constraints. Our IMDP planner assumes that the robot CoM always reaches the desired x − y region at each walking step. Furthermore, although high-level planning focuses on CoM positions, we also want to ensure that the robot avoids stepping in unsafe regions. We can enforce the CoM constraint with the appropriate selection of the cell length l of the x − y grid regions, and we can obtain safety guarantees with respect to foot positions via graph pruning: Theorem 2 (Safe Low-Level PIMDP Planning): Given CoM lateral apex deviation bounds y 1,max , y 2,max as in Section VI-B, geometrically safe planning with respect to CoM apex positions is enforced if the grid region cell length satisfies l > 2 y 1,max . Additionally, safety with respect to foot positions (i.e., no foot lands in a violating region) is enforced by pruning PIMDP states for which a violating x − y grid region lies within a circle of radius y 1,max + y 2,max .
Proof: First, we prove the CoM safety condition. Since the PIMDP controller (Section IV-C) produces discrete actions which always target the center of a 2D grid, the robot's CoM stays within the target grid region at each step under a CoM deviation of at most y 1,max if the distance from the center of each region to its boundary is at least y 1,max in all directions. For a hyperrectangular region, this is accomplished if each cell of the region has a side length greater than or equal to 2 y 1,max . Note that since the target point under each step is always the center of a region, lateral deviations do not accumulate over multiple steps.
We now prove the condition of safety with respect to foot locations. We know that the maximum deviation of the foot position from the center of a targeted 2D grid region is bounded by y 1,max + y 2,max . Therefore, the range of potential foot locations is bounded by a circle with radius y 1,max + y 2,max . If the circle for a PIMDP state intersects with a violating region, the PIMDP state is violating. Then, by including these new violating states in the graph pruning method detailed in Algorithm 1, we can again use Theorem 1 to guarantee that the robot will never step in a violating region. Fig. 3 illustrates these safety conditions. Theorem 2 can be applied to guarantee safety of the PIMDP planner in the presence of low-level full-body kinematic constraints and in particular to the special case of the backward step action in the IMDP-BL (i.e., action "S" in (6)). Physically, this motion is executed through a "turn-in-place" sequence wherein the robot performs a predefined series of steps in place to change its yaw angle 180 • and then take a forward step to the desired location. Extending the arguments in Theorem 2, the safety of this "turn-in-place" sequence corresponds to the condition that there exists no violating region within radius y 1,max + y 2,max of the region from which the turn will be initiated, since each step during the turn has a target waypoint corresponding to the center of the initial region with an increment only in the yaw angle.
An alternative approach for obtaining safety guarantees while considering low-level legged kinematic constraints would be to discretize the 2D grid environment in a coarser fashion, allowing for the foot positions to always remain in the same region as the robot CoM. This would also necessitate the use of IMDP-BL states with a longer sequence of actions corresponding to, e.g., predefined multi-step sequences. However, we choose to focus on scenarios with fine grid-world discretizations in order to showcase the level of precision available with our proposed approach and to illustrate our algorithms on an IMDP-BL structure that explicitly accounts for legged kinematic constraints.

VII. CASE STUDY
We demonstrate our methodologies utilizing the Digit bipedal robot from Agility Robotics [4]. The Digit robot weighs 45 kg and has 16 degrees of freedom: three joints in each arm (shoulder roll and pitch, elbow), and five in each leg (hip roll, pitch, and yaw, knee, and ankle).
We consider an environment discretized as a 10x10 grid region as shown in Fig. 4, with each square region having a side length of 0.2 meters. Each region is divided into 24 yaw angle bins which are 15 • wide. For each grid region and yaw angle bin combination, there exists two IMDP-BL states: one for a left-foot contact state, and one for a right-foot contact state. At each IMDP-BL state, the robot has up to four potential actions. First, the robot can walk forward along its current yaw 0.4 meters (two regions forward) or turn around and walk backward 0.4 meters (two regions backward). Additionally, in left-foot states the robot can turn right 15 • and step 0.4 meters (two regions forward and one region right), and in right-foot states the robot can turn left 15 • and step 0.4 m (two regions forward and one region left). In all cases, the discrete IMDP-BL action targets the center of the IMDP-BL state which is closest to the target point under these parameters. Finally, actions which would lead outside the boundaries of the environment are removed. Note that this formulation, which accounts for bipedal legged kinematic constraints, restricts the motion trajectories of the robot considerably as compared to, e.g., a mobile robot which can target any adjacent state during grid cell transition.
We assume that the terrain elevation varies continuously within [−0.1, 0.1]m in the environment, and construct a yaw perturbation for each step which is a nonlinear function of the commanded action as well as the terrain elevation error between the robot's estimation and ground truth.

A. REACH-AVOID CASE STUDY
First, we consider a reach-avoid specification wherein the robot must reach a goal state while avoiding hazard states, written as the LTL specification To illustrate the effectiveness of our approach under highlyperturbed conditions, we assume a priori that the learnable motion perturbations can cause the robot to deviate up to 15 • in yaw angle from its target during each step, along with up to an additional 10 • of perturbation due to bounded stochastic noise, which we take as a zero mean, σ ν = 2.5 • standard deviation Gaussian truncated at ±10 • = 4σ ν . This gives an initial probability of satisfaction of 0.18. Fig. 4 shows the results of the case study simulation. The robot initially cannot safely traverse through the obstacles so it traverses the lower half of the environment, training its GPs until it gains enough confidence to reach the upper half of the environment. From there, after collecting more samples, the probability of satisfactionP max exceeds the desired threshold P sat = 0.99, and the robot is able to successfully reach the goal.
The simulation was run using Georgia Tech's PACE computing infrastructure [61]. Using a 24-core Intel Xeon Gold 6226 CPU and a RTX-6000 GPU with 32 GB of memory, the complete algorithm took 24 minutes 28 seconds to run, comprised of four iterations of the batch sampling. The initial sampling batch had four steps to allow the robot to gain a baseline understanding of the environment, and subsequent batches consisted of eight steps to gather more data.

B. INTERMEDIATE PERSISTENCE CASE STUDY
Next, we consider a more complex scenario to demonstrate the expressivity of the specifications considered. The specification is now for the robot to reach a goal state while avoiding hazards and also returning to a set of "checkpoint" regions at least once every three steps along its trajectories. Written as a LTL specification, this task becomes φ 2 = ¬Haz U Goal ∧ ¬(¬Checkpoint∧ ¬Checkpoint ∧ ¬Checkpoint) (26) Note that this specification effectively quadruples the size of the PIMDP state space from 4800 states in the first case study to 19200 states. In this case study, the number of hazard states FIGURE 5. Illustration of the second case study results. The robot's objective is to avoid the red hazard states and reach the green goal state while touching pink states every three steps or less along its trajectory. In this case, the trajectory of the robot is altered from the reach-avoid case study so that the robot is able to return to pink states, but it is still able to train its GP models by traversing the safe lower half of the state space until it is able to gain confidence to reach the goal state.
is reduced in order to maintain feasibility of the problem under the checkpoint constraint; otherwise, the parameters are the same as in the first case study. Fig. 5 shows the simulation results for this case study. In this case, the trajectory of the robot is noticeably altered from the first case study in order to favor "checkpoint" states although the general structure of the environment remains similar. As in the first case study, the robot traverses in the safer lower region (with an initial probability of satisfying the specification of 0.46) before collecting enough data to move upwards and eventually reach the goal.
Although the hazard set has been reduced and there exist many checkpoint states, the low-level legged kinematic constraints on the robot's motion make this problem challenging to solve. In particular, it is possible for the robot to enter a non-checkpoint region in a configuration such that it cannot reach a checkpoint region within the next two steps, even if there exist such regions adjacent to the robot's position. Because the PIMDP planner performs infinite-horizon control policy synthesis, states with these nonobvious unsafe configurations are pruned, enforcing safe robot trajectories. Thus, this case study illustrates both the planning capabilities of our abstraction-based approach as well as the unique challenges inherent to legged locomotion systems as compared to, e.g., mobile robots. The complete simulation run took 10 hours 11 minutes to run, with four batches and 18 robot steps per batch. Thus, the curse of dimensionality with regards to PIMDP state space size is clearly illustrated, which will be addressed in future works.

VIII. DISCUSSION, CONCLUSION, AND FUTURE WORK
In this work, we introduced a temporal-logic-based planner for bipedal robot locomotion which allows for GP online learning of system and environmental uncertainties. We first formulated an IMDP model of bipedal locomotion which produces control policies compatible with low-level legged kinematic constraints. Additionally, we incorporated stacked GP learning of correlated multi-source yaw and terrain uncertainty. Furthermore, we developed a safe learning procedure which allows a robot to traverse its environment while respecting complex LTL specifications. Finally, we demonstrated the simulation case study results of our algorithms on reducedorder PIPM bipedal dynamics. Our approach intersects the fields of formal methods, learning, and locomotion in order to perform planning with respect to complex specifications on robots with nonlinear, high-DOF dynamics in highly uncertain environments.
Our work lays the theoretical and algorithmic foundations for the application of temporal-logic-based planning techniques onto legged locomotion systems. In particular, we demonstrate an approach which synthesizes high-level control policies that are compatible with low-level bipedal locomotion kinematic constraints. Our future work will focus on physical hardware implementation. We note that there remain several technical challenges in order to translate the simulation results in this work onto physical hardware experiments. Among these are the need to develop a robust low-level controller to track the desired reduced-order PIPM trajectory and the need for experimental characterizations of the motion perturbations experienced by the Digit robot, which are the subject of ongoing research [51]. Additionally, future works will address the computational complexity associated with our algorithms, in particular finding techniques to reduce the effect of the curse of dimensionality associated with abstraction-based techniques.