Provable Traffic Rule Compliance in Safe Reinforcement Learning on the Open Sea

For safe operation, autonomous vehicles have to obey traffic rules that are set forth in legal documents formulated in natural language. Temporal logic is a suitable concept to formalize such traffic rules. Still, temporal logic rules often result in constraints that are hard to solve using optimization-based motion planners. Reinforcement learning (RL) is a promising method to find motion plans for autonomous vehicles. However, vanilla RL algorithms are based on random exploration and do not automatically comply with traffic rules. Our approach accomplishes guaranteed rule-compliance by integrating temporal logic specifications into RL. Specifically, we consider the application of vessels on the open sea, which must adhere to the Convention on the International Regulations for Preventing Collisions at Sea (COLREGS). To efficiently synthesize rule-compliant actions, we combine predicates based on set-based prediction with a statechart representing our formalized rules and their priorities. Action masking then restricts the RL agent to this set of verified rule-compliant actions. In numerical evaluations on critical maritime traffic situations, our agent always complies with the formalized legal rules and never collides while achieving a high goal-reaching rate during training and deployment. In contrast, vanilla and traffic rule-informed RL agents frequently violate traffic rules and collide even after training.


I. INTRODUCTION
Reinforcement learning (RL) has provided promising results for a variety of motion planning tasks, e.g., autonomous driving [1], [2], robotic manipulation [3], [4], and autonomous vessel navigation [5]- [7].RL algorithms learn a capable policy through random exploration.As random exploration is inherently unsafe, RL agents are mainly trained and tested in simulation only.To transfer the capabilities of RL-based motion planning systems to the physical world, the agents have to be safe.Safe RL extends RL algorithms with safety considerations.Most safe RL approaches constrain the learning softly, e.g., by integrating risk measures in the reward function or by adapting the optimization problem for obtaining a policy considering constraints [8].However, for safety-critical tasks, such as motion planning in the physical world, hard safety guarantees are necessary, which most safe RL approaches cannot provide.
Provably safe RL achieves hard safety guarantees during training and operation by combining RL with formal methods [8].The safety specifications regarded in provably safe RL are so far mainly avoid specifications, i.e., it is ensured that unsafe All authors are with Technical University of Munich, Germany; TUM School of Computation, Information and Technology, Department of Computer Engineering; Munich Center for Machine Learning (MCML).
{hanna.krasowski, althoff}@tum.deareas and actions are always avoided.However, the notion of safety for real-world tasks is often more complex than avoiding unsafe sets.For autonomous vehicles, legal safety is usually required, meaning that vehicles do not cause collisions by obeying traffic rules [9], [10].To apply formal methods, these traffic rules need to be formalized.Temporal logic is suited to formalize traffic rules [9], [11]- [15], as it can capture their spatial and temporal dependencies well.Still, efficient and generalizable integration of formalized traffic rules in motion planning approaches is an open research problem.
In this work, we propose a provably safe RL approach that ensures legal safety by complying with traffic rules formalized in temporal logic for the application of autonomous vessel navigation.Fig. 1 displays the concept of our approach.We develop a statechart that reflects the formalized traffic rules and their hierarchy.Regular collision avoidance rules are followed as long as there is no immediate collision risk, and an emergency operation that executes a last-minute maneuver is immediately activated once a collision becomes likely.For the regular collision avoidance rules, an application-specific maneuver synthesis method based on a search algorithm is developed to efficiently identify actions that are compliant with traffic rules.For emergency operation, we detect imminent collision of the vessels using set-based reachability analysis and design an emergency controller that aims to prevent collisions as much as possible.Rule-compliant actions for both regular and emergency operation are computed online based on our statechart and are used to constrain the RL agent so it can only select verified actions.Our main contributions are: • We are the first to introduce a safe RL approach that ensures provable satisfaction of open-sea maritime collision avoidance rules, which are formally specified via temporal logic; • We improve our previously formalized maritime traffic rules [15], newly formalize the last-minute maneuver rule from the Convention on the Convention on the International Regulations for Preventing Collisions at Sea (COLREGS), and develop a rule-compliant emergency controller; • Our provably safe maneuver synthesis for discrete action spaces efficiently identifies safe actions online; • We train provably safe RL and safety-informed RL agents on critical maritime traffic situations and evaluate their performance in different deployment configurations on handcrafted and recorded maritime traffic data.
The remainder of this article is structured as follows: We present and discuss related literature in Sec.II, introduce Fig. 1.Proposed provably safe RL approach for autonomous vessels.First, traffic rules for collision avoidance are formalized with temporal logic (see Sec. III).Based on the formal specification, the set of rule-compliant actions is identified (see Sec. IV and Sec.V), which are integrated in the RL process so that the agent can only select actions that are rule-compliant (see Sec. VI).Note that the statechart in Fig. 3  relevant concepts published preliminarily to this article and state the problem in Sec.III.We present the formalized traffic rules and prove that a statechart models the traffic rules in Sec.IV.We describe our rule-compliant maneuver synthesis in Sec.V.The RL approach is detailed in Sec.VI.In Sec.VII, we discuss our experimental results on critical maritime traffic situations and conclude in Sec.VIII.

II. RELATED WORK
We categorize related work into safety specifications for maritime motion planning, motion planning approaches for autonomous vessels, and provably safe RL.
a) Safety specification for maritime motion planning: The notion of safety in maritime motion planning is usually rule compliance with maritime traffic rules describing collision avoidance maneuvers [16].The most relevant maritime traffic rules for collision avoidance are specified in the COLREGS [17].Often, these traffic rules are indirectly integrated in the motion planning approach, e.g., through geometric thresholds [18]- [23], virtual obstacles [24], or cost functions [6], [25]- [29].However, these approaches usually do not capture the temporal properties of collision avoidance rules, and the implemented interpretation of the COLREGS is often intransparent.
Another concept is to formalize the traffic rules and directly use them in motion planning.This is a more faithful consideration of traffic rules than the previously mentioned indirect integration.Additionally, the rule formalization is usually parameterized, which eases adaptions.Temporal logic is suited to formalize COLREGS since it captures temporal dependencies and thus can model sophisticated specifications of encounter situations.There are two relevant studies that formalize maritime traffic rules with temporal logic.Torben et al. [30] formalize COLREGS with signal temporal logic for automatic testing of autonomous vessels.This has the advantage that robustness measures specified through signal temporal logic formulas can be used as costs for motion planning approaches, since they quantify rule compliance.Krasowski et al. [15] formalize COLREGS with metric temporal logic and evaluate their compliance on real-world maritime traffic data.They discuss that the COLREGS are currently not well posed for more than two vessels, which needs to be addressed by regulators to make autonomous vessels admissible for commercial deployment in the real-world.How to best employ temporal logic formalizations for motion planning approaches as presented by [15], [30] is an open research question, for which we propose a solution in this work.

b) Motion planning for vessels:
The motion planning literature can be categorized into single-agent and multi-agent motion planning problems [31], where multi-agent settings are often distinguished into cooperative [32]- [34] and noncooperative [35], [36] settings.In this article, we regard singleagent motion planning.Maritime motion planners are often divided into three building blocks [37]: a guidance system generating reference trajectories, a control system for tracking reference trajectories, and a state observer 1 .For example, one line of single-agent motion planning research employs searchbased algorithms based on motion primitives, e.g., rapidlyexploring random trees [29], [38], [39].Other studies employ model predictive control (MPC) [26], [28], [40] to obtain an optimal control signal.In contrast to using search algorithms based on a finite amount of motion primitives, MPC directly optimizes the controller in the continuous state and input space.In particular, the studies [26], [28] show promising results on multi-obstacle scenarios and Kufoalor et al. [28] even evaluate their approach in real-world experiments with two obstacle vessels.However, for MPC an optimization problem must be solved repeatedly, which can be computationally costly.
RL is a well-suited machine learning approach to solve single-agent motion planning tasks in uncertain environments [6], [7], [41]- [44].Regarded scenarios are usually on the open sea with other non-reactive dynamic obstacles [7], [42], [43] and static obstacles [6], [41], [44].To achieve a behavior that adheres to maritime traffic rules, the reward function considers rule compliance to minimize risks, but does not guarantee compliance because the reward function is only maximized [6], [7], [41]- [44].In contrast, provably safe RL approaches ensure safety [8].c) Provably safe RL: Provably safe RL approaches ensure safety during training and operations.There are three conceptual approaches for provably safe RL [8]: action replacement, action projection, and action masking.In this article, we present an action masking approach, for which the agent can only choose actions that are verified as safe.Most research on action masking considers discrete action spaces; common applications are autonomous driving [45]- [50] and power systems [51].Usually, the action verification is tailored to the specific application and, thus, cannot be directly transferred to other applications.
Another way to distinguish provably safe RL approaches is by the safety specification.Most approaches consider safety specifications that can be formalized as containment in a safe set or avoiding intersection with unsafe sets.A few works regard safety specifications based on temporal logic [52]- [54], which can additionally model temporal dependencies in safety specifications.The studies [52], [53] use model checking to determine whether a given action fulfills a linear temporal logic formula, which expresses the safety specification.Their approaches are transferable between applications but limited to discrete action and state spaces.In contrast, Li et al. [54] leverage linear temporal logic specifications to synthesize control barrier functions, which are used to project unsafe actions proposed by the agent to safe actions.This allows them to apply their approach to continuous action and state spaces.However, their approach cannot deal with dynamic obstacles that are not controllable, such as other traffic participants.To the best of our knowledge, we are the first to formulate a provably safe RL approach for the application of autonomous vessels and to include temporal safety specifications in the online safety verification of RL agents while operating in a continuous state space.

III. PRELIMINARIES AND PROBLEM STATEMENT
a) Notation and dynamics: We denote sets by calligraphic letters, vectors are boldfaced, and predicates are written in Roman typestyle.The Minkowski sum is defined as A traffic rulebook ⟨Φ, ≤⟩ is a tuple where Φ is the set of formalized rules and ≤ is the order [55].We denote that the model Ξ and its initial state ξ entail the rulebook ⟨Φ, ≤⟩ by Ξ, ξ |= ⟨Φ, ≤⟩.
The state of a vessel s ∈ R 4 consists of the position p = [p x , p y ] ∈ R 2 in the Cartesian coordinate frame as well as the orientation θ ∈ R, and the orientation-aligned velocity v ∈ R. The operator proj □ projects a state to the state dimensions indicated by □ and R(Υ) = {R(υ)|υ ∈ Υ} denotes the set of rotation matrices for the angles Υ with R(υ) being the rotation matrix for the angle υ.To model the ego vessel (i.e., the autonomous vessel we control), we use a yawconstrained model Ω yc with orientation-aligned acceleration a ∈ R and turning rate ω ∈ R as control inputs: The control input is denoted as u(t) = [a(t), ω(t)] and the initial state as s 0 .b) Set-based prediction of vessels: To obtain predictions that enclose all possible behaviors of a traffic participant, the concept of set-based predictions for road traffic participants [56] can be transferred to maritime traffic.The fundamental idea is to define abstract models and perform reachability analysis for them.We first specify the dynamics used for the prediction, and then introduce the reachable sets and occupancy sets.Finally, we discuss the special case of a closed-loop system.
For vessels, we assume that the abstract model is a pointmass model Ω pm with velocity and acceleration constraints: The maximum velocity and maximum acceleration are denoted by a pm,max and v pm,max , respectively.To ensure formal safety of our approach, the two constraints must be chosen such that the point-mass model over-approximates the behavior of vessels using reachset conformance [57].The state of the model Ω pm is abbreviated by x = [p x , p y , v x , v y ].
The time-point reachable sets for the model Ω pm are calculated with set-based reachability analysis [58] based on the initial state s 0 , time step size ∆t, and the time horizon t pred .Note that the state s 0 is transformed into x 0 by using trigonometry to convert [v, θ] into [v x , v y ].The time-interval reachable sets are computed as in [58], [59] and are denoted by R ∆t (s 0 , Ω pm , t pred ).To obtain the occupancy sets from the time-interval reachable sets, the reachable sets are projected to the position domain and enlarged by the spatial extensions of the vessel V rotated by all possible reachable orientations using the Minkowski sum: For a detailed derivation of the occupancy sets, we refer the interested reader to [56, Sec.V-A].
The occupancy sets O pm are calculated for the open-loop system Ω pm since we do not have access to the control input of other traffic participants.However, for an ego vessel, we have a precise model Ω yc and access to the control input.Thus, the forward simulation of our closed-loop system with the control input u(t) provides the time-point reachable sets.The occupancy is denoted by: c) Problem statement: The COLREGS specify the traffic rules for collision avoidance on the open sea for power-driven vessels in natural language.These traffic rules are satisfiable for two vessels.For more than two vessels, unsatisfiable traffic situations can occur, e.g., a vessel needs to keep its course and speed with respect to one vessel and perform an avoidance maneuver with respect to another vessel.The COLREGS do not specify how to adequately resolve such conflicting situations with more than two vessels.Due to the lack of legal specifications, we regard traffic situations with two vessels only.In particular, we assume: 1) The traffic situation is an open-sea situation without traffic signs, traffic separation zones, or static obstacles; 2) There is one traffic participant vessel obs and one autonomous vessel ego, which are both power-driven; 3) The dynamics of the autonomous vessel is modeled by (1); 4) The current state of the traffic participant vessel s obs is observed without measurement errors; 5) In the initial state of the traffic situation, none of the collision avoidance rules specified in the COLREGS apply.We define the traffic rulebook ⟨Φ, ≤⟩ that describes the legally relevant collision avoidance rules of the COLREGS given our assumptions 1) and 2).The formal traffic rules are denoted by Φ and the hierarchy between them by ≤.Based on the traffic rules, we search for an RL approach, which ensures that the RL agent only selects safe, i.e., rule-compliant, actions leading to rule-compliant trajectories.Thus, the overall problem is to find where ζ πs |= ⟨Φ, ≤⟩ .
The observation space of the RL agent is S, the set of provably rule-compliant actions is A s , and the trajectories ζ πs are solutions of (1) when following the RL policy π s .To address this problem, we first introduce the rulebook ⟨Φ, ≤⟩, and prove that a statechart Γ entails the rulebook in Sec.IV.Then, we describe the synthesis of rule-compliant maneuvers and detail the safe-by-design action selection in Sec.V. Finally, we describe the RL specification in Sec.VI.

IV. SPECIFICATION
Our previous work [15] formalizes the COLREGS rules specifying collision avoidance between two power-driven vessels on the open sea.The temporal operators used are G, F, and U, and if there is a subscript, the temporal operator is evaluated over the time interval indicated by the subscript.The operator G(ϕ) evaluates to true iff ϕ is true for all future time steps.In contrast, for the operator F(ϕ), ϕ only has to be true for at least one future time step.The until operator ϕ 1 Uϕ 2 is true iff ϕ 1 holds true for all time steps until ϕ 2 holds true.In this section, we introduce the legal specification through a rulebook and detail the novel formalization of the emergency rule.Finally, we introduce the statechart Γ and show that it models the specification.

A. Traffic Rulebook
Table I lists all formalized rules considered in this work.While the predicates can be evaluated on any two vessels, the predicate arguments are set to be evaluated for the ego vessel with respect to an obstacle vessel according to the COLREGS.The traffic rule R 2 enforces a safe speed, which is trivially ensured through the ego vessel dynamics.Thus, we do not include this rule in the traffic rulebook.
Definition 1 (Rules Φ).The rulebook consist of rules R 1 and R 3 − R 6 specified in Table I.
We introduce the emergency rule R 1 to reflect the COLREGS specification that if the other vessel does not take appropriate actions for collision avoidance, the ego vessel has to react and perform a last-minute maneuver for collision risk minimization.
COLREGS Requirement 1 (Rulebook order ≤).Rule R 1 is always prioritized over rules R 3 -R 5 , and R 6 has the lowest priority.Rules R 3 -R 5 are all of equal priority.
The predicates of rule R 1 are detailed in Sec.IV-B.Note that we use the emergency maneuver to describe the lastminute maneuver, through which the ego vessel minimizes the collision risk and thereby achieves legal safety.Yet, in the literature, the term failsafe planning is also frequently used [8], [60].
Rules R 3 -R 6 describe how vessels have to behave in a COLREGS encounter situation.In these encounter situations, the vessels are on a collision course meaning that the vessels would collide in the near future if no appropriate collision avoidance measures are taken.There are three different encounter situations specified in the COLREGS as illustrated in Fig. 2: overtaking (R 5 , R 6 ), crossing (R 3 , R 6 ), and head-on encounters (R 4 ).In an encounter, a vessel can be a give-way or a stand-on vessel.A give-way vessel is required to change course and perform a collision avoidance maneuver.A standon vessel has the obligation to keep its course and speed.The predicate for determining a stand-on vessel is keep (see Appendix-A).The stand-on rule R 6 has the lowest priority since whenever the other vessel changes its course so that the ego vessel becomes the give-way vessel, the give-way rules R 3 to R 5 are applied (see COLREGS Requirement 1).

Rule
Temporal logic formula To formalize that a give-way encounter is persistent for at least the reaction time, we use the following temporal logic specification, where {give_way} can take the values from {crossing, head_on, overtake} (see Appendix-A) and * denotes additional arguments for the predicates: We assume that both vessels keep their course and speed to obtain rule-compliant predictions for their future states.These predicted states allow us to evaluate ahead of time if the encounter situation will persist long enough so that the ego vessel has to perform a collision avoidance maneuver.The reaction time t react does not indicate the minimum required reaction time of a human operator but instead specifies how much time the human operator would require to decide if the encounter situation persists.Given a give-way encounter is detected, a rule-compliant collision avoidance maneuver has to be conducted until ¬collision_possible evaluates to true (see Table I R 3 -R 5 ).The time interval for performing a rulecompliant maneuver is t react + 2t maneuver , where 2t maneuver approximates the time required for the maneuvering.

B. Emergency Rule Predicates
We use the predicate collision_possible to determine if two vessels are on a collision course for rules R 3 -R 6 .Because the rules R 3 -R 6 assume a constant velocity, we use the velocity obstacle concept [61] for this predicate.However, the velocity obstacle concept is not sufficient for detecting imminent risk as necessary for R 1 .Thus, we present four predicates in this section that are relevant for our formalization of rule R 1 .
First, we define an auxiliary position predicate determining if vessel m is in a relative orientation sector of vessel l: where the lower relative orientation is β and the upper relative orientation is β relative to the orientation of vessel l.The normal vector h i is the unit vector in the direction i − π/2 and b l,i is the offset to the origin for a line through the position of vessel l in the direction i.We illustrate the sector predicate with two specific usages in Fig. 4. Second, we use set-based prediction for rule R 1 to detect potential collisions in the near future.In particular, we predict the future occupancy of the obstacle vessel until the time horizon t pred as described in (3) and that of the ego vessel as in (4), for the control sequence to keep course and speed as demanded for stand-on vessels.If the ego occupancy and the predicted occupancy of the obstacle vessel intersect, the ego vessel is in an emergency situation: where t 0 is the current time.
Third, the predicate emergency_maneuver describes a maneuver that minimizes the risk of collision for the specific traffic situation.We detail our interpretation of emergency_maneuver in Sec.V-A.
Fourth, an emergency situation is resolved when the obstacle vessel is behind the ego vessel, is moving away from the ego vessel, and the Euclidean distance between both is larger than

C. Specification-compliant Statechart
The overall rule specification is modeled by the statechart Γ in Fig. 3. Due to assumption 5), the initial state in every traffic situation is the state ρ 0 .There are two main states for normal operation and emergency operation.During normal operation, whenever the predicate collision_possible is true, the corresponding maneuver state for R 3 -R 6 (see blue states in Fig. 3) is entered and the collision avoidance maneuver is started.
Proof: This follows directly from the definition of the predicates keep, head_on, crossing, and overtake (see Appendix-A), which are true for the states of the statechart ρ 1 − ρ 4 , respectively.Lemma 1.For two specific vessels, at most one of the predicates keep, head_on, crossing, or overtake can be true at the same time.
Proof: The predicates keep, head_on, crossing, and overtake cannot apply at the same time due to their mutually exclusive specification.The detailed proof is in Appendix-B.
If an emergency situation is detected, the statechart transitions to the emergency operation state until the emergency situation is resolved.
Proof: The initial state ρ 0 fulfills the rulebook by assumption 5) (see Sec. III.c).We continue proving the compliance with each rule: do not (see COLREGS Requirement 1), which is realized by transitioning to ρ 5 (see Fig. 3).The state ρ 5 can only be exited iff is_emergency_resolved evaluates to true.Thus, the transition to and from ρ 5 directly represents R 1 .
If collision_possible ∧ ¬is_emergency is true, then Γ has to represent rules R 3 -R 6 .Whenever collision_possible becomes true, it can be deduced from Lemma 1 and Proposition 1 that the statechart transitions to a state ρ i , i ∈ {1, ..., 4}.
(III) R 6 : Once rule R 6 applies, i.e., keep is true, the statechart transitions to ρ 1 and stays there until ¬keep ∨ ¬collision_possible ∨ is_emergency.If ¬keep ∧ collision_possible, an encounter of higher priority is present (see COLREGS Requirement 1) and R 3 -R 5 apply.In this situation, the statechart transitions to the states ρ i for i ∈ {2, ..., 4} and the remaining proof steps are stated in case (III).Identically to case (III), if ¬collision_possible is true, the statechart Γ transitions to ρ 0 and if is_emergency the statechart transitions to ρ 5 .

V. RULE-COMPLIANT MANEUVER SYNTHESIS
Given our specification-compliant statechart Γ, we need to identify rule-compliant actions for the individual states ρ i of the statechart.Trivially, for the state ρ 0 all actions are rulecompliant since no rules apply.We introduce the synthesis of emergency maneuvers in Sec.V-A and of encounter maneuvers in Sec.V-B.Finally, we detail how we ensure a selection of only safe actions for the RL agent in Sec.V-C.

A. Emergency Maneuver
Once we detect an emergency situation, i.e., the statechart is in ρ 5 , the ego vessel is legally required to evade the obstacle vessel in a manner that minimizes the risk of collision.In similar motion planning applications, such as autonomous driving [10], autonomous aerial traffic [62], or human-robot environments [63], states that are safe for infinite time are used to identify a legally safe emergency maneuver.In contrast, the current COLREGS do not state specifically how to interpret "minimize risk" or the characteristics of an invariably safe state.Thus, we cannot provide a formal specification.Consequently, we cannot verify risk minimizing behavior.Nevertheless, we identify three situations in which different emergency maneuvers are appropriate: base mode, ahead mode, and stern mode (see Fig. 4).
In the ahead case (see Fig. 4b), the obstacle vessel is in the ahead sector in front of the ego vessel, and the orientation difference between the ego vessel orientation and the reversed orientation of the obstacle vessel is at most ∆ ahead .This can be formalized as: where the predicate in(ρ 5 ) evaluates to true if and only if the statechart Γ is in base state ρ 5 .In this ahead situation, steering to the stern of the obstacle vessel would lead to an even more critical situation, as both vessels would encounter each other head-on, given the obstacle vessel approximately keeps its speed and course.Thus, we instead require the ego vessel to turn 90 • .The direction of turning is determined as presented in Fig. 5. Depending on the situation, turning 90 • can be enough to resolve the emergency situations.Yet, if the emergency is not resolved and the traveled distance of the ego vessel from the start of the maneuver is larger than d min,ahead , the emergency controller switches to the base mode (see Fig. 4a) and steers the ego vessel behind the stern of the obstacle vessel.
The stern case is necessary for situations where the obstacle vessel is almost astern of the ego vessel and still relatively far away (see Fig. 4c): stern_emergency(s ego , s obs , ∆ stern , u acc (t), V ego , V obs , (7) with the control sequence u acc (t) = [a stern , 0 rad s −1 ], ∀t ≤ t react and then 0 m s −2 , 0 rad s −1 ], ∀t ≤ t react < t ≤ t pred .By using the set-based prediction within this predicate, we ensure that we only use this controller mode if it is certain that accelerating would resolve the situation.In such a situation, performing an emergency maneuver that navigates the ego vessel to the stern of the obstacle vessel would be an unnecessarily long detour, given that a short acceleration period would also resolve the emergency situation.
For the base case (see Fig. 4a), the emergency situation can be safely resolved by steering to a position behind the stern of the obstacle vessel.The base emergency situation is formalized by: base_emergency ⇐⇒ in(ρ 5 ) ∧ ¬ahead_emergency ∧ ¬stern_emergency ∧ ¬is_emergency_resolved.
Alg. 1 summarizes the control mode selection when entering the emergency operation state (see Fig. 3) and is an instantiation of the predicate emergency_maneuver of rule R 1 in Table I for our problem statement.For base and ahead modes, the target positions are depicted in Fig. 4 and obtained with the functions get_target_ahead and get_target_base, respectively.Given the target position, a reachable desired position given the current state is identified and a control input toward this desired position is generated (for details on the controller design see Appendix-C).The controller is abbreviated by the function tracking_controller.

B. Encounter Maneuvers
Given a persistent give-way encounter is detected (i.e., the statechart in Fig. 3  t i ← t i + ∆t 18: end while states ρ 1 , . . ., ρ 4 ), we identify safe actions that result in safe maneuvers resolving the encounter.
Set-based predictions are well suited to verify that no collisions can occur if not all vessels comply with the regular collision avoidance rules R 3 -R 6 .Still, for the regular collision avoidance rules, the implicit assumption in the COLREGS is that both vessels comply with them.Thus, for identifying actions of the ego vessel that are rule-compliant with these rules, we can use a rule-compliant prediction for the obstacle vessel.For the three encounter situations specified (see Fig. 2), we differentiate between the ego vessel being the give-way (R 3 -R 5 apply) and stand-on vessel (R 6 applies).First, we detail the verification of actions given the ego vessel is the standon vessel, i.e., in(ρ 1 ).Then, we describe the more intricate synthesis given that the ego vessel is the give-way vessel (ρ i where i ∈ {2, ..., 4}), and finally, summarize our encounter action synthesis.
a) Stand-on maneuver synthesis for ρ 1 : The trivial action for the predicate keep is a keep = [a = 0 m s −2 , ω = 0 rad s −1 ], i.e., keeping course and speed.Note that for this trivial action there is no explicit maneuver time and the action space needs to be restricted to this action until the ego vessel is not the stand-on vessel anymore or an emergency is detected (see Fig. 3).
b) Give-way maneuver synthesis for ρ 2 − ρ 4 : For all give-way maneuvers, a significant change of orientation (i.e., ∆ large_turn ) is required so that other traffic participants can identify give-way maneuvers (see Fig. 2).For head-on and crossing encounters, the give-way vessel is always obliged to turn toward the right.For the overtake encounter, the suited turning direction depends on the orientation of the obstacle vessel, but this is not further specified in the COLREGS.For our maneuver synthesis, the turning direction is to the left if the orientation of the obstacle vessel is more to the right than the orientation of the ego vessel, and otherwise turning direction is to the right.
Given the turning direction, we identify candidate actions, construct maneuvers based on them, and verify if a maneuver complies with the rules.Candidate actions lead to trajectories that already fulfill the minimal turning requirement within the maneuver segment time t m .A maneuver is verified if the predicate collision_possible is false at the end of the maneuver and the occupancies of both vessels do not intersect during the maneuver: where t 0 is the current time, t end ∈ n t m is the time horizon of the maneuver with n ∈ N + , s ego,t end is the final state of the maneuver, and u m (t) is the control sequence for the maneuver trajectory.The predicted obstacle state at t end is s obs,t end and the set V obs + is the spatial extensions of the obstacle enlarged by the safety factor d obs,safety for width and length.The occupancy of the obstacle vessel is based on the assumption that the obstacle vessel will keep its speed and course, i.e., the control sequence u keep (t).This assumption is compliant with the COLREGS collision avoidance rules for the crossing and overtake encounter.In case of the head-on encounter, the predicted trajectory for the obstacle vessel is a conservative prediction since the obstacle vessel would also need to evade to the right to be rule-compliant.Assuming that the obstacle vessel will keep its course and speed leads to the fact that the ego vessel has to turn more to resolve the encounter situation.
With the turning direction and the maneuver verification predicate defined in (8), we want to determine all actions that lead to verified maneuvers.The generation of maneuvers based on candidate actions is computed by a breadth-first search with rule-compliant pruning.The search algorithm is detailed in Alg. 2. Note that to obtain a control sequence for multiple actions, we introduce the function a2u.For a maneuver segment trajectory, the control input corresponding to an action, is held constant for a maneuver segment time t m while (1) is forward simulated.We initialize a search tree with a maneuver segment trajectory resulting from the candidate turning action a c .A candidate action a c ensures that the orientation of the ego vessel changes at least ∆ large_turn within t m .Potentially, this first maneuver segment trajectory results already in a verifiable maneuver (cf.Alg. 2, line 2-3).If not, the search tree is extended by (a) a maneuver segment trajectory based on the candidate action a c (cf. Alg. for a acc ∈ A acc do 20:   u m (t) ← a2u(a ′ ) + a2u(last(a ′ )) end while 30: end if 31: return G extended with the previously used action (cf.Alg. 2, line [24][25].This has the effect that the vessel does not switch between different accelerations during the maneuver.The expansion of the search tree is stopped (a) if at least one trajectory sequence is verified for the current search tree depth, i.e., for time horizon t end , or (b) if the maneuver horizon t max,m is reached.Note that t max,m follows from the rule specification and is t react + 2t maneuver .The search tree generation is illustrated in Fig. 6 for three give-way encounters.Due to the rule-compliant pruning, our search algorithm has the time complexity O(n N c N acc ) for tree generation where N c ∈ N + is the number of candidate actions a c , and N acc ∈ N + is the number of actions in A acc .c) Actions for encounter maneuvers: Alg. 3 summarizes the action verification to achieve rule-compliant maneuvers for rules R 3 -R 6 given the statechart Γ is in an encounter state (i.e., ∃i ∈ {1, ..., 4} : in(ρ i )).We denote the search tree generation with build_st (see Alg. 2) and the detection of end for 16: end if 17: return A s , G actions in the correct turning direction for overtake situations is abbreviated by the function get_turning_act.The result of Alg. 3 is the safe action set A s and the verified part of the search tree G.
In an encounter situation, in which the ego vessel has to give way, a maneuver of the verified part of the search tree G is performed until there is no collision risk with respect to the obstacle vessel.In particular, the actions are conducted for at least the maneuver segment time t m .At the end of a maneuver segment, the encounter situation is either resolved, or the action selection is constrained to the children of the selected search tree node.If G is an empty set, the ego vessel is a stand-on vessel and the only selectable action is a keep .

C. Safe-by-design Action Selection
We utilize a discrete action space for RL since this realizes efficient online safety verification and makes the encounter action verification feasible.In particular, we define an action set A of 49 discrete actions.One action represents the emergency action a em and the others result from the combination of turning rates and accelerations: A = {a em , A regular } where ( 9) where A a is the finite set describing the allowed normal accelerations and A ω is the finite set describing the allowed turning rates.
In the previous sections, we derived the verification of rulecompliant actions.By constraining the RL agent to these rule-compliant actions, we ensure by design that only safe actions are executed, and consequently only safe trajectories are performed.Theorem 2 states the solution to our problem statement in (5).
Theorem 2. Legal safety specified by ⟨Φ, ≤⟩ can be ensured through constraining the action space of the RL agent to A s (ρ) since all actions in A s (ρ) are specification-compliant actions.
Proof: To prove this statement, we derive the safe action set A s for all states of the statechart Γ.
(I) Initial state ρ 0 : Since no rules apply in this state as proven in Theorem 1, any action is compliant with the specification and A s (ρ 0 ) = A regular .
(II) Emergency state ρ 5 : We constrain the actions of the RL agent to the emergency action a em returned by Alg. 1, i.e., A s (ρ 5 ) = a em .
(III) Encounter states ρ 1 − ρ 4 : Based on Theorem 1 the maneuver predicates for the respective encounter situations must hold in these states to comply with the specification.Alg. 3 returns the synthesized rule-compliant maneuvers and respective actions A s (ρ i ) where i ∈ {1, ..., 4}.
Given A s (ρ), we can constrain the action selection of the RL agent to A s (ρ) with standard action masking [8] to obtain the safe policy π s .Since the safe policy π s only allows rulecompliant actions from A s , the trajectories ζ πs are compliant with the legal safety specification ⟨Φ, ≤⟩.

VI. REINFORCEMENT LEARNING
For the task of autonomous vessel navigation on the open sea, we design a simulation environment based on CommonOcean benchmarks [64] and the yaw-constrained dynamics in (1).The CommonOcean benchmarks contain a planning problem which specifies the goal area and initial state of the ego vessel as well as a scenario which specifies the traffic situation, i.e., for this study the trajectory of the obstacle vessel and the navigational area.At the start of an episode, a CommonOcean benchmark is randomly selected from the training set and the agent is provided with the initial observation.Based on the observation, the agent selects an action from the action set and receives the corresponding reward and next observation of the environment (see Fig. 1).
If the safety verification is activated, the agent can only select from the verified safe action set A s as derived in Sec.V. We regard a setting with finite time horizon episodes and terminate the episode in specified situations (see Sec. VI-A).
The observation space, termination conditions, action space, action selection constraints, and reward function are detailed in the following paragraphs.

A. Observation Space and Termination
The observation space has 27 dimensions.We specify four types of observations: ego vessel observations, goal observations, surrounding traffic observations, and termination observations.Fig. 7 visualizes the ego vessel observations, goal observations, and surrounding traffic observations for the time step t.
The four ego vessel observations are the velocity v ego and orientation θ ego of the ego vessel state s ego , the acceleration a ego , and turning rate ω ego corresponding to the ego vessel control input.The five continuous goal observations are the Euclidean distance to the goal d goal , the remaining time steps until the maximal time step of the episode k max , the orientation difference to the goal orientation range β goal , and the longitudinal d long and lateral d lat position with respect to the line from the initial state to the center of the goal state.The observations d long and d lat are relevant since they indicate the deviation of the ego vessel from the optimal path when no other vessels need to be avoided.Additionally, we provide one Boolean goal observation that evaluates to true whenever min(|d lat |, |d long |) is larger than the distance d hull , i.e., the ego vessel is far away from the path between the initial state and goal area.
The surrounding traffic observations are the distance d j , angle β j and distance rate ḋj for the detected vessel in the sector j ∈ {1, ..., J}, where J is the number of sectors.The vessels are only detected if the Euclidean distance to the ego vessel is at most the sensing distance d sense .For this study, we align the sectors with the sectors specified for the COLREGS collision avoidance rules.Thus, we obtain the four sectors front, left, right, and behind and twelve observation variables, as depicted in Fig. 7.

The five termination observations are Boolean observations and indicate if
• the maximal time step was reached 1 time = 1, • the vessel is outside of the navigational area 1 area = 1, • the vessel velocity is zero 1 stopped = 1, • the vessel collided 1 collision = 1, • the vessel reached the goal area 1 goal = 1.
We terminate the episode when the ego vessel stopped, as reverse driving is not meaningful on the open sea and the termination leads to the agent being reset to a much more meaningful initial state of another CommonOcean benchmark.The termination conditions follow directly form the termination observations, as we terminate the episode if one of these observations is present.

B. Reward
The reward is designed such that the vessel is reinforced in goal reaching behavior and penalized for unsafe or inefficient behavior.In particular, we design a reward function based on sparse and dense components.The sparse rewards are related to termination conditions and using the emergency planner: where c i indicate the reward coefficients, which are all negative except for c goal .Additionally, we define four types of dense rewards for COLREGS compliance, advancing to the goal, keeping the velocity, and deviation from the path between initial state and goal.To incentivise behavior that is compliant with the collision avoidance rules specified in the COLREGS, we utilize a reward component specified in [41, Eq. ( 26)]: The angle ϕ ∈ [−π, π] specifies the relative angle between the ego orientation and the orientation toward the obstacle vessel, v obs,ϕ specifies the velocity component of the obstacle vessel velocity in the radial direction from the ego vessel to the obstacle vessel, and d obs is the distance observed to the obstacle vessel, i.e., the respective d j .The parameters α, γ ϕ,dyn , ζ v , and ζ obs,d are set to the same values as defined in [41].Further, we define a reward component that supports the agent in learning how to reach the goal by providing a reward that is proportional to the advance or retreat from the goal since the previous time step:  The center position of the goal area is p goal , and p ego,t is the current ego position, p ego,t−1 is the ego vessel position at the previous time step, and c reach is a scaling coefficient.On the open sea, vessels typically navigate in a narrow speed range.To enforce this also for the RL agent, the reward component r velocity provides a penalty proportional to the deviation from the desired speed range: The parameters v low and v high define the speed range bounds, and c v is the reward coefficient.The last reward component informs the agent about its deviation from the direct path between the initial state and the goal area: where the coefficient c deviate scales the penalty proportional to the absolute lateral deviation |d lat |, and c deviate d hull is the maximum of the reward component r deviate .Finally, the reward function is given by the sum of all components: r = r sparse + r colregs + r goal + r velocity + r deviate .(10) VII.NUMERICAL EXPERIMENTS Critical encounter situations are rare in maritime traffic data.Thus, this data is not well suited for training RL agents that should learn how to handle encounter situations.Therefore, we construct random CommonOcean benchmarks [64] that represent critical encounters as a foundation of our simulation environment.In particular, we initialize the ego vessel and the other vessel approximately 2000 m -3500 m away from their closest encounter position.The initial velocity range for both vessels is [3 m s −1 , 7 m s −1 ].For the obstacle vessel, we generate a trajectory that is close to constant velocity and speed, and disturb the initial orientation and velocity with values sampled uniformly from [−0.05 rad, 0.05 rad] and [−0.1 m s −1 , 0.1 m s −1 ], respectively, to make the trajectory more realistic.The goal area is approximately 4500 m away from the initial position of the ego vessel.The goal area is 400 m long and 60 m wide.The time horizon for the scenario is k max = 170 time steps where the time step size is ∆t = 10 s.In total, we constructed 2000 CommonOcean benchmarks [64] and randomly split them in a 70 % training and 30 % testing set.The model of the ego vessel is the yaw-constrained model in (1) and we use the parameters of a container vessel 2 .We reduce the maximum velocity specified in the vessel parameters to 9.5 m s −1 to better match a realistic velocity range for open sea maneuvering.
Next to the simulation environment, we need to specify values for the parameters of the safety verification approach, ego vessel, and reinforcement learning.Table II summarizes the parameters.Note that the emergency controller can use the full control input space specified for the ego vessel through the intervals [− a max , a max ] and [−ω max , ω max ].For normal operation, we reduce the control input limits to a more reasonable range for open sea maneuvering.This is reflected by the sets of allowable accelerations A a and turning rates A ω (see Table II).As model-free RL algorithm, we used proximal policy optimization (PPO) [65].Our implementation is based on stable-baselines3 [66] and the action masking implementation in [8].The agent networks are multi-layer perceptron networks with two layers and 64 neurons in each layer.

A. Evaluation concept
To comprehensively evaluate our approach, we introduce two benchmark agents next to our provably safe agent and compare different deployment setups.We train all three agents in our simulation environment, which is based on the training data of critical CommonOcean benchmarks [64].The trained agents are: 1) the baseline agent with the reward function r = r sparse + r goal + r velocity + r deviate , i.e., r colregs = 0 in (10), and no safety verification, 2) the rule-reward agent, which is informed by the COLREGS reward r colregs , i.e., reward function (10), and 2 The container vessel is the vessel type 1 from commonocean.cps.cit.tum.de/commonocean-models.
3) the safe agent with safety verification and reward function (10).The baseline agent represents a straightforward RL implementation for which the agent is informed about unsafe actions only sparsely with a collision penalty.The rule-reward agent models the state-of-the-art for traffic-rule-informed open-sea vessel navigation [6], [7], [41], [42], because the reward function includes a COLREGS reward r colregs .For each agent type, we use ten random seeds and train an agent per seed for three million environment steps.
We evaluate the deployment performance of the trained agents on the testing set of the handcrafted critical scenarios and on scenarios from recorded traffic data 3 .For the rule-reward and baseline agent, we investigate performance without, i.e., as trained, and with safety verification enabled.Including the safety verification after training allows us to evaluate if guaranteeing traffic rule compliance after training is sufficient.Note that the action space of the two benchmark agents is A regular , except for deployment with safety verification.
We consider critical scenarios from recorded traffic data to examine the generalization of the agents to real-world situations.To this end, we use marine traffic data from three large open-sea areas off the US coast from [15] and extract critical encounters.In particular, we only use scenarios where the distance between two vessels drops to 5000 m or lower.Further, we ensure that the paths of both vessels cross each other.Then, we replace one vessel by an ego vessel to generate the initial state and goal area.The initial state is part of the recorded trajectory and is selected about 2000 m before the closest encounter.The position of the goal area is also part of the recorded trajectory and is about 2000 m after the closest encounter.We use the same shape for the goal area as in our handcrafted scenarios.In total, we identify 49 critical scenarios in the three large open-sea areas off the US coast from traffic data of January 2019 (about 30 GB of raw Automatic Identification System (AIS) data).
We evaluate our agents based on the goal-reaching rate, reward, episode lengths, collisions, emergency controller usage and rule violations.Rule violations reflect how often per episode the regular collision avoidance rules are violated.For that, we count: • every time step of violating the stand-on vessel position results; • every crossing, overtaking and head-on encounter for which no proper collision avoidance maneuver is taken.

a)
Training evaluation: Fig. 8 shows the training curves for the three agent types.The average reward curves show similar convergence across agent types, although the baseline and rule-reward agents achieve slightly higher rewards after three million training steps.Note that for the displayed reward curves, the emergency penalty and COLREGS reward term r colregs are subtracted for comparability.The goal-reaching rate curves mirror the reward curves and the agents reach goals in about 90 % of all scenarios at the end of the training.We observe that the agent types without safety verification reach the goal slightly more often.Importantly, there are no collisions and rule violations for the safe agent (see Fig. 8c and Fig. 8d).For the baseline and rule-reward agent, the collision rate is relatively stable around 5 % during the full training time.Rule violations for the baseline and rule-reward agent slightly decrease but never reach zero.This suggests that complying with the COLREGS effectively achieves collision avoidance.
b) Deployment evaluation: The results averaged over ten random seeds for each agent type are summarized in Table III.For the handcrafted scenarios, the rule-reward agents the goal for 90.7 % of the scenarios.This is about 5 % higher than for the baseline and safe agent.Yet, only the safe agent achieves zero collisions and no rule violations.The rule-reward agent collides and violates the rules fewer times than the baseline agent.If the safety verification is enabled for the baseline and rule-reward agent, the goal-reaching rate drops significantly by approximately 40 %.Additionally, for the safe agent, the emergency controller intervenes on average in 6 % of the time steps in an episode, whereas for the rulereward and baseline agents with activated safety verification, the emergency controller is needed in approximately 10 % of the time steps in an episode.
Table III displays the testing results on the 49 critical recorded traffic scenarios for the different agent types.The rule-reward agent reaches the goal most often and exhibits the lowest average episode length.Interestingly, the goal-reaching rate for the baseline and rule-reward agent drops only by about 5 % when activating our safety verification approach.The collision rate and rule violation rate are smaller than for the handcrafted scenarios.With activated safety verification, we observe no collisions and no rule violations.Note that the differences in the reported means for goal-reaching rate and emergency steps between the agents with activated safety verification are statistically insignificant 4 .C. Discussion a) Safety in handcrafted scenarios: The safety verification ensures that the encounter traffic rules are never violated and we empirically observe that no collisions occur.However, this results in a lower goal-reaching rate than for the softconstrained rule-reward agent.One reason for this observation might be that with safety verification, the task is more difficult to solve since the agent is often constrained to avoidance maneuvers before it can maneuver freely again.Thus, the safe agent can explore less freely compared to the baseline and rule-reward agents.The drop in the goal-reaching rate when the safety verification is enabled after training is likely due to the distribution shift, as the baseline and rule-reward agents are probable led to states that they explored less frequently or not at all during training.b) Safety on recorded scenarios: In contrast, testing the rule-reward and baseline agent with safety verification on the scenarios from recorded traffic data does not lead to such a significant drop.At the same time, the agent setups without safety verification exhibit fewer rule violations and fewer collisions on the recorded maritime traffic scenarios.Both observations indicate that the scenarios based on recorded data are less critical than the handcrafted situations and, thus, easier to solve for the agents that were not constrained to rule-compliant actions during training.Generally, the agents generalize well to the scenarios based on recorded data.Since identifying critical situations in recorded maritime traffic data is computation-heavy and critical situations are very rare, this small gap between realistic recorded and randomly handcrafted situations is compensated by being able to create many scenarios: The 49 critical situations resulted from one month of maritime traffic data at the coast of the US, whereas the 2000 handcrafted critical situations were generated in a matter of minutes.Yet, recorded scenarios are not fully representing the variety of the real world.Thus, future work should investigate if our safe agent also performs well on a real-world test bed.c) Requirements for multi-vessel traffic situations: Realworld traffic situations can include more than two vessel on a collision course.Our formalized traffic rules can be evaluated for these more complex traffic situations as demonstrated in [15].Yet, the current version of the COLREGS does not provide a clear collision avoidance specification if more than two vessels are involved.Thus, a formal verification cannot be developed due to the lack of a clear specification.Future work should investigate extensions of the COLREGS to fill this specification gap and consequently realize provably rulecompliant motion planning in multi-vessel traffic situations.d) Action space choice: The discrete action space makes it possible to efficiently identify rule-compliant actions.However, a continuous action space would allow the agent to explore all possible actions.This significantly increases the challenge of identifying safe actions, because there are infinitely many individual continuous actions in a continuous action space.Yet, one approach to investigate in future work could be obtaining rule-compliant state sets as proposed in [67] and correcting actions proposed by the agent to safe actions, e.g., with action projection as in [68].
e) Satisfiablity of rules: The parametrization of the temporal logic rules eases re-adjusting to regulation changes.Yet, these parameters must be manually tuned to ensure that the temporal logic rules are satisfiable.For example, it is important that the detection of an encounter situation happens early enough so that no emergency situation is detected during a give-way maneuver.For instance, theorem provers could help to verify that the chosen rule parameters guarantee that the rules are satisfiable.However, formulating this proof is challenging due to the continuous state and action space, and subject to future work.

VIII. CONCLUSION
We are the first to propose a provably safe RL approach for autonomous power-driven vessels on the open sea that achieves provable compliance with traffic rules formalized with temporal logic.For that, we introduced an online verification approach based on our formalized rules identifying the set of safe actions.Our formal emergency detection and emergency controller achieves collision avoidance for the regarded traffic situations even if other vessels do not comply with traffic rules.In critical maritime traffic situations, our safe RL agent achieves rule compliance, in contrast to stateof-the art agents that are informed about safety only through the reward.At the same time, all agents achieve a satisfactory goal-reaching performance on critical traffic situations.Our evaluation on recorded traffic situations shows that our safe RL agent generalizes beyond the distribution of training data.This study is a first step toward learning-based motion planning systems complying with traffic rules for autonomous vessel navigation.

APPENDIX
A. Predicates specified [15] In Table IV, we briefly recapitulate the predicates specified in [15].We refer the interested reader to our previous work [15] for detailed explanations.Subsequently, the necessary notation that was not yet introduced in this article is introduced and the re-parametrization of the predicate collision_possible is explained.
The trajectory of vessel i consists of states at discrete time steps and is denoted as T i .The velocity vector based on the state of the vessel is v i = proj v (s i ) unit_v(s i ).We define a clock cl(T i , s i ) that starts at the initial time step of a trajectory and returns the elapsed time for a state s i .Further, we require a function state(T i , t k ) which returns the state of a trajectory at time t k .The modulo operator mod(a, b) returns the remainder of a/b for a, b ∈ R using floored division.The function t s returns the time for a predicate trace where the respective predicates changed last from false to true.The collision cone CC ′ is based on the velocity obstacle concept [61] and the construction is detailed in [15, Fig. 1].
For this work, we made two re-parametrizations of the collision_possible, which determines if two vessels l and m are on a collision course and, thus, could collide within the time t horizon .First, we also want to detect a collision course if the vessels would pass each other with insufficient distance.Thus, we use r m = 3 l m for the collision cone CC ′ instead of r m = l m in [15, Fig. 1].This results in detecting a collision possibility if the vessels would not keep a safe distance of at least two lengths of the vessel m.Second, we evaluate the set of vessel velocities V l with respect to their collision possibility instead of only the current velocity v l .In particular, we check the collision possibility for We set the velocity difference v ϵ to 1 m s −1 for our numerical evaluations.

B. Proof of Lemma 1
Proof: To prove that only one predicate of keep, crossing, head_on, and overtake can evaluate to true, we show for each combination that the conjunction is false when evaluated for two vessels l and m.For the combination of crossing and head_on, it directly follows that the predicates cannot be true at the same time from the relative position detected by the respective sector predicates.For the combination of crossing and overtake, let us assume that crossing predicate is true.Then, the vessel m is oriented towards left and in the right sector of vessel l (see Fig. 3 and Fig. 4 in [15]).Thus, it is geometrically impossible for vessel l to be in the behind sector of vessel m and overtake cannot be true.The predicates head_on and overtake cannot be true simultaneously as the relative positions and orientations contradict each other similar to case (II).In particular, if the vessel m is in the front sector of vessel l and their relative orientation is in [π − ∆ head-on , π + ∆ head-on ], then vessel l cannot be in the behind sector of vessel m.

= ⊥
The predicate keep is a disjunction of two cases in which the vessel has to keep its course and speed.Thus, we have to show that for both statements of the disjunction that they evaluate to false.The explanation for the equation steps are marked with small letters in round brackets, e.g., (a), and follow after the respective equations.
Fig.1.Proposed provably safe RL approach for autonomous vessels.First, traffic rules for collision avoidance are formalized with temporal logic (see Sec. III).Based on the formal specification, the set of rule-compliant actions is identified (see Sec. IV and Sec.V), which are integrated in the RL process so that the agent can only select actions that are rule-compliant (see Sec. VI).Note that the statechart in Fig.3details the computation of verified rue-compliant actions and comprises two modes: normal operation and emergency operation.The resulting safe agent achieves rule compliance and collision avoidance during training and deployment, while agents without the safety verification of actions violate the formalized traffic rules and collide still after training (see Sec. VII).

Fig. 3 .
Fig.3.Statechart Γ modeling the legal safety specification with predicates at the transitions.The states for the regular collision avoidance rules R 3 -R 6 are depicted in blue and the emergency operation state for rule R 1 in red.For safety verification of actions, the algorithms identifying the set of rule-compliant actions (indicated in brackets) are employed given the current state ρ i of the statechart.

Fig. 4 .
Fig. 4. Emergency controller modes with set-based occupancy prediction of obstacle vessel in orange and the occupancy of the ego vessel in blue for several time intervals.The orientation of the ego vessel and the obstacle vessel are indicated with dashed lines and emergency maneuver is depicted by green arrows or occupancies.The green cross indicates the target position for the base and ahead modes.The sectors, for which the predicate in_sector is true, are shown in gray for the ahead and stern mode.The visualization of the sectors includes the arguments of predicate in_sector in dark blue and the point of origin in black.

Fig. 5 .
Fig. 5. Visualization of turning direction cases.The obstacle vessel is depicted in orange and the ego vessel in blue.Arrows indicate orientations and positions are marked with dots.The turning direction case is indicated by the superscript.For cases 1 and 3, the ego vessel should turn right and for the cases 2 and 4, the ego vessel should turn left. else

,Fig. 7 .
Fig. 7. Illustration of observations with sensing range and four sectors in gray, goal region in green, initial position with direct path to goal region in blue, and obstacle vessel for the previous time step t − 1 and the current time step t in orange.

Fig. 8 .
Fig. 8. Mean and bootstrapped 95 % confidence interval for training curves for baseline, rule-reward, and safe agents averaged over ten random seeds.

(
IV) overtake ∧ keep: overtake(s l , s m , •) ∧ keep(s l , s m , •) (s l , s m , •) ∧ (in_left_sector(s l , s m ) ∧ ...)) ∨ (overtake(s l , s m , •) ∧ overtake(s m , s l , •)) (b) = in_behind_sector(s l , s m ) ∧ ... ∧ in_left_sector(s l , s m ) ∧ ... ∨ (overtake(s l , s m , •) ∧ overtake(s m , s l , •)) (c) = ⊥ ∨ ⊥ = ⊥ transitions to one of the respective blue Algorithm 1 emergency_maneuver(s ego , s obs , * ) Input: current state of ego vessel s ego , current state of obstacle vessel s obs , emergency mode mode, initial time t 0 , time step size ∆t, acceleration control sequence u acc (t) Output: control input u(t i ) 1: s ego,0 = proj p (s ego ), s obs,0 = proj p (s obs ), t i = t 0 2: while ¬emergency_resolved do 3: if ∥proj p (s ego,0 ) − proj p (s ego )∥ 2 > d min,ahead ∧ mode = ahead then target ← get_target_ahead(s ego , s ego,0 , s obs,0 ) p p target ← get_target_base(s ego , s obs ) 13: a, ω ← tracking_controller(s ego , p target ) return u(t i ) = [a, ω] 16: s ego , s obs ← step_environment(a, ω) 17: 2, line 17-18), and (b) with maneuver segment trajectories for each action a ∈ A acc , which keep the speed or accelerate the ego vessel (cf.Alg. 2, line 19-21).If the action of the maneuver segment trajectory that should be extended (obtained with the function last) does not correspond to a c , the maneuver is only Algorithm 2 build_st Input: candidate action a c , accelerating actions A acc , current state of obstacle vessel s obs , current state of ego vessel s ego , maneuver segment time t m , maneuver horizon t max,m , control sequence u keep (t) Output: verified part of search tree G 1: t end ← t m , G ← {a c } Fig.6.Example search trees for the three give-way encounter situations, in which the ego vessel has to give way.The prediction of the obstacle vessel is depicted in orange and the maneuver segment trajectories in green with a dot for the final state.The trajectories based on actions from Aacc are displayed as dashed line.Note that we display only one trajectory based on actions from Aacc for visualization purposes.The candidate actions initializing the search trees are a d,1 and a d,2 where d is either tr for turning right and tl for turning left.The mark ✓ indicates that the maneuver is verified for the maneuver_verified predicate and ✗ indicates that the maneuver is not rulecompliant.A temp ← get_turning_act(s ego , s obs , A tr , A tl ) G temp ← build_st(s ego , s obs , a, A acc , t m , t max,m ) Algorithm 3 Encounter action verification Input: stand-on action a keep , turning to right actions A tr , turning to left actions A tl , accelerating actions A acc , current state of obstacle vessel s obs , current state of ego vessel s ego , encounter predicate ψ e Output: set of safe actions A s , verified part of search tree G 1: A s ← ∅, G ← ∅ 2: if ψ e = keep then

TABLE III TESTING
RESULTS ON 600 HANDCRAFTED AND 49 RECORDED SCENARIOS Note: The rule-reward and baseline agents are abbreviated with RR and base.Ep. length is the average episode time horizon.Emerg.steps denote the percentage of steps for which the emergency controller intervened.