COLREG-Compliant Collision Avoidance for Unmanned Surface Vehicle Using Deep Reinforcement Learning

Path Following and Collision Avoidance, be it for unmanned surface vessels or other autonomous vehicles, are two fundamental guidance problems in robotics. For many decades, they have been subject to academic study, leading to a vast number of proposed approaches. However, they have mostly been treated as separate problems, and have typically relied on non-linear first-principles models with parameters that can only be determined experimentally. The rise of deep reinforcement learning in recent years suggests an alternative approach: end-to-end learning of the optimal guidance policy from scratch by means of a trial-and-error based approach. In this article, we explore the potential of Proximal Policy Optimization, a deep reinforcement learning algorithm with demonstrated state-of-the-art performance on Continuous Control tasks, when applied to the dual-objective problem of controlling an autonomous surface vehicle in a COLREGs compliant manner such that it follows an a priori known desired path while avoiding collisions with other vessels along the way. Based on high-fidelity elevation and AIS tracking data from the Trondheim Fjord, an inlet of the Norwegian sea, we evaluate the trained agent’s performance in challenging, dynamic real-world scenarios where the ultimate success of the agent rests upon its ability to navigate non-uniform marine terrain while handling challenging, but realistic vessel encounters.


Introduction
Autonomous vehicles is one of the most interesting prospects associated with the rise of Artificial Intelligence (AI) and Machine Learning (ML) in recent years.Specifically, the success of Deep Learning (DL) applications in an ever-increasing number of domains, ranging from computer vision to imperfect-information games, has put the former pie-in-the-sky proposal of self-driving vehicles on the horizon of technological development.
While automated path following, at least in the maritime domain, has been a relatively trivial endeavor in the light of classical control theory and is a well-established field of research [1][2][3][4][5][6][7][8][9][10][11], considerably more advanced capabilities are required to navigate unknown, dynamic environments; characteristics that, generally speaking, apply to the real world.Reactive collision avoidance, i.e. the ability to, based on a sensor-based perception of the local environment, perform evasive manoeuvres that mitigate collision risk, remains a very challenging undertaking (e.g., see [12][13][14][15]).This is not to say, however, that the topic is not wellresearched; a wide variety of approaches have been proposed, including especially (but not exhaustively) artificial potential field methods [16][17][18], dynamic window methods [19][20][21], velocity obstacle methods [22,23] and optimal control-based methods [24][25][26][27][28].However, it appears from a literature review that, when applied to autonomous vehicles with non-holonomic and real-time constraints, the approaches suggested so far suffer from one or more of the following drawbacks [29][30][31][32]: • Unrealistic assumptions, or neglect, of the vessel dynamics.
• Inability to scale to environments of non-trivial complexity (e.g.multi-obstacle scenarios).
• Excessive computation time requirements.
• Disregard for desirable output trajectory properties, including smoothness, continuity, feasibility and safety.
• Incompatibility with external environmental forces such as wind, currents and waves.
• Stability issues caused by singularities.
• Sub-optimal outputs due to local minima.
• Requirement of a precise mathematical model of the controlled vessel.
Focusing on the maritime domain, this paper will explore how Deep Reinforcement Learning (DRL), a machine learning paradigm concerned with using DL for iteratively approximating optimal behavior policies in unknown environments, can be used for training an end-to-end autopilot mechanism capable of avoiding collisions at sea.For the simpler problem of treating path following and collision avoidance as separate challenges, DRL-based methods have already demonstrated remarkable potential, yielding promising results in a multitude of studies, including especially [33][34][35][36][37] for the former problem domain and [38][39][40][41] for the latter.
For a preliminary study, we simulate a small-sized supply ship model, equip it with a rangefinder sensor suite and train a DRL-based controller using Proximal Policy Optimization (PPO).A carefully constructed reward function, which balances the prioritization of path adherence versus that of collision avoidance (which can be considered competing objectives), is used to guide the agent's learning process.Finally, we evaluate its performance in challenging, dynamic test scenarios reconstructed from real-world terrain and maritime traffic data.

Policy Network
Figure 1: Flowchart outlining the structure of the guidance system explored in this study.At each time-step, the agent receives an observation vector s (t) , and then, according to its policy π, which is implemented as a neural network, outputs an action (i.e.control vector) a (t) , influencing the state of the simulated environment.During training, the agent's policy is continuously improved by means of gradient ascent based on the reward signal r (t) that it receives at each time-step.This constant feedback enables the agent, whose policy is initially nothing more than a clean slate with no intelligent characteristics, to improve its capabilities through a trial-and-error based approach.Its learning objective is simple: Find the policy that yields the highest expectation of the agent's long-term future reward.

Motivation
Arguably, the most promising aspect of autonomous vessels is not the obvious economic impact resulting from increased efficiency and the replacement of costly human labor, but instead the potential to eliminate injuries and material damage caused by collisions.According to the European Maritime Safety Agency, which annually publishes statistics on maritime accidents related to the EU member states, almost half of casualties at sea are "navigational in nature, including contact, collision and grounding or stranding" [42].
Validating a DRL-based approach to vessel guidance in a simulated environment can pave the way for applying the technology on a real, physical vessel.Since maritime collisions, of which 65.8% can be attributed to human error [43], account for hundreds of injuries each year in the EU alone (as shown in Figure 2), a positive result could be a preliminary step on the important path towards the adoption of AI systems for autonomous vessel guidance.Due to the limitations of existing methods, this is yet to take place on a large scale.3 Background

Maritime Navigation Rules
For collision avoidance at sea, adherence to the International Regulations for Preventing Collisions at Sea (COL-REGs) [44] is crucial.Before autonomous vessels became a possibility, the COLREGs were formulated to prevent collisions between two or more vessels.The two main takeaways from these rules relevant for this work are; 1) the give way vessel should take early and substantial action, and 2) safe speed should be ensured at all times, such that course alteration is effective towards avoiding collisions where there is sufficient sea-room.Furthermore, the following rules provide clear instructions on how maritime vessels should behave upon encounters with other ships.
Rule 14: Head-on situation (a) When two power-driven vessels are meeting on reciprocal or nearly reciprocal courses so as to involve risk of collision each shall alter her course to starboard so that each shall pass on the port side of the other.

Rule 15: Crossing situation
When two power-driven vessels are crossing so as to involve risk of collision, the vessel which has the other on her own starboard side shall keep out of the way and shall, if the circumstances of the case admit, avoid crossing ahead of the other vessel.

Rule 16: Action by give-way vessel
Every vessel which is directed to keep out of the way of another vessel shall, so far as possible, take early and substantial action to keep well clear.
Rule 18: Responsibilities between vessels (a) A power-driven vessel underway shall keep out of the way of: (ii) a vessel restricted in her ability to manoeuvre.
Since the vessel controlled by the RL agent (the own-ship) is significantly smaller than the vessels encountered, it is, as a result of Rule 18, required to act as the give-way vessel in all situations.
(a) Head-on situation, targetvessel free to move.
(b) Head-on situation, targetvessel restricted in movement.
(c) Crossing situation, targetvessel free to move.
Figure 3: Expected behavior from the own-ship (colored in blue) in head-on and crossing encounters according to COLREGs.

Coordinate frames
In order to model the dynamics of marine vessels, one must first define the coordinate frames.Two coordinate frames typically used in vehicle control applications are of particular interest: The geographical North-East-Down (NED) and the body frame.The NED reference frame {n} = (x n , y n , z n ) forms a tangent plane to the Earth's surface, making it useful for terrestrial navigation.Here, the x n -axis is directed north, the y n -axis is directed east and the z n -axis is directed towards the center of the earth.Assumption 1 (State space restriction).The vessel is always located on the surface and thus there is no heave motion.Also, there is no pitching or rolling motion.
The origin of the body-fixed reference frame {b} = (x b , y b , z b ) is fixed to the current position of the vessel in the NED-frame, and its axes are aligned with the heading of the vessel such that x b is the longitudinal axis, y b is the transversal axis and z b is the normal axis pointing downwards.However, as the vessel is restricted to surface level motion in our application, only the North and East components are of interest.
Figure 4: Illustration of the NED and body coordinate frames.

State variables
Following SNAME notation [45], the state vector consists of the generalized coordinates η = [x n , y n , ψ] T , where x n and y n are the North and East positions, respectively, in the reference frame {n}, and ψ is the yaw angle, i.e. the current angle between the vessel's longitudinal axis x b and the North axis x n .Correspondingly, the translational and angular velocity vector ν = [u, v, r] T consists of the surge (i.e.forward) velocity u, the sway (i.e.sideways) velocity v as well as yaw rate r.

Vessel model
To facilitate further research, we base the vessel dynamics on CyberShip II, a 1:70 scale replica of a supply ship which has a length of 1.255 m and mass of 23.8 kg [46].Training the RL algorithm on a small vessel, such as Cy-berShip II, would allow for a relatively straight-forward deployment on a real-world model ship for further testing of the algorithm.However, the symbolic representation of the dynamics of a surface vessel, which is obtained from well-researched ship maneuvering theory, is the same regardless of the vessel -the distinctions lie solely in the numerical matrix parameters.Thus, if it can be demonstrated that an RL agent can control a small-sized model ship in an intelligent manner, there is reason to believe that controlling a full-sized ship would be within its reach.
As it is equipped with rudders and propellers aft, as well as one bow thruster fore, Cybership II is a fully actuated ship.This means that it could, in principle, be commanded to follow an arbitrary trajectory in the state space, as it is able to accelerate independently in every relevant DOF simultaneously.However, for the purpose of simplifying the RL agent's action space, we disregard the bow thruster in this study and allow only the aft thrusters and control surfaces to be applied by the Reinforcement Learning (RL) agent as control signals.This omission is further motivated by the fact that bow thrusters have limited effectiveness at higher speeds [47].Thus, the control vector can be modelled as f = [T u , T r ] T , where T u represents the force input in surge and T r represents the moment input in yaw.Assumption 2 (Calm sea).There are no external disturbances to the vessel such as wind, ocean currents or waves.
Given Assumption 2, the 3-DOF vessel dynamics can be expressed in a compact matrix-vector form as where R z,ψ represents a rotation of ψ radians around the z n -axis as defined by Furthermore, M ∈ R 3×3 is the mass matrix and includes the effects of both rigid-body and added mass, C(ν) ∈ R 3×3 incorporates centripetal and Coriolis effects and D(ν) ∈ R 3×3 is the damping matrix.Finally, B ∈ R 3×2 is the actuator configuration matrix.The numerical values of the matrices are taken from [48], where the model parameters were estimated experimentally for CyberShip II in a marine control laboratory.

Deep reinforcement learning
Applications of RL on high-dimensional, continuous control tasks heavily rely on function approximators to generalize over the state space.Even if classical, tabular solution methods such as Q-learning can be made to work (provided a discretizing of the continuous action space), this is not considered an efficient approach for control applications [49].In recent years, given their remarkable generalization ability over high-dimensional input spaces, the dominant approach has been the application of deep neural networks which are optimized by means of gradient methods.There are, however, different approaches to how the networks are utilized, and thus their semantic interpretation in the context of the learning agent differs.
In Q-Learning-based methods such as Deep Q-Learning (DQN) [50], a deep neural network is used to predict the expected value (i.e.long-term, cumulative reward) of stateaction pairs, which reduces the policy to an optimization problem over the set of available actions given the current state.In gradient-based policy methods, on the other hand, the policy itself is implemented as a deep neural network whose weights are optimized by means of gradient ascent (or approximations thereof).Lately, several algorithms built on this principle have gained a large traction in the RL research community, most notably Deep Deterministic Policy Gradient (DDPG) [49], Asynchronous Advantage Actor Critic (A3C) [51] and Proximal Policy Optimization (PPO) [52].For continuous control tasks, this family of DRL methods is commonly considered to be the more efficient approach [53].Based on previous work, where the PPO algorithm significantly outperformed other methods on a learning problem similar to the one covered in this study [54], we focus our efforts on this method.

RL Preliminaries
First, we model the interplay between the agent and the environment as an infinite-horizon discounted Markov Decision Process (MDP), formally defined by the 6-tuple (S, A, p, p 0 , r, Ω, o, γ) where • S is the state space, • A is the action space, • p : S × A × S → [0, 1] defines the conditional transition probabilities for the next state s such that p(s |s, a) = P r(S t+1 = s |S t = s, A t = a), • p 0 : S → [0, 1] is initial state distribution, i.e. p o (s) = P r(S 0 = s), • r : S × A → R returns the numeric reward at each time-step as function of the current state and applied action, • γ ∈ [0, 1] is the discount factor for future rewards.
The agent draws its actions from its policy π.The policy may be a deterministic function (as in DDPG), but in the context of PPO, it is modelled as a stochastic function.
The conditional action distribution given the current state s is given by π(a|s) : S × A →→ [0, 1] = P r(A t = a|S t = s).Specifically, we assume that the agent is drawing actions from a non-uniform multivariate Gaussian distribution whose mean is outputted by a neural network parametrized by the weights θ.Formally, this translates to a t ∼ π(s t ), where t is the current time-step.
Next, we introduce the state-value function V π (s) and the action-value function Q π (s, a).V π (s) is the expected return from time t onwards given an initial state s, whereas Q π (s, a) is the expected return from time t onwards, but conditioned on the current action a t .Formally, we have that where the random variable R t represents the reward at time-step t.

Policy gradients
The stochasticity of the policy enables us to translate the RL problem, i.e. the search for the optimal policy, into the problem of optimizing the expectation The family of policy gradient methods, to which PPO belongs, approach gradient ascent by updating the parameter vector θ according to the approximation θ t+1 ← αθ t + ∇ θ J(θ), where ∇ θ J(θ) is a stochastic estimate of From the policy gradient theorem [55] we have that the policy gradient ∇ θ J(θ) satisfies where µ is the steady state distribution under π such that µ(s) = lim t→∞ P r{S t = s|A 0:t−1 ∼ π}.Following the steps outlined in [56], this can be algebraically transformed to Also, it can be shown that one can greatly reduce the variance of this expression by replacing the state-action value function Q π (s, a) in Equation 4by Q π (s, a) − b(s), where the baseline function b(s) can be an arbitrary function not depending on the action a, without introducing a bias in the estimate.Commonly, b(s) is set to be the state value function V π , which yields the advantage function which represents the expected improvement obtained by an action compared to the default behavior.This leads to Thus, an unbiased empirical estimate based on N episodic policy rollouts of the policy gradient ∇ θ J(θ) is A π (s, a) is, like Q π (s, a) and V π (s), unknown, and must thus be estimated by the function approximator Â(s).Generalized Advantage Estimation (GAE), as proposed in [57], is the most notable approach.GAE makes use of a function approximator (commonly a neural network) V (s) to approximate the actual value function V (s).A common approach is to use an artificial neural network, which is trained on the discounted empirical returns.

Proximal policy optimization
PPO, as well as its predecessor (Trust Region Policy Optimization [58]) do not, even though it is feasible, optimize the policy directly via the expression in Equation 8. TRPO instead optimizes the surrogate objective function which provides theoretical guarantees for policy improvement.However, as this relies on an approximation that is valid only in the local neighborhood, carefully choosing the step size is critical to avoid instabilities.Unlike in TRPO, where this is achieved by imposing a hard constraint on the relative entropy between the current and next policy, PPO elegantly incorporates the preference for a modest step-size in the optimization target, yielding a more efficient algorithm [52].Specifically, it instead focuses on maximizing ) where r t (θ) is a shorthand for the probability ratio π θ (at|st) π θ (at|st) .The PPO training process, which is written in pseudocode format in Algorithm 1, can then be summarized as follows: At each iteration, PPO first collects batches of Markov trajectories from concurrent rollouts of the current policy.Next, the policy is updated according to a stochastic gradient descent update scheme.Our maritime simulation environment is made from a digital reconstruction of the Trondheim Fjord (Figure 5), an inlet of the Norwegian sea.Specifically, it is based on a digital terrain model (DTM) provided by the Norwegian Mapping Authority (Kartverket).The data set, which is called DTM10, is generated from airborne laser scanning, and has a horizontal resolution of 10x10 meters with coverage of the entire Norwegian mainland [59].The coordinates are given according to the Universal Transverse Mercator (UTM) rectangular projection system, which partitions the Earth into 60 north-south zones, each of which has a 6 degree longitudinal span.Within each zone, which is indexed consecutively from zone 1 (180°W to 174°W) to zone 60 (174°E to 180°E), a mapping from latitude/longitude coordinates to a Cartesian x-y coordinate system is performed based on a local flat earth-assumption.Given the vast number of zones used in the UTM projection system, the approximated coordinates, which of course have inherent distortions because of the spherical shape of the Earth, are of relatively high accuracy.The DTM10 data set is given with respect to zone 33.

Tracking data
We obtain a sample of historical vessel tracking data in the Trondheim Fjord area from a query of the Norwegian Coastal Administration's AIS Norway data service.The automatic identification system (AIS) is an automatic tracking system which provides both static (e.g.vessel dimensions) and dynamic (e.g.vessel position, heading and speed) information based on vessel transmissions.Within the field of autonomous surface vehicle guidance, AIS information is often used as a supplementary data source that is, by method of sensor fusion, combined with marine radar in collision avoidance algorithms.Additionally, given a large enough sample time within the area of interest, it provides a historical model of the marine traffic in the area.In our case, our historical data results from a 10 day data query ranging from January 26, 2020 to February 6, 2020 of all recorded traffic (Figure 7) within a rectangular area around the Trondheim Fjord.Depending on the transmitter characteristics for each individual vessel, the resulting tracking data resolution varies from 2-20 seconds, facilitating a high-accuracy reconstruction of each vessel's trajectory in our simulation.As the AIS tracking data represents vessel position by latitude/longitude coordinates, a conversion to the zone 33 UTM x-y coordinate system is called for.To do the conversion, we utilize the from_latlon method provided by the Python package utm [60].

Training environment
DRL-based autonomous agents have a remarkable ability to generalize their policy over the observation space, including the domain of unseen observations.And given the complexity and heterogeneity of the Trondheim Fjord environment, with archipelagos, shorelines and skerries (see Figure 7), this ability will be fundamental to the agent's performance.However, the training environment, in which the agent is supposed to evolve from a blank slate to an intelligent vessel controller, must be both representative, challenging and unpredictable to facilitate the generalization.Of course, the most representative choice for a training scenario would be the Trondheim Fjord itself, which would, if it was not for the generalization issues associated with this approach [61], also allow for training the agent via behavior cloning based on the available vessel tracking data.However, given the resolution of our terrain data, the resulting obstacle geometry is typically very complex, leading to overly high computational demands for simulating the functioning of the distance sensor suite.Thus, the better choice is to carefully craft an artificial training scenario with simple obstacle geometries.To reflect the dy-namics of a real marine environment, we let the stochastic initialization method of the training scenario spawn other target vessels with deterministic, linear trajectories.Additionally, circular obstacles, which are scattered around the environment, are used as a substitute for real-world terrain.A randomly chosen initialization of the training environment is shown in Figure 8.The circles are static obstacles, whereas the vessel-shaped objects are moving according to the trajectory lines.

Observation vector
Here, the goal is to engineer an observation vector s containing sufficient information about the vessel's state relative to the path, as well as information from the sensors.
To achieve this, the full observation vector is constructed by concatenating navigation-based and perception-based features, which formally translates to s = [s n , s p ] T .In the context of this paper, we consider the term navigation as the characterization of the vessel's state, i.e. its position, orientation and velocity, with respect to the desired path.
On the other hand, perception refers to the observations made via the rangefinder sensor measurements.In the following, the path navigation feature vector s n and the elements culminating in the perception-based feature vector s p are covered in detail.

Path navigation
A sufficiently information-rich path navigation feature vector would be such that it, on its own, could facilitate a satisfactory path-following controller (without any consideration for obstacle avoidance).A few concepts often used in the field of vessel guidance and control are useful in order to formalize this.First, we introduce the mathematical representation of the parameterized path, which is expressed as where x d (ω) and y d (ω) are given in the NED-frame.Navigation with respect to the path necessitates a reference point on the path which is continuously updated based on the current vessel position.Even though other approaches exist, this reference point is best thought of as the point on the path that has the closest Euclidean distance to the vessel, given its current position, as visualised in the example illustration shown in Figure 9.To find this, we calculate the corresponding value of the path variable ω at each timestep.This is an equivalent problem formulation because the path is defined implicitly by the value of ω.Formally, this translates to the optimization problem Which, using the Newton-Raphson method, can be calculated accurately and efficiently at each time-step.Here, the fact that the Newton-Raphson method only guarantees a local optimum is a useful feature, as it prevents sudden path variable jumps given that the previous path variable value is used as the initial guess [62].
Accordingly, we define the corresponding Euclidean distance to the path, i.e. the deviation between the desired path and the current track, as the cross-track error (CTE) .Formally, we thus have that Next, we consider the look-ahead point p d (ω + ∆ LA ) to be the point which lies a constant distance further along the path from the reference point p d (ω).The parameter ∆ LA , the look-ahead distance, is set by the user and controls how aggressively the vessel should reduce the distance to the path.Look-ahead based steering, i.e. setting the look-ahead point direction as the desired course angle, is a commonly used guidance principle [63].
We then define the heading error ψ as the change in heading needed for the vessel to navigate straight towards the look-ahead point from its current position, as illustrated in Figure 9.This is calculated from where ψ is the vessel's current heading and x n , y n are the current NED-frame vessel coordinates as defined earlier.
However, even if minimizing the heading error will yield good path adherence, taking into account the path direction at the look-ahead point might improve the smoothness of the resulting vessel trajectory.Referring to the first order path derivatives as x p (ω) and y p (ω), we have that the path angle γ p , in general, can be expressed as a function of arc-length ω such that As visualized in Figure 9, the path direction at the lookahead point is then given by γ p (ω + ∆ LA ).Accordingly, we can then define the look-ahead heading error, which is zero in the case when the vessel is heading in a direction that is parallel to the path direction at the look-ahead point, as Our assumption is then that the navigation feature vector s n , defined as outlined in

Sensing
Using a set of rangefinder sensors as the basis for obstacle avoidance is a natural choice, as it yields a comprehensive, easily interpretable representation of the neighbouring obstacle environment.This should also enable a relatively straightforward transition from the simulated environment to a real-world one, given the availability of common rangefinder sensors, be it lidars, radars, sonars or depth cameras.In our setup, the vessel is equipped with N distance sensors with a maximum detection range of S r , which are distributed uniformly with 360 degree coverage, as illustrated in Figure 10.While the area behind the vessel is obviously of lesser importance, and not necessary to consider for navigating purely static terrain [54], the possibility of overtaking situations where the agent must react to another vessel approaching from behind makes full sensor coverage a necessity.

Sensor partitioning
The most natural approach to constructing the final observation vector would then be to concatenate the path information feature vector with the array of sensor outputs.However, initial experiments with this approach were aborted as it became apparent that the training process had stagnated -at a very dissatisfactory agent performance level.A likely explanation for this failure is the size of the observation vector which was fed to the agent's policy and value networks; as it becomes overly large, the agent suffers from the well-known curse of dimensionality.
Due to the resulting network complexity, as well as the exponential relationship between the dimensionality and volume of the observation space, the agent fails to generalize new, unseen observations in an intelligent manner [64].This calls for a significant dimensionality reduction.This can, of course, be achieved simply by reducing the number of sensors, something which would also have the fortunate side effect of reducing the simulation's computational needs.Unfortunately, this approach also turned out unsuccessful, even after testing a wide range of smaller sensor setups.Clearly, when the sensor count becomes too low, the agent's perception of the neighboring obstacle environment is simply too scattered to facilitate satisfactory obstacle-avoiding behavior in challenging scenarios such as the ones used for training the agent.As balancing the trade-off between sensor resolution and observation dimen-sionality appears intractable, this calls for a more involved approach.
A natural approach is to partition the sensor suite into D sectors, each of which produces a scalar measurement which is included in the final observation vector, effectively summarizing the local sensor readings within the sector.However, given our desire to minimize its dimensionality, dividing the sensors into sectors of uniform size is likely sub-optimal, as obstacles located in front of the vessel are significantly more critical and thus require a higher degree of perception accuracy than those that are located at its rear.
In order to realize such a non-uniform partitioning, we use a logistic function -a choice that also fulfills our general preference for symmetry.Assuming a counter-clockwise ordering of sensors and sectors starting at the rear of the vessel, we map a given sensor index i ∈ {1, . . ., N } to sector index k ∈ {1, . . ., D} according to Non-linear mapping where σ is the logistic sigmoid function and γ C is a scaling parameter controlling the density of the sector distribution such that decreasing it will yield a more evenly distributed partitioning.In Figure 11, the practical output of this sensor mapping procedure is visualised, with the sectors being the narrowest near the front of the vessel.
We can then formally define the distance measurement vector for the k th sector, which we denote by w k , according to w k,i = x i for i ∈ {1, . . ., N } such that κ(i) = k Next, we seek a mapping f : R n → R, which takes the vector of distance measurements w k , for an arbitrary sector index k, as input, and outputs a scalar value based on the current sensor readings within the sector.Always returning the smallest measured obstacle distance within the sector, i.e. f = min (in the following referred to as min pooling), is a natural approach which yields a conservative and thereby safe observation vector.As can be seen in Figure 12a, however, this approach might be overly restrictive in certain obstacle scenarios, where feasible openings in between obstacles are inappropriately overlooked.However, even if the opposite approach (max pooling, i.e. f = max) solves this problem, it is straight-forward to see, e.g. in Figure 12b by considering the fact that the presence of the small obstacle near the vessel is ignored, that it might lead to dangerous navigation strategies.In order to alleviate the problems associated with min and max pooling mentioned above, a new approach is required.The feasibility pooling procedure, which was introduced in [54], calculates the maximum reachable distance within each sector, taking into account the location of the obstacle sensor readings as well as the width of the vessel.This method requires us to iterate over the sensor reading in ascending order corresponding to the distance measurements, and for each resulting distance level check whether it is feasible for the vessel to advance beyond this level.As soon as the widest opening available within a distance level is deemed too narrow given the width of the vessel, the maximum reachable distance has been reached.Formally, we define f to be the algorithm outlined in Algorithm 2.

Require:
Vessel width W ∈ R + Angle between neighboring sensors θ Sensor rangefinder measurements for current sector x = {x1, . . ., xn} procedure FEASIBILITYPOOLING(x) Initialize I to be the indices of x sorted in ascending order according to the measurements xi for i ∈ I do Arc-length di ← θxi Opening-width y ← di/2 Opening was found si ← f alse

Motion detection
Simply feeding the pooled current rangefinder sensor readings to the agent's policy network, will, without any doubt, be insufficient for the agent to learn a policy for intelligently avoiding moving obstacles.A continuous snapshot For the sectors colored green, the maximum distance S r was outputted, implying that the sector is clear of any obstacles.It is obvious that min-pooling yields an overly restrictive observation vector, effectively telling the agent that a majority of the travel directions are blocked.On the other hand, max pooling yields overly optimistic estimates, potentially leading to dangerous situations.The feasibility pooling algorithm, however, mirrors an intuitive reasoning about the reachability within each sector, producing a more intelligent estimate.
(a) Full distance is reachable.(b) Less than half the distance is reachable.
Figure 13: Illustration of the feasibility algorithm for two different scenarios.After sorting the sensor indices according to the corresponding distance measurements, the algorithm iterates over them in ascending order, and, at each step, decides if the vessel can feasibly continue past this point.In the scenario displayed in the figure on the right, the opening is deemed too narrow for the full distance to be reachable.
of the environment can facilitate a purely reactive (but still intelligent [54]) agent in a static environment, but without explicit or implicit knowledge of the nearby obstacles' velocities, such an agent will invariantly fail when placed in a dynamic environment, as it will be unable to distinguish between stationary and moving obstacles.
An implicit approach worth mentioning is to process the sensor readings sequentially using a Recurrent Neural Network (RNN).In recent years, RNN architectures, such as Long Short-Term Memory LSTM, have gained a lot of traction in the ML research community [65] and been successfully applied to sequential RL problems.An example of this is the LSTM-based AlphaStar agent, which reached grandmaster level in the popular real-time strategy game StarCraft II [66].It is therefore possible that a highperforming collision avoidance policy could be found by feeding a recurrent agent with sensor readings.If such an implementation was shown to be successful, it would facilitate a very straight-forward transition to an implementation on a physical vessel, as no specialized sensor equipment for measuring object velocities would be needed.However, even if sequentially feeding sensor readings to a recurrent network might sound relatively trivial, the motion of the vessel would induce rotations of the observed environment, complicating the situation.Initial experimentation with an off-the-shelf recurrent policy compatible with our simulation environment confirmed the difficulties with this approach.Even with a purely static environment, the recurrent agent was incapable of learning how to avoid collisions.
Thus, this preliminary study will focus on the explicit approach, i.e. providing the obstacles' velocities as features in the agent's observation vector.Admittedly, while the implementation of this is trivial in a simulated environment, as obstacle velocities can simply be accessed as object attributes, a real-world implementation will necessitate a reliable way of estimating obstacle velocities based on sensor data.However, even if this can be challenging due to uncertainty in the sensor readings, object tracking is a well-researched computer vision discipline.We reserve the implementation of such a method to future research, but refer the reader to [67] for a comprehensive overview of the current state of the field.
For each sector, we provide the decomposed velocity of the closest moving obstacle within the sector as features for the agent's observation vector.Specifically, the decomposition, which yields the x and y component of the obstacle velocity, is done with respect to the coordinate frame in which the y-axis is parallel to the center line of the sensor sector in which the obstacle is detected.This is illustrated in Figure 14.For each sector k, we denote the corresponding decomposed x and y velocities as v x,k and v y,k , respectively.Naturally, if there are no moving obstacles present within the sector, both components are zero.
Figure 14: Velocity decomposition for two moving obstacles, α and β.For each obstacle, its velocity vector is decomposed into x and y components relative to the obstacle sector, such that the decomposed y-component is parallel to the center line of the corresponding sector, and has a positive value if it is moving towards the vessel.

Perception state vector
As having access to both obstacle distances and obstacles velocities is critical to achieve satisfactory obstacleavoiding agent behavior, we include both in the perception state vector.
To avoid discontinuities in the obstacle distance features caused by the sudden transition from 0 to S r at the point of detection, we introduce the concept of obstacle closeness.
The closeness to an obstacle is such that it is 0 if the obstacle is undetected, i.e. further away from the vessel than the maximum range of the distance sensors, and 1 if the vessel has collided with the obstacle.Furthermore, within this range, is it reasonable to map distance to closeness in a logarithmic fashion, such that, in accordance with human intuition, the difference between 10m and 100m is more significant than the difference between, for instance, 510m and 600m.Formally, we have that a distance d maps to closeness c(d) : R → [0, 1] according to By concatenating the reachable distance and the decomposed obstacle velocity from every sector, we then define the perception state vector s p as x,1 , v

Reward function
Any RL agent is motivated by the pursuit of maximizing its reward.The simplest, and thus highly sought-after approach to rewarding RL agents is to reward it at the end of each episodes -at that point, one already knows if the agent succeeded or failed.However, given the length of a full episode, such a reward function turns out extremely sparse, leaving the agent with a near impossible learning task.This calls for a continuous reward signal, rewarding the agent based on its current adherence to its objectives, i.e. how well it is currently doing with respect to both path following and obstacle avoidance.Given the complexity of the dual-objective learning problem focused on in this study, as well as the general tendency of RL agents' to exploit the reward function in any way possible (e.g.standing still, going in circles), designing an appropriate rewards function r (t) is paramount to the agent exhibiting the desired behavior after training.
It is natural to reward the agent separately for its performance in the two relevant domains: path following and collision avoidance.Thus, we introduce the independent reward terms r path and r colav , representing the pathfollowing and the obstacle-avoiding reward components, respectively, at time t.Furthermore, as suggested in [54], we introduce the weighting coefficient λ ∈ [0, 1] to regulate the trade-off between the two competing objectives.In addition, as it is crucial to penalize the agent whenever it collides with an obstacle, we represent this by the negative reward term r collision , which is activated upon collision.This leads to the preliminary reward function

Path following performance
A natural approach to incentivize path adherence is to reward the agent for minimizing the current absolute crosstrack error (t) .In [62], a Gaussian reward function centered at = 0 with standard deviation σ e was suggested.However we argue that the absolute exponential reward function exp −γ (t) has more desirable characteristics due to its fatter tails, as seen in Figure 15a.By avoiding the vanishing improvement gradient of the Gaussian reward occurring at large absolute cross-track errors, the absolute exponential reward function ensures that the agent is rewarded even for a slight improvement to a very unsatisfactory state.However, this alone does not reflect our desire for the agent to actually make progress along the path -and thus, the RL agent, greedy as it is, will eventually develop a policy of standing still indefinitely after closing the gap to the path.Thus, the reward signal must be expanded upon so that it incorporates the incentivization of motion -and not just arbitrary motion, but movement in the right direction.
The already defined look-ahead heading error term ψ is a natural basis for formalizing this.Specifically, we consider  Umax cos ψ(t) , with U max being the maximum vessel speed, which effectively yields zero reward if the vessel is heading in a direction perpendicular to the path, and a negative reward if the agent is tracking backwards.Multiplying this with the cross-track error reward component defined earlier is a natural choice, and yields the provisional reward function Given this reward function, however, we note that, if the vessel is standing still (i.e.u (t) = 0), or if it is heading in a direction perpendicular to the path (i.e.ψ(t) = ± π 2 ), the agent will receive zero reward regardless of the cross-track error, which is undesired.Similarly, if the cross-track error grows very large, i.e. exp −γ (t) → 0, the reward signal will be zero regardless of the vessel velocity and heading.Thus, we add constant multiplier terms γ r to both reward components, yielding the following expression for the final path-following reward function where the −γ 2 r term is added to remove the constant reward bias implied by the function choice.

Static obstacle avoidance performance
Collision avoidance involves both collisions with other vessels as well as avoiding running ashore (or colliding with some other static obstacle).However, the two aspects should be treated separately, as would any human sailor.In the following, we refer to the former as dynamic, and the latter as static obstacle avoidance.
In order to encourage obstacle-avoiding guidance behavior, penalizing the agent for the closeness of nearby terrain in a strictly increasing manner seems reasonable.However, we note that the severity of closeness intuitively does not increase linearly with distance, but instead increases in some quasi-exponential fashion.
Furthermore, given the presence of a nearby static obstacle, it seems clear that the penalty given to the agent must depend on the orientation of the vessel with regards to the obstacle in such a manner that obstacles located near the stern of the vessel are of significantly lower importance than obstacles that are currently right in front of the it.Thus, given a static obstacle located at distance x, at the angle θ with respect to the centerline of the vessel, we propose the penalty function Raw closeness penalty (23) where α x is in the order of magnitude of the sensor range, such that sufficiently high negative rewards are given as objects get closer to the own-ship.
For practical reasons, we use the distances measured by the rangefinder sensors as surrogates for obstacle closeness, and penalize each sensor reading according to r obst,stat (x i , θ i ), where x i is the i th distance sensor measurement and θ i is the vessel-relative angle of the corresponding sensor ray.In order to to cancel the dependency on the specific sensor suite configuration, i.e. the number of sensors and their vessel-relative angles, that arises when this penalty term is summed over all sensors, we compute the overall static obstacle-avoidance reward according to the weighted average which is visualised on a logarithmic scale in Figure 16.

Dynamic obstacle avoidance performance
For dynamic obstacle avoidance, we expand on the framework developed for static obstacles.Firstly, the penalty needs to reflect the relevant COLREGs.Since the COL-REGs are defined according to the bearing of a target ship relative to the own-ship, an intuitive way to guide the RL agent towards COLREGs compliance is to adjust the static obstacle penalty (Equation 23) according to the relative bearing of the dynamic obstacle.The area around a vessel is normally split into three sectors: port, starboard, and stern, as illustrated in Figure 17a.Therefore, a tunable parameter ζ x was added to allow for differentiated weighting of these sectors.According to the COLREGs, it is desirable that crossings take place on the port side, meaning that the weighting of sensor readings on the starboard side should be higher.However, since it is assumed in this work that the target vessels have restricted maneuverability, sensor readings on the port side and astern must also be sufficiently penalized.Denoting starboard as "st.b.", we thus have that γ x,st.b. < γ x,port ≤ γ x,stern .
(a) Illustration of sectors around the own-ship.
Furthermore, the reward must reflect the variable risk associated with the direction of a target ship's velocity; an approaching target ship gives rise to a much higher risk than a receding one.In addition, the relatively steep function used as weighting term in Equation 23was exchanged for a flatter function of the form 1/(1 + exp(x)), so as to give dynamic obstacles detected around the own-ship sufficient priority.Making adjustments to the static obstacle penalty to adhere to these requirements, the penalty for a single dynamic obstacle was chosen as r obst,dyn = − 1 1 + exp(γ θ,dyn |θ|) Raw penalty (26) where x is the distance to the obstacle, θ is the vesselrelative angle (azimuth angle), and v y is the velocity component in the direction towards the vessel.The scaling factor ζ v is given as a function of the angle θ and the velocity v y , such that the reward efficiently guides the agent towards COLREGs-compliant behavior.It was found that an algorithm with less explicit classification of situations and therefore fewer parameters was in fact harder to tune due to the subsequent high level of dependency between different encounter situations.For instance, since the starboard side is already heavily penalised, lighter weighting of velocity was needed to prevent the agent from reacting too strongly when detecting a target ship on the starboard side.The sign of the velocity component of the target ship towards the own-ship is therefore used to determine whether the target ship is moving towards the own-ship or moving away, which together with the sensor angle θ provides a good basis for determining a reasonable scaling factor for v y .This scaling factor, ζ v , is therefore given as Finally, as was done for static obstacles, we then compute the dynamic obstacle-avoidance reward according to the weighted average where λ i is a parameter regulating the relative importance of path following and collision avoidance in an encounter situation.This parameter function depends on the velocity v i y detected, and takes the distance x i measured by the sensor as input, according to the logistic function Here, α λ (v y ) and γ λ (v y ) are tunable parameters.Two sets of constant values were chosen such that the overall function for λ i would depend solely on the sign of the speed v y of the target ship towards the own-ship, giving higher values when v y < 0. In other words, λ i incorporates the difference in risk between crossing ahead and astern of a target ship, allowing the agent to return to path following quicker in a situation where the target ship is moving away from the own-ship.Formally, we thus have and

Total reward
Combining the penalties for static and dynamic obstacle avoidance introduced in Eqs.24 and 28, the total collision avoidance penalty function becomes Dynamic component (32) Further, in order to encourage the agent to complete the path within a reasonable time frame, a constant penalty r exists < 0 was added.Combining all the elements presented, the expression for the final overall reward function then becomes The relative weighting of the path and collision avoidance rewards regulated by λ (t) was, as previously discussed, found necessary to avoid more lenient collision avoidance manoeuvres when encountering a target ship close to the path.Note that each component i in the sum r colav is multiplied by a weighting term (1 − λ i ).
Since small values for λ i indicate a critical presence of another ship (and hence that less priority should be given to the path following objective), the smallest value of λ i is chosen to regulate r path .Formally, this translates to 4.4 Software implementation

Tools and libraries
Our solution is based on the Python framework OpenAI Gym [68], which has become a de facto standard for DRL interfaces.By implementing our simulation environment as an extension of OpenAI Gym, it is straight-forward to train state-of-the-art, parallelizable RL agents on our scenarios.We use Stable Baselines [69], a Python library providing a wide range of well-documented, off-the-shelf RL algorithms, including PPO, for training our agent.The most challenging aspect of the simulation, which is the calculation of the intersection points between the sensor rays and the boundaries of the nearby obstacles, is handled efficiently by the shapely Python library [70], which offers an easy-to-use interface to a wide range of geometric analysis-related operations.

Simulation parameters
In our setup, both the policy network as well as the value network used in the PPO algorithm's advantage estimation have two hidden layers with 64 units each, and use the tanh activation function across the networks.Furthermore, the hyperparameter values presented in Look-ahead distance 3 km Table 3: Vessel configuration 4 were used for customizing the reward function.This choice of reward function parameters stems from intuitive reasoning about the desired characteristics of the agent's guidance behavior and how it relates to the parameters.In addition, adjustments were made based on observations made during testing.

Evaluation
To provide a comprehensive basis for evaluating the agent's performance, we test the trained agent in three different test domains.

COLREGs compliance
First, artificial vessel encounter scenarios, in which COL-REGs compliance easily can be categorized as a success or failure in binary terms, are created to quantify the trained agent's performance in a simple and unambiguous manner.Specifically, we simulate head-on and crossing scenarios, and expect the vessel to adhere to the relevant COLREGs rules.The head-on scenario and the first crossing scenario represent the scenarios illustrated in Figure 3, which allows for easy comparison later on.

Training environment performance
Next, we provide a statistical evaluation of the agents based on random samples of the training scenario.More precisely, we evaluate the degree to which the agent is avoiding collisions, as well as the degree to which it adheres to Living penalty -1 Table 4: Reward configuration its path following objective, by simulating its behavior in new (i.e.unseen) permutations of the training scenario.As described, the training environment is challenging, with a dense scattering of both static and dynamic obstacles.

Real-world-based experiments
Finally, based on combining high-fidelity terrain data with AIS tracking data from the Trondheim Fjord area, we construct three digital real-world environments in which the vessel's performance can be evaluated.The dashed black line represents the desired vessel trajectory.Each other vessel is drawn as its initial position with an arrow whose length corresponds to its initial speed.Additionally, each other vessel's trajectory is drawn as a transparent and dotted red line.

Ørland-Agdenes
This scenario takes place in the heavily trafficked entrance region of the fjord: The region between the municipalities Ørland and Agdenes.After spawning near the coastline, the vessel must blend into two-way traffic and follow the path until it reaches the opening of the fjord.In particular, the agent will be tested on its ability to handle head-on and overtaking situations.ing environment, the agent is subject to a wide variety of unrealistic situations during training.These have been discarded.

AIS-based environment
Extending the testing to scenarios based on real-world AIS data, it can be seen that the agent behaves in a COLREGs-compliant manner in situations where the COL-REGs clearly define an expected behaviour.Some examples of this are presented in Figure 23, where situations similar to those shown in Figure 22 were chosen for comparison.The main difference between the training environment and the AIS-based environment, however, is the shapes and sizes of the static obstacles, which represent land and islands in the AIS-based environment.As seen in Figure ??, the agent has generalized sufficiently to tackle these scenarios with ease.Further, overall COLREGs-compliant trajectories undertaken by the agent in the Trondheim, Ørland-Agdenes and Froan scenarios can be seen in Figure 25.Although the agent had no issue traversing the complex geography of Froan, it struggled when encountering target ships in restricted waters (see Figure 24).The main explanation of this is likely that the training environment does not reflect these situations properly for the agent to be prepared for them.For instance, in the training environment, the own-ship can always sail around a circular obstacle when encountering a target ship close to such an obstacle.In the Froan scenario, this is not the case, and the own-ship is prone to get lost while attempting to find other ways to the goal.It should therefore be noted that in the scenario presented in Figure 25c, the own-ship did not encounter a target ship after entering the narrow end section of the desired path, but is included to showcase the agent's ability to navigate in restricted waters in the absence of target ships.

Conclusion
In this study, we demonstrated that an RL-based autonomous vessel can avoid collisions with other vessels, while at the same time follow a desired trajectory without getting stranded.With no a priori knowledge of the environment except for the waypoints of its desired path, the agent makes reactive control decisions based on rangefinder sensors measuring the distance to nearby obstacles, be it static obstacles such as the shoreline or dynamic obstacles such as other vessels.The agent was trained in an   artificial, simulated environment, and evaluated in a digital reconstruction of the Trondheim Fjord area.Based on a representative sample of the marine traffic in the area, the trained vessel was evaluated by its performance in realistic encounter scenarios.Our results suggest that DRL agents, if trained in a stochastic, generic obstacle environment are capable of performing complex guidance tasks.
The successful demonstration in a simulated environment shows great promise for the viability of implementing it on a real-world vessel.As the approach requires no knowledge of the internal dynamics, and allows us to easily adapt the agent behavior by customizing the performance measure, our paper lays the groundwork for further research which may, given equally positive results, bring significant value to the field of autonomous guidance.

Figure 2 :
Figure 2: Human injuries per year according to maritime accident statistics published by the European Maritime Safety Agency.

Figure 5 :
Figure 5: Map of the Norwegian mainland highlighting the area of interest. 2

Figure 6 :
Figure 6: Digital terrain reconstructed from DTM10 (Norwegian Mapping Authority) rendered in 3D for debugging and showcasing purposes.Specifically, this shows a view of the Bymarka area, a nature reserve on the west side of Trondheim.

Figure 7 :
Figure 7: Snapshot of the marine traffic from January 26, 2020 to February 6, 2020 in the Trondheim Fjord area based on AIS tracking data.Each line represents one recorded travel.

Figure 8 :
Figure 8: Random sample of the stochastically generated path following training scenario with moving obstacles.The circles are static obstacles, whereas the vessel-shaped objects are moving according to the trajectory lines.

Figure 9 :
Figure 9: Illustration of key concepts for navigation with respect to path following.The path reference point p d (ω), i.e. point yielding the closest Euclidean distance to the vessel, is here located right of the vessel, while the lookahead reference point p d (ω + ∆ LA ) is located a distance ∆ LA further along the path.

Figure 10 :
Figure 10: Rangefinder sensor suite attached to autonomous surface vessel.

Figure 11 :
Figure 11: Rangefinder sensor suite partitioned into D = 9 sectors according to the the mapping function κ with the scale parameter γ C = 0.13.

Figure 12 :
Figure12: Pooling techniques for sensor dimensionality reduction.For the sectors colored green, the maximum distance S r was outputted, implying that the sector is clear of any obstacles.It is obvious that min-pooling yields an overly restrictive observation vector, effectively telling the agent that a majority of the travel directions are blocked.On the other hand, max pooling yields overly optimistic estimates, potentially leading to dangerous situations.The feasibility pooling algorithm, however, mirrors an intuitive reasoning about the reachability within each sector, producing a more intelligent estimate.

Cross
Cross-section of the path-following reward landscape assuming path-tangential full-speed motion visualized for both Gaussian and absolute exponential kernels for cross-track error rewarding.-1000m-750m-500m-250m0m 250m 500m 750m 1000m Path-following reward function assuming full-speed motion.

Figure 15 :
Figure 15: Cross-section and level curves for the pathfollowing reward function with γ = 0.05.

Figure 16 :
Figure 16: Static obstacle closeness penalty landscape as a function of obstacle distance and angle relative to the vessel with the scale parameters γ θ = 10, γ x = 0.1.The maximum penalty is imposed for obstacles located right in front of the vessel.

Figure 22 :
Figure 22: Agent performing common naval collision avoidance maneuvers in the training environment.Agent trajectories are drawn with blue dashed lines, and the target ships are drawn in red.

Figure 23 :
Figure 23: COLREGs-compliant agent performing common naval collision avoidance maneuvers in the AISbased environment.Agent trajectories are drawn with blue dashed lines, and the target ships are drawn in red.

Figure 24 :
Figure 24: Agent (in blue) getting lost attempting to find an alternate route to the goal after encountering a target ship (in red).
Agent's trajectory in the Trondheim test scenario.
Agent's trajectory in the Froan test scenario.

Figure 25 :
Figure 25: COLREGs-compliant agent trajectories in the test scenarios drawn as blue dashed lines, and target ships and trajectories are drawn in red.

Table 1 ,
should provide a sufficient basis for the agent to intelligently adhere to the desired path.Formally, we thus have that LATable1: Path-following feature vector s n at timestep t.

Table 2 :
Table2were used for the PPO algorithm.In terms of the vessel setup, the values Hyperparameters for PPO algorithm.inTable3wereused.Finally, the parameters in Table