Obstacle Avoidance for UAS in Continuous Action Space Using Deep Reinforcement Learning

Obstacle avoidance for small unmanned aircraft is vital for the safety of future urban air mobility (UAM) and Unmanned Aircraft System (UAS) Traffic Management (UTM). There are many techniques for real-time robust drone guidance, but many of them solve in discretized airspace and control, which would require an additional path smoothing step to provide flexible commands for UAS. To provide a safe and efficient computational guidance of operations for unmanned aircraft, we explore the use of a deep reinforcement learning algorithm based on Proximal Policy Optimization (PPO) to guide autonomous UAS to their destinations while avoiding obstacles through continuous control. The proposed scenario state representation and reward function can map the continuous state space to continuous control for both heading angle and speed. To verify the performance of the proposed learning framework, we conducted numerical experiments with static and moving obstacles. Uncertainties associated with the environments and safety operation bounds are investigated in detail. Results show that the proposed model can provide accurate and robust guidance and resolve conflict with a success rate of over 99%.


Introduction
From delivery drones to autonomous electrical vertical take-off and landing (eVTOL) passenger aircraft, modern unmanned aircraft systems (UAS) can perform many different tasks efficiently, including goods delivery, surveillance, public safety, weather monitoring, disaster relief, search and rescue, traffic monitoring, videography, and air transportation (Balakrishnan et al., 2018;Kopardekar et al., 2016).Urban air mobility (UAM) is likely to occur in urban areas close to buildings or airports.Thus, it is expected that UAS can use onboard detect and avoid systems to avoid other traffic, hazardous weather, terrain, and man-made and natural obstacles without constant human intervention (Kopardekar et al., 2016).
Among the decentralized methods, the conflict resolution problem can also be formulated as a Markov Decision Process (MDP).Reinforcement Learning (RL) has been proved to be a good solution to aircraft traffic management, but mostly use traditional algorithms (Sun & Zhang, 2019).The next-generation airborne collision avoidance system (ACAS X) formulates the collision avoidance systems (CAS) problem as a partially observable Markov Decision Process (POMDP) and has been extended to unmanned aircraft, named ACAS Xu (Kochenderfer et al., 2012).Both ACAS X and ACAS Xu use Dynamic Programming (DP) to determine the expected cost of each action (Manfredi & Jestin, 2016;Owen et al., 2019).Chryssanthacopoulos & Kochenderfer (2012) combined decomposition methods and DP for optimized collision avoidance with multiple threats.
The traditional RL algorithms require a fine discretization scheme of state space and finite action space.Discretization potentially reduces safety by adding discretization errors and cannot provide flexible maneuver guidance for UAS.In addition, discretizing large airspace implies a high computation demand and can be time-consuming.Tree search based algorithms (Yang & Wei, 2018, 2020) are also applied to CAS problems using MDP formulation which does not involve state discretization.But they typically require high onboard computation time to accommodate the continuous state space.
The large and continuous state and action spaces present a challenge for conflict resolution problems using reinforcement learning.Recently, Deep Reinforcement Learning (DRL) is studied to solve this challenge by applying the deep neural network to approximate the cost and the optimal policy functions.
Development of DRL algorithms, such as Policy Gradient (Sutton et al., 2000), Deep Q-Networks (DQN) (Mnih et al., 2013), Double DQN (Van Hasselt et al., 2016), Dueling DQN (Wang et al., 2015), Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015), Asynchronous Advantage Actor-Critic (A3C) (Mnih et al., 2016), and Proximal Policy Optimization (PPO) (Schulman et al., 2017) has increased the potential of automation.Li et al. (2019) used DQN to compute corrections for an existing collision avoidance approach to account for dense airspace.In Yang et al. (2019), the feasibility of using algorithms based on DQN in UAV obstacle avoidance is verified.Wulfe (2017) concluded that DQN can outperform value iteration both in terms of evaluation performance and solution speed when solving a UAV collision avoidance problem.The per-formance of the agent in avoiding single up to multiple aircraft by using the DQN algorithm is investigated in Keong et al. (2019).Brittain et al. (2020) proposed a novel deep multi-agent reinforcement learning framework based on PPO to identify and resolve conflicts among a variable number of aircraft in a high-density, stochastic, and dynamic sector in en-route airspace.The DRL work mentioned above is in continuous state and discrete action space.
There has been less progress on utilizing DRL to solve UAS conflict resolu- To the best of the authors' knowledge, this is the first study to develop a DRL approach based on PPO algorithm to allow the UAS to navigate successfully in continuous state and action spaces.The benefit of calculating in continuous space is that there is no need to discretize the state space or smooth results for postprocessing results.The proposed model with the optimal policy after offline training can be utilized for UAS real-time online trajectory planning.The main contributions of this paper are as follows: • A PPO-based framework has been proposed for UAS to avoid both static and moving obstacles in continuous state and action spaces.
• A novel scenario state representation and reward function are developed and can effectively map the environment to maneuvers.The trained model can generate continuous heading angle commands and speed commands.
• We have tested the effectiveness of the proposed learning framework in the environment with static obstacles, the environment with static obstacles and UAS position uncertainty, and the deterministic and stochastic environments with moving obstacles.Results show that the proposed model can provide accurate and robust guidance and resolve conflict with a success rate of over 99%.
The remainder of this paper is organized as follows.Section 2 describes the backgrounds of Markov Decision Process and Deep Reinforcement Learning.
Section 3 presents the model formulation using Markov Decision Process for UAS conflict resolution in continuous action space.In Section 4, the numerical experiments are presented to show the capability of the proposed approach to make the UAS learn to avoid conflict.Section 5 concludes this paper.

Background
In this section, we briefly review the backgrounds of Markov Decision Process (MDP) and Deep Reinforcement Learning (DRL).

Markov Decision Process (MDP)
Since the 1950s, MDPs (Bellman, 1957) have been well studied and applied to a wide area of disciplines (Howard, 1964;White, 1993;Feinberg & Shwartz, 2012), including robotics (Koenig & Simmons, 1998;Thrun, 2002), automatic control (Mariton, 1990), economics, and manufacturing.In an MDP, the agent may choose any action a that is available based on current state s at each time step.The process responds at the next time step by moving into a new state s with certain transition probability and gives the agent a corresponding reward r.
More precisely, the MDP includes the following components: 1.The state space S which consists of all the possible states.
2. The action space A which consists of all the actions that the agent can take.
3. Transition function T (s t+1 |s t , a t ) which describes the probability of arriving at state s t+1 , given the current state s t and action a t .
4. The reward function r(s t , a t , s t+1 ) which decides the immediate reward (or expected immediate reward) received after transitioning from state s to state s , due to action a.In general, the reward will depend on the current state, current action, and the next state.r t is the immediate reward at the time step t and R t is the total discounted reward from time-step t forwards.
5. A discount factor γ ∈ [0, 1] which decides the preference for immediate reward versus future rewards.Setting the discount factor less than 1 is also beneficial for the convergence of cumulative reward.
In an MDP problem, a policy π is a mapping from the state to a distribution over actions (known as stochastic policy) or to one specific action (known as deterministic policy) The goal of MDP is to find an optimal policy π * that, if followed from any initial state, maximizes the expected cumulative immediate rewards: Q-function and value function are two important concepts in MDP.The optimal Q-function Q * (s, a) represents the expected cumulative reward received by an agent that starts from state s, picks action a, and chooses action optimally afterward.Therefore, Q * (s, a) is an indication of how good it is for an agent to pick action a while being at state s.The optimal value function V * (s) denotes the maximum expected cumulative reward when starting from state s, which can be expressed as the maximum of Q * (s, a) over all possible actions:
Comparing with value-based DRL algorithms, the policy-based algorithm is effective in high-dimensional or continuous action spaces and can learn stochastic policies, which is beneficial when there is uncertainty in the environment.
Typically, the policy-based algorithm uses function approximation such as neural networks to approximate the policy π(s), where the input is the current state and output is the probability of each action (for discrete action space) or an action distribution (for continuous action space).After each trajectory τ , the algorithm updates the parameter of the function approximation to maximize the cumulative reward using gradient ascent: where J(π θ ) is the expected cumulative reward of policy π parameterized by θ, π θ (a t |s t ) is the probability of action a t for state s t , and R t is the cumulative reward gathered by the agent for the remaining trajectory τ in one episode.
The general idea of Eq. ( 4) is to reduce the probability of sampling an action that leads to a lower return and increase the probability of action leads to a higher reward.But one issue is that the cumulative reward usually has very high variance, which makes the convergence speed to be slow.To address this issue, researchers proposes actor-critic algorithm (Sutton & Barto, 2018) where a critic function is introduced to approximate the state value function V (s t ).
By subtracting the value function V (s t ), the expectation of the gradient keeps unchanged and the variance is reduced dramatically: where the function approximator V (s t ) is updated to approximate the value function for s t .

Markov Decision Process Formulation
In this study, the UAS and intruders are considered as a point mass.The objective of the proposed conflict resolution algorithm is to find the shortest path for a UAS to its goal while avoiding conflict with other UAS and static obstacles.Guiding the UAS to its destination is a discrete-time stochastic control process that can be formulated as a Markov Decision Process (MDP).In the following subsections, we introduce the MDP formulation by describing its state representation, action space, terminal state, and reward function.For this work, a method of deep reinforcement learning, proximal policy optimization algorithm developed in Schulman et al. (2017), is adopted.The reason and details are also introduced in this Section.

State representation
The agent gains knowledge of the environment from the state of the formulated MDP.The state should include all the necessary information for an agent to make optimal actions.In this paper, we let s t denote the agent's state at time t.All the parameters are normalized when querying the neural network.
The state can be divided into two parts, that is , where s 0 t denotes the part that is related to the agent itself and the goal, and s 1 t denotes the part related to the environment such as obstacles.We use w t to denote the information of environment (which represents moving/static obstacles in this paper) and set s 1 t = [w 1 t , w 2 t , ..., w n t ]. w i t indicates the information of obstacle i.To speed up the training of the DRL algorithm, we transform the state by following the robot-centric parameterization in Chen et al. (2017), where the agent is located at the origin and the x-axis is pointing toward its goal.

State representation for static obstacle avoidance
In the simulations of static obstacle avoidance, where d g is the agent's distance to goal, v x , v y denote the agent's velocity.The information of obstacle i is, where P i y is the position in y-axis of obstacle i and d i is the agent's distance to the center of obstacle i. P i y is introduced to help the agent learn the global optimal solution, for example, when approaching the obstacle, turn a small angle counterclockwise if P i y is positive, which means the agent is on the right side of the line passing the obstacle center and the goal.We note that v and P are vectors in the transformed coordinate system.

State representation for moving obstacle avoidance
As for moving obstacle avoidance, the position of the goal, (g x , g y ), is added The information of intruder i, w i t is represented by, where P i is the position of intruder i, V i is the velocity of intruder i, d i is the distance between the agent and the intruder i and V i ref is the velocity of the agent relative to the intruder i.We note that v, g, P and V are vectors in the transformed coordinate system.
More specifically, at each time step, the agent will select an action a h ∈ A h , and change its heading angle ψ:

Action space for deterministic intruder avoidance
As for deterministic intruder avoidance case, besides the heading angle change a h ∈ A h , the UAS is also controlled by speed command, which is updated every one second.During the interval, the two commands are fixed.Since UAS is more flexible than manned aircraft and there is no available regularization on UAS speed change, UAS speed command a v can be chosen from A v , More specifically, the agent will select an action a v ∈ A v , and change its speed v at next time step t + 1: In real-world applications, however, making a sharp turn is usually not desirable for the controlling of a UAS.Thus a penalty of large heading or speed change due to the power consumption may be considered in future work.

Terminal state
In the current study, the conflict is defined to be when the distance from the agent to the obstacle is less than a minimum separation distance.When the UAS operation is deterministic, a buffer zone is not necessary and the minimum

Terminal state for static obstacle avoidance
The terminal state for static obstacle avoidance includes two different types of states: • Conflict state: the distance between the agent and obstacle is less than the minimum separation distance.
• Goal state: the agent is within 400 m from the destination.

Terminal state for moving obstacle avoidance
The episode terminates only when the agent is within 200 m from the destination, which indicates the agent accomplishes the navigation task.

Reward function
To guide the agent to reach its goal and avoid conflict, the reward function is developed to award accomplishments while penalizing conflicts or not moving towards the goal.

Reward function for static obstacle avoidance
In the simulations of static obstacle avoidance, the reward function, R(s, a), is expressed as the following form, where we set a reward for the goal state and a penalty for the conflict state.The linear term of the reward function guides the UAS flying towards the destination.The constant penalty at each step emphasizes the shortest path rule. (14)

Reward function for moving obstacle avoidance
In the simulations of intruder avoidance, the reward function, R(s, a), is expressed in the following form, similar to Eq. ( 14).
In this reward function, c g , c 0 , c 1i , c 2i , c 3i are coefficients of different cost and should be balanced to help the agent learn conflict resolution and achieve the goal simultaneously.When the ownership is close to the intruder, the inverse tangent term of the reward function is activated to maintain the distance in an appropriate range.With the coefficients set in the stochastic intruder case in Section 4.2.1, the relation between the distance and the inverse tangent term, 17(arctan(0.1(di −12))− π 2 ), is shown in Fig. 1.The agent starts to get a penalty when the distance approaches 250 m.This reward setting can help the agent avoid conflicts with other intruders at a relatively early stage.We note that c 2i , c 3i can be tuned to fit different separation standards.

Proximal policy optimization algorithm
One drawback of ∇ θ J(π θ ) proposed in Sutton & Barto (2018), also shown in Eq. ( 5) , is that one bad update can lead to large destructive effects and hinder the final performance of the model.Proximal Policy Optimization (PPO) algorithm (Schulman et al., 2017) was proposed recently to solve this problem by introducing a policy changing ratio describing the change from previous policy to the new policy at time step t: where θ old and θ denote the network weights before and after update.
By restricting policy changing ratio in the range [1 − , 1 + ] with set to 0.2 in this paper, the PPO loss function for the actor and critic network is formulated as follows: where is a hyperparameter that bounds the policy changing ratio r t (θ).In Eq. ( 17) and Eq. ( 18), the advantage function A t measures whether or not the action is better or worse than the policy's default behavior.Also, the policy entropy β•H(π(•|s t )) is added to the actor loss function to encourage exploration by discouraging premature convergence to sub-optimal deterministic polices.
In the implementation, we use two layers Multilayer perceptron (MLP) with 64 hidden units each for both actor and critic networks.Tanh function is chosen as the activation function for the hidden layers.

Numerical experiments
Numerical experiments are presented in this section to evaluate the proposed conflict resolution model in continuous action space.There are two categories of collision avoidance: static obstacle avoidance and moving obstacle avoidance.
As for static obstacle avoidance, we investigate the performance on different obstacle shapes and sizes, and uncertainty in UAS operation.We also study the environment with stochastic intruders under control of heading angle, and the environment with deterministic intruders under control of heading angle and speed.In all the simulations, one pixel in the figure represents 10 m in the real world.The implementation of PPO algorithm is conducted by OpenAI Baselines (Dhariwal et al., 2017).The deep reinforcement learning model for each case is trained for 30 million time steps.

Static obstacle avoidance
The simulation environment is the free flight airspace with 4 km length and 4 km width.The UAS speed is set to 20 m/s.During the training process, the starting position of the aircraft is randomly sampled from four edges of the airspace boundary and is an array of the integer type for simplification.
The goal is located at (2500, 2500).In this experiment, we study two types of static obstacles: circular obstacle and rectangular obstacle, as shown in Fig. 2.
The plus sign represents the goal position and the blue region represents the no-passing area.The episode reward mean is shown in Fig. 3, which shows the episode reward is growing and the policy is converging to the optimal solution.To visualize the performance of the proposed conflict resolution model, we generate a testing set of 160 trajectories starting from different origins.The origin for testing is chosen every 100 m on each edge of the airspace boundary.
Heading angles in 160 trajectories are collected and the heading angle is plotted every 15 time steps.From Fig. 4, it can be seen that the agent is selecting the heading angle pointing to the goal and tending to avoid the no-passing region.Also, the agent chooses the optimal behavior according to the relative position of the agent, obstacle, and goal.For example, near the lower-left obstacle, if the agent's position is above the line passing the obstacle center and the goal, the UAS takes a small left turn to avoid the obstacle.Otherwise, the UAS bypasses the lower semicircle.For the 160 generated trajectories in Fig. 4, there is no failure.

Rectangular obstacle avoidance
The difference between this rectangular obstacle case and the previous circular obstacle case in simulation is the condition when checking whether the agent is at a conflict state.The environment for this case is shown in Fig. 2(b) and the testing result of 160 trajectories is shown in Fig. 5.
The performance in Fig. 5 is similar to the result in Fig. Results in Fig. 4 and Fig. 5 show that the proposed model has the capability to make the UAS learn to find the shortest path and also avoid static obstacles  for different obstacle sizes or shapes.

Circular obstacle avoidance with uncertainty
This case is studied to see the performance of handling uncertainty by the proposed conflict resolution model.UAS operation is stochastic and randomness exists in almost every aspect of UTM.Inclusion of uncertainty quantification of aircraft operation is critical for future safety analysis (e.g., deviation from a trajectory plan due to wind, true speed, positioning error) (Hu et al., 2020;Liu & Goebel, 2018;Hu & Liu, 2020;Pang et al., 2019bPang et al., ,a, 2021)).Thus, to model the uncertainties in UAS operation, we form a circle, the center of which is the predicted UAS position without uncertainty.And the radius is the separation requirement, 75 m.With 90% probability, the UAS position is accurately located at the center of the circle; with a 10% probability, the UAS position will be located at a point around the circle with a uniform distribution.Such uncertainty is considered when calculating the agent's position at the next time step after taking action a.
The testing results are shown in Fig. 6.In Fig. 6

Moving obstacle avoidance
For the moving intruder aircraft avoidance case, the speed of intruders is set to 20 m/s.The origin and heading angle of the three intruders are assumed to follow a uniform distribution and the distribution range is shown in Table 2.The origin coordinate of agent is uniformly sampled from (75 ∼ 135, 0 ∼ 25).The goal is located at (100,200).The agent moves at 20 m/s.The intruders are designed to pass the line connecting the UAS origin and the goal.
The demonstration of one scenario and UAS performance is shown in Fig. 8.
Information related to the ownership is plotted in blue and black represents the information of intruders.The plus sign denotes the origin and the star sign is the goal for the agent.The centers of circles are the positions of aircraft which are plotted every 5 time steps and labeled with time step every 10 time steps.
The radius of the circle represents the aircraft speed.In this scenario, the agent

Deterministic-intruder avoidance with control of heading angle and speed
We also investigate the possibility of utilizing the proposed reward function to generate heading angle change command and speed command.This investigation is valuable when changing the heading angle cannot efficiently resolve conflicts.Moreover, with an extra choice of changing speed, the UAS may result in less influence on flight plans of other aircraft and aerospace capacity.
However, due to the larger action space, the training process needs more effort.
The origin and heading angle of the three intruders are listed in Table 3.
The origin coordinate of agent is (100, 210) and the goal is located at (100, 0).

Conclusion
In this work, we present a method for using deep reinforcement learning to tion with continuous control.Pham et al. (2019) proposed a method inspired by Deep Q-learning and Deep Deterministic Policy Gradient algorithms and it can resolve conflicts, with a success rate of over 81 %, in the presence of traffic and varying degrees of uncertainties.Ma et al. (2018) developed a generic framework that integrates an autonomous obstacle detection module and an actor-critic based reinforcement learning (RL) module to develop reactive obstacle avoidance behavior for a UAV.Experiments in Schulman et al. (2017) test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and show that PPO outperforms other online policy gradient methods.PPO appears to be a favorable balance between sample complexity, simplicity, and wall-time.Thus, a PPO-based conflict resolution model is very valuable for UAS traffic management, which is the major motivation of this study.
1. Action space for static obstacle avoidance and stochastic intruder avoidance In the implementations of static obstacle avoidance and stochastic intruder avoidance, the action represents the change in the heading angle for the controlled UAS at each time step.The action space is set to, separation distance is set to zero.In the implementations of static obstacle avoidance with uncertainty and moving obstacle avoidance, the UAS position uncertainty is taken into account.The separation requirement is determined according to the operational safety bound proposed inHu et al. (2020).With the UAS speed of 20 m/s and other UAS operation performance following the mean value shown in Table3inHu et al. (2020), the minimum separation distances for static obstacle avoidance is 75 m and 150 m for moving obstacle avoidance.

Figure 1 :
Figure 1: Reward related to the distance from the agent to the intruder.

Figure 3 :
Figure 3: Episode reward mean for (a) circular obstacle avoidance, (b) rectangular obstacle environment, and (c) circular obstacle avoidance with probabilistic agent's position.
4. The UAS learns to bypass the obstacle through one side of the obstacle depending on the relative position of the agent, obstacle, and goal.No failure happens among the test of 160 trajectories.
(a), the agent's position with uncertainty is plotted.While in Fig. 6(b), the uncertainty of 75 m is added to the obstacle, which is indicated by the red circle.As expected, the UAS tries to keep 75 m away from the obstacles.So either method can work when doing the simulations of collision avoidance with uncertainty.One failure happens near the upper-left obstacle in Fig. 6(a).There are three failures near the lower-left obstacle in Fig. 6(b).The common among the failures is that the agent's origin is approximately on the line passing the obstacle center and the goal.The possible reason is that the policy network gets stuck at the local optimum since the two trajectories next to it behave well.

Figure 6 :
Figure 6: (axis unit: ×10 m) (a) Results with probabilistic agent's position.(b) Results with uncertainty added to obstacles.+: goal; blue: no-passing area; arrow: the selected heading direction; red: separation requirement due to uncertainty in UAS operation.

Figure 7 :
Figure 7: (a)Episode reward mean of stochastic-intruder avoidance with control of heading angle.(b) Episode reward mean of deterministic-intruder avoidance with control of heading angle and speed.

Figure 8 :
Figure 8: Demonstration of one scenario and UAS performance of stochasticintruder avoidance with control of heading angle.Number: time step.Blue: ownership; black: intruders.Circle center: UAS position; circle radius: UAS speed.+: origin; *: goal.

Figure 9 :
Figure 9: Minimum distance results of stochastic-intruder avoidance with control of heading angle.Orange line: separation requirement of 150 m.Blue dot: the minimum distance from the agent to the three intruders within each episode.

Figure 10 :
Figure 10: Demonstration of the scenario and UAS performance of deterministic-intruder avoidance with control of heading angle and speed.Number: time step.Blue: ownership; black: intruders.Circle center: UAS position; circle radius: UAS speed.+: origin; *: goal.

Figure 11 :
Figure 11: Minimum distance results of deterministic-intruder avoidance with control of heading angle and speed.Orange line: separation requirement of 150 m.Blue dot: the minimum distance from the agent to the three intruders within each episode.
allow the UAS to navigate successfully in urban airspace with continuous action space.Both static and moving obstacles are simulated and the trained UAS has the capability to achieve the goal and do conflict resolution simultaneously.We also investigate the performance on different static obstacle shapes and sizes, and under uncertainty in UAS operation.Stochastic intruders are considered in the training process of the moving obstacle experiments.Moreover, we investigate the possibility of the proposed reward function to resolve conflict through heading angle and speed.Results show that the proposed model can provide accurate and robust guidance and resolve conflict with a success rate of over 99%.To make the proposed algorithm more practical and efficient in the realworld, in future work, we would model part of the intruders as agents and there could be cooperation among the multiple aircraft.Morgan, D., Chung, S.-J., & Hadaegh, F. Y. (2014).Model predictive control of swarms of spacecraft using sequential convex programming.Journal of Guidance, Control, and Dynamics, 37 , 1725-1740.doi:10.2514/1.G000218.Owen, M. P., Panken, A., Moss, R., Alvarez, L., & Leeper, C. (2019).Acas xu: Integrated collision avoidance and detect and avoid capability for uas.In 2019 IEEE/AIAA 38th Digital Avionics Systems Conference (DASC) (pp.1-10).IEEE.Pallottino, L., Feron, E. M., & Bicchi, A. (2002).Conflict resolution problems for air traffic management systems solved with mixed integer programming.IEEE transactions on intelligent transportation systems, 3 , 3-11.doi:10.1109/6979.994791.Pang, Y., Xu, N., & Liu, Y. (2019a).Aircraft trajectory prediction using lstm neural network with embedded convolutional layer.In Proceedings of the Annual Conference of the PHM Society.volume 11.Pang, Y., Yao, H., Hu, J., & Liu, Y. (2019b).A recurrent neural network approach for aircraft trajectory prediction with weather features from sherlock.In AIAA Aviation 2019 Forum (p.3413).Pang, Y., Zhao, X., Yan, H., & Liu, Y. (2021).Data-driven trajectory prediction with weather uncertainties: A bayesian deep learning approach.Transportation Research Part C: Emerging Technologies, 130 , 103326.Pham, D.-T., Tran, N. P., Alam, S., Duong, V., & Delahaye, D. (2019).A machine learning approach for conflict resolution in dense traffic scenarios with uncertainties.Pontani, M., & Conway, B. A. (2010).Particle swarm optimization applied to space trajectories.Journal of Guidance, Control, and Dynamics, 33 ,   1429-1441.doi:10.2514/1.48475.
There are two cases for moving obstacle avoidance: stochastic intruder case with control of heading angle and deterministic intruder case with control of heading angle and speed.In the stochastic intruder case, the scenario changes every episode.In detail, the intruder has a different origin and heading angle for each episode.But within one episode, intruders have fixed heading angles.The reward coefficients are listed in Table1.The episode reward mean is shown in Fig.7.To visualize the performance of the proposed conflict resolution model, we generate a testing set of 500 episodes following the setting during the training process for each case.Also, the minimum distance of the agent to the three intruders within each episode is collected.