Representation Learning and Reinforcement Learning for Dynamic Complex Motion Planning System

Indoor motion planning challenges researchers because of the high density and unpredictability of moving obstacles. Classical algorithms work well in the case of static obstacles but suffer from collisions in the case of dense and dynamic obstacles. Recent reinforcement learning (RL) algorithms provide safe solutions for multiagent robotic motion planning systems. However, these algorithms face challenges in convergence: slow convergence speed and suboptimal converged result. Inspired by RL and representation learning, we introduced the ALN-DSAC: a hybrid motion planning algorithm where attention-based long short-term memory (LSTM) and novel data replay combine with discrete soft actor–critic (SAC). First, we implemented a discrete SAC algorithm, which is the SAC in the setting of discrete action space. Second, we optimized existing distance-based LSTM encoding by attention-based encoding to improve the data quality. Third, we introduced a novel data replay method by combining the online learning and offline learning to improve the efficacy of data replay. The convergence of our ALN-DSAC outperforms that of the trainable state of the arts. Evaluations demonstrate that our algorithm achieves nearly 100% success with less time to reach the goal in motion planning tasks when compared to the state of the arts. The test code is available at https://github.com/CHUENGMINCHOU/ALN-DSAC.


I. INTRODUCTION
I NDOOR service robots appeared in airports and restau- rants to provide services to visitors (e.g., luggage delivery and food delivery).However, these robots suffer poor motion planning performance in scenarios with dense and dynamic obstacles (pedestrians) because of obstacle's motion unpredictability.This barricades robot's further commercial use.Some robot's motions are controlled by classical path planning algorithms, such as the graph search algorithm (e.g., A * [1]), sample-based algorithm (e.g., the rapidly exploring random tree (RRT) [2]), and interpolating curve algorithms [3], [4], [5], [6], [7].These algorithms work well in static and low-speed less-obstacle scenarios.However, they make robots suffer many collisions in complex cases because they generate the motions or paths online.Online motion depends on map update that requires much computing resource.Reaction-based algorithms, such as dynamic window approach (DWA) [8] and optimal reciprocal collision avoidance (ORCA) [5], perform fast to handle obstacle's unpredictability.This enables the robot to fast avoid slow-speed obstacles.However, these algorithms require environment update in which consumed time must be considered.Deep learning (DL) generates robot's motion by performing trained models in which consumed time can be ignored.Classical DL, such as convolutional neural network (CNN) [9], generates instant motion, which is a one-step prediction and does not consider task goal, therefore obtaining suboptimal trajectories.Recent progress in deep reinforcement learning (RL), such as optimal value RL (e.g., dueling deep Q network (DQN) [10]) and policy gradient RL (e.g., asynchronous advantage actorcritic (A3C) algorithm [11]), enables robot to consider task goal and instant obstacle avoidance simultaneously.These algorithms provide near-optimal solutions to handle complex cases.However, training of RL algorithms faces many challenges: slow convergence speed and suboptimal converged result caused by high bias and variance.
In this article, we conclude that data quality, efficacy of data replay strategies, and optimality of RL algorithms are the main aspects, which decide the performance of RL-based motion planning.
Data Quality (Representation Methods): Bai et al. [9] and Long et al. [12] learned obstacle features directly from source images, which are with poor data quality, because of background noise or unnecessary information.Representation learning [13] partly alleviates this problem by interpretating and encoding the feature of the robot and obstacles to improve data quality.Representation methods that fit motion planning tasks are the long short-term memory (LSTM) [14], [15], attention weight (AW) [16], [17], and some graph representation learning methods [18] such as relation graph (RG) [19].
Efficacy of Data Replay: The data replay strategy makes use of saved data to improve the convergence speed dramatically when it is compared to online learning [14], which learns episodic data with online feature.Typical data replay strategy consists of the experience replay (ER) [20] and prioritized experience replay (PER) [21], [22].ER introduces less bias and variance when the algorithm learns from a random batch This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/This article has been accepted for inclusion in a future issue of this journal.Content is final as presented, with the exception of pagination.
of data because sampled data for training have the same data distribution, independently identical distribution (i.i.d.).
Optimality of RL: RL algorithms basically include the optimal value RL and policy gradient RL.They are primarily tested in games such as Atari game.Their representatives are DQN [20] and actor-critic algorithm [23].Then, many variants follow.DQN evolves into double DQN [24], dueling DQN [10], and soft Q-learning [25], while the actor-critic algorithm basically evolves from three directions: multithread direction, deterministic direction, and monotonous direction.Multithread direction denotes using a multiple-thread method and policy entropy to accelerate the convergence speed.Examples are A3C and A2C [11].Deterministic direction proves that the policy to select actions is stable or deterministic in one state s and actions are directly decided by this state a ← µ θ (s), while its counterpart, the stochastic policy, selects actions by the possibility a ← π θ (a | s).
Technical Difficulties: The mentioned methods above have many weaknesses.
1) In representation methods, LSTM suffers a suboptimal encoding strategy when encoding obstacle features.Poor and suboptimal orders (e.g., random order and the order by distance of robot and obstacle [14]) cause suboptimal converged result.Networks of AW and RG suffer from slow convergence.
2) In the data replay, learning from stored batch data in ER ignores the online feature of episodic data and importance of each data [35].PER also lacks the consideration of online feature of episodic data, but it considers the importance of each data by the data prioritization.Prioritization in PER finds a better tradeoff between stochastic sampling and greedy sampling.However, this prioritization changes data distribution.Hence, the bias is introduced in PER, although importance sampling weight (ISW) [21] is applied to partly alleviate this problem.
3) In RL algorithms, much bias and variance are introduced even in a deterministic case such as DDPG, although many tricks are used to reduce them, for instance, double Q network [24] and advantage architecture [10] to reduce the overestimation of Q value, and one-step/multistep actor-critic architecture to reduce the variance.Moreover, many RL algorithms, such as A3C and A2C, are not data-efficient.They are on-policy algorithms, which require more data for training.
Existing trainable motion planning algorithms have many weaknesses in mentioned three aspects.Trainable motion planning methods include RG [19], PPO with multiple robots [12], [29], CADRL [36], LSTM-A2C [11], LSTM-RL [14], and SARL [16].RG is the combination of RG and DQN.Relation graph and DQN face the problems of data quality and optimality of RL.It is hard to train the graph network although RG can represent robot-obstacle relationship.DQN brings high bias and variance, which causes slow convergence.PPO with multiple robots faces problems of data quality because it learns obstacle features from source images with much noise.CADRL learns the pairwise feature of the robot and one obstacle by DQN.Then, the trained model is applied to multiple-obstacle case [36].It faces the problems of data quality and optimality of RL because it is myopic and uses the closest obstacle feature for training instead of all obstacle features.DQN in CADRL also brings high bias and variance.LSTM-A2C and LSTM-RL face three problems simultaneously because LSTM encodes the obstacle features by distance-based order [14], which partly represents robot-obstacle relationship.A2C/A3C lack data replay.A2C/A3C and DQN bring high bias and variance.SARL consists of AW and DQN where the attention network interprets robot-obstacle features to efficacious neural network weight [16].However, it faces the problem in RL optimality because DQN brings high bias and variance.
Motivation: To address mentioned problems.1) We first implemented discrete SAC (DSAC) algorithm to improve the optimality of RL.The architecture of DSAC is the double Q-learning actor-critic, which is the most efficient architecture currently.
2) We optimized existing distance-based LSTM encoding to improve the data quality.Distance-based LSTM encodes pairwise robot-obstacle features by a distance of the robot and obstacle.It represents the robot-obstacle relationship partly; therefore, it is improved by attention-based encoding, which computes the attention-based importance of obstacles to decide the way of encoding robot-obstacle features.
3) We introduced a novel data replay method by combining online learning and offline learning to improve the efficacy of data replay.Existing RL algorithms are fed with either online experience or offline experience.We attempted to fuse them to create a new experience.Moreover, online learning and offline learning coexist in our novel data replay method.RL is fed with new experience when RL learns online, while it is fed experience sampled from replay buffer when RL learns offline.
Interpretability: ALN-DSAC is interpretable or explainable because it fits human intuitions to learn and make decisions on nonlinear motion planning tasks.This contributes to better hidden features of dynamic robot and obstacles.Features are then learned by explainable RL (DSAC).This means that.
1) Attention network is derived from human attention mechanism; therefore, it is understandable to human.Attentionbased LSTM encoding better describes the robot-obstacle relationship by attention-based obstacle importance.
2) Human has better learning performance if it learns from complete experience.Novel data replay better keeps complete, time-sequential, and episodic online experience.This provides high-quality inputs.These two factors contribute to better hidden features.
3) RL is explainable because it derives from the learning process of humans or other creatures.It is a basic reward/punishment-action process, which is understandable to humans.

II. PRELIMINARIES A. Preliminary of RL
Markov decision process (MDP) is the sequential decision process based on the Markov chain [39].Markov Chain is defined by a variable set X = {X n : n > 0}, where the probability p(X t+1 | X t , . . ., X 1 ) = p(X t+1 |X t ).This means that the state and action of the next step only depend on the state and action of the current step.MDP is described as a tuple ⟨S, A, P, R⟩. S denotes the state, and here, it refers to the state of robot and obstacles.A denotes an action taken by the robot.Action A = [θ, v] is selected from action space.In this article, actions directions θ ∈ {0, (π/8), . . ., 2π}.Speed of each direction v ∈ {0.2, 0.4, . . ., 1}.Hence, the action space consists of 81 actions, including a stop action.P denotes the possibility to transit from one state to the next state.R denotes the reward or punishment received by the robot after executing actions.The reward function in this article is defined by where p current denotes the position of the robot, p g denotes the position of the goal, d min denotes the minimum distance of the robot and obstacles, and d start_to_goal denotes the distance of the start to the goal.Our reward function (1) is modified from [16], which cannot work without imitation learning.Equation (1) accelerates the convergence speed by attaching a reward to the final position of the robot.This encourages the robot to approach the goal.Other crucial terms include value, policy, value function, and policy function.The value denotes how good one state is or how good one action is in one state.The value consists of state value (V value) and state-action value (Q value).A value is defined by the expectation of accumulative rewards where γ is a discounted factor.The policy denotes the way to select actions.In the function approximation case, policy is represented by networks.The value function in RL scope is represented by networks to estimate the value of environmental state via the function approximation [40].The policy function is also represented by neural networks.Actions are selected in an indirect way (e.g., a ← argmax a R(s, a) + Q(s, a; θ ) in DQN [20], [41]) or direct way (e.g., π θ : s → a in actor-critic algorithm [23]).

B. Problem Formulation
ORCA introduced a competent simulator that includes dynamic robot and obstacles in a fixed-size 2-D indoor area.The robot and obstacles move toward their goals simultaneously and avoid collisions with each other (Fig. 2).This simulator creates circle-and square-crossing scenarios that add predictable complexity to tasks.Let s represent the state of robot.Let a and v represent the action and velocity of  Circle-and square-crossing simulators.Obstacles are randomly generated near the brink of the circle in a circle-crossing environment.In a square-crossing environment, obstacles are randomly generated on the left side or right side.robot, respectively, and represent the position of robot.Let s t represent the state of robot at time step t. s t consists of observable and hidden parts s t = [s obs t , s h t ], s t ∈ R 9 .Observable part refers to factors that can be measured or observed by others.It consists of position, velocity, and radius s obs = [ p x , p y , v x , v y , r ], s obs ∈ R 5 .The hidden part refers to factors that cannot be seen by others.It consists of planned goal position, preferred speed, and heading angle s h = [ p gx , p gy , v pref , θ], s h ∈ R 4 .The state, position, and radius of obstacles are described by ŝ, p, and r , respectively The robot plans its motion by obeying policy : (s 0:t , ŝobs 0:t ) → a t , while obstacles obey π:(ŝ 0:t , s obs 0:t ) → a t .Robot policy π:(s 0:t , πobs 0:t ) → a t denotes that the algorithm is based on the robot state s 0:t and the observable obstacle state ŝobs 0:t to obtain policy π, which outputs action a t at time step t.Obstacle policy π:(ŝ 0:t , s obs 0:t ) → a t denotes that the algorithm is based on the obstacle state ŝ0:t and the observable robot state s obs 0:t to obtain policy π, which outputs action a t at time step t.The objective of robot is to minimize the time to its goal E[t g ] (2) under the policy π without collisions to obstacles.Constraints of robot's motion planning can be formulated via ( 3)-( 6) that represent collision avoidance constraint, goal constraint, This article has been accepted for inclusion in a future issue of this journal.Content is final as presented, with the exception of pagination.
kinematics of robot, and kinematics of obstacle, respectively.Collision avoidance constraint denotes that the distance of robot and obstacles ∥ p t − pt ∥ 2 should be greater than or equal to the radius sum of robot and obstacles r + r .Goal constraint denotes that the position of the robot p tg should be equal to the goal position p g if robot reaches its goal.Kinematics of robot denotes the position of robot p t that is equal to the sum of the robot position p t−1 and the change of robot position t • π : (s 0:t , ŝobs 0:t ).π : (s 0:t , ŝobs 0:t ) is a velocity decided by the policy π.Kinematics of obstacles is the same as that of the robot.

III. METHODS
We first present the mechanisms of LSTM, AW, and DSAC.They are fundamental knowledge.Then, we introduce LSTM-DSAC and AL-DSAC that work as the ablations for validating the contributions of DSAC, attention-based LSTM, and novel data replay.Finally, AL-DSAC is optimized by integrating novel data replay to form ALN-DSAC.

A. LSTM, AW, and DSAC 1) LSTM: It encodes pairwise robot-obstacle feature E lstm
= [e lstm 1 , . . ., e lstm N ] to form the description of obstacle state S lstm o .E lstm can be encoded in different orders in LSTM, such as the order by maximum distance, the order by minimum distance [14], and the random order where d i denotes the distance of the robot and obstacle, rank min (d i ) denotes that E lstm is ordered by minimum distance of the robot and obstacles, and rank max (d i ) and random(d i ) denote that E lstm is ordered by maximum and random distances, respectively.Environment state for training S lstm is defined as the combination of the robot state s r and obstacle state S lstm o .S lstm and S lstm o are defined by where the pairwise robot-obstacle feature E lstm consists of pairwise robot-obstacle features e lstm i .The pairwise robotobstacle feature is defined as the combination of robot state s r and the state of each obstacle o i where N denotes the number of obstacles.Note that s r and o i here are the robot-centric states, which are simply transformed from the states described in Section II-B.The robot-centric states are defined by where d g denotes the distance of the robot to its goal and o i denotes the robot-centric observable state of the ith obstacle.Note that the ith denotes the obstacle's order, which is generated randomly in the simulator when an episode of the experiment starts.In o i , radius sum r i +r denotes the collision constraint of each obstacle to robot.Radius (r and r i ) and radius sum in the feature definition enable the robot to be fast aware of the safe distance to each obstacle.This contributes to the convergence.
2) Attention Weight: AW-based [16], [42] obstacle feature S aw o combines with the robot feature s r to form the environmental state S aw for training where S aw o is defined by where α i and h i denote the attention score and the interaction feature of robot and obstacle o i , respectively, and n denotes the number of obstacles.The interaction feature is defined by where f h (•) and w h denote the multiple-layer perceptron (MLP) and its weight, respectively.e i denotes the embedded feature obtained from the pairwise robot-obstacle feature The attention score is defined by where f α (•) and w a denote MLP and its weight, respectively, and e mean denotes the mean of all embedded features.The embedded feature and e mean are defined by where f e (•) and w e denote MLP and its weight, respectively.M i denotes the occupancy map of obstacle o i and it is defined by where w ′ j is a local state vector of obstacle o j and N i denotes other obstacles near the obstacle o i .The indicator function [16].The mechanism of AW is shown in Fig. 3.
3) DSAC: The policy of classical RL algorithm is obtained by maximizing the objective . SAC considers the reward and entropy simultaneously.The objective of SAC is defined as the maximum entropy objective where H(π(• | s t )) denotes the entropy.α is the temperature parameter, which decides the importance of the entropy versus the reward for controlling the stochasticity of optimal policy.This means that the temperature further encourages the explorations to find promising avenues and give up unpromising avenues.In objective maximization, the SAC policy converges to optimal policy certainly by the soft policy iteration, which consists of policy evaluation and policy improvement.The optimal policy is obtained by repeatedly applying policy evaluation and policy improvement.Policy evaluation [32] proves that if This article has been accepted for inclusion in a future issue of this journal.Content is final as presented, with the exception of pagination.value of π when k → ∞.T π is a modified Bellman backup operator given by Applying Policy improvement [32] proves that Q π new ≥ Q π old in objective maximization.π new is defined by where Z π old (s t ) is the partition function for distribution normalization.It can be ignored because it does not contribute to the gradient of new policy.Q π old guides the policy update to ensure an improved new policy.New policy is constrained to a parameterized family of distribution π ′ ∈ like Gaussians to ensure the tractable and optimal new policy.Given the repeated application of policy evaluation and improvement, policy π eventually converges to the optimal policy π * , Q π * ≥ Q π , π ∈ .SAC is the combination of soft policy iteration and function approximation.In (19), temperature α is either a fixed value or an adaptive value.In function approximation, networks θ and φ are used to approximate the action value and policy value.The action value objective and its gradient are obtained by where Q is the target action value.A state value V (s t+1 ; θ ) is approximated by a target action value network θ. γ is a discount factor.Policy objective and its gradient are obtained by where f φ (ϵ t ; s t ) is the network transformation, in which ϵ t is an input noise vector sampled from fixed distribution like spherical Gaussian.The temperature objective is defined by where H is the target entropy.Temperature objective gradient is obtained by approximating dual gradient descent [43].
Eventually, the networks and temperature are updated by where the discount factor τ ∈ (0, 1).SAC is used in tasks with continuous action space.However, the action space in this article is discrete.Hence, SAC should be modified to suit our task.Some modifications [33] should be made.They are summarized as follows.
1) The Q function should be moved from where Q values of all possible actions should be outputted, instead of a Q value of the action taken by the robot.
2) The outputted policy should be the action distribution instead of mean and covariance of action distribution of SAC π : S → R 2|A| .
3) In (25), its expectation E a t ∼π t [•] is obtained by the Monte Carlo estimation, which involves taking an expectation over action distribution [33].In discrete action space, expectation should be calculated directly, instead of Monte Carlo estimation.Hence, the temperature objective changes into Similarly, the policy objective changes into B. LSTM-DSAC and AL-DSAC LSTM-DSAC: When designing the architecture of LSTM-DSAC, it is worthwhile to consider how LSTM connects with networks.Two options include: 1) one LSTM means that actor and critic networks share one LSTM and this option is used in [14] and 2) two LSTMs means that actor network and critic network have different LSTM.One LSTM encoder indicates that the actor-critic architecture does not contribute to the convergence of LSTM.LSTM updates its networks according to its own gradient where k denotes the batch length and W denotes the network weights in LSTM.E denotes the prediction error, which is defined by where h(•) is the prediction vector (hidden state).In one LSTM case, LSTM converges according to the gradient of itself.Moreover, there is no constraint here to regulate the gradient of LSTM.Hence, the gradient of LSTM may be small or unstable.The consequence is that the LSTM encoder may converge slowly and unstable.There may be large differences in the convergence among LSTM encoder, actor network, and critic network; therefore, overall convergence may be poor.
In two LSTM cases, LSTM combines critic network or actor network to form two integrated networks [Fig.4(b)].Integrated networks form new actor and critic, which are different from the original actor network and critic network.These new actor and critic (with LSTM) are in an actor-critic relationship, which contributes to the convergence of LSTM according to the chain rule in the backpropagation process [Fig.4(a)].This means that the gradient received by LSTM encoders changes from (33) to where χ is a discount factor, which decides the portion of gradient contributed by actor network or critic network.θ and φ denote the critic network and actor network in function approximation.The gradient of actor or critic networks works as the constraint to accelerate and stabilize the convergence of LSTM.Two LSTMs converge in different pace, but two encoders fit their connected actor network or critic network.This contributes to the overall convergence of all networks.Hence, two LSTMs are used in the design of LSTM-DSAC.Finally, LSTM-DSAC architecture is designed in Fig. 4(b).AL-DSAC: [14] shows that the first encoded feature has a large impact on LSTM gates, while the rear encoded feature has less impact.This means that the LSTM learns more front encoded features and forget more rear features when encoding a state with pairwise robot-obstacle features.In LSTM-DSAC, the pairwise robot-obstacle features are encoded in minimumdistance order (7).This indicates that the obstacle, which is close to the robot, has larger importance [14].It is not always true because the obstacle importance also depends on speed and moving direction of obstacle according to human intuitions, instead of the distance to robot only.
AW [16] addresses this problem well by introducing attention score to evaluate the obstacle importance, instead of the distance.Pairwise obstacle features are ranked by attention-based importance (attention score).Then, ranked pairwise obstacle features are encoded by LSTM.
In existing work [16] and LSTM-DSAC, LSTM connects with actor or critic network to form new actor or critic.Intuitively, AW can connect with actor or critic network to form the AW-based DSAC (AW-DSAC).Then, AW is updated indirectly by the loss gradient.However, this makes the attention network receive small gradient in backpropagation, and slow convergence follows.
To solve this problem, we optimize the indirect update of AW to a direct way by separating AW from actor or critic network [Fig.5(a)].Separate AW, actor, and critic form a double actor-critic architecture, which means: 1) direct actor-critic relationship of actor and critic and 2) indirect actor-critic relationship of attention network and actor/critic [Fig.5(b)].As a result, the actor and critic contribute to the convergence of each other.The loss gradient of actor contributes to the convergence of AW.AW also contributes to the convergence of actor/critic.To be specific, the importance of each pairwise robot-obstacle feature is computed by AW, and therefore, pairwise robot-obstacle features are ranked or reordered by their importance.Then, LSTM encodes reordered pairwise robot-obstacle feature.Hence, LSTM remembers more robot-obstacle features with large importance and forgets some features with small importance.This improves the data quality and indirectly contributes to overall convergence.Finally, the AL-DSAC architecture is designed in Fig. 5(c).

C. ALN-DSAC
Work [35] indicated that episodic data with online feature contribute more to the convergence than offline data stored in the replay buffer.Note that the online feature denotes the priority or weight of environment state in different time step.This weight is represented by the accumulative reward received in each state.AL-DSAC considers merely learning from offline data in replay buffer.It does not make the best of episodic data with online feature (online weighted data); therefore, there is space for further improvement in convergence.
Inspired by [35] and [44], which learns online data and offline data simultaneously to improve the convergence of RL, we propose ALN-DSAC to learn online weighted data and offline data simultaneously.This is achieved by the novel PER, which prepares the online combined data and offline prioritized data for training.
To prepare online combined data [Fig.6(a)], online episodic data cannot be used for batch learning directly because its data distribution is different from that of batch data.Data with different data distribution will cause overfitting in training.
To solve this problem, we first prepare prioritized data-1 (size n).It then combines with online weighted episodic data This article has been accepted for inclusion in a future issue of this journal.Content is final as presented, with the exception of pagination.(size m) to form the combined data.Prioritized data-1 is selected from sampled batch data by the weight or priority, which is the accumulative reward (Monte Carlo return) defined by To prepare offline prioritized data [Fig.6(b)], K -batch experiences are sampled from replay buffer.Offline prioritized data are then selected from concatenated K experiences by the weight defined by (34).However, sampled K -batch experiences share the similarity to each other.This makes offline prioritized data lack diversity in training, and slow convergence speed and suboptimal converged result follow.To solve the problem of large similarity in sampled K experiences, we first apply cosine difference [44] to describe the similarity of sampled K experiences.Then, if the similarity of offline prioritized data is smaller than the threshold ξ < ξ threshold , offline prioritized data are selected for training.Otherwise, a new batch data is randomly resampled from replay buffer to replace offline prioritized data for training.Similarity ξ is defined by where v 1 and v 2 denote priority vectors of two sampled experiences and cos −1 ((v 1 • v 2 )/(∥v 1 ∥∥v 2 ∥)) denotes the cosine difference.
For t ̸ = T ter minal in episode i do 6.
Off-Train If length (D) < batch size l 8.
Compute priorities of data in entire episode: Algorithm (Pseudocode): ALN-DSAC is described in Algorithm 1, while Algorithms 2-4 are its subalgorithms.In Algorithm 1, replay buffer, networks are first initialized.These networks include the attention network θ att , critic networks (θ c1 and θ c2 ), target critic networks ( θc1 and θc2 ), and policy network θ p .Here, double critic architecture is used to reduce the overestimation of Q value.Then, the experience of each state ⟨s t , a t , r t , S t+1 ⟩ is obtained by If length (D) < batch size l, networks learn from the prioritized data-2, which is obtained by subalgorithm Off-Train.
Once the robot reaches the terminal state (finding the goal, collision, and timeout), priorities of episodic experience are obtained by P t = ∞ k γ k−t r k .However, saving only one episodic experience to replay buffer at the end of every episode will cause poor data diversity in early stage training.This means that experiences sampled via subalgorithms Off-Train in each step for batch learning are the same or almost the same.Networks trained with these similar experiences converge slowly.Delayed infusion of recent experiences [44] is adopted to solve problem.This means that M episodic experiences (format ⟨s t , a t , r t , S t+1 , P t ⟩) are saved in every M episodes, instead of current one episodic Concatenate K -batch data 6.
Sort data by their priorities 7.
Select largest-priority data E prioriti zed−data−2 with the length l 8. else 9.
Resample one batch data E random ⟨s, a, r, S ′ , P⟩ from D 10. Forward-Backpropagation Algorithm 3 On-Off-Train 1.// Prepare prioritized data-1 and combined data 2.Sample one batch data E random from replay buffe D 3.Sort the data E random by its priority 4.Select the largest-priority data E prioriti zed−data−1 with length n, n = l − m 5.Concatenate the online and offline data: Current episodic experience E i then combines the prioritized data-1 to form the combined data, which is fed to subalgorithm On-Off-Train.Previous steps repeat until the convergence of networks.Then, networks are saved for evaluations.Algorithm 2 (network training based on offline data) is executed in every step of an episode.K -batch experiences {E 1 , E 2 , . . ., E K } are randomly sampled from replay buffer.Experience is in the format with priorities < s l ′ , a l ′ , r l ′ , S l ′ , P l ′ >, l ′ ∈ l and its similarity ξ is computed by ξ = 1 − cos −1 ((v 1 • v 2 )/(∥v 1 ∥∥v 2 ∥)).If ξ < ξ threshold , these K experiences are concatenated and sorted by their priorities to prepare the prioritized data-2 E prioritized-data-2 .Otherwise, new batch experience E random < s, a, r, S ′ , P > is randomly sampled from replay buffer.Finally, networks are updated via feedforward and backpropagation processes Forward-Backpropagation.
Algorithm 3 (network training based on combined data) is executed at the end of an episode.One batch experience E random is first sampled from replay buffer randomly.E random is sorted or reordered by its priority to prepare the prioritized data-1 E prioritized-data-1 .Current online experience E i combines E prioritized-data-1 to form a combined experience E combined < s, a, r, S ′ , P >.Finally, networks are updated using combined experience in the feedforward and backpropagation processes Forward-Backpropagation.
Algorithm 4 denotes network feedforward and backpropagation.The input E can be one of E random , E prioritized-data-2 , and E combined .These experiences have the same length l.The feedforward process consists of computing four loss values: 1) critic loss; 2) policy loss; 3) attention loss, and This article has been accepted for inclusion in a future issue of this journal.Content is final as presented, with the exception of pagination.

Calculate critic loss:
L critic = M S E(q 1 , Qnext ) + M S E(q 2 , Qnext ) 12.// Calculate policy loss 13.Calculate current probability distribution and its log value: p, log p ← f θ p (s ranked ) 14.Calculate current critic values: 15.Calculate current entropy: Calculate expectation of current Q value: Calculate policy loss: 18.// Calculate attention loss 19.Calculate policy loss: Pairwise robot-obstacle features in S ′ are reordered by its AW to obtain the ranked next state by S ′ ranked is fed to the policy network to obtain the next probability distribution and its log value by Note that here, the policy network is combined with LSTM to form a new actor.S ′ ranked is also fed to double target critic networks θc1/c2 to obtain the next target Q values by However, the next target Q values cannot be used to compute the critic loss directly.They should combine with entropy to form the discounted next target Q value.This is achieved by computing the expectation of next state value via where α is the temperature parameter.Note that the Q value here is in format Q : S × A → R |A| ; hence, the next state value is the sum of Q values from every action, instead of Q value of an action taken by the robot.Thus, the next target Q value is obtained by To calculate the current Q value, current state s is first fed to attention network θ att to generate its AW by AW ← f θ att (s). ( Pairwise robot-obstacle features in s are then reordered by obtained AW via Thus, the current Q value is computed by feeding reordered pairwise robot-obstacle features s ranked to double critic networks θ c1/c2 to obtain current critic values via Note that here, critic networks are combined with LSTM to form a new critic.Finally, the critic loss is obtained by summing the mean square error (mse) of q 1 /q 2 and Qnext via This article has been accepted for inclusion in a future issue of this journal.Content is final as presented, with the exception of pagination.

TABLE I
PARAMETERS OF ALN-DSAC vergences of these algorithms are almost the same, although AL-DSAC performs slightly better than the rest algorithms.Their differences are small (reward difference <0.1).Fig. 8(e) shows the convergence of these three algorithms in the fiveobstacle case.AL-DSAC and ALN-DSAC perform almost the same, but better than LSTM-DSAC.Fig. 8(f) shows the convergence of these three algorithms in the ten-obstacle case.ALN-DSAC performs the best and, then, the AL-DSAC and LSTM-DSAC.Fig. 8(d)-(f) also shows that the number of delayed experiences matters.When obstacles are less (e.g., two and five obstacles), small, delayed experience number (e.g., 5) works well.Large, delayed experience number (e.g., 10) works well in case with more obstacles (e.g., ten obstacles).Fig. 8(g)-(i) shows the comparisons of LSTM-DSAC, AL-DSAC, and ALN-DSAC against the state of the arts in convergence.In the five-obstacle case, these three algorithms outperform all state of the arts.In the ten-obstacle case, AL-DSAC and ALN-DSAC outperform all state of the arts.LSTM-DSAC converges faster than the state of the arts, but its converged result is poor than that of SARL.

C. Model Evaluations
We first present the converged reward of all algorithms in training.Then, all trained models are evaluated according to evaluation criteria, which consist of qualitative evaluation, quantitative evaluation, computational evaluation, and robustness evaluation.an on-policy RL algorithm, which is trained with 60k episodic experiences.
2) Qualitive Evaluation (Trajectory Quality): The motion planning process and the policy evolvement process are described before evaluation to clarify: 1) how motion planning task is accomplished and 2) how the model evolves with the increase of training data.Motion planning processes with five and ten obstacles are shown in Fig. 9.The robot is controlled by ALN-DSAC, while obstacles are controlled by ORCA.Policy evolvement processes with five and ten obstacles are shown in Fig. 10, in the supplementary material.Models of ALN-DSAC trained with a different number of experiences (1k-30k episodes) are executed.Trajectories are generated according to trained models.Policy evolvement is observed by analyzing the time to reach the goal.The policy evolves stably despite small fluctuations (e.g., 18k and 24k in cases with five and ten obstacles, respectively).
We analyze trajectories and conclude that trajectories consist of three types : bypass, wait-cross, and cross.The bypass strategy is the most efficient and safe strategy, while the wait-cross and cross strategies lack efficiency and safety.The bypass strategy means the robot fast bypasses all obstacles that move toward the center and their goals (Fig. 10).The wait-cross strategy means that the robot keeps waiting or slow until obstacles move away from the center.Then, the robot moves fast and right across the center to reach its goal.The cross strategy means that the robot moves toward the center and its goal with medium speed and short distance to obstacles.Learned motion planning strategies are shown in Table III.LSTM-DSAC, AL-DSAC, ALN-DSAC, and SARL learned the bypass strategy (the most efficient and safe strategy).LSTM-DSAC, LSTM-A2C, LSTMRL, and CADRL learned the wait-cross strategy.ORCA is the only algorithm, which uses the cross strategy.
3) Quantitative Evaluation: Quantitative evaluation of algorithms is measured by six criteria: success rate, time to goal, collision rate, timeout rate, mean distance to obstacles, and mean discounted reward.The success rate, collision rate, and timeout rate denote the rates of the case in which the robot reaches the goal, collides with obstacles, and does not reach its goal within time limit (25 s), respectively.The 500 tests are conducted in circle-crossing simulator to evaluate each algorithm in cases with five or ten obstacles.The detailed comparisons are shown in Table IV.LSTM-DSAC, AL-DSAC, and ALN-DSAC outperform near all state of the arts in cases with five and ten obstacles.LSTM-DSAC performs good in the five-obstacle case because of two reasons: 1) high efficiency of DSAC in convergence and 2) the competence     Main hardware is Intel Core i7-9750H processor with 16-GB memory.The trainings and evaluations are based on the CrowdNav simulation system [5], [16].The robot operation system (ROS) and physical implementations are shown on websites: https://www.youtube.com/watch?v=9znVReBmfwI and https://www.youtube.com/watch?v=bH5FbA14AqE.the network is too deep and complex.This is expected to be solved by integrating the skip connection and other robust pooling methods such as LSTM pooling.

Fig. 2 .
Fig.2.Circle-and square-crossing simulators.Obstacles are randomly generated near the brink of the circle in a circle-crossing environment.In a square-crossing environment, obstacles are randomly generated on the left side or right side.

Fig. 3 .
Fig. 3. Mechanism of AW.Here, we present the mechanism to compute the attention-based feature of one obstacle S aw o i .It combines with the features of other obstacle S aw o j , j ∈ N , to form the feature of obstacles S aw o .

Fig. 8 .
Fig. 8. Training results.In (c), ALN-DSAC-1/5/10Delay denotes the novel data replay with one/five/ten delayed experiences.ALN-DSAC-NoPrioritizedData denotes that the prioritized sampling method-1/2 is replaced by random sampling in the novel data replay.AL-DSAC-PER denotes that AL-DSAC uses classical PER.In (g), CADRL does not support multiobstacle training.Its networks are trained in one-obstacle case, and trained models can be to multiobstacle motion planning.Hence, the training of CADRL is presented in independent figure.
of distance-based LSTM to represent the obstacle importance.However, distance-based LSTM partly represents the obstacle importance, and therefore, its weakness appears in the ten-obstacle case.AL-DSAC improves the distance-based This article has been accepted for inclusion in a future issue of this journal.Content is final as presented, with the exception of pagination.
V. CONCLUSION Existing motion planning algorithms face challenges in data quality (representation learning), efficacy of data replay, and optimality of RL algorithm.Given these challenges, we first proposed LSTM-DSAC, which uses DSAC to learn features pooled by distance-based LSTM.Then, distance-based pooling is improved by priority-based pooling.This is AL-DSAC, which uses an attention network to compute the importance of each obstacle.LSTM then pools the pairwise robot-obstacle features in a priority-based order.Finally, we proposed ALN-DSAC where online episodic experience is utilized in training.Hence, the algorithm learns from online and offline experiences simultaneously.This further improves the convergence.We did extensive evaluations of ALN-DSAC against the state of the arts.Experiments show that ALN-DSAC outperforms the state of the arts in most evaluations.Future research will focus on the improvement of representation learning such as attention mechanism.It faces the challenge of overfitting if r k 9. Store M episodes data E M_delay 10.Update replay buffer: D ← D ∪ E M_delay if i%M = 0 11.On-Off-Train 12. i = i + 1 13.Save models: θ att , θ c1 , θ c2 and θ p in replay buffer (steps 1 and 2).For learning from online data, at the end of each episode, prioritized sampling method-1 is used to generate prioritized data-1 (step 3), which combines with current online weighted data to form combined data (step 4).In training, combined data are fed to an attention network to generate AW, which is used to reorder the pairwise robot-obstacle features in each state of combined data (steps 5 and 6).Actor and critic (combination of LSTM and critic/actor network) then learn from reordered data (step 7).
Algorithm 4 Forward-Backpropagation 1. // Calculate critic loss 2.Calculate value of next attention weight: AW next ← f θ att (S ′ ) 3.Rank next state according to AW next : S ′ ranked ← rank max_weight S ′ 4.Calculate next probability distribution and its log value:p next , log p next ← f θ p S ′ Qnext = r (s, a) + γ E S ′ ∼ p V S ′8.Calculate value of current attention weight:AW ← f θ att (s) 9.Rank current state according to AW : Algorithm (Continue.)Forward-Backpropagation 24.Update the policy network: θ p ← θ p − γ ∇ θ p L policy 25.Update the attention network: θ att ← θ att − γ ∇ θ att L attention 26.Update the temperature: α ← α − γ ∇ α L α , α ← e α The next target Q value and the current Q value are required to compute the critic loss.To compute the next target Q value, the next state S ′ is first fed to the attention network θ att to generate its AW by ci ← θ ci − γ ∇ θ ci L critic , i ∈ 1, 2 1) Converged Reward: Converged rewards are listed in Table II.AL-DSAC and ALN-DSAC outperform other algorithms in the five-obstacle case.ALN-DSAC outperforms other algorithms in the ten-obstacle case.Note that LSTM-A2C is

TABLE II CONVERGED
REWARD OF ALL ALGORITHMS IN CASES WITH FIVE AND TEN OBSTACLES

TABLE III LEARNED
MOTION PLANNING STRATEGIESTABLE IV FIVE HUNDRED TESTS OF ALL ALGORITHMS IN CASES WITH FIVE AND TEN OBSTACLES IN CIRCLE-CROSSING ENVIRONMENT

TABLE V TIME
COST OF ALGORITHM TRAINING IN CASES WITH FIVE AND TEN OBSTACLES TABLE VI FIVE HUNDRED TESTS IN SQUARE-CROSSING ENVIRONMENT IN CASES WITH FIVE AND TEN OBSTACLES TABLE VII VALUE CHANGES FROM CIRCLE-CROSSING ENVIRONMENT TO SQUARE-CROSSING ENVIRONMENT