Bidirectional Progressive Neural Networks With Episodic Return Progress for Emergent Task Sequencing and Robotic Skill Transfer | IEEE Journals & Magazine | IEEE Xplore

Bidirectional Progressive Neural Networks With Episodic Return Progress for Emergent Task Sequencing and Robotic Skill Transfer


Left: ERP task switching with Task 1 learning. Red arrows indicate weight updates; gray dashed arrows show no gradient flow. Top-Right: A High-level ERP-BPNN architecture...

Abstract:

Human brain and behavior provide a rich venue that can inspire novel control and learning methods for robotics. In an attempt to exemplify such a development by inspiring...Show More

Abstract:

Human brain and behavior provide a rich venue that can inspire novel control and learning methods for robotics. In an attempt to exemplify such a development by inspiring how humans acquire knowledge and transfer skills among tasks, we introduce a novel multi-task reinforcement learning framework named Episodic Return Progress with Bidirectional Progressive Neural Networks (ERP-BPNN). The proposed ERP-BPNN model 1) learns in a human-like interleaved manner by 2) autonomous task switching based on a novel intrinsic motivation signal and, in contrast to existing methods, 3) allows bidirectional skill transfer among tasks. ERP-BPNN is a general architecture applicable to several multi-task learning settings; in this paper, we present the details of its neural architecture and show its ability to enable effective learning and skill transfer among morphologically different robots in a reaching task. The developed Bidirectional Progressive Neural Network (BPNN) architecture enables bidirectional skill transfer without requiring incremental training and seamlessly integrates with online task arbitration. The task arbitration mechanism developed is based on soft Episodic Return progress (ERP), a novel intrinsic motivation (IM) signal. To evaluate our method, we use quantifiable robotics metrics such as ‘expected distance to goal’ and ‘path straightness’ in addition to the usual reward-based measure of episodic return common in reinforcement learning. With simulation experiments, we show that ERP-BPNN achieves faster cumulative convergence and improves performance in all metrics considered among morphologically different robots compared to the baselines. Overall, our method provides a human-inspired and efficient multi-task reinforcement learning approach with interleaved learning, making it highly suitable for lifelong learning applications.
Left: ERP task switching with Task 1 learning. Red arrows indicate weight updates; gray dashed arrows show no gradient flow. Top-Right: A High-level ERP-BPNN architecture...
Published in: IEEE Access ( Volume: 12)
Page(s): 69690 - 69699
Date of Publication: 16 May 2024
Electronic ISSN: 2169-3536

Funding Agency:


SECTION I.

Introduction

Developing robots capable of autonomous and continual learning effectively requires the exploitation of acquired knowledge without human intervention, which could be described as the main goal of lifelong robot learning [1], [2], [3]. A key feature of human learning is autonomous task switching and interleaved learning, which is not addressed in mainstream machine learning and robotics research [4]. Deep reinforcement learning (RL) is an area of machine learning that leverages deep learning to make decisions by learning from real-time or recorded interactions with the environment. Traditional deep RL methods often do not include mechanisms to exploit the potential benefits of interleaved learning and bidirectional skill transfer based on partial learning. In contrast to prior works, our proposed model integrates human-like interleaved learning, leverages intrinsic motivation for autonomous task-switching, and includes a novel bidirectional progressive architecture tailored for deep multitask reinforcement learning. In sum, we aim to help fill this gap by developing a multi-task reinforcement learning framework that can sustain and, importantly, benefit from interleaved task learning and allow bidirectional skill transfer among tasks.

In humans, it is shown that interleaved learning yields improved recall of information and better memory retention in the long run rather than blocked learning [5], [6], [7]. Supporting this behavioral data, the human brain is endowed with mechanisms against task interference and forgetting, such as internal rehearsal of experiences and memory consolidation [8], [9]. To enable interleaved learning, the question of when and which task to engage during multi-task learning must be answered. Developmental learning offers some inspiration: during learning, an infant autonomously decides what to do or play without external directions dictating what task (s)he needs to engage in. This behavior of infants is usually associated with the notion of intrinsic motivation (IM), which guides behavior through a putative internal reward system [10]. IM can be based on curiosity, novelty, learning progress (LP), or even challenge; as such, it is also used in robotics as LP [11], curiosity [12], and in machine learning as novelty [13], or surprise [14]. In the current study, we adopt an IM approach and propose a novel learning progress (LP) signal for RL tasks that guides task switching. Overall, inspired by the above discussion, we aim to develop a multi-task reinforcement learning (RL) framework with autonomous task switching, which can benefit from interleaved task learning without suffering from task interference. We argue that such a learning system may benefit a wide range of robot learning scenarios ranging from human-like learning for a social robot to skill transfer applications among morphologically different robots.

While significant progress has been made in deep RL for robotics, most approaches [15] have focused on transferring skills between robots with identical action spaces [16], [17] and tasks with pixel-level state inputs [18], [19], [20]. Nevertheless, achieving multi-task reinforcement learning between robots with different physical structures provides insights for human-inspired reinforcement learning and acts as a critical driving force for machine learning research [21]. This is particularly significant due to the different state and action spaces, providing a unique perspective on knowledge generalization. In light of this, we choose learning and transferring skills among morphologically different robots as the target to address with the the developed framework.

One of the key challenges in lifelong learning is catastrophic interference/forgetting, which needs to be considered if a robot is to learn continually. When novel instances to be learned diverge greatly from previously observed ones, new information may overwrite the already acquired knowledge by modifying the representations shared among multiple tasks, leading to catastrophic forgetting. To minimize the interference while learning a novel task, in the literature, several techniques have been proposed, such as restricting the probable update(s) on the network parameters, dynamic resource allocation, or rehearsing the old task samples while learning the new ones [1]. In this study, similar to the Progressive Neural Networks (PNN) [22], we utilize task-specific resources while learning to prevent possible task interference but also allow bidirectional inter-task connectivity to support positive skill transfer.

In sum, to enable human-like interleaved multi-task learning while avoiding task interference, we develop a multi-task reinforcement learning system composed of (1) a novel architecture called BPNN that improves the PNN [22], [23] and (2) a novel Intrinsic Motivation signal, Episodic Return Progress (ERP), for task-switching. Unlike the PNN architecture, which restricts skill transfer to the forward direction and requires learning previous tasks until convergence before transferring to the next task, our BPNN method enables bidirectional skill transfer during training. This means that skill transfer can happen in midway along multi-task learning among all tasks without the necessity of one task waiting for another to finish. The ERP signal evaluates task progress based on episodic return values, detecting the task that significantly contributes to enhancing overall performance across multiple tasks. The efficacy of the proposed multi-task learning framework is shown by its application to the learning of reaching skill by morphologically different robots, namely two degrees of freedom (2-DoF), 3-DoF, and 4-DoF manipulators (Figure 1). The conducted systemic experiments show that synergistic multi-task learning is possible due to bidirectional inter-task skill transfer provided by the proposed BPNN architecture and the ERP-based task switching.

FIGURE 1. - (a) 2-DoF, (b) 3-DoF, (c) 4-DoF reacher robot arm environments.
FIGURE 1.

(a) 2-DoF, (b) 3-DoF, (c) 4-DoF reacher robot arm environments.

The rest of the paper is organized as follows. In Section II, we present an overview of the related studies present in the literature. Then, we describe our method in detail in Section III, providing metrics used to evaluate the performance of our proposed method. Section IV details the experiments and presents the results for skill transfer between morphologically different reacher robots. Finally, we discuss the broader impact of our method and future research directions in Section V.

SECTION II.

Related Work

A. Human Learning

Numerous examples demonstrate how neuroscience and artificial intelligence have paved the way for each other [24], [25], [26]. In this vein, the ability of humans to acquire multiple skills with ease throughout their lives may guide machine learning research in multi-task learning and lifelong learning fronts. Human learning, especially during infancy and childhood, is characterized by autonomous engagement in play, i.e., exploration, where no task is learned completely in one sitting. Besides being ecologically unreasonable to focus on learning a single task until mastery, interleaved learning may allow positive skill transfer among tasks from partial learning if adequate mechanisms are engaged. This notion is supported by the contextual-interference (CI) effect studies showing that practicing tasks in an interleaved regime often results in improved learning compared to practicing in a block order [27]. Benefits of CI have been linked to increased brain activity during interleaved practice as opposed to repetitive practice [28], [29]. In addition, the CI effect not only improves information retention but also skill transfer between similar tasks [30].

On the other hand, machine learning settings usually prefer blocked learning to interleaved learning. Multi-task learning settings generally assume that either a random task (or a subset of tasks) is chosen for each training trial or that training proceeds to the next task after one task is mastered with a few exceptions [31], [32]. In the case of continual learning, tasks are learned sequentially as each task arrives [33]. This is in contrast with how humans learn. It is clearly seen that the beneficial impacts of interleaved learning have not been thoroughly examined within the machine learning literature thus far. Investigating these effects has the potential to enhance the alignment of artificial intelligence with the underlying mechanisms of the human brain.

B. Intrinsic Motivation

Intrinsic motivation (IM) is a significant topic in infant cognitive development and learning, which refers to motivation originating from innate satisfaction instead of the extrinsic reward gained from the environment [34], [35]. IM has been adopted for enabling open-ended robot learning [36], and improving self-supervised exploration [37], [38]. Intrinsic rewards can arise from exploring novel states, satisfying an ingrained curiosity, or the rate of acquiring skills and knowledge in an environment. Inclusion of intrinsic rewards to the extrinsic rewards from the environment is one way of solving challenging exploration problems [39] and discovering a diverse set of skills [40] in deep reinforcement learning. In our approach, unlike the existing uses of IM in reinforcement learning, we propose to utilize episodic return progress (ERP) as a higher-level IM signal to dynamically switch tasks for learning in an online manner instead of modulating the reward signal guiding RL. Consequently, the dynamic task switching carried out by the task selection mechanism leads to an emergent interleaved multi-task learning regime.

C. Curriculum Reinforcement Learning

Curriculum learning methods focus on discovering a goal or task sequencing procedure that can lead to a faster convergence during training or improved performance compared to random sequencing [41]. Most curriculum learning techniques require a priori domain knowledge of tasks to distinguish task levels to train from easier tasks to more difficult tasks [42], [43], [44]. For example, for predicting the output of a short Python code with Long Short Term Memory (LSTM) networks, the task levels can be identified by the number of nestings and the number of digits in the integers [45]. After manually identifying these task difficulty measures, it can be demonstrated that while training, a combination of a random curriculum and a naive curriculum where tasks are selected in ascending order of difficulty performs better than using only a random or a naive curriculum [45]. However, the difficulty of each task might not be readily available a priori in the robotics domain, and thus an automatic or emergent curriculum formation can be desirable. Deep Q-Networks (DQN) with prioritized experience replay [46] assigns importance to transitions based on their associated temporal difference error, thereby selecting data for more frequent replay in single-task reinforcement learning. On the other hand, our method prioritizes which task network is allowed to learn in an online manner. Hence, it operates at a higher level than DQN with prioritized experience replay’s prioritization scheme and creates emergent interleaved task-switching patterns on the fly. As such, the learning scheduling obtained is quite different from what can be obtained through a transition-level curriculum or a usual task curriculum where each task is learned to completion.

D. Multi-Task Reinforcement Learning

Multi-task learning involves sharing skills and knowledge between multiple tasks, where each task is identified as either a source or a target task during training. Typically, tasks share a part of the neural network model, and the model integrates a task conditioning parameter that defines the task during training. PathNet is a technique that uses a tournament selection genetic algorithm to evolve pathways of a neural network for multi-task, lifelong, and forward transfer learning [47]. However, Pathnet is trained consecutively for reinforcement learning tasks, meaning that the source task needs to be trained until convergence before moving on to the target task. This procedure does not allow backward transfer of tasks. Similarly, in PNN [22], training the source task until convergence is required to transfer skills to other tasks connected to the trained task.

Current state-of-the-art methods do not consider human-inspired interleaved learning as a viable approach for multi-task learning, yet there are potential benefits of adopting a human-like learning strategy. In this vein, ERP-BPNN supports bidirectional transfer and does not require convergence in one task to use previously learned representations in other tasks. Hence, ERP-BPNN fits better into the multi-task learning framework where all tasks are source and target tasks throughout training. This is essential because the BPNN architecture allows for the integration of learning progress and bidirectional transfer.

SECTION III.

Method

We propose a novel multi-task reinforcement learning framework that integrates a bidirectional progressive neural network (BPNN) with a unique architecture and soft task-switching mechanism crafted for RL, inspired by our previous work [4]. The BPNN architecture consists of bidirectional lateral connections among the hidden layers of each fully connected task network to allow skill transfer. By allocating a separate network module for each task, the model avoids negative task transfer at the core level but allows potential positive transfer to take place due to the bidirectional lateral connections among networks.

A. Bidirectional Progressive Neural Networks

We initialize a fully connected neural network module $m \in M$ with task parameters $\theta _{m}$ for each task ${\mathcal {T}}_{i}$ with index i. As each task corresponds to a module, we will use “network module” and “task” terms interchangeably. Since each robot controls different numbers of joints, each network module has task-specific output layers with a different number of neurons $n^{m}_{\{l=L\}}$ where $l \in \{1..L\}$ refers to the layer index.

During training, each network module receives the input of the task selected for training. In this way, the lateral activations can receive representations of other tasks for skill transfer. Subsequently, we compute the hidden activations $h_{l}^{(m)} \in \mathbb {R}^{n^{m}_{l}}$ \begin{equation*} \scriptstyle { h_{l}^{(m)}=f\left ({{W_{l}^{(m)} h_{(l-1)}^{(m)}+b_{l}^{(m)}+\sum _{t\neq m}U_{l}^{(m:t)} h_{l-1}^{(t)}+b_{l}^{(m:t)}}}\right)} \tag {1}\end{equation*} View SourceRight-click on figure for MathML and additional features. where, $U_{l}^{(m:t)}$ , $b_{l}^{(m:t)}$ denote the weight matrix and the bias vector corresponding to the lateral connections of the ordered pair of modules $(m,t)$ from the previous layer $h_{l-1}^{(t)}$ of the module t to the $l^{th}$ layer of module m. $W_{l}^{(m)} \in \mathbb {R}^{n^{m}_{l} \times n^{m}_{l-1}} $ and $b_{l}^{(m)}$ are the weight matrix and bias vector, respectively, for layer l of module m, and f is the element-wise activation function. In the BPNN architecture, linear layers are utilized to potentially transmit information from other tasks, represented by $\boldsymbol {\Sigma }$ in Fig. 2(c), functioning as a module for summation. The linear transformations applied to the previous layers of other tasks allow the tuning of the incoming lateral signals. In particular, a negative transfer can be suppressed, and a positive transfer can be enhanced by the tuning of lateral weights through gradient descent-based learning. In this work, we use the activation function $tanh(x)=\frac {e^{x}-e^{-x}}{e^{x}+e^{-x}}$ following the suggested hyperparameters for Proximal Policy Optimization (PPO) [48] for continuous control. In the initial phase, the lateral connections of each module are frozen, and all tasks are individually trained to jumpstart task-specific learning for a predetermined number of $K_{init}$ training iterations. In the experiments reported in Section IV, we set $K_{init}=20$ . In the subsequent learning steps, the parameters of the task modules that are not selected for training are frozen to avoid negative transfer between tasks. This prevents the gradient flow into the networks associated with the remaining tasks. On the other hand, the parameters of and the lateral connections incoming to the task module selected for training are unfrozen to facilitate task-specific learning as well as the skill transfer from other tasks. At a high level, this training mechanism and architecture enables bidirectional information transfer by allowing gradient flow through a subset of neural network parameters associated with the task and integrating information from other tasks. How the dynamic task selection takes place is described next.

FIGURE 2. - A high-level (a,b) ERP-BPNN architecture demonstrates ERP selecting Task 1 (a) then Task 2 (b), showcasing bidirectional flow for skill transfer in a many-to-many fashion among three tasks. Graphical representation of ERP-BPNN framework with ERP task switching, wherein (c) Task 1 is selected to learn. Weight updates are denoted by red (Task 1) arrows. Dashed gray arrows indicate no gradient flow during the learning update. In the current report, Task 1, 2, and 3 refer to RL tasks for 2-Dof, 3-DoF, and 4-DoF Reacher Robot arms as illustrated in Fig. 1 (a), (b), (c) respectively.
FIGURE 2.

A high-level (a,b) ERP-BPNN architecture demonstrates ERP selecting Task 1 (a) then Task 2 (b), showcasing bidirectional flow for skill transfer in a many-to-many fashion among three tasks. Graphical representation of ERP-BPNN framework with ERP task switching, wherein (c) Task 1 is selected to learn. Weight updates are denoted by red (Task 1) arrows. Dashed gray arrows indicate no gradient flow during the learning update. In the current report, Task 1, 2, and 3 refer to RL tasks for 2-Dof, 3-DoF, and 4-DoF Reacher Robot arms as illustrated in Fig. 1 (a), (b), (c) respectively.

B. Task Switching by A Novel Intrinsic Motivation Signal: Average Soft Episodic Return Progress

We propose a task-switching mechanism for multi-task reinforcement learning based on a novel Intrinsic Motivation (IM) signal, namely Average Soft Episodic Return Progress (ERP), that captures the learning progress of an agent in the RL context. At each optimization iteration k, for each task network module m, we record the expected discounted cumulative reward or average episodic return denoted by $R_{m}(k)$ . To compute $R_{m}(k)$ , we first normalize the immediate rewards to ensure the exponential moving average of the rewards has a constant variance and clip them between (-10,10) to stabilize training [49]. Then we compute the mean of the returns over $\boldsymbol {P}*\kappa $ trajectories where $\boldsymbol {P}=8$ is the number of parallel RL environments, and $\kappa =2$ is the number of episodes completed in each RL environment. Then, we take ERP for task m at step k, $ERP_{m}(k)$ , as the slope of the line that is fitted to $R_{m}(k-w+1), R_{m}(k-w+2), \ldots, R_{m}(k)$ using least squares, where w is a predetermined window size ($w=5$ in the experiments reported in this paper). The least squares solution is given by \begin{equation*} \scriptstyle { ERP_{m}(k)=\frac {w\left ({{\sum ^{w-1}_{i=0} X_{k}\lbrack i\rbrack Y_{m}(k)\lbrack i\rbrack }}\right)-\sum ^{w-1}_{i=0} X_{k}\lbrack i\rbrack \sum ^{w-1}_{i=0} Y_{m}(k)\lbrack i\rbrack }{w\left ({{\sum ^{w-1}_{i=0} \left ({{X_{k}\lbrack i\rbrack }}\right)^{2}}}\right)-\left ({{\sum ^{w-1}_{i=0} X_{k}\lbrack i\rbrack }}\right)^{2}}} \tag {2}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $X_{k}=[k-w+1,k-w+2,\ldots..,k]$ and $Y_{m}(k)=[R_{m}(k-w+1),R_{m}(k-w+2),\ldots..,R_{m}(k)]$ .

For bootstrapping ERP computation for each task, at the beginning of a multi-task learning session, each task is given an initial ERP bootstrapping run of $K_{init}\gt w$ iterations. Then, we compute the ERP for each module individually for the initial run and subsequently at each iteration to select the task that has made the highest recent progress. The objective of the ERP procedure is to select tasks dynamically by identifying the task with the highest recent progress so that the selected task can continue its rapid progress. Selecting the most efficient task for training is particularly important as it can facilitate skill transfer to the other tasks, utilizing the bidirectional lateral connections of BPNN. For instance, the tasks that have made less progress can benefit from the steeper improvement of the selected task. Since ERP is computed for all tasks at each iteration, other tasks can benefit from the top-performing task and increase their chances of getting selected. The window size should be chosen to balance the reduction of noise and the tracking of recent updates. A large window size, w, can lead to selecting the initially top-performing task for an extended number of iterations. Correspondingly, a small window size can introduce noise during training. Thus, to monitor the recent changes in the progress of all tasks, we empirically selected the window size as five based on a grid search. Consecutive selection of the same task will eventually reach a plateau during training either due to the fact that the task is learned or a local minimum is encountered, where learning in other tasks may help the plateaued task to jumpstart as the reduction in progress of the current task allows other tasks with more progress to be selected. This dynamic task selection regime can be considered analogous to the flow state theory [50], which strives to maintain a balance between challenging and effortless tasks.

C. Multi-Task Reinforcement Learning

We initialize two separate BPNN architectures for critic and policy networks to integrate our method into the actor-critic reinforcement learning framework. In this manner, the policy network receives the state for the corresponding task as input and outputs the mean of a diagonal multivariate Gaussian distribution with a learned log standard deviation parameter independent of the state. Correspondingly, the critic network receives the state and learns the value function. Hence, only the parameters of the training task’s actor-critic modules and their corresponding lateral connections are updated using Adam optimizer [51] during task learning. We use the tuned hyperparameters in [52] for the reacher task as there are multiple extensions of PPO [53]. This extended PPO loss $L^{{\mathcal {T}}_{i}}_{PPO}$ with a clipped value function loss and a value function entropy bonus can be formulated [48] as \begin{equation*} L^{{\mathcal {T}}_{i}}_{PPO}=\mathbb {E}_{t}[L^{CLIP_{{\mathcal {T}}_{i}}}_{t}(\theta)-c_{1}L^{VF_{{\mathcal {T}}_{i}}}_{t}(\theta)+c_{2}S^{{\mathcal {T}}_{i}}(\theta (\pi)(s_{t})] \tag {3}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $L^{CLIP_{{\mathcal {T}}_{i}}}_{t}(\theta)$ is the clipped PPO surrogate objective, $c_{1}L^{VF_{{\mathcal {T}}_{i}}}(\theta)$ is the clipped value loss with coefficient $c_{1}$ , $c_{2}S^{{\mathcal {T}}_{i}}(\theta (\pi))(s_{t})$ is the entropy bonus for the actor network with coefficient $c_{2}$ . Although the actor and critic neural networks are separate, we can denote them collectively as $\theta $ as in previous works [48] for brevity. For instance, $\theta (\pi)$ and $\theta (\phi)$ refer to the actor-network and critic-networks parameters, respectively.

Algorithm 1 ERP-BPNN

Require:

${\mathcal {T}}_{i}$ : Task with index i, ${\mathcal {T}}$ : set of tasks, $m_{{\mathcal {T}}_{i}}$ : network module of ${\mathcal {T}}_{i}$

1:

Constants: $\kappa =2$ (#episodes), $P=8$ (#parallel tasks), $K_{init}=20$ (#jumpstart training iterations)

2:

for all ${\mathcal {T}}_{i}$ do

3:

Unfreeze module $m_{{\mathcal {T}}_{i}}$

4:

for $k \in \{1,\ldots,K_{init}\}$ do

5:

for $p \in \{1,\ldots,P\}$ do

6:

Sample $\kappa $ episodes by running policy $\pi _{\theta ^{old}_{{\mathcal {T}}_{i}}}$

7:

$\forall t$ , Compute Advantage estimates $\hat {A}_{t}$

8:

end for

9:

Record average episodic return $R_{m_{{\mathcal {T}}_{i}}}(k)$

10:

Optimize $L^{{\mathcal {T}}_{i}}_{PPO}$ w.r.t. $\theta _{{\mathcal {T}}_{i}}$ by Equation 3

11:

$\theta ^{old}_{{\mathcal {T}}_{i}} \leftarrow \theta _{{\mathcal {T}}_{i}}$

12:

end for

13:

Freeze module $m_{{\mathcal {T}}_{i}}$

14:

end for

15:

while not done do

16:

Calculate progress for all tasks using ERP

17:

Choose task ${\mathcal {T}}_{i}$ with maximum ERP

18:

Unfreeze module $m_{{\mathcal {T}}_{i}}$

19:

for $p \in \{1,\ldots,P\}$ do

20:

Sample $\kappa $ episodes by running policy $\pi _{\theta ^{old}_{{\mathcal {T}}_{i}}}$

21:

$\forall t$ , Compute Advantage estimates $\hat {A}_{t}$

22:

end for

23:

Record average episodic return $R_{m_{{\mathcal {T}}_{i}}}(k)$

24:

Optimize $L^{{\mathcal {T}}_{i}}_{PPO}$ w.r.t. $\theta _{{\mathcal {T}}_{i}}$ by Equation 3

25:

$\theta ^{old}_{{\mathcal {T}}_{i}} \leftarrow \theta _{{\mathcal {T}}_{i}}$

26:

Freeze module $m_{{\mathcal {T}}_{i}}$

27:

for all ${\mathcal {T}}_{j} \in T - \{{\mathcal {T}}_{i}\}$ do

28:

for $p \in \{1,\ldots,P\}$ do

29:

Sample $\kappa $ episodes by running policy $\pi _{\theta ^{old}_{{\mathcal {T}}_{j}}}$

30:

end for

31:

Record average episodic return $R_{m_{{\mathcal {T}}_{j}}}(k)$

32:

end for

33:

end while

D. Evaluation Metrics

Designing a reward function is crucial and challenging in deep reinforcement learning, primarily because the reward function may not fully encapsulate all attributes expected from a learning agent. One example is in the Reacher environment, where the need often arises to tune the reward function coefficients or introduce additional parameters. However, this requires significant time and resources to adhere to the predetermined metrics and careful tuning of the parameters. In this sense, evaluating how a straightforward, uninformed reward function performs in terms of metrics meaningful for the task domain at hand is important. Therefore, in this section, in addition to the classical RL metric of episodic return, we also present task domain metrics used to evaluate the collective performance of morphologically different agents. To account for early stopping introduced in [54], we evaluate all tasks after each iteration, save the best policies obtained up to that iteration, and use them in our evaluations. Notably, saving the best policy based on the cumulative performance of all tasks for all methods ensures an accurate comparison of the proposed method against baselines and enhances performance across all methods after training is completed.

1) Episodic Return

Maximum episodic return tracks the best expected discounted cumulative reward achieved across all tasks in an iteration. We report the best expected discounted cumulative reward after each training iteration to ensure a fair comparison with baselines. Crucially, the ERP task-switching procedure provides an inherent early-stopping procedure, provided that we allow a fixed number of cumulative training iterations under resource constraints. Since the task is to reach a given point in space with minimum effort, we follow the reward function definition in [55] \begin{equation*} \mathbf {r} = -\|p_{\text {fingertip}} - p_{\text {target}}\|_{2} - \sum \text { a}^{2} \tag {4}\end{equation*} View SourceRight-click on figure for MathML and additional features. where a is the action, $-\sum \text { a}^{2}$ is the control cost, $p_{\text {fingertip}}$ , and $p_{\text {target}}$ are the positions of the fingertip and the target, respectively.

2) Distance to the Goal

Distance to the goal is the minimum expected L2-norm distance between the final end-effector position and the goal position obtained over all tasks. The Reacher environment only terminates after 50 timesteps, which is equal to the episode length. Hence, we expect the end-effector of the reacher to stay in the immediate vicinity of the goal until episode termination.

3) Deviation from Shortest Path to the Goal

At a high level, we expect an efficient learning agent to follow the shortest feasible path to the goal, which is also desirable for many robotic applications. Thus, we introduce a straightness metric that measures the minimum expected deviation from the shortest path to the goal taken by the manipulator’s end-effector over all tasks. We define the deviation metric, ${\mathcal {D}}$ across tasks as \begin{equation*} {\mathcal {D}}=E_{{\mathcal {T}}} \left [{{ \left ({{\sum _{t=1} L_{2}(\boldsymbol {x}_{{\mathcal {T}}}(t), \boldsymbol {x}_{{\mathcal {T}}}(t-1)) }}\right) - L_{2}(g, \boldsymbol {x}_{{\mathcal {T}}}(0))}}\right ]\end{equation*} View SourceRight-click on figure for MathML and additional features. where ${\mathcal {T}}$ is a task, $\boldsymbol {x}_{{\mathcal {T}}}(t)$ is the position of the end-effector for task ${\mathcal {T}}$ at timestep t, g is the goal and $\boldsymbol {x}_{{\mathcal {T}}}(0)$ is the initial end-effector position. Here, we first compute the length of the spatial trajectory by summing the $L_{2}$ -norm of the distance between the current and the previous end-effector position. Subsequently, we subtract the $L_{2}$ -norm of the distance between the goal position and the initial end-effector position to obtain the deviation from the shortest path to the goal for task ${\mathcal {T}}$ . After each iteration, we calculate the path length traveled by all agents for every task and record the smallest expected deviation from the optimal path to the goal as obtained by the learning algorithm.

E. Implementation Details

Given the morphological diversity among robots results in different state space dimensions, and our BPNN architecture has a static input layer; we have standardized the state input dimensions for robots with 2-DoF and 3-DoF to match the maximum state space dimension, which of the 4-DoF robot, across all tasks. Hence, we apply zero-padding to the dimensions corresponding to the angular velocity and the angle of each missing link. To encourage skill transfer among networks, we limit the available computational resources to the task network modules by setting the number of hidden layers as three and the hidden layer size as two.

For the Reacher tasks, we use PPO [48] as the RL algorithm and adopt the hyperparameter set from [52] for the remaining hyperparameters. After each training iteration, we sample trajectories from each task, excluding the task trained most recently, using the latest BPNN actor and critic parameters. Then, we compute the ERP, maximum episodic return, minimum expected distance to the goal, and minimum expected deviation from the shortest path to the goal for each task. Additionally, we run eight parallel environments per task to reduce simulation time. Each environment runs two episodes, with each episode lasting 50 steps, thereby accumulating a total of 800 timesteps per task.

SECTION IV.

Experiments

In this section, we first present the environment used for benchmarking multi-task learning for morphologically different robots illustrated in Fig. 1. Then, we elaborate on ERP-based task switching and present results obtained with the evaluation metrics introduced in Section III-D.

A. Single Goal Multi-Task Learning Between Morphologically Different Robots

To demonstrate skill transfer between morphologically different robots, we modified the Reacher-v2 environment, simulated in MuJoCo [56] using Gymnasium framework [55]. The learning experiments have been conducted on a modern computer equipped with an NVIDIA 3090 GPU and 10-core Intel i9 CPU at 3.7GHz. The experiment setup consists of three different reacher environments, each having distinct action spaces, specifically featuring 2-DoF, 3-DoF, and 4-DoF robot arms. Although each environment has unique environmental dynamics due to its distinct morphologies, it shares the same reward functions. The reward function defined in Eq. 4 comprises a control cost, the negative vector norm of the end-effector of the reacher, and a predetermined goal. We consider the baselines RANDOM-BPNN and random Multi-Layer Perceptron (RANDOM-MLP) to evaluate the performance of our ERP-BPNN approach across various metrics. RANDOM-BPNN abides by random task selection while featuring the same underlying BPNN architecture as ERP-BPNN. Similarly, RANDOM-MLP follows the same random task selection procedure as RANDOM-BPNN but utilizes a separate actor-critic network pair for each task with no lateral connections among different tasks. Accordingly, RANDOM-MLP can be regarded as training a PPO algorithm for each task separately.

B. Results

In this section, we report the results of our method ERP-BPNN compared to the baselines of RANDOM-BPNN and RANDOM-MLP, based on the evaluation metrics presented in Section III-D. Table 1 shows each method’s mean and standard deviation results obtained during 100K episodes across five random seeds for each metric. The results indicate that ERP-BPNN can obtain superior performance compared to the baselines across all metrics with the lowest standard deviation. More importantly, as seen from the Episodic Return plot in Fig. 3 (a), the proposed model, ERP-BPNN, achieves faster convergence than the baselines.

TABLE 1 Episodic Return, Distance to Goal, and Deviation From Shortest Path to Goal Results Across 5 Random Seeds Using ERP-BPNN, RANDOM-BPNN, and RANDOM-MLP
Table 1- Episodic Return, Distance to Goal, and Deviation From Shortest Path to Goal Results Across 5 Random Seeds Using ERP-BPNN, RANDOM-BPNN, and RANDOM-MLP
FIGURE 3. - Performances of the proposed model, ERP-BPNN, and the two baselines of RANDOM-BPNN and RANDOM-MLP across five random seeds are shown in terms of (a) maximum episodic return, (b) minimum expected final end-effector distance to goal, and (c) minimum expected deviation from the shortest path to the goal.
FIGURE 3.

Performances of the proposed model, ERP-BPNN, and the two baselines of RANDOM-BPNN and RANDOM-MLP across five random seeds are shown in terms of (a) maximum episodic return, (b) minimum expected final end-effector distance to goal, and (c) minimum expected deviation from the shortest path to the goal.

Since the combination of BPNN architecture with ERP-based task selection (ERP-BPNN) yields the best results surpassing the random task selection strategy (RANDOM-BPNN), it can be argued that the ERP task selection procedure is essential for successful multi-task learning with positive transfer among tasks. Note that interestingly, the random task selection strategy (RANDOM-BPNN), although inferior to ERP-BPNN, performs better than the RANDOM-MLP baseline in terms of Episodic Return and yields better endpoint accuracy measured as Distance to Goal at the end of the training (Table 1). This indicates the lateral connections among task networks may facilitate limited positive skill transfer even with random task selection. However, this picture is not reflected in the Deviation from the Shortest Path to the Goal measure as random task selection leads to negative transfer for the Deviation from the Shortest Path to the Goal measure (see Fig 3(c), episodes >55K). In the early stages of learning, the performance of ERP-BPNN and RANDOM-BPNN are better than RANDOM-MLP (episodes $\lt 55K$ ); however, while the proposed ERP-BPNN enjoys positive skill transfer and thus continues to improve its progress, RANDOM-BPNN suffers from negative interference degrading to a performance worse than RANDOM-MLP in this metric (Fig 3(c), episodes >55K).

In order to get an intuitive understanding of the learned policy performances, we illustrate typical trajectories followed by the end-effector of the Reacher robot in Fig. 4(a) when controlled by policies acquired by the $1750^{th}$ learning iteration, corresponding to approximately 80K episodes, i.e., when approximately 80% of the training is completed. Observe that at this training stage, the proposed ERP-BPNN has already acquired the skill to reach the target accurately through a less curved path compared to the baselines. In particular, the RANDOM-BPNN diverges from the shortest path to the goal for the 4-DoF robot arm, whereas the RANDOM-MLP diverges from the shortest path to the goal for both 3-DoF and 4-DoF Reacher Robot arms.

FIGURE 4. - The policies obtained by our model and the baselines during multi-task learning in terms of straightness (a) and endpoint accuracy (b) are demonstrated (at 
$1750^{th}$
 policy update).
FIGURE 4.

The policies obtained by our model and the baselines during multi-task learning in terms of straightness (a) and endpoint accuracy (b) are demonstrated (at $1750^{th}$ policy update).

To have an intuitive grasp of the reaching ability obtained on morphologically different robots by the proposed model and the baselines, a typical goal position and the final end-effector positions reached for each robot are shown in Fig. 4(b). Final end-effector positions reached using each method indicate that the agent trained using ERP-BPNN is consistently closest to the goal across all environments compared to the baselines. Likewise, the results for Deviation from Shortest Path to Goal in Fig. 4(a) show that the agent trained with ERP-BPNN can reach the goal by taking a shorter path than the baselines in the 4-DoF reacher robot arm environment. Importantly, accurate reaching using a straighter path can be learned significantly faster by our proposed method ERP-BPNN compared to the baselines as evidenced by the distance plots given in Fig. 3(b).

C. ERP-Based Dynamic Task Selection

To examine the task engagement patterns that emerge with the ERP-based task selection mechanism, we record the selection frequency of each task over a moving learning window, i.e., a fixed number of learning iterations. By using this scheme, the selection frequencies of the tasks are plotted against training iteration counts in Fig. 5. As can be seen, although ERP primarily selects the more straightforward 2-DoF Reacher Robot Arm as the one to learn initially (Fig. 5, time range: 0-10), it gradually starts selecting more complex tasks with 3-DoF and 4-DoF Reacher Robots, respectively (Fig. 5 time range: 10-150). Then, we observe that ERP continues selecting the 4-DoF Reacher Robot for skill transfer more frequently (Fig. 5, time range: 250-500). This task selection behavior suggests that the 4-DoF Reacher Robot has benefited from the skill transfer of the other tasks and started to learn more rapidly later in the training. Upon further elaboration of the episodic return plots for all tasks in Fig. 3 (a), and standard deviations across iterations in Table 1, ERP-BPNN exhibits a consistent overall improvement across training iterations, achieving a faster convergence with lower standard deviation compared to the baselines. In line with this, faster convergence of ERP-BPNN indicates that consecutive selections of the 4-DoF Reacher task have collectively increased the maximum episodic return.

FIGURE 5. - Selection frequency plot of task switching by average episodic return progress with an iteration window size of 
$\nu =35$
.
FIGURE 5.

Selection frequency plot of task switching by average episodic return progress with an iteration window size of $\nu =35$ .

SECTION V.

Conclusion

This paper introduces Episodic Return Progress with Bidirectional Progressive Neural Networks (ERP-BPNN), a novel multi-task reinforcement learning approach incorporating a human-like interleaved learning mechanism, and shows its application to skill transfer across morphologically different robots. ERP-BPNN comprises a Bidirectional Progressive Neural Network (BPNN) that enables efficient, bidirectional skill transfer and a soft Episodic Return Progress (ERP) mechanism for dynamic task selection. The BPNN architecture is specifically designed to enable skill transfer through bidirectional lateral connections, drawing inspiration from the human brain’s adeptness at retaining existing knowledge while acquiring new skills. Episodic return progress-based (ERP) task selection complements BPNN by autonomously selecting the tasks to engage in learning during training, eliminating the need for prior domain knowledge to evaluate difficulty levels of tasks. We demonstrate that soft ERP-based task selection, with BPNN, achieves higher performance and faster convergence than the baselines across all the tested metrics of Episodic Return, Distance To the Goal, and Deviation from Shortest Path to Goal.

The insights presented in this work are closely related to the field of human brain-inspired reinforcement learning (RL) research. By implementing an Intrinsic Motivation (IM) signal and leveraging curiosity-driven exploration within the context of multi-task RL, ERP-BPNN framework closely aligns with biological processes observed in the human brain. Specifically, the behavior of dopaminergic neurons, known for their critical role in reward-based learning, mirrors the temporal difference (TD) prediction errors utilized in RL algorithms [24], [57], [58]. BPNN provides a mechanism for a shared multi-task RL architecture that prevents task interference during skill transfer by adjusting knowledge transfer from other tasks during training. In addition, the ERP-guided freezing of lateral connections among task modules prevents interference in the previously acquired knowledge. Furthermore, the bidirectional lateral connections augment noise in the task-specific computational flow akin to noisy computations in the brain [26], alleviating catastrophic forgetting [59] while amplifying plasticity [60]. Analogous to the modular control architecture in cerebellum [61], as corroborated by Functional Magnetic Resonance Imaging studies [62], [63], BPNN adopts a modular architecture. This architecture choice is instrumental in decreasing catastrophic forgetting via instantiating a module for each task.

Overall, we have designed a cognitive architecture applicable to a wide range of lifelong learning scenarios. Yet, an important future work that remains is to extend our architecture to support heterogeneous learning, where each task may require a different type of learning mechanism. For instance, one module may support supervised learning while others may handle reinforcement and unsupervised learning tasks. Other future work involves enhancing the framework to accommodate more complex lifelong RL scenarios, including real-world applications. This can be achieved by increasing the number of tasks and improving the bidirectional skill transfer mechanisms to enhance learning capacity and generalization ability.

References

References is not available for this document.