Introduction
Developing robots capable of autonomous and continual learning effectively requires the exploitation of acquired knowledge without human intervention, which could be described as the main goal of lifelong robot learning [1], [2], [3]. A key feature of human learning is autonomous task switching and interleaved learning, which is not addressed in mainstream machine learning and robotics research [4]. Deep reinforcement learning (RL) is an area of machine learning that leverages deep learning to make decisions by learning from real-time or recorded interactions with the environment. Traditional deep RL methods often do not include mechanisms to exploit the potential benefits of interleaved learning and bidirectional skill transfer based on partial learning. In contrast to prior works, our proposed model integrates human-like interleaved learning, leverages intrinsic motivation for autonomous task-switching, and includes a novel bidirectional progressive architecture tailored for deep multitask reinforcement learning. In sum, we aim to help fill this gap by developing a multi-task reinforcement learning framework that can sustain and, importantly, benefit from interleaved task learning and allow bidirectional skill transfer among tasks.
In humans, it is shown that interleaved learning yields improved recall of information and better memory retention in the long run rather than blocked learning [5], [6], [7]. Supporting this behavioral data, the human brain is endowed with mechanisms against task interference and forgetting, such as internal rehearsal of experiences and memory consolidation [8], [9]. To enable interleaved learning, the question of when and which task to engage during multi-task learning must be answered. Developmental learning offers some inspiration: during learning, an infant autonomously decides what to do or play without external directions dictating what task (s)he needs to engage in. This behavior of infants is usually associated with the notion of intrinsic motivation (IM), which guides behavior through a putative internal reward system [10]. IM can be based on curiosity, novelty, learning progress (LP), or even challenge; as such, it is also used in robotics as LP [11], curiosity [12], and in machine learning as novelty [13], or surprise [14]. In the current study, we adopt an IM approach and propose a novel learning progress (LP) signal for RL tasks that guides task switching. Overall, inspired by the above discussion, we aim to develop a multi-task reinforcement learning (RL) framework with autonomous task switching, which can benefit from interleaved task learning without suffering from task interference. We argue that such a learning system may benefit a wide range of robot learning scenarios ranging from human-like learning for a social robot to skill transfer applications among morphologically different robots.
While significant progress has been made in deep RL for robotics, most approaches [15] have focused on transferring skills between robots with identical action spaces [16], [17] and tasks with pixel-level state inputs [18], [19], [20]. Nevertheless, achieving multi-task reinforcement learning between robots with different physical structures provides insights for human-inspired reinforcement learning and acts as a critical driving force for machine learning research [21]. This is particularly significant due to the different state and action spaces, providing a unique perspective on knowledge generalization. In light of this, we choose learning and transferring skills among morphologically different robots as the target to address with the the developed framework.
One of the key challenges in lifelong learning is catastrophic interference/forgetting, which needs to be considered if a robot is to learn continually. When novel instances to be learned diverge greatly from previously observed ones, new information may overwrite the already acquired knowledge by modifying the representations shared among multiple tasks, leading to catastrophic forgetting. To minimize the interference while learning a novel task, in the literature, several techniques have been proposed, such as restricting the probable update(s) on the network parameters, dynamic resource allocation, or rehearsing the old task samples while learning the new ones [1]. In this study, similar to the Progressive Neural Networks (PNN) [22], we utilize task-specific resources while learning to prevent possible task interference but also allow bidirectional inter-task connectivity to support positive skill transfer.
In sum, to enable human-like interleaved multi-task learning while avoiding task interference, we develop a multi-task reinforcement learning system composed of (1) a novel architecture called BPNN that improves the PNN [22], [23] and (2) a novel Intrinsic Motivation signal, Episodic Return Progress (ERP), for task-switching. Unlike the PNN architecture, which restricts skill transfer to the forward direction and requires learning previous tasks until convergence before transferring to the next task, our BPNN method enables bidirectional skill transfer during training. This means that skill transfer can happen in midway along multi-task learning among all tasks without the necessity of one task waiting for another to finish. The ERP signal evaluates task progress based on episodic return values, detecting the task that significantly contributes to enhancing overall performance across multiple tasks. The efficacy of the proposed multi-task learning framework is shown by its application to the learning of reaching skill by morphologically different robots, namely two degrees of freedom (2-DoF), 3-DoF, and 4-DoF manipulators (Figure 1). The conducted systemic experiments show that synergistic multi-task learning is possible due to bidirectional inter-task skill transfer provided by the proposed BPNN architecture and the ERP-based task switching.
The rest of the paper is organized as follows. In Section II, we present an overview of the related studies present in the literature. Then, we describe our method in detail in Section III, providing metrics used to evaluate the performance of our proposed method. Section IV details the experiments and presents the results for skill transfer between morphologically different reacher robots. Finally, we discuss the broader impact of our method and future research directions in Section V.
Related Work
A. Human Learning
Numerous examples demonstrate how neuroscience and artificial intelligence have paved the way for each other [24], [25], [26]. In this vein, the ability of humans to acquire multiple skills with ease throughout their lives may guide machine learning research in multi-task learning and lifelong learning fronts. Human learning, especially during infancy and childhood, is characterized by autonomous engagement in play, i.e., exploration, where no task is learned completely in one sitting. Besides being ecologically unreasonable to focus on learning a single task until mastery, interleaved learning may allow positive skill transfer among tasks from partial learning if adequate mechanisms are engaged. This notion is supported by the contextual-interference (CI) effect studies showing that practicing tasks in an interleaved regime often results in improved learning compared to practicing in a block order [27]. Benefits of CI have been linked to increased brain activity during interleaved practice as opposed to repetitive practice [28], [29]. In addition, the CI effect not only improves information retention but also skill transfer between similar tasks [30].
On the other hand, machine learning settings usually prefer blocked learning to interleaved learning. Multi-task learning settings generally assume that either a random task (or a subset of tasks) is chosen for each training trial or that training proceeds to the next task after one task is mastered with a few exceptions [31], [32]. In the case of continual learning, tasks are learned sequentially as each task arrives [33]. This is in contrast with how humans learn. It is clearly seen that the beneficial impacts of interleaved learning have not been thoroughly examined within the machine learning literature thus far. Investigating these effects has the potential to enhance the alignment of artificial intelligence with the underlying mechanisms of the human brain.
B. Intrinsic Motivation
Intrinsic motivation (IM) is a significant topic in infant cognitive development and learning, which refers to motivation originating from innate satisfaction instead of the extrinsic reward gained from the environment [34], [35]. IM has been adopted for enabling open-ended robot learning [36], and improving self-supervised exploration [37], [38]. Intrinsic rewards can arise from exploring novel states, satisfying an ingrained curiosity, or the rate of acquiring skills and knowledge in an environment. Inclusion of intrinsic rewards to the extrinsic rewards from the environment is one way of solving challenging exploration problems [39] and discovering a diverse set of skills [40] in deep reinforcement learning. In our approach, unlike the existing uses of IM in reinforcement learning, we propose to utilize episodic return progress (ERP) as a higher-level IM signal to dynamically switch tasks for learning in an online manner instead of modulating the reward signal guiding RL. Consequently, the dynamic task switching carried out by the task selection mechanism leads to an emergent interleaved multi-task learning regime.
C. Curriculum Reinforcement Learning
Curriculum learning methods focus on discovering a goal or task sequencing procedure that can lead to a faster convergence during training or improved performance compared to random sequencing [41]. Most curriculum learning techniques require a priori domain knowledge of tasks to distinguish task levels to train from easier tasks to more difficult tasks [42], [43], [44]. For example, for predicting the output of a short Python code with Long Short Term Memory (LSTM) networks, the task levels can be identified by the number of nestings and the number of digits in the integers [45]. After manually identifying these task difficulty measures, it can be demonstrated that while training, a combination of a random curriculum and a naive curriculum where tasks are selected in ascending order of difficulty performs better than using only a random or a naive curriculum [45]. However, the difficulty of each task might not be readily available a priori in the robotics domain, and thus an automatic or emergent curriculum formation can be desirable. Deep Q-Networks (DQN) with prioritized experience replay [46] assigns importance to transitions based on their associated temporal difference error, thereby selecting data for more frequent replay in single-task reinforcement learning. On the other hand, our method prioritizes which task network is allowed to learn in an online manner. Hence, it operates at a higher level than DQN with prioritized experience replay’s prioritization scheme and creates emergent interleaved task-switching patterns on the fly. As such, the learning scheduling obtained is quite different from what can be obtained through a transition-level curriculum or a usual task curriculum where each task is learned to completion.
D. Multi-Task Reinforcement Learning
Multi-task learning involves sharing skills and knowledge between multiple tasks, where each task is identified as either a source or a target task during training. Typically, tasks share a part of the neural network model, and the model integrates a task conditioning parameter that defines the task during training. PathNet is a technique that uses a tournament selection genetic algorithm to evolve pathways of a neural network for multi-task, lifelong, and forward transfer learning [47]. However, Pathnet is trained consecutively for reinforcement learning tasks, meaning that the source task needs to be trained until convergence before moving on to the target task. This procedure does not allow backward transfer of tasks. Similarly, in PNN [22], training the source task until convergence is required to transfer skills to other tasks connected to the trained task.
Current state-of-the-art methods do not consider human-inspired interleaved learning as a viable approach for multi-task learning, yet there are potential benefits of adopting a human-like learning strategy. In this vein, ERP-BPNN supports bidirectional transfer and does not require convergence in one task to use previously learned representations in other tasks. Hence, ERP-BPNN fits better into the multi-task learning framework where all tasks are source and target tasks throughout training. This is essential because the BPNN architecture allows for the integration of learning progress and bidirectional transfer.
Method
We propose a novel multi-task reinforcement learning framework that integrates a bidirectional progressive neural network (BPNN) with a unique architecture and soft task-switching mechanism crafted for RL, inspired by our previous work [4]. The BPNN architecture consists of bidirectional lateral connections among the hidden layers of each fully connected task network to allow skill transfer. By allocating a separate network module for each task, the model avoids negative task transfer at the core level but allows potential positive transfer to take place due to the bidirectional lateral connections among networks.
A. Bidirectional Progressive Neural Networks
We initialize a fully connected neural network module
During training, each network module receives the input of the task selected for training. In this way, the lateral activations can receive representations of other tasks for skill transfer. Subsequently, we compute the hidden activations \begin{equation*} \scriptstyle { h_{l}^{(m)}=f\left ({{W_{l}^{(m)} h_{(l-1)}^{(m)}+b_{l}^{(m)}+\sum _{t\neq m}U_{l}^{(m:t)} h_{l-1}^{(t)}+b_{l}^{(m:t)}}}\right)} \tag {1}\end{equation*}
A high-level (a,b) ERP-BPNN architecture demonstrates ERP selecting Task 1 (a) then Task 2 (b), showcasing bidirectional flow for skill transfer in a many-to-many fashion among three tasks. Graphical representation of ERP-BPNN framework with ERP task switching, wherein (c) Task 1 is selected to learn. Weight updates are denoted by red (Task 1) arrows. Dashed gray arrows indicate no gradient flow during the learning update. In the current report, Task 1, 2, and 3 refer to RL tasks for 2-Dof, 3-DoF, and 4-DoF Reacher Robot arms as illustrated in Fig. 1 (a), (b), (c) respectively.
B. Task Switching by A Novel Intrinsic Motivation Signal: Average Soft Episodic Return Progress
We propose a task-switching mechanism for multi-task reinforcement learning based on a novel Intrinsic Motivation (IM) signal, namely Average Soft Episodic Return Progress (ERP), that captures the learning progress of an agent in the RL context. At each optimization iteration k, for each task network module m, we record the expected discounted cumulative reward or average episodic return denoted by \begin{equation*} \scriptstyle { ERP_{m}(k)=\frac {w\left ({{\sum ^{w-1}_{i=0} X_{k}\lbrack i\rbrack Y_{m}(k)\lbrack i\rbrack }}\right)-\sum ^{w-1}_{i=0} X_{k}\lbrack i\rbrack \sum ^{w-1}_{i=0} Y_{m}(k)\lbrack i\rbrack }{w\left ({{\sum ^{w-1}_{i=0} \left ({{X_{k}\lbrack i\rbrack }}\right)^{2}}}\right)-\left ({{\sum ^{w-1}_{i=0} X_{k}\lbrack i\rbrack }}\right)^{2}}} \tag {2}\end{equation*}
For bootstrapping ERP computation for each task, at the beginning of a multi-task learning session, each task is given an initial ERP bootstrapping run of
C. Multi-Task Reinforcement Learning
We initialize two separate BPNN architectures for critic and policy networks to integrate our method into the actor-critic reinforcement learning framework. In this manner, the policy network receives the state for the corresponding task as input and outputs the mean of a diagonal multivariate Gaussian distribution with a learned log standard deviation parameter independent of the state. Correspondingly, the critic network receives the state and learns the value function. Hence, only the parameters of the training task’s actor-critic modules and their corresponding lateral connections are updated using Adam optimizer [51] during task learning. We use the tuned hyperparameters in [52] for the reacher task as there are multiple extensions of PPO [53]. This extended PPO loss \begin{equation*} L^{{\mathcal {T}}_{i}}_{PPO}=\mathbb {E}_{t}[L^{CLIP_{{\mathcal {T}}_{i}}}_{t}(\theta)-c_{1}L^{VF_{{\mathcal {T}}_{i}}}_{t}(\theta)+c_{2}S^{{\mathcal {T}}_{i}}(\theta (\pi)(s_{t})] \tag {3}\end{equation*}
Algorithm 1 ERP-BPNN
Constants:
for all
Unfreeze module
for
for
Sample
end for
Record average episodic return
Optimize
end for
Freeze module
end for
while not done do
Calculate progress for all tasks using ERP
Choose task
Unfreeze module
for
Sample
end for
Record average episodic return
Optimize
Freeze module
for all
for
Sample
end for
Record average episodic return
end for
end while
D. Evaluation Metrics
Designing a reward function is crucial and challenging in deep reinforcement learning, primarily because the reward function may not fully encapsulate all attributes expected from a learning agent. One example is in the Reacher environment, where the need often arises to tune the reward function coefficients or introduce additional parameters. However, this requires significant time and resources to adhere to the predetermined metrics and careful tuning of the parameters. In this sense, evaluating how a straightforward, uninformed reward function performs in terms of metrics meaningful for the task domain at hand is important. Therefore, in this section, in addition to the classical RL metric of episodic return, we also present task domain metrics used to evaluate the collective performance of morphologically different agents. To account for early stopping introduced in [54], we evaluate all tasks after each iteration, save the best policies obtained up to that iteration, and use them in our evaluations. Notably, saving the best policy based on the cumulative performance of all tasks for all methods ensures an accurate comparison of the proposed method against baselines and enhances performance across all methods after training is completed.
1) Episodic Return
Maximum episodic return tracks the best expected discounted cumulative reward achieved across all tasks in an iteration. We report the best expected discounted cumulative reward after each training iteration to ensure a fair comparison with baselines. Crucially, the ERP task-switching procedure provides an inherent early-stopping procedure, provided that we allow a fixed number of cumulative training iterations under resource constraints. Since the task is to reach a given point in space with minimum effort, we follow the reward function definition in [55] \begin{equation*} \mathbf {r} = -\|p_{\text {fingertip}} - p_{\text {target}}\|_{2} - \sum \text { a}^{2} \tag {4}\end{equation*}
2) Distance to the Goal
Distance to the goal is the minimum expected L2-norm distance between the final end-effector position and the goal position obtained over all tasks. The Reacher environment only terminates after 50 timesteps, which is equal to the episode length. Hence, we expect the end-effector of the reacher to stay in the immediate vicinity of the goal until episode termination.
3) Deviation from Shortest Path to the Goal
At a high level, we expect an efficient learning agent to follow the shortest feasible path to the goal, which is also desirable for many robotic applications. Thus, we introduce a straightness metric that measures the minimum expected deviation from the shortest path to the goal taken by the manipulator’s end-effector over all tasks. We define the deviation metric, \begin{equation*} {\mathcal {D}}=E_{{\mathcal {T}}} \left [{{ \left ({{\sum _{t=1} L_{2}(\boldsymbol {x}_{{\mathcal {T}}}(t), \boldsymbol {x}_{{\mathcal {T}}}(t-1)) }}\right) - L_{2}(g, \boldsymbol {x}_{{\mathcal {T}}}(0))}}\right ]\end{equation*}
E. Implementation Details
Given the morphological diversity among robots results in different state space dimensions, and our BPNN architecture has a static input layer; we have standardized the state input dimensions for robots with 2-DoF and 3-DoF to match the maximum state space dimension, which of the 4-DoF robot, across all tasks. Hence, we apply zero-padding to the dimensions corresponding to the angular velocity and the angle of each missing link. To encourage skill transfer among networks, we limit the available computational resources to the task network modules by setting the number of hidden layers as three and the hidden layer size as two.
For the Reacher tasks, we use PPO [48] as the RL algorithm and adopt the hyperparameter set from [52] for the remaining hyperparameters. After each training iteration, we sample trajectories from each task, excluding the task trained most recently, using the latest BPNN actor and critic parameters. Then, we compute the ERP, maximum episodic return, minimum expected distance to the goal, and minimum expected deviation from the shortest path to the goal for each task. Additionally, we run eight parallel environments per task to reduce simulation time. Each environment runs two episodes, with each episode lasting 50 steps, thereby accumulating a total of 800 timesteps per task.
Experiments
In this section, we first present the environment used for benchmarking multi-task learning for morphologically different robots illustrated in Fig. 1. Then, we elaborate on ERP-based task switching and present results obtained with the evaluation metrics introduced in Section III-D.
A. Single Goal Multi-Task Learning Between Morphologically Different Robots
To demonstrate skill transfer between morphologically different robots, we modified the Reacher-v2 environment, simulated in MuJoCo [56] using Gymnasium framework [55]. The learning experiments have been conducted on a modern computer equipped with an NVIDIA 3090 GPU and 10-core Intel i9 CPU at 3.7GHz. The experiment setup consists of three different reacher environments, each having distinct action spaces, specifically featuring 2-DoF, 3-DoF, and 4-DoF robot arms. Although each environment has unique environmental dynamics due to its distinct morphologies, it shares the same reward functions. The reward function defined in Eq. 4 comprises a control cost, the negative vector norm of the end-effector of the reacher, and a predetermined goal. We consider the baselines RANDOM-BPNN and random Multi-Layer Perceptron (RANDOM-MLP) to evaluate the performance of our ERP-BPNN approach across various metrics. RANDOM-BPNN abides by random task selection while featuring the same underlying BPNN architecture as ERP-BPNN. Similarly, RANDOM-MLP follows the same random task selection procedure as RANDOM-BPNN but utilizes a separate actor-critic network pair for each task with no lateral connections among different tasks. Accordingly, RANDOM-MLP can be regarded as training a PPO algorithm for each task separately.
B. Results
In this section, we report the results of our method ERP-BPNN compared to the baselines of RANDOM-BPNN and RANDOM-MLP, based on the evaluation metrics presented in Section III-D. Table 1 shows each method’s mean and standard deviation results obtained during 100K episodes across five random seeds for each metric. The results indicate that ERP-BPNN can obtain superior performance compared to the baselines across all metrics with the lowest standard deviation. More importantly, as seen from the Episodic Return plot in Fig. 3 (a), the proposed model, ERP-BPNN, achieves faster convergence than the baselines.
Performances of the proposed model, ERP-BPNN, and the two baselines of RANDOM-BPNN and RANDOM-MLP across five random seeds are shown in terms of (a) maximum episodic return, (b) minimum expected final end-effector distance to goal, and (c) minimum expected deviation from the shortest path to the goal.
Since the combination of BPNN architecture with ERP-based task selection (ERP-BPNN) yields the best results surpassing the random task selection strategy (RANDOM-BPNN), it can be argued that the ERP task selection procedure is essential for successful multi-task learning with positive transfer among tasks. Note that interestingly, the random task selection strategy (RANDOM-BPNN), although inferior to ERP-BPNN, performs better than the RANDOM-MLP baseline in terms of Episodic Return and yields better endpoint accuracy measured as Distance to Goal at the end of the training (Table 1). This indicates the lateral connections among task networks may facilitate limited positive skill transfer even with random task selection. However, this picture is not reflected in the Deviation from the Shortest Path to the Goal measure as random task selection leads to negative transfer for the Deviation from the Shortest Path to the Goal measure (see Fig 3(c), episodes >55K). In the early stages of learning, the performance of ERP-BPNN and RANDOM-BPNN are better than RANDOM-MLP (episodes
In order to get an intuitive understanding of the learned policy performances, we illustrate typical trajectories followed by the end-effector of the Reacher robot in Fig. 4(a) when controlled by policies acquired by the
The policies obtained by our model and the baselines during multi-task learning in terms of straightness (a) and endpoint accuracy (b) are demonstrated (at
To have an intuitive grasp of the reaching ability obtained on morphologically different robots by the proposed model and the baselines, a typical goal position and the final end-effector positions reached for each robot are shown in Fig. 4(b). Final end-effector positions reached using each method indicate that the agent trained using ERP-BPNN is consistently closest to the goal across all environments compared to the baselines. Likewise, the results for Deviation from Shortest Path to Goal in Fig. 4(a) show that the agent trained with ERP-BPNN can reach the goal by taking a shorter path than the baselines in the 4-DoF reacher robot arm environment. Importantly, accurate reaching using a straighter path can be learned significantly faster by our proposed method ERP-BPNN compared to the baselines as evidenced by the distance plots given in Fig. 3(b).
C. ERP-Based Dynamic Task Selection
To examine the task engagement patterns that emerge with the ERP-based task selection mechanism, we record the selection frequency of each task over a moving learning window, i.e., a fixed number of learning iterations. By using this scheme, the selection frequencies of the tasks are plotted against training iteration counts in Fig. 5. As can be seen, although ERP primarily selects the more straightforward 2-DoF Reacher Robot Arm as the one to learn initially (Fig. 5, time range: 0-10), it gradually starts selecting more complex tasks with 3-DoF and 4-DoF Reacher Robots, respectively (Fig. 5 time range: 10-150). Then, we observe that ERP continues selecting the 4-DoF Reacher Robot for skill transfer more frequently (Fig. 5, time range: 250-500). This task selection behavior suggests that the 4-DoF Reacher Robot has benefited from the skill transfer of the other tasks and started to learn more rapidly later in the training. Upon further elaboration of the episodic return plots for all tasks in Fig. 3 (a), and standard deviations across iterations in Table 1, ERP-BPNN exhibits a consistent overall improvement across training iterations, achieving a faster convergence with lower standard deviation compared to the baselines. In line with this, faster convergence of ERP-BPNN indicates that consecutive selections of the 4-DoF Reacher task have collectively increased the maximum episodic return.
Selection frequency plot of task switching by average episodic return progress with an iteration window size of
Conclusion
This paper introduces Episodic Return Progress with Bidirectional Progressive Neural Networks (ERP-BPNN), a novel multi-task reinforcement learning approach incorporating a human-like interleaved learning mechanism, and shows its application to skill transfer across morphologically different robots. ERP-BPNN comprises a Bidirectional Progressive Neural Network (BPNN) that enables efficient, bidirectional skill transfer and a soft Episodic Return Progress (ERP) mechanism for dynamic task selection. The BPNN architecture is specifically designed to enable skill transfer through bidirectional lateral connections, drawing inspiration from the human brain’s adeptness at retaining existing knowledge while acquiring new skills. Episodic return progress-based (ERP) task selection complements BPNN by autonomously selecting the tasks to engage in learning during training, eliminating the need for prior domain knowledge to evaluate difficulty levels of tasks. We demonstrate that soft ERP-based task selection, with BPNN, achieves higher performance and faster convergence than the baselines across all the tested metrics of Episodic Return, Distance To the Goal, and Deviation from Shortest Path to Goal.
The insights presented in this work are closely related to the field of human brain-inspired reinforcement learning (RL) research. By implementing an Intrinsic Motivation (IM) signal and leveraging curiosity-driven exploration within the context of multi-task RL, ERP-BPNN framework closely aligns with biological processes observed in the human brain. Specifically, the behavior of dopaminergic neurons, known for their critical role in reward-based learning, mirrors the temporal difference (TD) prediction errors utilized in RL algorithms [24], [57], [58]. BPNN provides a mechanism for a shared multi-task RL architecture that prevents task interference during skill transfer by adjusting knowledge transfer from other tasks during training. In addition, the ERP-guided freezing of lateral connections among task modules prevents interference in the previously acquired knowledge. Furthermore, the bidirectional lateral connections augment noise in the task-specific computational flow akin to noisy computations in the brain [26], alleviating catastrophic forgetting [59] while amplifying plasticity [60]. Analogous to the modular control architecture in cerebellum [61], as corroborated by Functional Magnetic Resonance Imaging studies [62], [63], BPNN adopts a modular architecture. This architecture choice is instrumental in decreasing catastrophic forgetting via instantiating a module for each task.
Overall, we have designed a cognitive architecture applicable to a wide range of lifelong learning scenarios. Yet, an important future work that remains is to extend our architecture to support heterogeneous learning, where each task may require a different type of learning mechanism. For instance, one module may support supervised learning while others may handle reinforcement and unsupervised learning tasks. Other future work involves enhancing the framework to accommodate more complex lifelong RL scenarios, including real-world applications. This can be achieved by increasing the number of tasks and improving the bidirectional skill transfer mechanisms to enhance learning capacity and generalization ability.