Meta-Learning Based Tasks Similarity Representation for Cross Domain Lifelong Learning

Deep neural networks perform better in most specific single tasks than humans, but it is hard to handle a sequence of new tasks from different domains. The deep learning-based models always need to remember the parameters of the learned tasks to perform well in the new tasks and forfeit the ability to generalize from previous data, which is inconsistent with human learning. We propose a novel lifelong learning framework that can guide the model to learn new knowledge without forgetting the old knowledge through learning the similarity representation based on meta-learning. Specifically, we employ a cross-domain triplets network (CDTN) by minimizing the maximum mean discrepancy (MMD) between the current task and the knowledge base to learn the domain invariant similarity representation among tasks in different domains. Furthermore, we add a self-attention module to enhance the extraction of similarity features. Secondly, a soft attention network (SAN) which can assign different weights according to the learned similarity representation of tasks is proposed. In addition, a low-level feature enhancement module (LLEM) based on self-attention mechanisms is developed to capture domain-invariant similarity information. The experimental results show that our method effectively reduces catastrophic forgetting compared with the state-of-the-art methods when learning many tasks. Moreover, we show that the proposed method can hardly forget the old knowledge while continuously enhancing the performance of the old tasks, which is more in line with the human way of learning.


I. INTRODUCTION
In the past few years, whether the traditional machine learning methods based on probability model and statistical model or the deep-learning methods such as deep convolution network (DCN) [1], transformers [2], and deep reinforcement learning (DRL) [3] is performed better than humans in some specific tasks, such as Go, image recognition and natural language process [4], [5], [6], etc. However, most existing deep models have a fatal flaw, which can not continuously learn tasks in sequence across domains like human beings. The The associate editor coordinating the review of this manuscript and approving it for publication was Wei Liu. deep models will always fall into catastrophic forgetting to perform better on new tasks because of the inherent optimization methods of the model. Because the data distribution of the new task and the old task is different, the optimal solution is different, so the weight of a trained model often changes when it is trained on the new task. This will inevitably fall into catastrophic forgetting. However, the way of human learning will not quickly forget the previous knowledge and instead accelerate the learning of new tasks based on what has been learned in previous tasks.
To prevent catastrophic forgetting, some approaches optimize parameter updates for new tasks in a space orthogonal to old tasks [8]. Others use rehearsal, adding a small number FIGURE 1. The training process of lifelong learning (from task 1 to task 2, 3 and 4 respectively). Tasks 1, 2 and 3 come from the same domain, and task 4 comes from different domains. From the weights update tragectories, we can see that tasks similarity information, especially cross domain similarity information is very important for lifelong learning.
of training samples from old tasks to new ones, mimicking human review behavior [9], [10]. Distillation [11] is a popular method for ensuring good performance across all tasks. Overparameterization is leveraged in some methods [7], [12], [13] to activate or expand neurons for new tasks. Recently, Mahalanobis similarity was employed as a learning parameter to learn meaningful features while linearly increasing parameters as tasks increase [14]. However, most lifelong learning methods assume tasks are from the same distribution, ignoring the more general scenario of tasks from different domains.
Meta-learning [15], which is also referred to as ''learning to learn,'' involves training models to learn characteristics beyond the specifics of a task. For instance, models can learn to represent similarities between tasks, which allows for quick adaptation to new tasks. If two tasks are similar, their distance in the feature space is small, and vice versa. Despite these benefits, most of the current meta-learning approaches are limited by the fact that they are trained and tested on data from the same distribution. As a result, existing meta-learners are ineffective at learning essential similarity representations when tasks belong to different domains. This limitation is observed in the majority of existing meta-learning methods [16], [17], [18]. Fig. 1 shows the training process of lifelong learning (from task 1 to task 2, 3 and 4 respectively). Tasks 1, 2 and 3 come from the same domain, but task 4 comes from another domain. The black point is the weights W1 obtained by task 1. When the model continuously learns other tasks, the optimization trajectory of training weights is shown as the red, green and blue arrows in the Fig. 1. Due to the high similarity between task 1 and task 2, the weights quickly converge (short path) to W2, which may be better than original W1. Task 3 has lower similarity than task 2 with task 1, but it is still in the same domain. The updated weights W3 change greatly compared with W2, resulting in catastrophic forgetting to a certain extent. When learning task 4 is situated in another domain, the training trajectory shows the weights W4 change greatly compared with W3 to a great extent. It has led to catastrophic forgetting. Therefore, tasks similarity information, especially cross-domain similarity information, is very important for lifelong learning. The starting point of meta-learning used by us and other meta-learning methods like GPT-3 [2] is the same. Through pre-training, we can get a broader optimization space, which is convenient for other tasks to finetune. Reference [2] use a large network with about 17 billion parameters to get a better result. However, in this paper, the proposed cross-domain triplet network can cross-domain do meta-learning and learn the similarity between tasks, so there is no need to use complex network structure. To address these issues, we propose a lifelong learning method based on the similarity of tasks in the different domains and effectively reduce catastrophic forgetting. The main contributions of this paper are concluded as follows: This paper is organized as follows: Section II provides an overview of previous work on lifelong learning and metalearning methods. Section III outlines the proposed method with detailed explanation. In Section IV, the experimental results are presented. Finally, Section V concludes the paper.

II. RELATED WORK
Lifelong learning is a method of training models in a way that prevents them from losing previously acquired knowledge as they learn new tasks. This is achieved by maintaining the consistency of the model's plasticity and stability, allowing it to continually absorb new information without forgetting old knowledge. Other related approaches of lifelong learning include continual learning [19], class incremental learning [20], task incremental learning [21].
Early approaches to lifelong learning were categorized into several main methods, such as minimizing representation overlap [22], [23], utilizing past samples or generated virtual samples, and implementing dual architectures [24], [25]. For instance, [26] used clustering techniques to assess the similarity and transfer invariance between tasks. Additionally, VOLUME 11, 2023 36693 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. reinforcement learning methods and meta-learning technology have also shown promising results in lifelong learning tasks [29], [30], [31].
Due to limited resources, some early lifelong learning methods could only learn a small number of tasks sequentially using a specific shallow architecture. However, ELLA [32] was introduced as a general lifelong learning algorithm that operates within a multi-task learning framework, allowing the model to learn multiple basic learning models continuously. This led to the development of probabilitybased [33] and non-parametric Bayesian methods [34], which share information between tasks through linear combinations of basis vectors to enhance model performance. However, these methods have limitations in terms of the types of learning tasks they can handle. To overcome this limitation, GO-MTL [35] proposed a sparse shared model to address the multi-task learning problem.
However, these methods have limited applicability to simple tasks and are not suitable for handling a large number of complex tasks. With the recent growth in interest in deep neural networks, researchers pay more attention to avoiding catastrophic forgetting in lifelong learning and conducted empirical studies on the impact of dropout and activation functions on catastrophic forgetting, as well as exploring task incremental learning from a theoretical standpoint [36], [37], [38].
Deep lifelong learning methods can be broadly categorized into three categories. The first category is rehearsal-based methods, which are similar to human review. These methods take into account the impact of old tasks when the model learns new tasks, allowing it to better remember them and avoid catastrophic forgetting. Distillation technology is often used in rehearsal-based methods, which enables quick learning of new tasks using only a few samples. One example of a rehearsal-based method is the ICARL algorithm [9], which uses a teacher network and a student network to quickly converge all learned tasks with a small number of training samples. This approach allows for storing only a few samples from previous tasks when learning a new task, thereby reducing memory overhead.
A different approach to address the problem is the GEM method [41]. Instead of storing training samples, GEM stores the gradient of previous tasks, ensuring that the gradient update for new tasks is orthogonal to previous tasks. This reduces the interference of previous knowledge. Some GANbased methods have also been proposed to generate highquality images and model the data-generating distribution of previous tasks, allowing for retraining on generated examples [25], [42], [43], [44]. However, these methods require more calculations and additional resources, despite their ability to reduce storage space.
GAN-based methods provide storage space savings but require additional calculations, while other approaches like Continual Prototype Evolution (CPE) [39] combine the nearest-mean classifier approach with a more efficient reservoir-based sampling scheme. For more detailed experiments on rehearsal for lifelong learning, refer to [40].
One type of deep lifelong learning methods uses regularization to control parameter updates. These methods assign a weight to each parameter based on its importance for previous tasks and adjust it accordingly. LwF [8] limits the changes to parameters that are consistent with previous tasks. EWC [7] measures the significance of parameters using the Fisher information matrix from previous training. However, this method can restrict the network too much when there are many tasks and hinder new learning. Some methods like SI algorithm [45] solve this problem by considering the variation of parameters from previous to new tasks. However, this approach can lead to unstable results with random gradient descent. MAS [46] allows unsupervised estimation of parameter importance, making it suitable for specific data processing without supervision. VCL [47] applies a variational framework for continual learning. Other Bayesianbased methods [48], [45] estimate parameter importance online during task training. Aljundi et al. [46] proposed an unsupervised method for evaluating parameter importance, which can be adapted to different settings without supervision. This method was extended to the case of no task setting [49], [50]. However, these methods often have difficulty converging.
The third category of deep lifelong learning methods is neurons activation or expansion techniques. These methods use different parameters for different tasks or add extra parameters for new tasks if the network has spare parameters. However, this can quickly fill up the model parameters as the number of tasks grows. PackNet [12] ranks weights in the network based on their significance and only trains the current task with the first 50% of the selected weights. HAT [71] freezes the parameters of previous tasks or allocates a separate model for each task when learning new tasks. The network structure remains fixed, with specific components assigned to each task. During the training of a new task, the parameters of previous tasks are masked and converted into embeddings. After passing through these embeddings, the network transforms them into masks. HAT [71] uses sparsity as its loss function, making it more sophisticated. These methods usually need a task oracle to activate the relevant masks or task branches during prediction, thus limiting them to a multi-head setup and preventing them from handling a shared head between tasks. Expert Gate [51] solves this problem by learning an auto-encoder gate. In contrast to fixed network weight numbers, there are also methods such as progressive network [52], dynamic memory network [53], and DER [20] that increase the network structure. Whenever a new task is performed, suitable neurons are added to train it. However, these methods are restricted to small-scale task learning due to the constraints of parameter numbers.
Meta-learning, or learning to learn, is a machine learning method that aims to acquire higher-level data such as task-level and hyperparameter-level data. This higher-level data, known as meta-knowledge, helps the model to acquire new data more quickly and effectively. MAML [54] applies gradient descent to acquire the hyperparameters or initial parameters of the basic learner. [55] employs a read and write mechanism and merges it with soft attention and time convolution to access data from previous events. The siamese network [56], matching network [57], and prototype network [58] rely on the metric learning paradigm and use deep neural networks to acquire a mapping function from the input space to the feature space. The model acquires how to place examples that belong to the same category near each other and examples that belong to different categories far from each other, enabling it to classify new tasks effectively. Meta-learning has more robustness than traditional similarity measurement methods. Reference [59] apply metalearning methods to acquire generalized parameters that are not specific to either old or new tasks to prevent catastrophic forgetting. Reference [68] introduced a differentiable Bayesian change point detection scheme to improve metalearning methods for continuous learning tasks. With the help of the idea from MAML [54], [60] suggested an activationgating function that selectively activates neurons, but these methods struggle to acquire cross-domain task similarity well.
To conclude, the existing methods for lifelong learning encounter major difficulties in terms of resource utilization and model performance when handling a large number of tasks. To address these difficulties, we suggest a metalearning-based task similarity representation lifelong learning framework, which is a significant improvement over previous methods.

III. METHOD
The proposed new cross-domain lifelong learning framework, shown in Fig. 2, composed of two stages. In the first stage, a Cross Domain Triplets Network (CDTN) is trained to acquire a similarity representation of tasks across domains. In the second stage, a Soft Attention Network (SAN) uses this similarity data, combined with knowledge from a knowledge base, to obtain a task-specific attention map and assign specific weights to the network, enabling it to learn the current task effectively.
As illustrated in Algorithm 1, we start by initializing the two network parameters, W1 and W2, and setting the hyperparameters. When a new task t arrives, we first train the CDTN by optimizing the weights W1 using meta-learning loss (MLL) and maximum mean discrepancy loss (MMDL) to obtain the similarity representation, b t l , as shown in lines 2 to 6. In lines 7 to 9, we adjust the number of channels to match the channels of the SAN by using 1 × 1 convolution based on the b t l and compute the attention map a using Eq. (8). To preserve prior knowledge, we calculate the gradients using Eq. (9), and update W2 with cross-entropy loss (CEL) for the classification task, all while keeping W1 fixed.

9:
Multiply attention maps a by feature maps F from SAN. 10: Compute gradients use Eq. (9) and update W 2 use cross entropy loss (CEL) (for classification task). 11: end for 12: end for 13: A. THE PROPOSED LIFELONG LEARNING PROBLEM SETTING First, we have a set of labeled samples from various domains that serve as the knowledge base, which is different from the definition of knowledge graph and does not have a graph structure. These training samples can pre-train a base model so that they can better finetune in new tasks and better learn the similarity in different tasks, which agrees with the definition of meta-learning. The model faces a series of supervised learning tasks, denoted as Z1, Z2. . . Zt, where each task Zt = f(x,y), consists of data X(t) and Y(t), drawn randomly from different distributions D1, D2. . . Dt. X(t) represents a set of data samples for task t, while Y(t) represents the corresponding ground truth labels, typically either 1 or −1 for classification tasks or a real number for regression tasks. Each task t has N training samples. Unlike previous methods, we have a more flexible definition of lifelong learning tasks, where tasks don't have to come from the same distribution. T represents the total number of tasks encountered so far. At each time step, the model is given a batch of labeled data for a task t in T. After being trained on each batch. The model can predict on instances of any previous or current task without access to X(T) and Y(T) from the previous tasks.

B. CROSS DOMAIN TRIPLETS NETWORK FOR TASK SIMILARITY REPRESENTATION
In the lifelong learning scenario, where tasks come from different distributions, traditional meta-learning methods struggle to learn the similarity of tasks across domains. To overcome this issue, we introduce the Cross Domain Triplets Network, shown in the left half of Fig. 2. This network consists of a shared weight triplets network that is designed to learn the similarity representation of domaininvariant features. In conventional deep neural networks, the features become more specific from the last layer, causing a transferability gap that increases with regional differences. To tackle this problem, in the first step, we aim to reduce the inter-domain differences between samples and increase the differences within the same domain. This improves the diversity of the sample distribution, leading to better extraction of domain-invariant features. We measure the inter-domain distribution differences using maximum mean discrepancy (MMD) as described in Eq. (1) and (2).
where x and y are the samples from knowledge base and current task respectively, f is the mapping function, here it refers to deep neural network, m and n are the number of samples of a batch from knowledge base and current task respectively.
To ease calculation, we use the kernel embedding of distributions, the hidden layers related with the learning task in the convolutional neural networks (CNN) is mapped into the reproducing kernel Hilbert space (RKHS), and the distance between different domains is reduced by the multi-core opti-mization method. As shown in Eq. (3).
where k is Gaussian kernel function shown in Eq. (4).
The proposed CDTN has another purpose, which is to create similar representations. The inputs for the network are training samples from the knowledge base. The network creates these similarity representations by minimizing the meta-learning function (MLF), as demonstrated in Eq. (5). The distance between the network embeddings for two inputs will be minimized if they belong to the same class, and will be greater than a certain margin value ''n'' if they belong to different classes.
where x i and x j is image pairs and x ′ is equal 1 if the two images come from the same class. On the contrary, if the two image come from different classes the x is equal0,n is a margin parameter to balance the training loss. Furthermore, to better obtain the cross domain similarity feature representation, we have introduced a low-level feature enhancement module (LFEM). This approach is based on the observation that shallow neural networks possess some degree of robustness to object features across domains. To extract domain-invariant information, we utilized a selfattention module, as depicted in Fig. 3. In this module, feature map A is first transformed into B, C, and D via three 1 × 1 convolution layers. Then, B and C are reshaped and multiplied to obtain the attention map S through the Softmax function. Finally, the feature map D is multiplied by S, and the resulting feature map is added to A to obtain the final feature map E, as shown in Eqs. (6) and (7).
where sji measure the the influence degree of j position on i position. The homologous features will increase the corresponding, and enhance the ability of feature extraction. The value of each position of E is obtained by the weighted sum of the original features.

C. SOFT ATTENTION NETWORK
As depicted in the right half of Fig. 2, in the lifelong learning stage, to prevent catastrophic forgetting and retain previous task knowledge, we propose a Soft Attention Network (SAN) based on the soft attention mechanism to allocate specific parameters to new tasks. Moreover, we also factor in the similarity information between tasks, which can enhance the performance of related old tasks during new task learning. The attention mechanism is feature-wise, instead of channelwise, allowing the model to learn as many tasks as possible without the need for additional hyperparameters to regulate network unit plasticity. The intermediate similarity feature from the trained Cross Domain Triplets Network is transformed into a channel matching feature map through four bottleneck layers, and the attention map is then generated via the Sigmoid function, as shown in Eq. (8). This attention map encompasses task-specific features and task similarity information. These attention maps are multiplied with the SAN to yield task-specific features, and the final classification result is obtained through a fully connected layer and Softmax. We use cross-entropy loss and Stochastic Gradient Descent (SGD) to train the network. To ensure that information learned from previous tasks is preserved upon learning a new task, the gradients are conditioned based on the attention value. If the attention value of a feature is high, it indicates that it is beneficial to learning the task, and thus, its gradient update should be substantial. Conversely, if the attention value is low, it suggests that the feature is not valuable to the task, and its gradient update should be reduced, as demonstrated in Eq. (8).
where g is the gradients and δ is the learning rate. β is a hyperparameter control gradient update. A low β provide splasticity to the units and capacity of adaptation, but the network may easily forget what it learned. A high β prevents forgetting, but the network may have difficulties in adapting to new task.

D. LOSS FUNCTION
The loss function of the proposed framework is divided into two parts. In the CDTN, we minimize the MMD between the current task and knowledge base, which can learn the invariant feature of different domains. We minimize the MLF to learn the similarity information. The joint loss function is as Eq. (9): where L MMD (x i ,y i ) and L MMD (x i ,x j ) are inter domain loss and intra domain loss respectively, L MLF (x i ,y j ) is the meta cross entropy loss, θ is a hyperparameter to balance the two loss functions. In the SAN, our loss function is cross entropy loss for classification tasks.

IV. EXPERIMENT
We test the effectiveness of our lifelong learning approach with various experiments and analyze how each part contributes to the results and explain our method better.

A. DATASET AND TRAINING DETAILS
For our experiments, we have selected 8 image classification tasks [60], [61], [62], [63], [64], [65], [66], [67]. We randomly choose a portion of the data as the knowledge base. The datasets have class numbers ranging from 10 to 100, with 80% of the data used for training and 20% for testing. In our experiments, we use the ResNet-18 [69] architecture as our backbone and SGD as the optimization method. We train our network with the knowledge base in two stages with different learning rates. We reduce the learning rate if the validation loss does not improve and stop the training early if needed. We use 64 samples per batch and the same settings for all methods. We run our method on a computer with a CPU, a GPU, and 32 GB of RAM.

B. EVALUATION CRITERIA
To assess the generic performance of the model, we calculate the average accuracy (AA) on all the testing datasets from tasks t to T after training each task t. The AA is defined as in Eq. (10). A higher AA value indicates better performance of VOLUME 11, 2023  the model.
where TP and TN are the numbers of correctly classified samples. P t and N t are the number of positive and negative samples for task t. T is the total number of tasks.
We also use the average forgetting rate (AF) in Eq. (11) to measure how much the model forgets the old knowledge. The lower the AF, the better the model remembers the old knowledge. If the AF is negative, it means the model improves the old task when learning a new task.
where A τ is the accuracy obtained when the task is the latest task, A t is the accuracy of the previous tasks test on the current model obtained from the latest task. T is the number of tasks. Note that the forgetting rate of latest task is 0 and the smaller the average forgetting rate, the better the performance of the model.

C. PERFORMANCE EVALUATION 1) COMPARATIVE EXPERIMENTS
We compare the AA and AFR of different methods after 8 tasks in Table 1. Our method has the best AA of 70.21%.
Other methods like DER [20] have good AA but more network parameters and computation. Methods like EWC [7] and IMM [72] have lower AA due to gradient update issues. Methods like GEM [41] and ICARL [9] have good performance but it drops as tasks increase and old samples decrease. GAN-based [73] methods can generate enough samples and have good results, but they need more resources. ''Our method has the lowest average forgetting rate AF (-0.07%), which means learning new tasks helps previous tasks perform better. This is because the tasks are similar and can fine-tune the parameters of earlier tasks. Methods based on weight regularization, such as EWC [7] and IMM [72], have trouble updating parameters with new tasks. As more tasks are learned, the gradient direction gets more restricted  and forgetting increases. Methods based on rehearsal, like GEM [41] and ICARL [9], can lower forgetting, but they need more training samples for earlier tasks as the number of tasks grows. Otherwise, the average forgetting rate goes up. Some methods based on attention, such as PackNet and HAT [71], use fixed weight masks to avoid forgetting, but they ignore the similarities between tasks. So, learning new tasks does not improve previous tasks and they cannot handle many tasks.' ' We tested how well our method can avoid catastrophic forgetting by learning eight tasks and measuring the forgetting rate against other methods (see Fig. 4). The multi-task learning method learns all tasks at once and never forgets. Our method improves previous parameters by using task similarity and performs better on new tasks that are related to old ones. This leads to a low forgetting rate. Other methods only try to reduce forgetting. PackNet [12] and HAT [71] have limited capacity and do worse on new tasks than our method. But they keep all the knowledge by locking task parameters with masks. EWC [7] and IMM [72] still forget over time  because they don not solve the forgetting problem completely. GEM [41] and ICARL [9] also forget little, but they need to store training samples for new tasks, which takes more space.

2) EFFECTS OF MODEL CAPACITY
Network capacity is important for lifelong learning. A model with high capacity can learn more tasks. Fig. 5 shows how the model uses its parameters. When a new task is learned, more weights are used. During training, the usage rate drops slowly at first, then faster until it stops. This means the network can be made smaller by 10% to 50%, depending on the task. When learning task 4, fewer new parameters are used because it is similar to task 2. The method uses the task similarity to improve learning. But when learning task 8, with no similar task before, the usage amount goes up by about 10% in top 5 tasks. The method uses less parameters than PackNet (25% to 80%) and HAT (15% to 70%) when learning similar tasks. Table 2 shows how well the model can do multi-task classification. The accuracy stays the same even when learning 10 tasks in the CIFAR-100 dataset without forgetting. When more tasks are added, the old tasks get better. This is because the method uses the similarity between tasks and the sparsity from the loss function to learn many tasks continuously.

3) ABLATION STUDY
To evaluate the contribution of each module of the proposed method, an ablation study was conducted. In the study, one part of the method was omitted while the rest was kept intact. The average accuracy and average forgetting rate are shown in Table 3.
The results showed that the CDTN alone increased the average accuracy by about 4% and reduced the average forgetting rate by nearly 0.2%. This suggests that the task similarity information helps in learning new tasks. Moreover, the LLEF in the lifelong steps boosted the average accuracy by more than 2%, proving that the LLEF is very effective. Fig. 6 illustrates the visualization results of similarity when learning three binary classification tasks. As more tasks are learned, CDTN can recognize the similarities among tasks, resulting in a small distance between similar tasks. This maximizes the utilization of parameters to ensure model accuracy.

V. CONCLUSION
We propose a new lifelong learning method that uses the task similarity to retain previous task information while learning new tasks. First, we use a Cross Domain Triplets Network (CDTN) to learn the similarity representation between tasks across different domains. Then, we use a Soft Attention Network (SAN) to assign different weights to different tasks in the deep network based on the similarity representation which effectively captures task similarity information. It significantly improves the lifelong learning performance compared to previous methods when dealing with a number of sequential tasks. In addition, our method avoids catastrophic forgetting and enhances the performance of old tasks when learning new tasks. In the future, we plan to add distillation technology to our framework to further improve our model.