Gating Mechanism in Deep Neural Networks for Resource-Efficient Continual Learning

Catastrophic forgetting is a well-known tendency in continual learning of a deep neural network to forget previously learned knowledge when optimizing for sequentially incoming tasks. To address the issue, several methods have been proposed in research on continual learning. However, these methods cannot preserve the previously learned knowledge when training for a new task. Moreover, these methods are susceptible to negative interference between tasks, which may lead to catastrophic forgetting. It even becomes increasingly severe when there exists a notable gap between the domains of tasks. This paper proposes a novel method of controlling gates to select a subset of parameters learned for old tasks, which are then used to optimize a new task while avoiding negative interference efficiently. The proposed approach executes the subset of old parameters that provides positive responses by evaluating the effect when the old and new parameters are used together. The execution or skipping of old parameters through the gates is based on several responses across the network. We evaluate the proposed method in different continual learning scenarios involving image classification datasets. The proposed method outperforms other competitive methods and requires fewer parameters than the state-of-the-art methods during inference by applying the proposed gating mechanism that selectively involves a set of old parameters that provides positive prior knowledge to newer tasks. Additionally, we further prove the effectiveness of the proposed method through various analyses.


I. INTRODUCTION
Deep neural networks generally access the complete data of tasks when learning multiple tasks [1], [2]. A more challenging scenario of learning multiple tasks, known as continual learning [3]- [13], assumes that a task is observed at a specific time without accessing the data of the previous tasks. When tasks appear sequentially, a deep learning model prioritizes current tasks; however, it forgets the knowledge of previous tasks. This phenomenon is called catastrophic forgetting [3], [14], which represents a major obstacle to the success of continual learning.
The existing research on continual learning primarily addresses the problem of forgetting the knowledge of previous tasks. Recently proposed methods attempt to prevent The associate editor coordinating the review of this manuscript and approving it for publication was Miaohui Wang. the forgetting of previous knowledge while exploiting current information. Methods for continual learning can be categorized as: regularization [3]- [5], replay [6]- [8], dynamic architecture [9]- [11] and structural allocation [12], [13] approaches.
The regularization-based strategy [3]- [5] identifies important parameters and prevents their update while learning the knowledge of the current task. However, the ability to curb changes in the parameter values is limited, especially when learning a long sequence of tasks. The replay-based strategy [6]- [8] stores a small number of training examples. The stored set is utilized to perform joint training with the set of the current task [6], [7] or seeking key parameters to retain the knowledge of previous tasks [8]. Note that because the same examples are used for learning subsequent tasks, overfitting may occur [15]. The dynamic architecture-based strategy [9]- [11] generally introduces new learnable parameters when a current task is observed [9] or the network fails to achieve a predetermined criterion on loss or validation accuracy [16]. However, this approach incurs a higher computational cost than other approaches, which reduces its applicability. The structural allocation strategy [12], [13], [17] assigns task-specific parameters in a fixed network and prevents the update of the previous parameters while training a current task. In the approach proposed in [12], all parameters, including the previous parameters, are considered in training a current task. This may incur negative interference between tasks due to the indiscriminate use of previous parameters (knowledge) with the new ones.
The proposed method is a type of structural allocation strategy. Unlike other kinds of strategies, the structural allocation approach assigns a disjoint set of parameters for a task and prevents the rewrite of previous parameters [12]. In other words, such methods do not update the previous parameter sets and thus do not forget the knowledge of the previous tasks. Despite their notable advantage of protecting the previous knowledge, indiscriminate use of the previous parameter set when learning a new task may negatively affect the optimization of the network. This aspect highlights the importance of associating previous parameters that provide a positive response to network optimization while skipping other parameters that generate negative interference.
In this study, we establish a novel method to selectively skip previous parameters that negatively interfere with the current task based on the proposed gating mechanism. The proposed method includes a feature extractor consisting of units (disjoint sets of parameters) and multiple classification heads for sequential tasks. For each observed task, the proposed method first allocates parameters in an element-wise manner into disjoint groups through three steps: training, pruning, and retraining. When learning a new task, the gates are controlled to execute or skip the previous parameters.
To this end, we exploit the previous parameters that yield a positive network response to the current task for the gate control. In particular, two main responses to the current task are considered, namely, (i) the effect of the high-level feature and (ii) the amount of information in the lower-level features. The high-level features represent the response obtained from the end of the network, i.e., loss, and low-level features correspond to a feature map response from an intermediate layer of the network. By controlling the gates using both types of features, the proposed method effectively skips previous parameters that negatively interfere with the current task. Furthermore, the proposed method does not induce a memory overhead to store data of previous tasks, as in replay-based approaches [6]- [8].
We apply the proposed method to a range of continual learning scenarios using the CIFAR-100 [18] and ImageNet-50 [19] datasets. Experiments are conducted in which the number of tasks and size of the backbone model are varied for each dataset. Furthermore, we obtain results for additional scenarios involving different semantic information between tasks. Experiment results show that the proposed method outperforms existing continual learning approaches regardless of the similarity in the task domains. The contributions of this work can be summarized as follows: • We propose a novel method to explore a task-specific network structure by controlling the gate of each unit guided by the high-and low-level responses from the network.
• The proposed method can effectively detect previously learned parameters that are helpful in accomplishing the current task from the responses.
• Experimental results show that the proposed method not only minimizes the harmful interference between tasks but also requires fewer parameters to perform the task.
We briefly introduce related works in continual learning in Section II. In Section III, we describe the proposed method with the gating mechanism. In Section IV, we show experimental results of the proposal with other compared approaches. Finally, we discuss the conclusion of this work in Section V.

II. RELATED WORK
This section introduces four different types of strategies in continual learning and gated neural network methods.

A. CONTINUAL LEARNING
Regularization strategies [3]- [5], [20] identify key older parameters that are strongly linked to a new task. The importance of the parameters is determined through the sensitivity measure [3], change in loss [4], or derivative of the output of the network [5]. However, the consolidation of important parameters weakens the learning efficiency when addressing a long sequence of tasks [21]. Replay strategies [6]- [8], [22] store a small set of training examples to replay when training on a current task. This line of methods employs the stored set to jointly train the network with current data [6], [7], [22]. Another approach [8] alternatively used the stored set for gradient estimation with important parameters of old tasks. However, the strategy incurs an additional memory overhead to store the examples. In addition, overfitting may occur due to repeated training with a small fixed number of replay data.
Dynamic expansion strategies [9]- [11], [16] increase the size of the model through a pre-determined criterion for loss or accuracy. References [9] and [11] proposed new dynamic models to accommodate each incoming task. References [10] and [16] attempted to expand the network size by adding new learnable parameters. However, the dynamic expansion strategy involves a notable limitation as it expands the size of the network whenever a pre-determined criterion is not satisfied. Consequently, the computational cost increases proportionally to the network size, rendering it difficult to apply this approach in practical problems.
Structural allocation strategies [12], [13], [17], [23] generally allocate the parameter sets in a single feature extractor through pruning [12], learnable binary masks [17] or the attention mechanism with gradient descent [13] for each task. To maintain previous knowledge, the approach updates the current set of parameters [12], [17] while preventing the previous ones from being updated or restricting the update of critical parameters of the earlier tasks. However, if all the previous parameters are used to perform optimization for the current task [12], negative interference among tasks may be introduced, which may be especially severe for tasks of different domains.

B. GATED NETWORK
In gated networks [24]- [27], gates are controlled to identify the suitable computational path of the network for a given task. However, to search for a computational path or its corresponding subnetwork, additional networks or modules are required [24]- [27]. In addition, the approach may execute or skip the entire residual block [25], [26] or channels [24], [27] during inference. Moreover, these methods have been established for scenarios in which all data are simultaneously accessed. In contrast, the proposed method is applied to a more challenging continual learning scenario in which tasks appear sequentially. Because the domains of tasks are unknown in real-world scenarios, we attempt to determine an optimal computational path in a single backbone architecture for each task. In addition, the proposed method does not require additional search modules for the gating mechanism. Instead, the gates are controlled considering the low-and high-level information of the network collected within the network.

III. GNN-CL: GATED NEURAL NETWORK IN CONTINUAL LEARNING
In this section, we describe the proposed gated neural network in continual learning, named GNN-CL. Section III-A introduces the framework of GNN-CL and issues of existing works. Section III-B and III-C describe two responses for the gating mechanism, respectively. We provide mathematical symbols used in this work in Table 1.

A. FRAMEWORK
GNN-CL aims to skip previously learned parameters that may cause negative task interference. It controls the gates using two responses of the network, which are collected from the intermediate part and end of the network. The framework follows the structural allocation strategy, in which each task-specific parameter set does not overlap those of other tasks.
The problem of interest pertains to the learning of sequentially incoming tasks T = {T 1 , T 2 , · · · }, where each task The base network has a single feature extractor f (·) and task-specific classifiers c w i (·) parameterized with w i . The feature extractor f (·) consists of L units, wherein a unit can be a layer or a or a group of layers. Specifically, we define the feature extractor with L units as and • is the function composition operator. Similarly, we define the output of the l th intermediate unit as Initially, all parameters in f (·) are initialized to α 0 such that α 0 = L l=1 α 0 l . When the first task is observed, α 0 is assigned to the first task as θ 1 . T 1 is trained by updating θ 1 in the feature extractor f and w 1 in the classifier c w 1 . After training T 1 , redundant parameters in θ 1 are removed from the feature extractor to provide space for the subsequent task. Discarded parameters are initialized as α 1 , whereas the survived parameters are fine-tuned to produceθ 1 . Finally, the parameters in the feature extractor are composed ofθ 1 to perform the first task and α 1 to be allocated for the next task. For example, parameter sets of the l th unit U l becomes [θ 1 l , α 1 l ]. When the i th task is observed, we allocate initialized parameters α i−1 as . From this, one can predictŷ i for T i using all previous parameters [12]: The objective function for the i th task is This objective updates the newly allocated parameters θ i and w i , while the other parameters are maintained constant. After learning T i , the sparse parameter setθ i is obtained through pruning and retraining. Similarly, future tasks are treated with the train-prune-retrain strategy. However, the approach using all parameters of previous tasks [12] when learning the current task is exposed to potential risks of negative interference between tasks. When the domains of the current and previous tasks are different, harmful information from the previous tasks may disturb the learning of the current task. To mitigate this phenomenon, the proposed method provides a parameter selection approach by introducing and controlling the gates. Figure 1 represents conceptual illustration of the gating mechanism. The figure on the left represents the network before controlling the gates. The operations of the unit l with and without the use of the previous knowledge are presented in the middle. In the units, we obtain information I (i,t) by executing the i th previous set of parametersθ i l and t th new set of parameters θ t l . Furthermore, we obtain loss L (i,t) l by further executing c w t . Similarly, we obtain information I (t) l and loss L (t) l using θ t l (withoutθ i l ). The gated network of the l th unit is shown in the figure on the right.

B. LOW-LEVEL RESPONSE
To select previous parameters that are helpful in accomplishing the current task, we first consider the outputs in the intermediate layers of the network. Specifically, we use lowlevel features of the network as feature maps and control the gates based on the information of feature maps generated by the previous sets of parameters.
The proposed method learns a new task by using the parameters that generate a feature map with rich information when the new task is given. To measure the relative amount of information with respect to different usages of parameters, it employs singular value decomposition (SVD) [28]. The singular values of feature maps can reflect the amount of information [29]. We denote the feature maps obtained using both current and the i th previous parameter set and only the current parameter set in the l th unit as h where u k and v k denote the left and right singular vectors, respectively, and s k is the singular value of h l (x t )) can be divided into two terms by the k th rank, where the left (low-rank) term k k=1 u k s k v T k contains a considerable amount of information and the right (highrank) term K j=k +1 u j s j v T j contains relatively insignificant information. Consequently, the amount of feature information is dominant in the left term. We use the singular values that reflect information of feature maps as where s k is a k th singular value of feature map and S(·) is sum of singular values of a feature map for input image x t j . I represent the average feature map information in the l th unit, calculated as l (x t j )). If l , the i th parameter set provides richer positive information for the current task T t . Considering this aspect, VOLUME 10, 2022 we define gate control g i l that applies the sets of previous parametersθ i l in the l th unit as follows: The feature extractor f (x t ) incorporated with the gate control is denoted as

C. HIGH-LEVEL RESPONSE
The proposed method takes the loss as another response for the gate control and explores the sets of previous parameters that incur a small loss for a new task. The loss is employed as a measure to find relevant tasks [30], [31]. We seek the sets of previous parameters that further minimize the loss when used with the current parameter set. We denote the loss incurred when using the current parameters and both current and the i th previous set of parameters in the unit l as L Likewise, L to control the gates associated with the previous parameters.
Finally, the gate control, an improved version from Eq. (9), for the i th previous parameter set is defined aŝ The gate control is implemented by minimizing the loss of the current task while maximizing the information. After the gate control, the final network can be expressed as In summary, the proposed method mitigates the negative interference between tasks by the gate control that enriches the information of the feature maps while minimizing the loss for the new task.

IV. EXPERIMENTS
We compared the proposed method, GNN-CL, with other continual learning methods. The dataset used in the experiment and implementation details are discussed in  Sections IV-A and IV-B, respectively. Section IV-C and IV-D show the results using ImageNet-50 and CIFAR-100, respectively. In Section IV-E, we analyze the proposed gate mechanism, including an ablation study. All experiments were conducted using the PyTorch library [32] and NVIDIA 2080Ti GPU.

A. DATASETS
We applied the proposal to various task-incremental scenarios in continual learning. The datasets used in the experiments included ImageNet-50 [19] and CIFAR-100 [18]. Following the method specified in [22], ImageNet-50 was resized to a resolution of 32 × 32 by randomly selecting 50 subclasses from the original ImageNet-1K [19] dataset. The CIFAR-100 dataset [18] was split in the order of the labels provided.
In particular, we composed a scenario by splitting the dataset into 20 tasks (super-classes), each of which involved five classes. The characteristics of the datasets used in the experiments are summarized in Table 2.

B. IMPLEMENTATION DETAILS
We used the ResNet-20 [33] and WideResNet-28-2 (WRN-28-2) [34] backbone networks. The proposed and existing methods have a classification head (fully connected layer) for each task. We defined each task as learning a set of classes at a time. Each experiment was conducted by dividing the dataset into sequential tasks according to the provided labels. Three criteria were considered to divide the dataset: by label order (L), random order (R), and super-class order (S). The random order strategy shuffled the labels and divided them in a specified order to produce sequential tasks. Table 3 represents the information for different scenarios. The experiment based on ResNet-20 (WRN-28-2) for 20 tasks constructed by the random order was denoted as N − R − 20 (W − R − 20). To show the effectiveness of the proposed method, we compare it with EWC [3], SI [4], and MAS [5] in the regularization strategy, GEM [8] in the replay strategy, and PackNet [12] in the structural allocation strategy.
In the experiment, we compared ours with a joint training method (Joint) using all the data and a fine-tuning method (Fine-tune) that is trained using data from only the current tasks. In addition, we compared the proposed approach with a naïve version of the replay method (Replay) [35], which  randomly stores a part of the previous data and performs joint training with the current data.
All datasets in the experiment were subjected to random horizontal flip and random cropping augmentation during training. The proposed method trained the network until convergence with an initial learning rate of 0.01 and stochastic gradient descent with a momentum of 0.9. The learning rate was multiplied by 0.1 at the 50 and 75 epochs. Note that PackNet and the proposed method discarded approximately 75% of the parameters based on the absolute values of the parameters during pruning in each task. We controlled the gates for the last three units of the network in the main scenarios, with each unit corresponding to a ResNet block [33]. The results for different numbers of units to be controlled in the ablation study were presented.

C. ImageNet-50 RESULTS
We applied the proposed method to ImageNet-50 which is divided into 5, 10 or 25 sequential tasks, respectively. The classes in the ImageNet-50 dataset were randomly selected from the original dataset [19]. We trained all the methods until convergence and derived the average accuracy of all tasks after training on the last task. Table 4 (top) summarizes the results obtained using ResNet-20 [33] on ImageNet-50. Note that the best and second best results are boldfaced and underlined, respectively. Overall, the average accuracy increases as the number of tasks increases because the number of classes in each task decreases. The fine-tuning method exhibits an inferior performance with average accuracies of 25% and 50%, respectively, for cases involving 10 and 25 tasks. This finding highlights the importance of retaining the helpful knowledge of previous tasks. For ImageNet-50 divided into five tasks, the GEM-1K and Replay-2K methods show slightly higher performance than other continual learning strategies. However, these methods require additional memory to store examples of previous tasks. The regularization strategies, EWC and SI, outperform PackNet by 2.06% and 2.79%, respectively; however, their performance is inferior to the proposed GNN-CL (5.57% and 4.84% lower, respectively). The performance of the proposed method is the most similar to joint training and higher than the other competitors. Specifically, the proposed method achieves accuracies that are 1.51% and 1.88% higher than those attained using GEM-1K in cases involving 10 and 25 tasks, respectively. Furthermore, GNN-CL outperforms PackNet [12] when all previous parameter sets are adopted regardless of the number of tasks (performance enhancement of 7.63%, 6.67% and 1.88% in scenarios involving 5, 10, and 25 incremental tasks, respectively). Table 4 (bottom) presents the average accuracy of ImageNet-50 using WRN-28-2, with the difference in the average accuracies between WRN-28-2 and ResNet-20 presented in parentheses. The performance of the methods is enhanced as the size of the network increases. However, the regularization methods, EWC and SI, achieve a lower accuracy when using the larger-size network. A similar trend has been reported in [15]. In contrast to EWC and SI, the performance of MAS is enhanced when WRN-28-2 is implemented. The replay approach, GEM-1K, shows the most competitive performance against the proposed method. Moreover, this approach exhibits a consistent performance improvement regardless of the number of tasks. The structural VOLUME 10, 2022 TABLE 5. Continual learning results of the compared methods on CIFAR-100 with respect to average accuracy (%) using ResNet-20 (N) and WRN-28-2 (W ). allocation strategy, PackNet, and ours improve performance as the larger-scale network is used. The proposed method outperforms PackNet by 4.36%, 3.62%, and 2.8% when the number of tasks is 5, 10, and 25, respectively. The results indicate that the negative interference between tasks can be reduced through the proposed gate mechanism for networks of different sizes.
In addition, we present the average accuracies associated with using ResNet-20 in Figure 2. The accuracy on the right is the same as the values presented in Table 4 as the average accuracy for all tasks is considered. The accuracies of GNN-CL and PackNet are slightly lower than those of the other methods after learning on the first task. Unlike other methods that perform the first task using all network parameters, the structural allocation methods show a marginal decrease in accuracy because certain parameters are pruned for the subsequent tasks. The proposed method can retain the previous knowledge and learn the current task using only the previous parameters that provide positive responses, thereby outperforming PackNet.

D. CIFAR-100 RESULTS
We validated the proposed method for the CIFAR-100 dataset [18]. Specifically, experiments were conducted with 5, 10, and 20 sequences of tasks on the ResNet-20 network [33]. We divided CIFAR-100 into tasks by the label order. Table 5 (top) summarizes the results obtained using the compared approaches. We report the final accuracy which is obtained by averaging the accuracies for all tasks. The best and second best results are boldfaced and underlined, respectively. Similar to the previous experiments, the average accuracy increases as the number of tasks increases for most continual learning methods. GEM-1K, which exhibits the most competitive performance on ImageNet-50, performs lower than the structural allocation methods on CIFAR-100. EWC performs lower than SI in the sequence of five tasks but better than that for 10 tasks. MAS gives an unsatisfying performance, and its accuracies remain unchanged for different numbers of tasks. The performance of another regularization method, SI, deteriorates as the number of tasks increases. The structural allocation approaches outperform other regularization and replay methods. GNN-CL consistently outperforms PackNet, with margins of 1.99%, 0.31%, and 1.19% for 5, 10, and 20 tasks, respectively. Notably, GNN-CL also outperforms Replay-2K on CIFAR-100 even if it does not perform better than the competitor on ImageNet-50. The average accuracy pertaining to ResNet-20 is presented in Figure 3. The results at the rightmost point of each figure are identical to the results presented in Table 5 (top). Table 5 (bottom) lists the average accuracies of CIFAR-100 using WRN-28-2, with the difference in the average accuracies between WRN-28-2 and ResNet-20 presented in parentheses. The performance of the replay and structural allocation methods is improved with the larger network. In contrast, the regularization methods do not show satisfactory performance for WRN-28-2 as they forget the previous knowledge. The replay approach, GEM-1K, shows better performance than the regularization method in most cases. However, GEM-1K achieves inferior results as the number of tasks increases (10 and 20 tasks), contrary to the case of the ImageNet-50 dataset. The structural allocation strategy, PackNet, outperforms other approaches but performs lower  than ours by 1.36%, 4.41%, and 5.87% performance gap for 5, 10, and 20 tasks, respectively.

E. ANALYSIS
In the subsection, we discuss the effect of different division strategies for the proposed method (IV-E1), the parameter consumption of the network (IV-E2), negative interference (IV-E3), the effect of the number of gating units (IV-E4), and the ablation study of the proposal (IV-E5).

1) EFFECT OF DIFFERENT DIVISION STRATEGIES
We analyzed the influence of different division strategies for the CIFAR-100 dataset on the considered methods. We report the final average accuracy of the dataset divided by the random label order to produce sequential tasks. Furthermore, we present the accuracy difference between random division and division by the label order. Table 6 (top) reports the final average accuracy using ResNet-20. The final accuracy is obtained by averaging the accuracies for all tasks. The best and second best results are indicated in bold font and underline, respectively. The accuracy of the regularization strategy is lower than that of the other strategies. The replay strategy, GEM-1K, shows higher average accuracy than the regularization strategy. In particular, the improvement is greater than other methods when the number of tasks increases from 10 to 20. However, this approach is less accurate than the structural allocation strategy, regardless of the number of tasks. The structural allocation methods, PackNet and the proposed, consistently achieve higher average accuracy than other strategies for all numbers of tasks. Similar results and trends are observed in the middle of the table when the large-scale network, WRN-28-2 is implemented. Table 6 (bottom) presents the results for the CIFAR-100 split by the super-class order. The trend of the results is similar to that of the case in which CIFAR-100 is split by the label order, as shown in Table 5. The proposed method outperforms other continual learning approaches by a larger margin on average. Figure 4 reports the average accuracy in the tasks of the CIFAR-100 split by the super-class order.

2) PARAMETER CONSUMPTION
We analyzed the number of parameters and the corresponding accuracy on CIFAR-100. In particular, we split the CIFAR-100 dataset into 10 tasks by label order. Figure 5 shows the average parameter consumption in each task and the corresponding average accuracy. The strategies except the structural allocation strategy use all parameters of the backbone network to perform the tasks. The structural allocation methods, PackNet, and the proposed method use fewer parameters than other approaches owing to the use of the pruning step. Even the proposed method performs better than PackNet with fewer parameters, as shown in Figure 5.

3) NEGATIVE INTERFERENCE
We implicitly measure the negative interference among tasks. We compare the loss and standard deviation of the proposed method with and without the presented gate control mechanism. We used ResNet-20 for ImageNet-50 divided into 10 tasks with random label order. Figure 6 reports the training loss of each task. The loss of the proposed method with the gating mechanism is similar to that without it for the first task. The training loss of the proposal for the subsequent tasks is noticeably smaller and more stable than that without the gating mechanism. This shows that the proposed gate control mechanism selectively chooses the parameters of previous tasks that negatively affect the optimization of the current task.

4) NUMBER OF GATING UNITS
We investigated the performance of the proposed method under different numbers of gating units. Specifically, we applied the proposed gating module to the last three to nine units in ResNet-20, with a total of nine units. CIFAR-100  was divided into 10 tasks according to the label order. Figure 7 shows the results of the experiment. From the figure, we can observe that controlling a large number of units deteriorated performance. Controlling seven to nine units corresponds to an inferior performance than that in the case of controlling three to five units. This finding indicates that applying a gate to deeper units representing better task-specific features is more effective and can enhance the performance.

5) ABLATION STUDY
An ablation study for the gate control was conducted. The study was performed under the same experimental setup as in the previous experiment. We compared the proposed method without considering each response (intermediate feature map or loss) or both responses, as described in the method section. As shown in Figure 8, the method without the loss response exhibits the lowest performance. The proposed approach without the intermediate responses also shows unsatisfying performance. In contrast, when both responses are used, the proposed approach outperforms all other methods, indicating that the two responses work complementarily to minimize the negative interference between tasks.

V. CONCLUSION
In this work, we have addressed the catastrophic forgetting issue in continual learning that prevents the efficient optimization of a deep neural network for sequential tasks. To alleviate the issue, we have established a novel method of selecting units with the gate control in structural allocation based continual learning. The proposed gated network employs the helpful previous parameters for the current task using two responses from the intermediate layer end of the network. By selectively using the helpful parameters learned from the previous tasks, the proposed method effectively learns the current task by maximizing the information of feature maps and minimizing the loss. This framework also reduces the negative interference between tasks. A diverse set of experiments indicates that the proposal outperforms other continual learning competitors with different learning strategies. The effectiveness of the proposed approach under different learning scenarios was extensively evaluated. The proposed method exhibited a competitive performance among the considered approaches without requiring additional parameters. We have also provided thorough analyses of the proposed method under different experimental setups.