Reduce the Difficulty of Incremental Learning With Self-Supervised Learning

Incremental learning requires a learning model to learn new tasks without forgetting the learned tasks continuously. However, when a deep learning model learns new tasks, it will catastrophically forget tasks it has learned before. Researchers have proposed methods to alleviate catastrophic forgetting; these methods only consider extracting features related to tasks learned before, suppression to extract features for unlearned tasks. As a result, when a deep learning model learns new tasks incrementally, the model needs to learn to extract the relevant features of the newly learned task quickly; this requires a significant change in the model’s behavior of extracting features, which increases the learning difficulty. Therefore, the model is caught in the dilemma of reducing the learning rate to retain existing knowledge or increasing the learning rate to learn new knowledge quickly. We present a study aiming to alleviate this problem by introducing self-supervised learning into incremental learning methods. We believe that the task-independent self-supervised learning signal helps the learning model extract features not only effective for the current learned task but also suitable for other tasks that have not been learned. We give a detailed algorithm combining self-supervised learning signals and incremental learning methods. Extensive experiments on several different datasets show that self-supervised signal significantly improves the accuracy of most incremental learning methods without the need for additional labeled data. We found that the self-supervised learning signal works best for the replay-based incremental learning method.


I. INTRODUCTION
The long-term goal of artificial intelligence is to build an agent that can act autonomously for a long time; which requires the agent to continuously learn new tasks to adapt to the changing environment and deal with various unknown new objects after training and deployed in the natural environment. However, the agent model based on the deep neural network will essentially forget the previously learned tasks when performing incremental learning new tasks. Specifically, when a neural network model is learning a new task, if only the samples of the new task are used to train the neural network, the recognition accuracy of the learned task will be significantly reduced. This phenomenon is called the catastrophic forgetting problem [1]. The main reason is that when the neural network model learns new tasks serially, the newly The associate editor coordinating the review of this manuscript and approving it for publication was Chun-Wei Tsai . learned knowledge will overwrite the old knowledge. A trivial solution to prevent catastrophic forgetting is to keep all learned samples. When the model needs to learn a new task, let the model relearn all learned tasks simultaneously; this solution is unsustainable because storing all the learned samples and retraining the model from scratch requires massive storage space and computing resources. Since the storage space and computing resources carried by the mobile agent are minimal, it cannot meet the requirements of storing all the learned samples, so the purpose of incremental learning research is to overcome catastrophic forgetting under the condition of limited computing and storage resources, provide the agent with the ability to continue learning. In recent years, this problem has received more and more attention from researchers, and many methods have been proposed to solve the catastrophic forgetting problem [2]- [5] through various constraints. However, it does not consider that the feedback signal obtained by learning the current task will inhibit the extraction of relevant features of the future learning task in the scenario of incremental learning. For example, when the model recognizes a dog, it needs to treat the cat in the picture as the background, not extract its relevant features. Still, when the model needs to learn new tasks incrementally, such as when recognizing cats, it also needs to learn quickly to extract features related to the new class (cat); this requires a significant change in the behavior of the model to extract features, thereby increasing the difficulty of model optimization, putting the model into the dilemma of stability/plasticity of reducing the learning rate to retain existing knowledge or increasing the learning rate to learn new knowledge [6] quickly. Figure 1 schematically shows that the model requires different responses to the same input sample in the above situation. This paper proposes to introduce self-supervised learning to alleviate this problem. The self-supervised learning method obtains supervision signals by customizing various pseudo-labels, such as cropping, resizing, color distorting, and adding Gaussian noise to generate multiple image samples. Let the model determine whether it comes from the same picture or not [7], rotate the image to a specific angle and let the model predict the rotation angle of the image [8], and shuffle the order of the image after dividing it into blocks and let the model predict the original position [9]. Because the features extracted by the network model based on self-supervised learning have universal applicability independent of the specific subtasks of incremental learning, the neural network model trained based on self-supervised signals has achieved significant performance improvements in multiple subsequent tasks (such as image classification, target detection) [10], [11]. We integrate the supervision signals of self-supervised learning into the incremental learning process, provide constraints for the model in the incremental learning training process, and prompt the model not only to extract features that are effective for the current learned task but also extract features that have potential effects on other unlearned tasks. So that when the model learns new tasks incrementally, it can iterate more smoothly and quickly to the parameter space points that meet the new and old tasks simultaneously. Figure 2 illustrates the parameter optimization trajectory when the network model parameters are trained using different methods. After completing the learning of task one, the model is in the parameter area with a lower error rate for task one. When the model continues to train using the samples in task two, The model parameters will be iterated to a region with a lower error rate for task two, such as the yellow area shown by the red arrow. Still, this area has a higher error rate for task one, which means the model has disastrously forgotten the knowledge learned from task one. When using the incremental learning method, the model will optimize to the yellow parameter area where both task one and task two perform well, as shown by the yellow arrow, this area has a lower error rate for both task one and task two, but this area is far from the optimal parameter area (such as the green area where the circle is located) of the unlearned task (e.g., task three). Incremental learning combined with self-supervised learning can optimize the parameters of the model to balance between the current learned tasks (such as task one and task two) and the parameter space point closer to the optimal parameter area of the unlearned task (e.g., task three), such as the blue area shown by the blue arrow in the picture. By combining different databases and incremental learning methods, we have verified that fusing self-supervised and supervised learning signals can significantly improve the accuracy of the class incremental learning (class-IL) method without additional labeling data or storing more samples. In particular, self-supervised learning has a significant effect on sample replay-based class-IL.

II. RELATED WORK A. INCREMENTAL LEARNING
Early scholars encountered the catastrophic forgetting problem in the 1980s [1] when studying incremental learning based on neural networks. Kolen et al. found that the main reason is that there are cliff-like changes in the parameter space; small changes in model parameters caused by incremental learning may cause drastic changes in model VOLUME 9, 2021 output [12]. Some early related work includes the introduction of multiple network architectures [13], increasing the independence of features [14], and replaying samples, etc. [15], [16]. However, due to the limitations of computing resources, these works only used few samples and shallow neural network architecture. With the popularity of deep neural networks in recent years, more and more works have tried to solve the catastrophic forgetting problem in incremental learning. These incremental learning research works can be roughly divided into replay-based methods [17]- [19], regularization-based methods [2], [5], [20], and parameter isolation methods [3], [4].
Early scholars encountered the catastrophic forgetting problem in the 1980s [1] when studying incremental learning based on neural networks. Kolen et al. found that the main reason is that there are cliff-like changes in the parameter space; small changes in model parameters caused by incremental learning may cause drastic changes in model output [12]. Some early related work includes the introduction of multiple network architectures [13], increasing the independence of features [14], and replaying samples, etc. [15], [16]. However, due to the limitations of computing resources, these works only used few samples and shallow neural network architecture. With the popularity of deep neural networks in recent years, more and more works have tried to solve the catastrophic forgetting problem in incremental learning. These incremental learning works can be roughly divided into replay-based methods [17]- [19], regularizationbased methods [2], [5], [20], and parameter isolation methods [3], [4].
The method based on model parameter constraints reduces the forgetting of learned tasks by introducing additional constraints to the loss function. This type of method does not need to store the trained samples, thereby reducing memory. According to the different constraints, such methods can be divided into data-based constraint methods and a priori constraint-based methods. When the model starts training a new task, LwF [25] uses the output of the new sample in the old model as a soft label to constrain the update of model parameters, to reduce the degree of forgetting of the learned task in the model. Elastic Weight Consolidation(EWC) [5], Memory Aware Synapses(MAS) [26], and Intelligent Synapses(SI) [20] calculate the importance of model parameters to the learned task and using it as a constraint condition for updating the parameters, reduce the update range of essential parameters when learning new tasks, thereby reducing the degree of forgetting of the learned tasks by the model. This type of method requires additional space to save the importance of the model parameters in the learned task.
Parameter isolation-based methods such as Progressive Neural Networks (PNN) [3] freeze the network parameters of all old tasks and learn new tasks by building a new network each time. To retain the previously learned knowledge, the model will simultaneously input the data of the new task into all old frozen networks and merge the output of each net-work to prevent the forgetting of the learned tasks when learning new tasks; Dynamically Expandable Networks (DEN) [4] participate in the training of new tasks by dynamically selecting some of the parameter nodes of the existing network, and expand the parameters of the network when needed to reduce the forgetting of learned tasks; Expert Gate(EG) [27] select and copy an old model with the highest similarity from the old task for the new task through a self-encoding gate for training the new task. The Deep Adaptation Networks (DAN) [28]frozen the learned filters of the old tasks, but allows for learning new filters that are linear combinations of existing filters from new tasks. Such methods need to continuously expand the capacity of the model when learning new tasks.

B. SELF-SUPERVISED LEARNING
Traditional supervised deep learning methods rely heavily on the amount of labeled data. However, data labeling is very time-consuming, laborious, and easy to introduce noise, which significantly limits the development of deep learning. The self-supervised learning method uses the information of the data itself to construct pseudo-labels to obtain supervised signals and gets rid of the dependence on artificially labeled data. The feature representation extracted based on self-supervised learning has improved performance in multiple subsequent tasks such as image classification, object detection, and semantic segmentation. Scholars have proposed a variety of methods to construct pseudo-labels from raw data, such as using cropping, resize and color distortion methods to generate multiple samples from the same image and train the model to judge where a pair of images are from the same image [7]. Gidaris et al. [8] proposed to rotate the training images with random angles and train the model to predict the rotation angle. Doerschshuffle et al. [9] utilizing the order of the image after dividing it into blocks and train the model to predict the original position. Qian et al. [29] proposed to train the model to judge whether the order of the sample frames in the video is correct.
However, there are few attempts to combine self-supervised learning and incremental learning. We believe that self-supervised learning signals complement specific incremental learning tasks, so it helps to improve the generalization performance of the features extracted from the model and makes it easier to learn incrementally. The most related work with ours is [30], this work only discussed the effect of combining self-supervised signals and the OWM algorithm [31]. However, we discussed the effect of combining self-supervised learning with the current main different types of incremental learning methods, and we have done many experiments on CIFAR-10 and Tiny ImageNet datasets and found that self-supervised learning has a better effect on alleviating catastrophic forgetting in incremental learning methods based on sample replay than based on parameter isolation. In further experiments, we found that the self-supervised learning signal can amplify the cached samples to alleviate catastrophic forgetting as the cached samples increase.
Experiments in section IV show that incremental learning based on the fusion of self-supervised learning signals and sample replay can achieve the current best incremental learning performance with only a slight increase in storage and calculation costs.

III. METHODOLOGY A. FORMULATION
In this section, we formally discuss the incremental learning problem. An incremental learning problem can be defined as letting the model M with the parameter θ learn T tasks sequentially, the training dataset D t of each task t ∈ {1, . . . , T } consists of N t training samples conforming to independent and identical distribution {x t,i , y t,i } N t i=1 , the label sets of different tasks y i and y j are orthogonal to each other: The goal of incremental learning is to optimize the parameters θ from the training samples to minimize the sum of the model's losses in all T tasks: The function ce is the cross-entropy loss. Since incremental learning assumes that the model learns each task sequentially, the model can only be optimized based on the first n tasks currently learned each time, where n ≤ T , so the model minimizes the following objective function each time: The optimal solution of the above objective function is to find the parameter region with the lowest error rate for the first n tasks. The problem with this objective function is that its optimization direction does not consider the parameter area that has a lower error rate for the task to be learned in the future, so the optimal parameters obtained are easy to fall in the area far away from the optimal parameters of the unlearned task (take the n + 1th as an example), this makes it difficult for the model to update its parameters to a region with a lower error rate for the first n + 1 tasks when it continues to learn the n + 1th task.
We add self-supervised learning items to the incremental learning objective function, let the model balance the loss of incremental tasks L and the loss of self-supervised tasks L ssl during optimization. The optimized objective function is as follows: where L ssl is a self-supervised item, f t transforms the input image data,ŷ is the pseudo label generated after the image is transformed, α is a hyperparameter that balances incremental learning loss and self-supervised learning loss. Adding L ssl to the loss function allows the model not only to consider the label of the current task in the incremental learning process but also maintain a particular perception of other unknown objects so that the model can adapt to recognize new classes of objects faster when learning new tasks. For example, when learning to recognize a new class of objects (such as dogs), the model needs to suppress the response to unlearned objects (such as cats) that may appear in the sample without a self-supervised learning signal. After adding the self-supervised learning signal to the optimized objective function, the model needs to extract semantic information of unknown objects for self-supervised learning tasks, such as judging the orientation of unlearned objects or the position of various body parts. The self-supervised signal requires the model to retain the perception of unknown objects in the feature map. Therefore, the model will learn how to classify them when incrementally learns new classes of objects correctly. Due to storage space limitations, the agent cannot save samples of all learning tasks in its life cycle. Incremental learning assumes that the training data for different tasks arrive in time sequence and the data of the n − 1th task is not retained when training the nth new task, so the formula 3 cannot be calculated due to missing data in D 1 . . . D n−1 . Different incremental learning methods have different strategies to compensate for the missing data of D 1 . . . D n−1 , such as using sample replay, model parameter constraints, and parameter isolation. We use L inc to represent the constraints added by incremental learning, and the corresponding optimization objective function is changed to the following form: Different incremental learning algorithms have different L inc formulas. For example, the constraints of regularization based methods such as EWC [5], SI [20], LwF [25] are as follows: where θ * is the model parameter after learning the previous task, F and record the importance of network weights in different forms, Y o andŶ o record the network output of the previous and current models on the data of current task respectively.
Replay based methods, such as ER [16], [18], Dark Experience(DER) [21], sample part of the training task data to the cache M, and use the data in the cache instead of D 1 . . . D n−1 VOLUME 9, 2021 to calculate the loss of the learned task, the corresponding loss function is as follows: where z is the cache of extracting features from the data x before the model is updated, and h θ (x) is the feature extracted from the data x by the current model. We can flexibly add the self-supervised learning strategy to the replay-based methods and regularization-based methods. We use predicted image rotation [8] as the implementation method of self-supervised learning signal. In addition, for the replay-based method, we can calculate the self-supervised learning loss on the cached sample (M ), the current task sample D n , or both. In section V, we will conduct detailed experiments on the above situations and analyze the results in detail. The corresponding loss function is modified as follows: where the Rot(·|m) function rotates the image by m degrees, and the function fˆy θ (·) predicts the probability that the model result isŷ, refer to [8] we set K to 4.

B. ALGORITHM DESCRIPTION
Algorithm 1 describes the calculation process of combining self-supervised learning and DER in detail.
Where function reservoir(M, (x, z)) implements the reservoir sampling algorithm [32] to ensure that the samples in different tasks have the same probability of being stored in the buffer.

IV. EVALUATION METRICS
The literature [33] divides incremental learning into task incremental learning (task-IL), class-IL, and domain incremental learning (domain-IL). They all divide samples into multiple different tasks, and each task contains multiple different classes. The model learns each task and corresponding classes incrementally. The main difference between task-IL and class-IL is that task-IL provides information about which task the test sample belongs to during the test period, while the class-IL does not provide this information, so class-IL needs to distinguish test samples from all learned classes, which is most demanding among the three incremental learning settings, while task-IL only needs to distinguish test samples belonging to the specific classes of the given task, because each task contains fewer numbers of classes, task-IL is the easiest one among the three types of incremental learning. Domain-IL assumes that the class architectures in all tasks are the same. Still, domain switching occurs in different tasks, so domain-IL does not provide task information in the test and only needs to predict that the sample belongs Algorithm loss ← loss nt + α · loss der + β · loss ssl Back-propagation loss,optimize{θ, W rot } with SGD M ← reservoir(M, (x, z)) end while end for to a particular class within a single task, so it is more challenging than task-IL and simpler than class-IL. This section will experiment on the effect of self-supervised learning on class-IL and task-IL. Our experimental environment settings are consistent with [21], [33], [34] A. EVALUATION PROTOCOL Based on the universality, diversity, and complexity of the sample, we have selected the relatively simple CIFAR-10 [35],the more complex Tiny ImageNet [36] dataset, and the traffic sign dataset GTSRB [37] to comprehensively test the effectiveness of the algorithm, all results are averaged across five runs.

1) CIFAR-10
CIFAR-10 is consists of 60,000 32 × 32 color pictures. These pictures belong to 10 common object classes. Each class has 6,000 pictures. The entire dataset has 50,000 training pictures and 10,000 test pictures. We divide the dataset into five tasks according to the setting of incremental learning, and each task contains two categories.

2) TINY ImageNet
They were composed of 120,000 64 × 64 color pictures, including 200 types of common objects. Each class has 500 training pictures, 50 validation pictures, and 50 test pictures. We divide Tiny ImageNet into ten tasks according to the setting of incremental learning, and each task contains 20 classes. Examples of pictures randomly sampled from the dataset are shown in Figure 3.

3) GTSRB
GTSRB is composed of more than 50,000 color pictures, including 43 classes of traffic signs. We divide GTSRB into 43 tasks according to the setting of incremental learning, and each task contains 1 class. Examples of pictures randomly sampled from the dataset are shown in Figure 4.

4) NETWORK ARCHITECTURE
Refer to [19] and [21], we use the unpretrained ResNet-18 [22] as the network architecture, the details of ResNet-18 are shown in Table 1, where kw × kh, cn kw × kh, cn is a residual block,kw and kh means kernel width and height, cn means channel number, kw × kh, cn kw × kh, cn × rn means residual block repeat rn times, please refer to [22] for more details about the residual block.

5) DATA ENHANCEMENT STRATEGY
We enhance the training data with random cropping and horizontal flipping.

6) HYPERPARAMETER SELECTION
Refer to the hyperparameter settings of [21], we set the learning rate lr to 0.03, the batch data size to 32, and the number of data iterations for each task to 50. The hyperparameters of the self-supervised part are uniformly set to 0.05. The network training is all using Stochastic Gradient Descent(SGD) Optimizer.

B. EXPERIMENTAL RESULTS
To experimentally verify the effect of self-supervised learning on incremental learning, we selected Regularization-based Methods (LwF, EWC, SI) and Replay-based Methods (ER, DER, DERPP). We have not verified the effect of the parameter isolation-based incremental learning methods because the parameter isolation-based incremental learning method does not effectively prevent the catastrophic forgetting of the model based on the experimental results of [21]. We did detailed experiments on class-IL and task-IL. We set the number of batch samples to 32 in all experiments. For the regularization-based methods, we add the self-supervision directly to the new task samples; for the replay-based methods, we add the self-supervised signal to the cached samples and fix the number of cached samples to 200; all results are the average precision and variance of 5 experiments. Table 2 and Table 5 show the experimental results in the CIFAR database and Tiny ImageNet respectively. As seen in the class-IL, adding self-supervised learning to different incremental learning methods has a general improvement effect on classification accuracy. In particular, self-supervised learning can significantly improve the classification accuracy of incremental learning methods based on sample replay. As we can see in Table 2, the accuracy of DER's class-IL in CIFAR-10 increased from 75.47% to 77.27%, the accuracy of task-IL has also increased from 92.88% to 93.78%, Selfsupervised learning also improves ER and DERPP. Therefore, adding self-supervised learning has obvious help for  the improvement of class-IL. While, self-supervised signal has a slight negative effect on the accuracy of LwF method, we believe that this is because the LwF method needs to exactly fit the output of the previous tasks, so the model is less robust to the introduction of new supervision signals. Table 4 shows the p-value of the experimental results from the statistical aspect, as we can see from the result that the self-supervised signal has very high confidence in the improvement effect of the replay-based method. Figure 5 and Figure 7 shows the impact of various incremental learning algorithms on class-IL by self-supervised learning signals in the CIFAR-10 and Tiny ImageNet datasets, respectively.
Since we set the GTSRB data set to contain only one class per task, the model needs to perform incremental learning on a task sequence with a total of 43 tasks. After preliminary experiments on the regular-based methods, we found that this method cannot effectively preserve knowledge during such a long incremental learning sequence of tasks. For example, the LWF based method only has an average accuracy of 7.8%, but the replay-based methods sample works well. Therefore, we focused on experiments on the replay-based method (ER, DER, DERPP). In the experiment of the GTSRB dataset, we set the number of batch samples to 32 and fixed the number of cached samples to 1000. We tested the effects of category incremental learning and task-IL. We found that the results of all task-IL reached 100%. Hence, the table 3 only shows the experimental results of category incremental learning in the GTSRB dataset.
As we can see in the class-IL, adding self-supervised learning to different replay-based incremental learning meth-   ods has a general improvement in classification accuracy. Among them, the accuracy of the DER method is increased from 88.27% to 91.68%, and the ER method increased from 94.14% to 96.12%, and its accuracy exceeded 95.56% of DERPP and 95.86% of DERPP+SSL.
We also observed that the accuracy improvement effect of self-supervised learning on task-IL is not as significant as that of class-IL. We believe that the main reason is that task-IL is the simplest model of incremental learning, the accuracy of task-IL is relatively high. Therefore, the improvement is limited, and the addition of self-supervised learning signals will also have a particular interference effect on the optimization of the model, which will affect the further improvement of its classification accuracy. Figure 6 and Figure 8 shows the influence of self-supervised learning signals on task-IL of various algorithms in the CIFAR-10 and Tiny ImageNet datasets, respectively.

V. MODEL ANALYSIS
A. HOW TO ADD SELF-SUPERVISED SIGNAL TO INCREMENTAL LEARNING 1) SAMPLE WHERE THE SELF-SUPERVISED SIGNAL IS GENERATED In the replay-based incremental learning algorithm, we can generate self-supervised signals from new task samples   or/and cached samples. This section examines the impact of self-supervised signals generated from new task samples, cached samples, and new task samples + cached samples on incremental learning. We ran five trials on CIFAR-10 and Tiny ImageNet using the self-supervised + DER (RDER) method. The average accuracy and variance are shown in the Table 6, as we can see from the results that the self-supervised signal only acts on the cache to improve the accuracy of incremental learning more than it acts on the new samples or both of the new samples and cache samples. We can also find that the more cached samples, the better effect of self-supervised learning on the accuracy of incremental learning. We can conclude from the results that the self-supervised learning signal has a significant effect on improving the accuracy of replay-based incremental learning algorithms, and the key  is to apply the self-supervised learning signal to the cached sample.

2) THE EFFECT OF CACHE SIZE ON SELF-SUPERVISED SIGNAL
This section investigates the effect of buffer size on self-supervised signals in the replay-based incremental learning algorithm. We set the cache size to 200, 500, and 1000 in the experiment and repeat experiments five times on the CIFAR-10 and Tiny ImageNet datasets; the average results and its variance of experiments are shown in Table 7 and Table 8 as we can see from the results that the accuracy of class-IL is significantly improved as the cache size increases. The following is an explanation and analysis of the experimental results. When the cache size is set to 200, the average accuracy of class-IL has been improved by 2.06% and 0.76% in the CIFAR-10 and Tiny ImageNet datasets, respectively. When the cache size is increased to 500, the average accuracy of class-IL is improved by 2.86% and 1.73% in the CIFAR-10 and Tiny ImageNet datasets, respectively; this shows that the effect of the self-supervised signal on the incremental learning model increases significantly with the increase of the cache size. As the cache size increased to 1000, the average accuracy of class-IL is improved by 5.13% and 1.86% in the CIFAR-10 and Tiny ImageNet datasets respectively, this shows that the accuracy of the self-supervised signal on the incremental learning model continues to improve with the continuous increase of the cache size. We believe this is because the self-supervised signal can more fully dig out the potential information in the cached samples, thereby better alleviating the model's forgetting of learned knowledge in the incremental learning process. We also find that self-supervised signals do not significantly help the improvement of task-IL.

VI. CONCLUSION
Incremental learning requires continuous learning of new tasks and overcoming the network's catastrophic forgetting of learned tasks. Taking advantage of the fact that self-supervised learning and incremental learning have nothing to do with specific tasks, we add self-supervised learning to the incremental learning method to smooth the enormous changes in feature extraction behavior due to model learning in different tasks reducing the difficulty of learning. We added self-supervised learning signals to six different incremental learning algorithms. We did many experiments on CIFAR-10 and Tiny ImageNet to test the influence of self-supervised learning on incremental learning. Experimental results show that self-supervised learning can significantly improve the accuracy of class-IL algorithms, especially in the replay-based incremental learning method; its effect is significantly improved as the number of cached samples increases. In addition, we also found in experiments that adding self-supervised learning to the cached samples is more effective than adding to the new task samples or adding to both the new task samples and cached samples simultaneously; this shows that adding self-supervised learning can more effectively retain the knowledge of learned tasks through cached samples. This feature allows us to use a lower computational cost to significantly improve the performance of replay-based incremental learning methods. Recently, contrastive self-supervised learning has made significant progress. However, contrastive self-supervised learning is more challenging to converge; it requires more negative samples than pseudo-label-based methods, so designing a suitable contrastive self-supervised method to make it suitable for incremental learning is what we plan to do shortly. In the long term, we may also try to introduce methods such as the automatic discovery of high-order patterns [38] to help incremental learning methods better alleviate catastrophic forgetting.