Looking Through the Past: Better Knowledge Retention for Generative Replay in Continual Learning

In this work, we improve the generative replay in a continual learning setting to perform well on challenging scenarios. Because of the growing complexity of continual learning tasks, it is becoming more popular, to apply the generative replay technique in the feature space instead of image space. Nevertheless, such an approach does not come without limitations. In particular, we notice the degradation of the continually trained model’s performance could be attributed to the fact that the generated features are far from the original ones when mapped to the latent space. Therefore, we propose three modifications that mitigate these issues. More specifically, we incorporate the distillation in latent space between the current and previous models to reduce feature drift. Additionally, a latent matching for the reconstruction and original data is proposed to improve generated features alignment. Further, based on the observation that the reconstructions are better for preserving knowledge, we add the cycling of generations through the previously trained model to make them closer to the original data. Our method outperforms other generative replay methods in various scenarios. Code available at https://github.com/valeriya-khan/looking-through-the-past.


I. INTRODUCTION
The traditional approach to machine learning involves training models on shuffled training data to ensure independent and identically distributed conditions, enabling the model to learn generalized parameters for the entire data distribution.On the other hand, in continual learning, the models are trained on sequential tasks, with only data from the current task available at any given time.Such a scenario is more realistic in some applications with, The associate editor coordinating the review of this manuscript and approving it for publication was Wenbing Zhao .
for example, privacy concerns, where the old data may become unavailable.However, models trained in such an incremental fashion will face a catastrophic forgetting [24], a significant drop in the accuracy of previously acquired knowledge.
Class Incremental Learning (CIL) is a widely adopted setting where the classifier is trained on new classes incrementally using the sequence of separated data [22].Different regularization methods can be used to preserve the knowledge [16], [37], however, the performance is significantly lower without utilizing exemplars from the previous tasks.Therefore, generative models [5] have gained a significant attention as the source of synthetic data that can substitute data from the previous tasks.
Despite the promising setup, it turns out to be very challenging to scale approaches based on generative models in CIL to more demanding datasets than MNIST or CIFAR-10 [32].Generative replay methods have low performance on datasets with a larger number of classes or with more complex data.This can be attributed to the fact that modeling high-dimensional images is a challenging task during the incremental learning, and the quality of the generations degrades as number of learned tasks increases.
Therefore, more recent methods [18] introduced replay in feature-space of the trained and frozen feature extractor.The data is firstly passed through the feature-extractor, and the resulting features are used as the training data for the generative part.In this case, the distribution of the data has lower dimensionality and is much simpler for learning by the generator.
Brain-Inspired Replay (BIR) [33] is one of the recent works that uses feature-based generative replay.In their work, the authors introduce several modifications to make variational autoencoder (VAE) able to learn and generate longer sequences of more complex data.The highest results reported by the authors are when BIR is combined with Synaptic Intelligence (SI) [37] regularization method, which suggests that BIR alone for a generative features-replay is not enough and maybe other regularization techniques can yield better results.It motivates us to analyze an in-depth VAEbased replay approaches with BIR as its flagship example.We observe, that there remains a significant difference between the features produced by the real data and synthetic data.Our hypothesis is that this difference leads to a significant degradation of the quality of the replay data, and therefore, we propose two modifications that diminish the problem.Firstly, we introduce a new loss term for minimizing the difference between the encoded latent vectors of the original sample and the reconstructed sample.This loss enables the encoder to learn how to reverse the operation of the decoder.Secondly, we propose to refine the quality of rehearsal samples.To that end, we introduce a cycling method where we iterate the generated data through the previously trained model (decoder and encoder), and only after that feed it to the replay buffer for training the new model.As we show in our analysis, this has the effect of reducing a discrepancy between original and generated features for classification (see Figure 1), and as a result, improves the final model accuracy.The proposed changes allowed us to significantly improve the results over our baseline method.
Overall, the contribution of this study is threefold: • Based on the analysis of existing generative replay methods, we identify the weaknesses of VAE-based approaches such as degradation of generated data and distribution mismatch between the features obtained by original and synthetic data.
• To mitigate the discovered problems, we propose a new generative replay method for class-incremental learning.Our method uses distillation to better match latent vectors of reconstructed and original data.Also, we match the latent representations of current data obtained through previous and current models.Furthermore, we incorporate the cycling of generations to diminish the difference between the original and synthetic data.
• We perform a series of experiments to show that our approach outperforms the baseline method (BIR).
In addition, we demonstrate through an ablation study that each improvement we introduce makes an incremental contribution to the overall performance of the model.

II. RELATED WORKS
Continual learning methods can be divided into three categories that we overview in this section.
Regularization methods aim to strike a balance between preserving previously acquired knowledge and providing sufficient flexibility to incorporate new information.To that end, regularisation is applied to slow down the updates on the most important weights.In particular, in Elastic Weights Consolidation (EWC) [13] authors propose to use Fisher Information to select important model's weights, while in Synaptic Intelligence (SI) [37] and Memory Aware Synapses (MAS) [1] additional information is stored together with each parameter.Similarly, in Learning Without Forgetting (LWF) [17] additional distillation loss on current data is used to match the output of the model trained on the previous task, with a new one.In this work, we use distillation techniques to align representations of old and new features similarly to LWF.
Dynamic architecture methods create different versions of the base model for each task.This is usually implemented by creating additional task-specific submodules [29], [35], [36], or by selecting different parts of the base network [4], [20], [21], [23].Such approaches reduce catastrophic forgetting at the expense of expanding memory requirements.
Rehearsal methods involve storing and replaying past data to prevent catastrophic forgetting.The simplest implementation of this approach employs a memory buffer where a subset of examples from previous tasks can be stored [2], [3], [9], [19], [27].Such an approach achieves high performance and can significantly reduce catastrophic forgetting.
However, the memory buffer has to store a significant number of examples and, hence, grow with each task.Also in some domains, due to privacy concerns, using historical data is not possible.Therefore, generative models are often used to synthesize past data.The first example of generative replay for CIL model is [32] where a generative model (e.g., Generative Adversarial Network (GAN) [5]) is used as a source of rehearsal examples.This idea is further extended to other generative methods such as Variational Autoencoders in [12], [34], and [25] or Normalising Flows [28] in [31].In [15], the authors overview the general performance of generative models as a source of rehearsal examples, showing that even though GANs outperform other solutions, all the methods struggle when evaluated on more complex benchmark scenarios.Therefore, to simplify the problem, FIGURE 1. Principal Component Analysis (PCA) is performed on the original latent representations and the generated ones after 0, 10, and 20 passes during cycling.The PCA visualizations and Fréchet distances both imply that the cycling procedure decreases the discrepancy between original and generated data when the appropriate number of passes is performed [11].
in Brain-Inspired Replay (BIR) [33] the authors introduce a new idea known as feature replay and propose to focus on the replay of internal data representations instead of the original samples.This idea was further explored in [10], with a split between short and long-term memory, and in [18] where authors employ conditional GANs.Our method falls in the generative-feature replay category, as we directly base our approach on the BIR method.This work is an extension of workshop paper originally presented at ICCV [11].In this version, we add experiments on mini-ImageNet dataset, and detailed evaluation of the quality of rehearsal examples with precision and recall analysis.The results of these experiments are presented in Table 2 and Figure 6.In addition, we present Algorithm1, Figures 2, 3 and 4 for better comprehension of the method.

III. METHOD A. PROBLEM DEFINITION
This study addresses image classification within a classincremental setting.We train the model on a sequence of n tasks: T 1 , T 2 , . . ., T n where each task t consists of {X (t) , Y (t)  } drawn from the distribution D (t) , where X is a set of training samples, Y is a set of corresponding class labels, and 1 ≤ t ≤ n.During the training of task t the model has noo access to previous tasks data.
In class-incremental learning, the model has to be trained to predict the labels for all the tasks seen so far.

B. BASELINE MODEL
Brain-Inspired Replay (BIR) method [33] serves as a baseline for our work.The model consists of feature extractor and VAE on top of it that plays a role of the feature generator.The generator part is utilized to create the synthetic data for the replay of old knowledge.It has encoder part q φ and the decoder part p ψ .The goal of the encoder is to map the sample x to probabilistic latent variable z, and the goal of the decoder is map the latent variable z to reconstruction ẑ.Typically, the objective of training VAE is to maximize the a variational lower bound on the evidence (ELBO), or alternatively we try to minimize the per-sample loss: where q φ (.|x) = N (µ (x) , σ (x) 2 I ) is the posterior and p(.) = N (0, I ) is prior over the latent variables, and D KL is the Kullback-Leibler divergence.
For prior distribution equal to N (0, I ), the KL divergence can be calculated as follows: where D is a latent dimension.The reconstruction loss in this work is given by: where N is the size of the input, x p is the p th entry of the original input x, and xp is the p th entry of reconstruction x.
In order to generate samples from specifically chosen classes, the prior can be changed from the standard normal distribution to the Gaussian mixture with each class modeled as a separate distribution: where p X (.|c) = N (µ c , σ c I ) for c = 1, . . ., N classes , µ c and σ c are trainable means and standard deviation for class c, X is a set of means and standard deviations for all classes N classes and p(Y = c) is the class prior.
For the current task with hard targets (labels), the L latent has the following form: where µ y j is the j th element of µ y and σ y j is the j th element of σ y .For the replay, this loss is estimated for soft-target ỹ as: where ỹj is the j th entry of ỹ, and estimation of expectation is performed by a single Monte Carlo sample for each input.Classification loss is calculated for the current task as following: where p θ is the conditional probability distribution defined by the model parameters.
In the replay part of BIR method classification loss is substituted by the distillation loss.Typically, the objective of knowledge distillation is to transfer knowledge from the teacher model to student model.Knowledge distillation is performed by minimizing the distance between the resulting vectors of the softmax function in teacher and student models.One of the problem of this approach is that the predicted probability of the true class is usually close to 1. Hence, the probability vector is close to the one-hot ground-truth label vector, and does not provide additional information.To mitigate this problem, the softmax with temperature is incorporated [8].The distillation loss is calculated by: where T is the softmax temperature.

C. IMPROVED FEATURE REPLAY
This section describes our proposed modifications to the BIR method that serves as the baseline.These changes are aimed to mitigate the problems with VAE-based feature replay: (1) misalignment between original and reconstructed data, (2) latent drift due to continual learning training, (3) high difference between generations and original samples.

1) LATENT MATCHING FOR RECONSTRUCTIONS AND ORIGINAL DATA
The first modification we propose aims to improve VAE model performance in continual retraining.To that end, we propose a latent matching regularization that enforces encoder to reverse the decoding operation performed by the decoder.More specifically, we pass the sample x through the encoder model to get the latent representation z o .After that, we reconstruct the original sample by passing this latent vector through the decoder and obtain x.Then, the reconstruction is passed through the encoder model, and the latent representation z r is received.In particular, we calculate the regularisation on mean and variations outputted by the encoder.To that end, we utilize the mean squared error (MSE) loss for measuring the difference between obtained latent representations.Therefore, we introduced latent match loss which is defined as the following: The visualisation of our latent match loss is presented in Figure 2.

2) LATENT DISTILLATION
As mentioned in Section III-B, the BIR method does not have any mechanism for prevention of feature drift, i.e. the distribution change in feature space during training on new data.To prevent that, we add a latent distillation loss which is performed similarly as in [18].In order to calculate the loss during the task t, we pass the sample through the previous model encoder E t−1 and current model encoder E t , and obtain latent representations z t−1 and z t respectively.The latent distillation loss is calculated as the MSE between the latent representations of previous and current model, and is calculated by: The latent distillation loss serves as the purpose of the regularization term that controls forgetting, similarly to the SI regularization in the BIR method.Nevertheless our latent distillation achieves better performance.Figure 3 presents latent distillation loss.

3) CYCLING
Our hypothesis is that even with the first two added modifications, there is still a large discrepancy between the 45312 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.generated and original features.To minimise this effect, we propose a cycling mechanism that is inspired by the idea presented by Gopalakrishnan et al. in [6].In this work, authors propose to recursively pass images from the buffer through the pre-trained autonecoder in order to better align them to the data from a new task.Here, we use the similar mechanism with our Variational Autoencoder to align generations of data from the previous task with data reconstructions.
The visualisation of our cycling mechanism is presented in Figure 4.
In order to check the hypothesis, we calculate the Fréchet distance [7], which is used to measure the similarity of two Gaussian distributions.Typically, it is utilized to estimate the quality of generated images (known as Fréchet inception distance).In this case, we use it to measure the quality of the generated latent representations.Figure 5 presents the decrease in the Fréchet distance between the distributions of generated and original latent vectors with the increase of passes through the previous model.Therefore, we add it to the training procedure.
Empirical evaluation of the cycling and number of used rounds is presented with other experiments in Section V-B.

D. FINAL TRAINING OBJECTIVE
To summarize, we present our modified VAE-based replay method with all the improvements incorporated into the training routine via a single objective for class-incremental setting.This objective can be divided into two main parts: L current and L replay .Current task loss L current is calculated as follows: Replay loss L replay for the previous tasks is given by: Finally, the total objective is calculated as summation of these two losses: The final loss is utilized for training of the VAE and classifier using the current task data and the generative replay data passed through the previous model the defined number of times.The resulting loss is a combination of components without any coefficients to balance the off.That can be further investigated.The ablation study is provided in Section V-C.The steps of the overall training procedure can be found in the Algorithm 1.

IV. EXPERIMENTAL SETUP A. DATASET
We evaluate the models on two commonly used benchmarks that are challenging for the generative replay setup CIFAR-100 dataset [14] and mini-ImageNet.CIFAR-100 consists of 100 object classes in 45,000 images for training, 5,000 for validation, and 10,000 for test.All images are in the size of 32 × 32 pixels.The mini-ImageNet contains 50,000 training images, and 10,000 testing images evenly distributed across 100 classes.All images have the size 84 × 84.

B. IMPLEMENTATION DETAILS
As a framework for our experiment, we use PyTorch [26].We use ResNet-32 model for feature extraction.We pretrain feature extractor on the 50 classes contained in the first task, and freeze it afterwards.The same procedure is used for mini-ImageNet with the substitution of ResNet-32 to ResNet18.During the pretraining, we utilize the strong data augmentations from the PyCIL framework [38] to improve the feature extraction model.During the class-incremental training of generator and classifier, we use weaker data augmentations to minimize the distortions to the original data.More specifically, we firstly pad images by 4 with 0 values, and after that we crop the image at random location to the size 32 × 32 for CIFAR-100 and 84×84 for mini-ImageNet.Lastly, random horizontal flips are applied.We train the encoder part on top of the feature extractor for 10000 iterations for the first task and for 5000 iterations for the rest of the tasks.Adam optimizer from PyTorch [26] is used for the experiments with the learning rate equal to 1e-4.

C. EVALUATION
For evaluation, we use the average overall accuracy metric as in [33].It is the average accuracy of the model on the test data of all tasks up to the current one.In addition, to evaluate the overall performance, we calculate average incremental accuracy over all tasks.It is obtained by taking the average of accuracies after each task.Each experiment is performed over 3 random seeds and the mean is reported.

V. RESULT AND ANALYSIS A. MAIN RESULTS
For the experiments on CIFAR-100 and mini-ImageNet, 50 classes are contained in the first task following [33], 45314 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

FIGURE
Comparison of the Precision/Recall curves for features generated after each task with either the standard BIR method (left) or our improved version (right).Our method is able to retain much better precision-recall tradeoff of the generated samples.Experiment is performed for CIFAR-100 and T = 6.and the rest 50 classes are divided evenly to 5, 10, and 25 tasks.The average incremental accuracies for CIFAR-100 are shown in Table 1, and the accuracies after each task for T = 5, 10, 25 are shown in the form of plots in Figure 6 (top).Our method shows better result in comparison with the baseline and regularization methods.
The second best method is BIR+SI, but, it is consistently worse than the proposed approach.
Similar results are presented for mini-ImageNet dataset, which consists of bigger images than CIFAR-100.Table 2 present average incremental accuracy for this dataset.Here, as well for CIFAR-100, our method outperforms the other in a meaning of average incremental accuracy.However, the difference between ours and BIR+SI is more significant with the increasing number of tasks, where for T = 26 we reach 48.94 and BIR+SI 43.78.The other regularization-based methods baselines for this scenario fall far behind.In Figure 6 (bottom) we see accuracies after each task.For mini-ImageNet BIR results in a better average accuracy in the second task for T = 6 and T = 11.This can be attributed to better plasticity (no SI).However, with a longer training and with more task, our method outperforms others.
For both datasets, SI alone presents the results comparable to finetuning.While simple application of LwF works good for smaller number of bigger tasks, T = 6 and T = 11, but for longer sessions T = 26 the performance significantly drops.Here, better adjustment of regularization hyper-parameters can play more important role.Our proposed method does not suffer from this issue.

B. NUMBER OF CYCLES
We analyze the influence of number of passes during cycling procedure on the average incremental accuracy for 6 tasks.According to the results presented in Figure 7, there is a drop of performance for small number of passes, but increasing the number improves the accuracy significantly.We suggest to search an optimal value for the number of passes depending on the dataset and split scenario used.

C. ABLATION STUDY
Through adding the proposed modifications one by one to the baseline method, we perform an ablation study for the proposed method.The obtained results can be seen in Table 3.The ablation study suggests that each of our modifications significantly contributes to the total performance of the model, and overall increase to average incremental accuracy is 5.56% over baseline.

D. ANALYSIS OF PRECISION AND RECALL
Finally, we perform the analysis of our models performance in terms of the quality of generations.To that end, we refer to the distribution precision and recall of the distributions as proposed by [30].As authors indicate, those metrics disentangle FID score into two aspects: the quality of generated results (Precision) and their diversity (Recall).We calculate those two metrics on the features level and compare the resulting scores between standard BIR method and our improved approach.As presented in Figure 8, our improvements allow the model to retain both higher precision and recall of the regenerated samples.

VI. CONCLUSION AND FUTURE WORK
In conclusion, we propose the modifications to improve the VAE-based generative replay in the class-incremental setting.We observe the disparity between the latent representations of the original and generated data.Therefore, we incorporate the latent match loss that address this problem.To mitigate shift in the feature space during training on new data, we add latent distillation loss.Finally, we propose the cycling of the generated features though the previous model to decrease the distance between the distributions of original and generated samples.This allowed us to scale the generative approaches to more complex datasets, such as mini-ImageNet.The performed ablation study illustrates that the increase of performance due to each component.
In future, we plan to scale our method to perform well on more challenging scenarios such as ImageNet dataset and longer sequences of tasks.
This stands out as a notable limitation in numerous generative replay methods which are unsuitable for larger datasets, whereas our approach holds a significant advantage in this regard.
A. IMPACT STATEMENT By using the generative approach for continual learning, our method does not require storing exemplars of past data, therefore it addresses concerns about private or sensitive data, which are applicable in some scenarios.However, generative models can retain the biases present in the training data, and we strongly advise a careful examination of their performance to ensure unbiased outcomes.

FIGURE 2 .
FIGURE 2. Visualisation of the latent matching loss.We minimize the difference between latent vectors of the original samples and their reconstructions.

FIGURE 3 .
FIGURE 3. Visualisation of the latent distillation loss that reduces the feature drift between tasks.

FIGURE 4 .
FIGURE 4. Visualisation of the cycling procedure.Each time we generate a batch of rehearsal samples (orange stars), we pass the generated outputs several times through the Variational Autoencoder in the recursive passing procedure.As a consequence, the final generations exhibit a considerably improved alignment with the reconstructions of the original training data (green dots).

FIGURE 5 .
FIGURE 5.Fréchet distance between the distributions of original and generated latent vectors depending on the number of cycles.Zero cycles mean the model without cycling procedure.As the number of cycles increases (till some point), the distribution of generated representations better aligns with the original one.

FIGUREAlgorithm 1
FIGUREComparison of average accuracies on CIFAR-100 (top) and mini-ImageNet (bottom) after each task for 6, 11, and 26 tasks (from left to right) with the first task containing 50 classes.

FIGURE 7 .
FIGURE 7.Average incremental accuracy as a function of a number of cycles for T = 6.

TABLE 1 .
The average incremental accuracies on CIFAR-100 with the first task containing 50 classes and the rest 50 classes split into 5, 10, and 25 tasks equally.

TABLE 2 .
The average incremental accuracies on mini-ImageNet with the first task containing 50 classes and the rest 50 classes split into 5, 10, and 25 tasks equally.

TABLE 3 .
Ablation study of our method for class incremental learning setting with T = 6 and CIFAR-100.Average incremental accuracy is reported for ResNet32.