Confidence Calibration for Incremental Learning

Class incremental learning is an online learning paradigm wherein the classes to be recognized are gradually increased with limited memory, storing only a partial set of examples of past tasks. At a task transition, we observe an unintentional imbalance of confidence or likelihood between the classes of the past and the new task. We argue that the imbalance aggravates a catastrophic forgetting for class incremental learning. We propose a simple yet effective learning objective to balance the confidence of classes of old tasks and new task in the class incremental learning setup. In addition, we compare various sample memory configuring strategies and propose a novel sample memory management policy to alleviate the forgetting further. The proposed method outperforms the state of the arts in many evaluation metrics including accuracy and forgetting <inline-formula> <tex-math notation="LaTeX">$F$ </tex-math></inline-formula> by a large margin (up to 5.71% in <inline-formula> <tex-math notation="LaTeX">$A_{10}$ </tex-math></inline-formula> and 17.1% in <inline-formula> <tex-math notation="LaTeX">$F_{10}$ </tex-math></inline-formula>) in extensive empirical validations on multiple visual recognition datasets such as CIFAR100, TinyImageNet and a subset of the ImageNet.


I. INTRODUCTION
We consider the problem of 'incremental learning (IL)', which is a continual learning paradigm that the model is expected to learn a set of tasks and infer the sample's label without knowing its originating task. Specifically, we consider a classification problem where new sets of classes are added incrementally with a budgeted memory storing the training examples of past tasks, referred to as 'classincremental learning' [1]. Same as other continual learning set-ups, the IL algorithms suffer from the stability-plasticity dilemma [2]. This dilemma is often addressed by two constradicting issues separately; catastrophic forgetting of the old knowledge and intransigence on learning new knowledge [3]. By the success of deep neural networks in machine learning [4]- [6], many IL algorithms are based on deep neural networks [3], [7]- [9]. Unfortunately, training a neural network is known to suffer from the catastrophic forgetting not even in the incremental scenario [10], [11] and the forgetting problem worsens in the class incremental learning scenarios.
The limited memory resource in the IL set-up, which we name as a 'sample memory' (also called as a 'representative memory' [9]), is used to alleviate forgetting [12]. But the budget limitation causes two critical problems for supervised learning; 1) class imbalance between the classes in the The associate editor coordinating the review of this manuscript and approving it for publication was Benyun Shi . previous tasks and the new task and 2) lack of optimal sampling strategy. The class imbalance problem leads a model to overfitting to the new task, and thus, makes examples in the past tasks prone to be inferred as a new task's example. The sampling policy is less explored in the literature.
Investigating the class imbalance problem and the suboptimal sampling strategy, we observe two phenomena; 1) the predicted confidence or likelihood of an example at the task transition has an artificial offset in the new tasks due to the sample memory, and 2) the most of sampling strategy is marginally better than the uniform random sampling. Based on the observations, we propose 1) a novel method to calibrate the inferred confidence to alleviate forgetting of the IL models by balancing the likelihoods of past and the new task's classes and 2) a novel sample memory management policy using a proposed scoring function based on cumulative likelihood over the epochs. In empirical validations, we show that the proposed method outperforms state-ofthe-art IL algorithms in various metrics including average accuracy (A avg ) at final task transition, the accuracy of the final model (A 10 ) and forgetting (F 10 ) by a non-trivial margin (up to 5.71% in A 10 and 17.1% in F 10 ).

A. RELATED WORK 1) LEARNING SEQUENCE OF TASKS
A machine learning paradigm that a model is trained on a sequence of tasks has been referred to as various names such as {continuous, continual, sequential, incremental and life-long} learning. As these terms are not precisely defined, each work has its own learning scenarios and constraints. But all these paradigms face a common problem; stability-plasticity dilemma [13] which refers to contradicting difficulties for a model to not only learn a new task but also keep the past knowledge since the statistics of the samples continuously change. Especially, forgetting past knowledge, referred to as catastrophic interference or forgetting [10], [11], is exacerbated by the usage of deep neural networks. Among the strategies summarized in [1], we are particularly interested in continual learning with a memory of previous tasks and no task identity given at the inference, which is referred to as 'incremental learning' in [12]. Please refer to [14] for a comprehensive review.
Incremental learning (IL) is a sequential learning paradigm with limited memory resources [7], [9], [12]. There are many proposals about memory managing strategies [3], [7], [9], a better regularizing strategy [8], [15]- [19], and using an external memory with unlabeled data [20] and a method robust to the order of given tasks with task identity given [21]. We will review the ones addressing the forgetting and the sample memory management policy in detail.
Catastrophic forgetting or interference has been wellstudied but is still an open problem in neural networks [16], [18], [19], [22]- [27]. The catastrophic forgetting is aggravated in the incremental learning set-up [10], [11]. There are two problem set-ups in the incremental learning. We breifly reviewed how previous works alleviate the catastrophic forgetting in both set-ups as follows.
The first set-up is the class incremental learning that the task is unknown during inference. In this set-up, recent works [3], [7], [9] use previous tasks' samples that contain the knowledge of previous tasks. iCaRL [7] and end-to-end incremental learning (E2E) [9] use knowledge distillation [28] to preserve the knowledge of previous tasks. Chaudhry et al. [3] uses elastic weight consolidation (EWC) [8]. Our proposed method falls into this problem set-up.

2) SAMPLE MEMORY MANAGEMENT
As sample memory management policy affects accuracy, there are a few strategies proposed in the literature [3], [7], [9]. Interestingly, many proposals show marginal accuracy improvement over uniform sampling despite the computational complexity [3], [7], [9]. These methods include Herding selection [33] used in [7], sampling proportional to a histogram of each sample's distance to the class mean [9], using a distance to decision boundary, and entropy of output softmax distribution [3]. Recently, generative models are employed to generate past task samples [34]- [37] instead of sampling. The sample generation strategy for the IL is an active research topic and show promising results in relatively straightforward experimental validation (e.g., on MNIST and SVHN). But on these datasets, sampling from the uniform distribution already achieves saturated accuracy [3] and there is no promising results reported in challenging datasets yet. In addition, due to the difference in number of samples belonging to the past and the new tasks, we argue that the class imbalance problem exacerbates the overfitting on the new tasks, which leads to accuracy drop. The class imbalance problem, however, is relatively less explored in the IL context [9], [38], [39]. Ditzler et al. propose a dynamically-weighted consult and vote (DW-CAV) algorithm for an ensemble of classifiers for better plasticity [39]. Pang et al. use the weighted-LPSVM for the class imbalance for IL [38].

3) CONFIDENCE CALIBRATION IN NEURAL NETWORK
Guo et al. show that a high capacity neural network (e.g., ResNet) exhibits overconfidence which leads to accuracy drop in the ordinary learning set-up and propose multiple logit calibration methods [40]. We observe overconfidence in the classes of the new task in the IL setup and propose to calibrate the confidence. Pereyra et al. use a maximum entropy prior in logits as a regularizer for overly confident output distribution [41]. Kuleshov and Ermon propose a calibration method in the online learning set-up particularly when the inputs can come from a potentially adversarial source [42], whose objective is clearly different from our proposal addressing the artificial overconfidence on the set of new classes.

4) EVALUATION METRICS FOR INCREMENTAL LEARNING
Average accuracy of incrementally trained classification model at all task transitions is the most widely used evaluation metric [9], [20]. The average may not include the accuracy of the model at the first task as the model on the first task is not yet incrementally trained [9], [20]. The accuracy at the last task has also been used [3]. There are a number of measures for evaluating the amount of forgetting in the IL models including Backward transfer, Forward transfer, Remembering, measures for the efficiency of IL models [3], [29], [43]. The Backward transfer (BWT) captures the influence of learning new task on the previous task, which can be translated as a measure of forgetting [29]. The BWT averages accuracy drops at all task transitions. As the BWT can be negative, [44] further propose variants of BWT; Positive Backward Transfer (max(BWT , 0)) and Remembering (1−| min(BWT , 0)|). Chaudhry et al. proposed a measure for forgetting (F) which computes the accuracy drops between the best of all tasks and the accuracy at the current task [3]. There are also some measures for the model's ability to learn new knowledge; Forward Transfer (FWT) [29], [43] and Intransigence (I ) [3]. Other than these recognition performance measures, [44] propose multiple measures to evaluate the efficiency of the IL models.

II. APPROACH
We consider a class-incremental classification problem where tasks are defined as a disjoint set of classes and are sequentially given [1]. Formally, given a dataset D = {x i , y i } where x i is i th example, y i ∈ {1, . . . , C} is a corresponding ground truth label and C is the number of classes in a given dataset. Each task is a set of classes to be recognized. On a set of tasks T = {t 1 , . . . , t n }, a classification model at k-th task, parameterized by θ t k , is learned, where t k = {c|c ∈ {1, · · · , C}, c / ∈ T k p } is k-th task and T k p is a set of previous tasks (T k p = t 0 ∪ . . . ∪ t k−1 ) with a fixed sized sample memory [12] that contains a subset of examples in T k p , and the previous task's model parameters θ t k−1 . Each task t k consists of a set of classes whose configuration and the arrival order to the model are not given a priori.
In the literature [3], [8], [9], [16], [17], the IL objective is formulated with two terms; remembering the knowledge learned in the past tasks, L R (·), and learning new knowledge in the new (current) tasks, L N (·): L(D; θ t k ) = L R (D; θ t k ) + L N (D; θ t k ). Given that the IL set-up uses the sample memory, we further split the L R (·) into the remembering loss for preserving the knowledge of past tasks, L P (·), and a reminding loss for the past knowledge by the sample memory, L M (·). Following the notation of [9], we consolidate L M (·) and L N (·) into L C (·), which we call as a classification loss. We write the incremental learning objective at task t k as: . (1) Following [9], we use knowledge-distillation loss [28] for the L P (·) because of its effectiveness in transferring knowledge of the past tasks. For the L C (·), we propose a novel loss to calibrate imbalanced logits between the classes of the previous and the new tasks to address overconfidence on new task, based on cross-entropy.

A. OVERCONFIDENCE ON NEW TASK
In training a model on t k , the data given to the model consists of a subset of (usually small fraction of) examples of the past tasks in the sample memory and a large number of examples of the new task. Despite the forgetting prevention mechanisms including fine-tuning or knowledge distillation [28], the model tends to overfit to the set of the small samples from the past tasks and the samples of the new task. It makes the model less generalizable for the classes in the past tasks, which can be translated as forgetting of the past knowledge but rather promotes the model to classify to the classes in the new task by the increased confidence of the classes in the new task in the inferred likelihoodŷ. The artificially high confidence in inference for the classes in the new task results in low classification accuracy especially in the past tasks and eventually leads to low overall accuracy. Figure 1-(a) compares the average of logits on old and new classes of all test samples at the last task (Baseline) and figure 1-(b) shows the corresponding confusion matrices. We observe that the average of logits of new classes is much higher than that of old classes (Baseline) (Figure 1-(a)) and this imbalance causes artificially higher confusion of the old classes to the new classes ( Figure 1-(b)).

1) LOSS FUNCTION TO COMPENSATE FOR THE OVERCONFIDENCE
To compensate the unexpected logit differences between the past tasks and the new task, we propose a calibrated loss that adds the difference of the logits of the classes of the past tasks (T k p ) and the classes of the new task (t k ) to the logits of new classes. We call this addition as 'Logit Balancing'. It differs from [45] that requires held-out set to optimize the bias correction layer because we calibrate the logits during without requiring held-out set in training.
Formally, the classification loss L C (D; θ t k ) with the logit balancing is written as a form of modified cross entropy loss: where σ (·) is our modified softmax function with the logit balancing and the γ is the balancing parameter. We compute and update the γ in each epoch of training by the difference between the average logit of the classes in T k p and the classes in t k as follows: The α and β are averaged logits of current task and the previous tasks, respectively, except the logit of the ground truth class which is not part of abnormal logit.
denotes a training set at k-th task, we train the model together with sample memory at k-th task M k . |D k | is its cardinality. In computing the α and the β, we average the logits of old and new task respectively to compute 'overall' logit difference between old and new task. In particular for the difference statistic, we use α − β over a ratio (α/β) for numerical stability in learning since β is often close to zero. When γ > 1, the γ multiplied logit increases the loss of new classes thus the gradient magnitude of the new classes increases. The larger gradient penalizes the overconfidence by back-propagation and reduces artificially high confidence in the new classes. When γ < 1, in contrast, the γ multiplied logit increases loss of old classes and penalizes  the old classes. When γ = 0, The modified cross entropy is identical to the original cross entropy. Note that the proposed logit balancing can be combined with any methods that use a vanilla cross-entropy loss for L C (·).
We quantify the overconfidence on the new task by Expected Calibration Error (ECE) [46] as in [40]. In Figure 2, we visualize the overconfidence by the Reliability diagram [47] and the ECE of three models (refer to the caption for details). As shown in the figure, the E2E model [9], which uses an advanced forgetting prevention mechanism, and the baseline exhibit the overconfidence indicated by large gaps and high ECE value. On the other hand, our proposal reduces both the gap and the ECE value. Note that the ECE of ours is not zero. It is because our calibration objective is to predict the correct task while the ECE quantifies the difference between class accuracy and predicted confidence. Nevertheless, the significantly reduced ECE value shows that the proposed 'logit balancing' method could reduce the calibration error resulted from the overconfidence.

B. SAMPLE MEMORY MANAGING POLICY
Following the notion of 'representative memory' [9] that resembles human forgetting procedure, we define M k as the 'representative memory' at task t k and call it as 'sample memory' for simplicity. At the first task transition (1 → 2, the beginning of t 2 ), examples in the previous tasks (T 2 p ) are sampled to the sample memory by a policy. Then, at the later task transition (k → k +1, where k > 1), the examples of T k p in the sample memory should be deducted and the examples in t k are added by a policy while maintaining the size of memory fixed. For better maintaining the past knowledge with the fixed budget, we need a good policy to add or deduct the examples. However, most of the proposed sampling strategies are either uniform random sampling or complicated but marginally better than the uniform random [3], [7], [9]. We argue that the confidence or likelihood of an example could be a measure of knowledge about the class distribution; if an example is highly confident, the example is the most representative (as it is most frequent) about the distribution of its class.
The likelihood at each epoch of learning however, also depends on the learning progress; for instance, at the end of learning of a task, majority of samples have likelihood around 1.0 unless the network underfits. To alleviate the effect of training progresses in extracting the knowledge of the class distribution by the inferred likelihood, we propose to average the likelihoods over the epochs (called cumulative likelihood, denoted as ξ i ) and use it as a score for sampling. Formally, ξ i  for i th sample's cumulative likelihood is written as: where E is number of epochs in training a model for a task and σ (z) e iy i is a likelihood at epoch e. Note that σ (·) e is a vanilla softmax function, not a modified version proposed in Eq. 2. Figure 3 shows the histogram of the cumulative likelihood with a few examples shown; We argue that those examples well represent the knowledge about the class (e.g., the shape of tree~classis clearly shown in the examples with high ξ i ) and vice versa. Figure 4 shows more examples; images in CIFAR-100 with various cumulative likelihood scores. Similar to the examples in Figure 3, we observe that the samples with a high score (cumulative likelihood) are ones that have representative shapes of the class they belong to. And the images with the low score exhibit the class-oriented appearance in a fraction of the images (e.g., sunflower, and mushroom) or not very visible (e.g., bus).
Toneva et al. claim that the 'unforgettable' examples are not very useful during training [22]. Following their claim, we propose to discard those highly 'unforgettable' samples (particularly top 20% of training set sorted by the ξ i in descending order) and randomly sample the examples from the rest. At the next task transition, say t k+1 , we discard the  The score for 'Ours' selection method is calculated by likelihood of pre-trained model on CIFAR100. 'Ours-Reversed' selects samples with lowest score, which is opposite way of 'Ours'. 'Drop top-k best' and 'Drop top-k worst' are dropping 'k' best and worst samples with respect to the cumulative likelihood, each. Note that 'Ours' outperforms others.
examples with the lowest score for each class in the sample memory (the number of classes is (k − 1) · |t k |) due to the fixed memory budget, where |M k | denotes the memory budget, |t k | denotes the number of classes in task t k , and add examples belonging to the new tasks. In the removal or deduction process of samples for each class, we first discard samples of the top 20% 'unforgettable' samples. Among the remainder, we discard all samples scoring lower than top η = |M k | (k−1)·|t k | samples. We then randomly select |M k | k·|t k | samples among η. We illustrate the proposed sample memory managing policy in Figure 5.
In Figure 6, we empirically show that our scoring function performs better than the scoring functions used in the state of the art methods. Interestingly, reversing the score (1 − ξ i ) results in the worst selection policy, worse than random. It implies that our scoring function measures the worth of samples to be kept in the memory.

C. EVALUATION METRICS
The F metric [3] and the BWT [29] have been proposed to measure the forgetting. However, we argue that the I metric [3] that is defined for the 'single head configuration' results in artificially high I in more forgettable models because new task samples are barely confused to the old task as shown in 'Baseline' of Figure 1-(a). To exclude the effect of forgetting in the I measure, we propose to compare I 10 evaluated in a multi-head manner, which confines the accuracy computation only on the classes of the new task. At the first glance, it may sound unreasonable to use the multi-head based metric for single head configuration. But we argue that even though the model is trained in a single head configuration, the intransigence in a multi-head configuration is a helpful metric for the model's plasticity or learnability of new knowledge without the effect of forgetting included because the model classifies the sample only among the classes belonging to the given task in a multi-head configuration. Thus, we additionally report the I at task t k on the multihead configuration [3] denoted by I mh k in all our experimental validations.
The CIFAR100 consists of 60,000 images with a size of 32 × 32 pixels in 100 object categories. Among 60,000, 50,000 images are used for training a model and 10,000 images are used for testing.
The TinyImageNet consists of 120,000 images with a size of 64 × 64 pixels in 200 object categories. Among the 120,000 images, 100,000 are used for training, 10,000 for validation, and 10,000 for testing. Note that we opt to use the validation set when testing instead of submitting predicted labels on the test-set to the evaluation server held by Stanford.
For the ImageNet-100 dataset, we used 1,300 training samples and 50 verification samples from each class, following [7], [9]. Note that the random 100 classes of our subset would be different from [7], [9]. We resize (aspectratio preserving) and crop to make all images to be the size of 224 × 224 pixels.

2) TASK CONFIGURATION
Following a standard task configuration [7], [9], we define 10 tasks where each task consists of 10 classes (CIFAR100 and ImageNet-100) or 20 classes (TinyImageNet). Since the task configuration should be blind to the incremental learning algorithm, we report the average and standard deviation of three different random task configurations, following [3], [9].

3) SAMPLE MEMORY SIZE
We choose 2,000 for the experiments on CIFAR100 following [7] and choose 4,000, 5,200 for experiments on TinyIma-geNet and ImageNet-100, respectively (the same relative size considering the training set size).

4) EVALUATION METRICS
For the evaluation metrics, we use average accuracy (A avg ) in all task transitions (excluding the first task's accuracy as it is not incrementally learned [9], [20]), final model's accuracy (A k , where k is the final or last task. It is denoted as 'Average' in [44]), forgetting (F k ), intransigence (I k ) and multi-head intransigence (I mh k ) metrics at task t k proposed by [3]. Note that some previous works include the first task's accuracy in the average accuracy [7].

5) MODEL ARCHITECTURES AND TRAINING DETAILS
We use the ResNet-32 [52] as a backbone for the experiments on CIFAR100 and TinyImageNet with input size modification and ResNet-18 [52] for the experiments on ImageNet-100. We used images with the size of 32 × 32, 64 × 64 and 224 × 224 for CIFAR 100 dataset, TinyIma-geNet and ILSVRC 2012, each. For each incremental step, we train a network for 80 epochs. Learning rate starts at 0.1, and is divided by 10 every 20 epochs. We train the networks using stochastic gradient descent with mini-batches of 128 samples, weight decay of 10 −4 , and momentum of 0.9. As data augmentation, we use flipping and random crop in all experiments.

B. IMPLEMENTATION DETAILS
We compare our method with a baseline and two state of the arts. The baseline is a vanilla IL algorithm with a uniform sampling policy for the sample memory management and performs fine-tuning for L p (·). The two state of the arts are our extension of EWC [8] with the sample memory, denoted as EWC+SM, and the E2E [9]. We evaluate the methods by various metrics: average accuracy (A avg ), accuracy at the final (10-th) task (A 10 ), Forgetting (F 10 ) and Intransigence (I 10 ).
We apply the data-augmentation technique (DA) on CIFAR100 dataset, following [9]. Table 1 shows that the DA does not harm the performance gain of our methods to the other state of the arts in all evaluation metrics.
We did not compare with the continual learning methods that require to identify the task for inference such as SI [15], MAS [31], HAT [32], GEM [29] and VCL [30] since this setup is less challenging than our set-up that does not require any information. Note that the results of EWC+SM are not directly comparable to those reported in [3] since we use an advanced backbone network architecture (ResNet-32) and budgeted sample memory.

C. COMPARISON WITH STATE OF THE ARTS
Our proposed method outperforms the state of the arts by large margins in all metrics except the I 10 , as shown in Table 1 VOLUME 8, 2020 TABLE 1. Comparison with state-of-the-art methods on CIFAR100 dataset. The E2E refers to E2E incremental learning [9]. The (DA) denotes usage of data augmentation scheme used in [9] to ensure that our reproduction of E2E method is valid. '∼' refers to the inferred value from graphs in the paper. 'Ours w/o x' refers to the ablated models of our proposed method. For x, KD refers to knowledge distillation [28], SM refers to sample memory and LB refers to the proposed 'Logit Balancing'. [Abl.] denotes the ablation study. ↑ indicates higher the better, ↓ vice versa. (up to 5.71% in A 10 ). It implies that our method improves the forgetting at the expense of the intransigence (I 10 ). Observing that the best model in terms of intransigence is the baseline, we argue that the intransigence measure is artificially lower (i.e., better) due to the overconfidence on the new task at the expense of severe forgetting (see related discussions in Figure 1 in Sec. II-A). Thus, we additionally report I mh 10 to show the intransigence without the effect of overconfidence (see Sec. II-C for a detailed discussion). On the I mh 10 metric, an intransigence measure excluding the effect of forgetting, our proposed method outperforms the E2E model.
In Figure 7, we show the detailed accuracies and the forgetting (F) at each task transition as a function of the number of tasks given. Note that, in Figure 7-(d), the forgetting of other models increases as they learns more tasks. In contrast, our method shows not only the lowest forgetting but also almost no increases even after it learns several tasks. On the contrary, as shown in the Figure 7-(a),(b), and (c), the accuracy gradually decreases as more tasks are given. Our method shows clear gain to the state of the arts in both accuracy and the forgetting.
Recently, Lee et al. [20] proposed the 'global distillation' objective for incremental learning. However, the experimental set-up is significantly different from our set-up.
In particular, to use a global distillation objective, it is necessary to use additional neural network which is specialized to the current task. To compare [20] to ours method, we modified followings in the experimental set-up. 1) As they use a large stream of external data while we do not, we excluded the external data to isolate the effect of the proposed loss function. 2) we changed their network backbone (Wide-ResNet [53]) to ResNet [52] which is the same architecture that we use. We show the comparison results in Table 3.
Despite the less neural network capacity, ours outperforms [20] in A avg , A 10 , and F 10 . Ours shows relatively higher intransigence (I 10 ), but we argue that Lee et al. shows better intransigence (I 10 ) at the expense of severe forgetting on the previous task.  benefit of the knowledge distillation (KD) thus removing the KD results in marginal accuracy drop (within the standard deviation of F 10 ). Note that the logit balancing significantly improves the forgetting (by 13.99%: 31.38% → 17.28%), and the sample memory management policy also has a significant contribution to them (by 4.19%: 21.47% → 17.28%). However, the logic balancing sacrifices the I 10 heavily because of the forgetting followed by the overconfidence. However, in I mh 10 (the intransigence without the effect of the overconfidence), the logit balancing increases intransigence though marginally.

E. EVALUATION ON VARIOUS TASK CONFIGURATIONS
Yoon et al. [54] emphasized the configurations (e.g., size and order) of of tasks in continual learning affects a lot on the performance of the model. Hence we consider about various task configurations. We compare ours to the E2E on various task sizes in Table 4. Note that ours consistently outperforms the E2E on various task size. We also evaluate our method on task configuration that tasks are remarkably different, and classes in a task are confusing each other (Distinctive-10) in Tab.4). To make a distinctive-10 set-up, we use the superclass definition of 5 classes in CIFAR100; we merge two  [20] and ours on CIFAR100 dataset. ↑ indicates higher the better, ↓ vice versa. Note that [20] uses additional network capacity. super-classes into a task and conduct experiments. Ours outperforms E2E with the similar margin in Task size 10, but the accuracy drops by ∼ 2%.

F. COMPARABLE METHODS TO THE LOGIT BALANCING
We argue that the logit balancing alleviates the effect of overfitting to the new classes, leading to better accuracy and less forgetting. To support the argument, we compare the accuracy of the models with various balancing methods; one with no balancing (denoted as 'Baseline'), one with the proposed logit balancing (denoted as 'Logit Balanced'), and one with the Balanced fine-tuning (denoted as 'Balanced Fine-tuning' or BF) proposed in [9]. To isolate the benefit of the balancing algorithms, we use the simple fine-tuning for L p (·) and a simple uniform random sampling in the Figure 1-(c), (d).
As shown in the Figure 1-(b), the BF alleviates the class imbalance problem, so does confusion from the classes in the old task to the classes in the new task, for which the logit balancing aims. We hypothesize that the logit balancing also implicitly alleviates the class imbalance problem as well. Since the bias correcting layer [45] requires a held-out set for optimization after training, it is not comparable to our logit balancing. Figure 7-(b),(e) show that our logit balancing outperforms the BF in accuracy and forgetting in all task transitions and the margin increases as the model learns more classes. At the final task, the logit balancing outperforms the BF by a large margin (+4.41% in accuracy and −19.61% in F 10 ). Other than the baseline and the BF, we also compare our method with post-processing method that reduces the bias from data imbalance and of a teacher network during distillation [55] by applying it to the E2E method. Our method (A avg : 59.12% ± 2.98% and A 10 : 48.87% ± 2.14%) outperforms E2E with the post-processing (A avg : 57.08% ± 1.53% and A 10 : 46.73% ± 0.13%) where we use same hyper-parameter used in other experiments, T = 2.

G. COMPARISON WITH OTHER SAMPLE MEMORY MANAGEMENT POLICIES
We now compare our proposed sample memory management ('Our SM') with Random sampling ('Random') and Herding selection ('Herding'). To isolate the effect of the sample memory management policy alone, we use the simple VOLUME 8, 2020  fine-tuning for L p (·) and do not use the proposed logit balancing or Balanced fine-tuning [9] and summarize the results in Figure 7-(c) and (f).
We observe that our sample memory managing method outperforms the state of the arts by a large margin (+2.44% in accuracy and −2.98% in F 10 at the last task). More interestingly, the margin increases as the model learns more classes, i.e., the incremental steps progress. Figure 3-(b) shows that the choice of selection method results in a big difference in performance, especially when there are not many examples left (e.g., 10,20,50, 100 examples left per class) in the sample memory. Note that though the improvement in accuracy by our selection policy to the other methods is marginal at the early tasks but increases later because the number of samples per class decreases as the incremental steps progress.

H. RESULTS ON TinyImageNet AND ImageNet-100
We further conduct experiments on two other visual recognition datasets; the TinyImageNet and the ImageNet-100 and summarize the results in Table 2. Our proposed method outperforms the other state of the arts with a large margin in all metrics except the I 10 . It implies that our method improves the forgetting at the expense of intransigence. We argue that our method finds a better trade-off between the forgetting and the intransigence and results in better accuracy.

1) TinyImageNet
Following the experiments with CIFAR-100, we evaluate the methods with the same metrics: average accuracy (A avg ), accuracy at the final (10-th) task (A 10 ), Forgetting (F 10 ), Intransigence (I 10 ), and multi-head intransigence (I mh k ) (Refer to Table 2). At the final task, our proposed method outperforms the end-to-end method by a large margin; +5.60 % in accuracy A 10 and −11.92% in F 10 as shown in Figure 8-(a) and (d).
To isolate the benefit of the logit balancing, we use the simple fine-tuning for L p (·) and a simple uniform random sampling. Figure 8-(b), (e) show that our logit balancing improves accuracy marginally over the BF while it outperforms the BF in forgetting at all task transitions. As a tradeoff between the forgetting and the intransigence due to the stability-plasticity dilemma, the intransigence by our method is higher than the other methods (Refer to Table 2). We hypothesize that our method gets the accuracy gain from the better trade-off followed by less forgetting (F 10 ). In addition, the margin between our logit balancing and the BF (and the baseline) increases as the model learns more classes. At the final task, the logit balancing outperforms the baseline by a huge margin (−18.85% in forgetting (F 10 ) and +4.7% in accuracy (A 10 )) and the BF by a large margin (−7.76% in forgetting (F 10 ) and +0.09% in accuracy (A 10 )). Figure 8-(c) and (f) show the effect of the sample memory policy using the same setting as Sec. III-G. We observe that our sample memory managing method outperforms the state of the arts by a large margin (+2.15% in accuracy A 10 and −1.95% in F 10 at the last task). Even though  shows that improvement in accuracy by our sampling policy to other methods is marginal at the early tasks but later increases because the sampling strategy becomes more important as the number of samples per class decreases (see related discussion in Sec. III-G).

2) ImageNet-100
For the experiments on ImageNet-100 dataset, we use the same experimental comparisons with the same set of evaluation metrics, following the experiments on CIFAR-100 and TinyImageNet datasets. The only difference of the experiments of ImageNet-100 to the others is that we report our results both in Top-1 and Top-5 accuracy. In Figure 9-(a),(d), our proposed method outperforms the End-to-End method by a large margin (+2.55 % in accuracy A 10 and −4.51% in F 10 ) based on Top-5. Also, in Figure 10-(a),(d), our proposed method outperforms the End-to-End method by a large margin on Top-1 by a large margin (+3.20 % in accuracy A 10 and −9.92% in F 10 ). Figure 9-(b),(e) show that our logit balancing outperforms or performs marginally worse than the BF. Figure 10-(b),(e) show that our logit balancing outperforms the BF in accuracy and forgetting in all task transitions and the margin increases as the model learns more classes. At the final task, the logit balancing clearly outperforms the BF by a large margin (+2.44% in accuracy A 10 and −7.71% in F 10 on Top-1). Despite the label noise of the ImageNet dataset, we argue that the Top-1 based F is a more precise measure as Top-5 averages out the harm of some true-negatives. Figure 9-(c), (f) and Figure 10-(c), (f) show the effect of the sample memory policy using the same setting described in Sec. III-G. In Figure 9-(c), (f), we observe that the random selection performs better than our sample memory managing method in Top-5 accuracy. Figure 10-(c), (f) shows that our sample memory management policy outperforms the state of the arts by a large margin (+3.77% in accuracy at the last task). At the same time, our sample memory management policy performs on par with the random selection in the forgetting (especially with Top-1 measure) despite the better accuracy. It is because forgetting is defined by the difference of accuracy between the best among previous models on the task and the current model. It implies that our sample memory managing method improves the accuracy but not the forgetting significantly. However, our sample memory management method shows the lowest standard deviation in forgetting compared to the random and Herding selections. It implies that our method samples meaningful examples, even in the various task configurations. In addition, we observe that our sample memory managing method performs marginally worse than the random selection baseline in Fig 9 (c), (f) in Top-5 accuracy. We conjecture the reason being that our sample memory managing method is based on the cumulative likelihood that relies on the Top-1 accuracy. Together with the label noise, our proposed sampling score may not show the best performance due to the discrepancy between the Top-1 and Top-5 evaluation strategies.

IV. CONCLUSION
We present a novel class-incremental learning method for the set-up with a sample memory. The proposal consists of two parts; logit balancing and sample memory managing policy. We observed the unintentional imbalance of confidence between the classes of the past and the new task. Logit balancing is to alleviate this problem in the class incremental learning by adding learning objective for calibration. The most of sample memory managing policy is marginally better than the uniform random sampling. For better sample memory managing policy, we used a cumulative likelihood score of samples as a criterion because it represents the knowledge of the class distribution. In the experimental validations on multiple datasets including CIFAR100, Tiny ImageNet, and ImageNet-100, our approach outperforms the state of the arts such as End-to-End [9] and our extension of EWC [8], denoted as EWC+SM, by large margins in forgetting.