Class-Incremental Learning of Convolutional Neural Networks Based on Double Consolidation Mechanism

Class-incremental learning is a model learning technique that can help classification models incrementally learn about new target classes and realize knowledge accumulation. It has become one of the major concerns of the machine learning and classification community. To overcome the catastrophic forgetting that occurs when the network is trained sequentially on a multi-class data stream, a double consolidation class-incremental learning (DCCIL) method is proposed. In the incremental learning process, the network parameters are adjusted by combining knowledge distillation and elastic weight consolidation, so that the network can better maintain the recognition ability of the old classes while learning the new ones. The incremental learning experiment is designed, and the proposed method is compared with the popular incremental learning methods such as EWC, LwF, and iCaRL. Experimental results show that the proposed DCCIL method can achieve better incremental accuracy than that of the current popular incremental learning algorithms, which can effectively improve the expansibility and intelligence of the classification model.


I. INTRODUCTION
Incremental learning [1], also known as lifelong learning [2] or continuous learning, aims to enable learning models to continue learning just like human beings. The way the human brain learns is incremental, e.g., if a child gets to know the tiger and lion in the zoo, he can retain this knowledge and learn more new species such as the dolphin and seal in the future. However, most machine learning systems can only learn once with the available training data, and the models obtained are often task-oriented and isolated [3]. In contrast, incremental learning represents a dynamic model learning technique that can help models continuously learn new knowledge and realize knowledge accumulation. To achieve this, incremental learning has to overcome catastrophic forgetting, namely, to learn new knowledge without forgetting the old knowledge.
According to the degree of difficulty, incremental learning can be divided into three categories [4]: The associate editor coordinating the review of this manuscript and approving it for publication was Sudhakar Radhakrishnan . task-incremental learning (Task-IL), domain-incremental learning (Domain-IL), and class-incremental learning (Class-IL). Among them, Class-IL is the closest one to human learning behavior and is the most challenging one. Due to unknowing the number of output classes, researchers have to develop different methods to model new classes while preserving pre-trained performance [5]. Rebuffi et al. [6] made a clear statement of class-incremental learning. Formally, it requires an algorithm to have the following three attributes to be qualified as class-incremental: (1) it can learn from a data stream, in which the sample data of different classes occur at different times; (2) it can provide competitive multi-class classification accuracy for the classes observed so far; (3) the computational requirements and memory footprint of the algorithm should be kept limited, or grow very slowly relative to the number of classes observed so far.
The research on incremental learning is in the ascendant. In 2000, Christophe et al. [7] proposed the notion of incremental learning, and in 2001, Polikar et al. [8] standardized the definition of incremental learning and was widely accepted. In 2007, Shen and Osamu [9] proposed a VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ self-organizing incremental neural network (SOINN), which is based on competitive learning and has been applied to unsupervised learning tasks. In 2009 and 2011, two incremental learning algorithms Learn++.NSE [10] and Learn++.NC [11] were proposed, further expanding the application scope of incremental learning in non-stationary environments. For supervised learning, the broad learning proposed by Chen and Liu [12] can update network with only a small amount of calculation, however, this algorithm is only suitable for shallow neural networks. In 2017, the Deep-Mind team proposed the EWC [13] method and realized the continuous learning of multiple tasks for deep neural network models. By modifying the learning rule, the model can learn new tasks without forgetting the information on old tasks. In the same year, Li and Hoiem [14] proposed another approach, namely learning without forgetting (LwF), which could be seen as a combination of Distillation Networks [15] and fine-tuning [16]. Although the EWC and LwF methods are good at task incremental learning, they are not effective for the Class-IL scenario. The iCaRL method proposed by Rebuffi et al. [6] is the improvement of the LwF method, which is superior to most of the current incremental learning methods. But because historical data is needed to train the new network, the iCaRL approach is not consistent with the way humans learn.
In this paper, we propose a novel Class-IL method, DCCIL (double consolidation class-incremental learning), that allows convolutional neural networks to learn in such a way: only new class data are used to train the network while preserving the original capabilities. The DCCIL method draws on the idea of knowledge distillation in LwF and the idea of weight consolidation in EWC, and combines with the nearest-mean-of-exemplars classifier used in iCaRL to achieve better performance in Class-IL scenario. Under the hood, DCCIL makes use of a convolutional neural network (CNN). In principle, the DCCIL strategy is largely architecture agnostic and could be used on top of other features or metric learning strategies. With the proposed algorithm, the training is easy and fast. The result is also very promising.
The rest of this paper is organized as follows: Section II introduces the related works of incremental learning; in Section III, we explain the details of DCCIL method; in Section IV, experimental results on sonar image dataset are presented and show that DCCIL can class-incrementally learn over a series of target recognition tasks; finally, concluding remarks are provided in Section V.

II. RELATED WORK
In the domain of incremental learning, a key issue is to avoid catastrophic forgetting [17]. For example, the AlphaGo [18] trained on the Go task can only play Go and knows nothing about chess. If the AlphaGo is trained to learn chess, it would likely to forget the knowledge of Go, which we do not want to happen. Specific to the neural network, as shown in Fig. 1, the classification model continuously learns two different tasks: Dateset1 is the old task data and Dateset2 is the new task data. The learning process of a neural network is essentially the optimization process of parameters (weights and biases), and tasks are solved by constantly adjusting the network parameters. When the neural network is trained with the Dateset2, the weights learned from the old task are usually overwritten to adapt to the new task, resulting in forgetting the information related to the old task. Therefore, after learning new tasks, neural networks that even with the most advanced model structures are likely to degrade or fail completely on the old tasks.
To overcome catastrophic forgetting, researchers have come up with several solutions, the most famous of which include EWC and replay methods using knowledge distillation. This section will give a detailed introduction to EWC and knowledge distillation, and based on this, propose our DCCIL method.

A. ELASTIC WEIGHT CONSOLIDATION
In 2017, Kirkpatrick et al. proposed the elastic weight consolidation (EWC) algorithm. EWC method can help model gradually learn new tasks without forgetting the old ones, and the working principle is explained in detail below.
Inspired by the synaptic consolidation mechanism of the human brain, they found a way to tackle catastrophic forgetting. An artificial neural network is made up of connections in a way that is somewhat similar to the connections between neurons in the brain. After the network learns a task, we can calculate the importance each connection weight is to the task. When the network learns a new task, the connection weights can be protected differently depending on how important the connection is to the old task. This allows the network to learn new tasks without incurring significant computing costs while retaining what it has learned from previous tasks. We can think of the connections between nodes as connected by springs. The stiffness of the spring is proportional to the importance of the connection; a higher stiffness of the spring makes the connection weight harder to change. For this reason, the algorithm is called ''elastic weight consolidation.'' How to judge the importance of parameters and how to constrain parameters is the key to implement the algorithm. In EWC, the Fisher Information Matrix (FIM) [19] is used to determine the importance of parameters for previous tasks. After the network completes the training of a task, the Fisher information value of the network parameters can be calculated. The parameters with low Fisher values are considered redundant and can be modified to store new task information. The calculation formula of FIM is where s(θ ) is a scoring function and is defined as the gradient of a log-likelihood function, used to measure the influence of a parameter on a probability distribution. We do not know the exact form of the likelihood function p (x; θ ), but we can use the empirical distribution to estimate the FIM. Given a data set {x (1) , · · · , x (N ) }, the FIM can be approximated as The diagonal values of FIM can reflect the reliability of the corresponding parameters, and the larger value means the more reliable the estimation is and the more information it carries about the data distribution.
Again, taking neural network continuously learns two tasks as an example, and assume that the data sets of task T A and task T B are D A and D B respectively. We can use FIM to measure the information about D A that carried by one parameter in the network: To constrain the network parameters, the loss function can be set as the following when the network is trained on task T B : where L B (θ ) is the cross-entropy loss function of the task T B , F A j is the ith diagonal element of FIM, θ * A is the parameters learned on the task T A , λ is a hyper-parameter to balance the importance of the two tasks, and Q is the total number of the network parameters.

B. KNOWLEDGE DISTILLATION
Knowledge distillation is not originally proposed for incremental learning, but it has been widely used in this domain. Knowledge distillation is a simple and effective method for model compression and model training, which was proposed by Hinton et al. [15] in 2015. In the context of deep learning, to achieve better prediction, over-parameterized deep neural networks are often constructed. Such kind of network has very strong learning ability but requires a large amount of memory and computing resources, which is very unfavorable for deployment. However, it is empirically difficult to train a small model from scratch to achieve the equivalent performance of a large model. Fortunately, by using the knowledge distillation method, knowledge can be migrated from a large network or network cluster to a smaller network for more efficient deployment.
The small model need to be trained is called the student network, and the large model that has been trained is called the teacher network. Our goal is to make the output of the student network sufficiently close to that of the teacher network, but there is a problem with doing this directly: the Softmax output of the teacher network is similar to a one-hot vector, with one value large and the others small. In this case, the output of the teacher network provides very limited supervision information. Compared with the ''hard'' output which is almost one-hot, we hope the output can be more ''soft,'' that is, the probability distribution can be more moderate. Knowledge distillation presents an effective method. Let's consider a generalized Softmax function: where z i is the output value of the model's ith output node before Softmax function performed, and T is the temperature parameter. It is easy to prove that q i converges to a one-hot vector as temperature T approaches 0, and becomes softer as T gets higher. The knowledge distillation method assists the training of the student network by introducing the soft-target related to the teacher network as a part of the total loss, to realize knowledge transfer. The whole process of knowledge distillation is shown in Fig. 2. In Fig. 2, the predicted output of the left teacher network is transformed by generalized Softmax to obtain the soft VOLUME 8, 2020 probability distribution. The value is between 0 and 1, and the value distribution is relatively moderate. The hard-target is the true label of the input sample, which can be represented by a one-hot vector. The total loss is calculated as the weighted average of the cross-entropy loss corresponding to the soft-target and the hard-target. The larger the weighted coefficient of the soft-target, the more the training process of the student network depends on the teacher network. When training the student network, use high T to make Softmax distribution soft enough; after the training, set the temperature T = 1 for inferring.

III. DOUBLE CONSOLIDATION CLASS-INCREMENTAL LEARNING
For the memory function of human brain, neuroscientists have thus far given two explanations: systems consolidation [20] and synaptic consolidation [21]. Systems consolidation refers to the process of imprinting memory from a fast-learning part of our brain into a slow-learning part, through conscious or unconscious recall, which may occur while dreaming. For the second mechanism, synaptic consolidation, it means that if some synaptic connections in the cerebral cortex are important for previous learning tasks, the brain will reduce the plasticity of these synapses and keep them stable, to achieve continuous learning.
In our view, the process of knowledge distillation is a transfer of knowledge from one network structure to another and is similar to the brain's systems consolidation mechanism, in which knowledge acquired in the fast-learning part is imprinted to the slow-learning part. And the idea of the EWC method, as described in subsection II.A, is analogous to the brain's synaptic consolidation mechanism.
The process of memorizing knowledge in the human brain is obviously much more complicated. It is possible to impair memory capacity if the brain relying solely on systems consolidation or synaptic consolidation. With this in mind, we propose an incremental learning method combining systems consolidation mechanism and synaptic consolidation mechanism, that is, in the network learning process, knowledge distillation and weight consolidation are used at the same time, so that the model can better retain the knowledge learned from the old target classes while learning the new classes. For this reason, we refer to the proposed approach as ''double consolidation class-incremental learning '' (DCCIL).

A. OVERVIEW OF THE DCCIL METHOD
In this subsection, we will elaborate on the implementation process of the DCCIL method as a whole. As shown in Fig. 3, DCCIL has two key points: during the training stage, the network updates parameters through the double consolidation mechanism, and learns the features of continuous multi-class targets in the data stream; during the test stage, the DCCIL method uses the trained convolutional neural network module as a feature extractor: ϕ : χ → R d , followed by a nearest-mean-of-exemplars (NME) classifier [6], [22] for target classification, see subsection III.C for a detailed explanation.
During the training stage, the DCCIL method updates the model parameters and exemplar sets using Algorithm 1 whenever a new class of data needs to be learned. In this process, the update of the model parameters is the core part of class-incremental learning, see subsection III.B for the specific parameter update strategy. Besides, the exemplar sets need to be updated as a new class is added. The reduction of the old class exemplar sets is realized by discarding exemplars randomly and only m exemplars are kept in each class. The construction of the new class exemplar sets is realized by randomly selecting m samples from the training data of each new class.

Algorithm 1 Network Training Based on DCCIL Method
Start with: // current model parameters P = (P 1 , . . . , P s−1 ) // current exemplar sets FIM = (F 1 , . . . , F k−1 ) // the FIM of the first k-1 tasks Input: X s , . . . , X t // training images of new classes s, . . . , t M // the total size of exemplar sets ← UpdateRepresentation(X s , . . . , X t ; FIM, ) FIM ← (F 1 , . . . , F k ) // record the FIM of the kth task m ← M /t // number of exemplars per class for y = 1, · · · , s − 1 do P y ← ReduceExemplar(P y , m) // for old classes end for y = s, · · · , t do P y ← SelectExemplar(X y , m) // for new classes end P ← (P 1 , . . . , P t ) // new exemplar sets The classification part of the model is different between the training stage and the testing stage. For training, the learnable convolution module is followed by a fully-connected layer with as many sigmoid output nodes as classes observed so far. When there is a new class to learn, the number of nodes in the fully-connected layer increases accordingly. The feature vectors calculated by the convolution module are L 2 -normalized before being input into the fully-connected layer. Network parameters are grouped into a fixed number of parameters for the feature extraction and a variable number of weight vectors. We denote the latter by w 1 , . . . , w t ∈ R d , where t is the number of classes that have been observed so far. For any class y ∈ {1, . . . , t}, the output of the fully-connected layer is: where a y (x) = w T y ϕ(x) and x is an example image. Although these output values can be interpreted as the probability that the target belongs to a certain class, for the DCCIL method, the fully-connected layer is only used to calculate the loss at the training stage to guide the parameter update, while the final classification relies on the NME classifier.

B. NETWORK PARAMETER UPDATE
When DCCIL obtains data X s , . . . , X t for the new classes s, . . . , t, Algorithm 2 describes the routine to update network parameters step by step. Firstly, the network training set is formed with the images of new classes. Secondly, each training image is fed into the pre-update network, and the output vector q y of the network is recorded. Finally, the network is trained with the training set and the network parameters are updated by minimizing the constructed loss function. In the DCCIL method, for the network to be updated, the total loss function consists of 3 items, namely classification loss, distillation loss, and weight consolidation item. The specific form is as follows:

Algorithm 2 DCCIL Update Representation
where δ is an indicator function defined as δ {true} = 1, g y (x i ) is the output value of the network as defined in (6), N params is the total number of parameters in the network, F (k−1) j is the summation of the jth diagonal element of all Fisher information matrixes on the first k − 1 learning tasks, and θ * (k−1) j is the value of the jth network parameter after finishing training on task k − 1. The specific calculation formula ofF where γ is a hyper-parameter used to adjust the influence of previous tasks on weight consolidation, which is set to 1 in this paper.
The loss function L( ) we constructed encourages the network to output the correct image class label on the new class node (reflected in classification loss), to output the recorded soft-target on the old class output node (reflected in distillation loss), and to maintain the stability of those weights that are important to the previous tasks during the network parameter update process.
The feature representation update process described in Algorithm 2 seems complex in form, but its implementation is similar to network fine-tuning: minimizing the loss function on a new data set with the previously learned network parameters as initial value. Therefore, DCCIL can be implemented with standard end-to-end learning methods, such as the back-propagation algorithm, to update the network. And the latest network optimization methods, such as Dropout [23] and batch normalization [24], can also be applied to network training. Unlike network fine-tuning, however, the DCCIL approach has changed significantly in terms of loss function to overcome catastrophic forgetting. In addition to the standard classification loss item, the DCCIL method also has distillation loss item and weight consolidation item, which can stimulate the network to improve the classification accuracy of new classes without losing the discriminative information learned from previous tasks.

C. NEAREST-MEAN-OF-EXEMPLARS CLASSIFIER
In this paper, the nearest-mean-of-exemplars classifier is used to accomplish the final target classification. To predict the class y * of a new image x, the prototype vectors µ 1 , . . . , µ t for each class observed so far are calculated. For class y, the calculation formula of its prototype vector is where |P y | refers to the exemplar number of class y, p is a sample in the exemplar set P y . So µ y is the average feature vector of all exemplars in class y. For the image x to be inferred, the feature vector of the image only needs to be obtained through the convolution module, and compared with the prototype vector of each class, and then can be classified as the class corresponding to the nearest prototype vector: The reason we use exemplar features as the basis of classification is after considering the human visual recognition process. Although the current research results on the working VOLUME 8, 2020 mechanism of the human brain are limited [25], the brain can be regarded as a very accurate feature extractor and a superb painter. For example, when we hear the word ''cup,'' various types of cups come to mind. It is possible that our brains redrawn the cups using previously extracted and stored features about ''cup.'' These constructed cups serve as exemplars of ''cup'' that the brain stores. In other words, the brain stores not what the object looks like, but the features about the object. When we ''see'' an unfamiliar ''cup,'' the brain quickly extracts its features and, by analogy with the features of various objects previously stored in the brain, classifies the ''cup'' into the category of objects most similar to it, thus recognizing it as a water cup.

IV. EXPERIMENTS AND RESULTS
In this section, we will design experiments to verify the effectiveness of the proposed DCCIL method in the Class-IL scenario, and compare the classification accuracy of DCCIL with other methods. The PC used for the experiments is configured as Centos7 operating system, Intel Core i7-9800X @ 3.80 GHz processor, RTX 2080Ti GPU, and 64 GB memory. Pytorch is used for the implementation of all algorithms in this section.

A. TASK PROTOCOL
Up to now, there is no agreed benchmark protocol for evaluating class-incremental learning methods. Therefore, we use the following evaluation procedure: for a given multi-class target data set, the classes are first arranged in random order and fixed. Then, each method is trained in a class-incremental manner on the available training data. After each batch of classes, the classification ability of the model is tested with the test data of classes that have already been trained, and the evaluation criterion is the classification accuracy. Convolutional neural network is trained on Echoscope sonar [26] image data set, the data set contains 9 classes sonar images of the real underwater targets including cornerstone, diver, remote operated vehicle, shipwreck, aircraft and tank wreck, the number of images for each target class is shown in Table 1. Taking the task protocol shown in Fig. 4 as an example, we divide the original Echoscope sonar image data set into 3 tasks, each of which contains 3 target classes and the order of the task and class could be set freely. 150 sonar images of FIGURE 4. Chematic of task protocol. An incremental learning algorithm learns these 3 tasks successively, and the goal is to achieve good classification accuracy not only on each task but also on all 9 classes after all tasks are learned. each class are randomly selected from the data set to form a training set, and all the pseudo-color images are converted to gray-scale images and resized to 227 × 227 before being fed to the network.
For a fair comparison, the same convolutional neural network architecture is used for all methods, which is shown in Table 2. In the table, k is the size of the convolution kernel or pooling kernel, s is the stride, d is the number of convolution kernel or node, p is the padding parameter, and δ is the dropout rate. The number of nodes in the output layer is adjusted dynamically with the increase of the target class, and the maximum number of output nodes is 9 in this experiment.

B. PERFORMANCE ANALYSIS OF DCCIL METHOD
In Section III, we have described the proposed DCCIL method in detail. To verify the effectiveness of the method, experiments are carried out according to the protocol designed above. The standard back-propagation algorithm is used to train the network, and hyper-parameters are set as follows: the training iterations per task is 1000, the basic learning rate is 0.001, the weight decay parameter is 0.001, and the batch size is 45. The maximal number of exemplars is M = 90, that is to say, when there are at most 9 classes of targets to be classified, each class can store up to 10 exemplars, which occupy very small memory space.
To objectively evaluate the incremental learning performance of the DCCIL method, we tested two other network learning methods with the same model structure, namely joint learning and fine-tuning. Neither of these learning methods conforms to the specification of incremental learning, but their classification results can be used as the baseline for incremental learning method.
Fine-tuning learns the tasks separately according to the order of the tasks and without taking any measures to prevent catastrophic forgetting. We can also think of the fine-tuning method as learning a new multi-class classifier by fine-tuning the parameters of the network previously learned. Since finetuning does nothing to avoid the catastrophic forgetting that occurs during incremental learning, we can consider the experimental results of fine-tuning as a lower bound.
In contrast, joint learning can be seen as an upper bound for incremental learning. When learning each task, the joint learning method always uses the data of all tasks so far to train the model. Since all the historical data are involved in each task, it is equivalent to relearn the historical task while learning the new task. Therefore, the joint learning method can achieve good performance on each task. The main drawback of joint learning is that with the increasing number of learning tasks, the computation and memory consumption in network training become more and more onerous. Moreover, when historical data are not available, joint learning cannot be realized. The experimental results are shown in Fig. 5. The value of each point in the figure is the recognition accuracy across all classes observed up to a certain time point. We can observe that the accuracy of the DCCIL method is significantly better than the fine-tuning method, while the performance gap with the joint learning method is small, with a recognition accuracy of 87.44% across the 9 classes. As expected, the overall recognition accuracy of the joint learning method is the best and up to 89.90% because it uses training data on all tasks. Besides, because the fine-tuning method does nothing to prevent catastrophic forgetting, the accuracy across the 9 classes is only 33.08%, which means that the knowledge learned from the first two tasks (contain 6 classes) is forgotten and only the last 3 classes can be effectively classified. In Fig. 6, we show the recognition accuracy of each task after the network has learned all the 3 tasks. We can observe that for the first two tasks, the classification results of DCCIL are much better than the fine-tuning method and close to the accuracies of the joint learning method. For DCCIL method, the accuracy of the second task is lower than that of the first task, because the second task is more difficult, which also results in the accuracy reduction of the joint learning method. For the third task, the classification results of DCCIL are similar to those of the fine-tuning method and the joint learning method. As it is the most recent learning task, the recognition accuracies of the 3 methods are relatively high.
In addition to the recognition accuracy, we also compared DCCIL, fine-tuning and joint learning methods from several other aspects. Table 3 shows the advantages and disadvantages of the 3 methods in various aspects.
Based on the experimental results and the information in Table 3, it can be seen that the DCCIL method proposed in this paper can be well applied to the Class-IL scenarios. If we want to increase the recognition classes for existing network models, the recognition accuracy of the DCCIL approach is close to joint learning. Moreover, DCCIL does not require historical data to participate in training, so the network update process is efficient.
The DCCIL method uses the NME classification strategy in the classification part, which requires each class to store a certain number of exemplars. In order to minimize memory footprint, the exemplar set for each class contains up to 10 sonar images in the experiment above. Here, in order to figure out the impact of the memory budget M on the incremental learning performance, we change the size of the exemplar sets and the total number of exemplars increases from 45 to 450. With the learning model and other hyperparameters unchanged, the experimental results are shown in Figure 7. The value of each point in the figure is the recognition accuracy across all 9 classes. As can be seen from Fig. 7, with more exemplars, the recognition accuracy of the DCCIL method is gradually improved. The accuracy of the network is 86.97% when there are only 5 exemplars for each class, and 88.63% when there are 50 exemplars for each class. This is because the more exemplars of each class, the more accurate the feature mean of each class can be obtained, which is more conducive to the image inference. With the recognition system allowing more exemplars to be stored, the target recognition accuracy of the DCCIL method is expected to improve further.

C. METHODS COMPARATION
We carry out extensive experiments to compare the classincremental learning performance of the proposed DCCIL method with several popular incremental learning methods, which are briefly introduced below.
(1) Elastic weight consolidation (EWC): by adding a weight consolidation term in the loss function to reduce changes to the important network parameters for old tasks. See subsection II.A for a detailed introduction.
(2) Learning without forgetting (LwF): add a loss item about replayed data to the loss function of the current task: L total = L current + L replay + R. There are two main differences between the DCCIL method and the LwF method in terms of the parameter update. First, after learning the old task, the DCCIL method needs to calculate the Fisher information matrix to measure the importance of each parameter to the old task. Second, the loss function of the DCCIL method is different from that of the LwF method-the third item of LwF is a regularization item, while the DCCIL is weight consolidation item. The implement of the LwF method in this experiment refers to the paper [14].
(3) Incremental classifier and representation learning (iCaRL): the characteristic of this method is that for the model update, part images of the old tasks will be selected and mixed with the new task data for network training, to retain the discriminative information of historical classes. The implement of the iCaRL method in this experiment refers to the paper [6], and the number of exemplars is also set to 90. The DCCIL method proposed in this paper and the iCaRL method both require exemplars to help network classification, but there are 3 obvious differences: 1) DCCIL method does not require exemplars to participate in network training; 2) In the classification stage, exemplars used by DCCIL method are randomly selected, while exemplars in iCaRL method are selected by complex clustering algorithm based on herding [27]; 3) DCCIL method refers to the memory mechanism of the human brain and restricts the updating of network parameters at the same time through weight consolidation and knowledge distillation, while the iCaRL method only uses knowledge distillation. Therefore, compared with the iCaRL approach, DCCIL approach has 3 advantages in terms of implementation: 1) Historical data does not need to participate in the new task, so network training is more efficient; 2) Avoid the tedious process of exemplar selection and update; 3) To overcome the catastrophic forgetting problem from the perspective of the loss function so that the algorithm is easy to generalize and has better application perspective.
All methods use the same convolutional neural network structure designed in subsection IV.A. Different from DCCIL and iCaRL, the last layer of EWC and LwF are both Softmax output layer. The main hyper-parameters in each method are set in the same way: the Adam algorithm is used for parameter optimization with β 1 = 0.9 and β 2 = 0.999, each model is trained 1000 iterations with an initial learning rate of 0.001 and a batch size of 45. The experimental results are shown in Fig. 8, the value of each point is the recognition accuracy across all classes observed up to a certain time point.
As can be seen from Fig. 8, with the number of tasks increases the recognition accuracy of each method declines on the whole. Although each approach uses a different strategy for dealing with catastrophic forgetting, historical knowledge is more or less discarded when the network learns new tasks. Compared with other methods, the accuracy curve of DCCIL declines slower, indicating that the DCCIL method has a stronger ability to retain historical knowledge. Because EWC and LwF methods are not applicable for class-incremental learning, the multi-class recognition accuracies start to drop sharply from the second task. The iCaRL approach performs well since it uses a small amount of historical data when learning new tasks, which more or less violates the principle of not using historical data during incremental learning. After completing all the 3 tasks, we compare the class-incremental learning ability of each method with the overall recognition accuracy, and the results are shown in Table 4. As can be seen from Table 4, EWC and LwF have poor learning ability in the Class-IL scenario, while the iCaRL and the proposed DCCIL method can both achieve good incremental recognition accuracy. Specifically, the LwF method using knowledge distillation has a higher recognition accuracy than the EWC method relying solely on weight consolidation. In the 4 methods, the accuracy of DCCIL method is the best, since the DCCIL method uses both knowledge distillation and weight consolidation to train the network, so as to better preserve the knowledge learned from historical tasks while learning new tasks. In addition, the NME classifier is used to classify targets in the prediction stage, which is robust to changes in feature expression.

V. CONCLUSION
To overcome the catastrophic forgetting that often occurs in neural networks during continuous learning, this paper proposes a new class-incremental learning method named DCCIL. The DCCIL method draws on the memory mechanism of the human brain, which uses knowledge distillation and weight consolidation strategies at the same time during the network training stage, and adopts the nearest-meanof-exemplars classifier to recognize targets in the prediction stage. Experimental results on the sonar image data set show that the average recognition accuracy of the DCCIL method can achieve 87.44%, which is close to the recognition accuracy of non-incremental joint learning method and better than the popular incremental learning methods such as iCaRL, LwF, and EWC. Besides, with the increase of target classes, the performance of DCCIL decreases slowly, indicating that the method has strong conservation ability for the knowledge of historical tasks. Although the experiments are carried out on a sonar image data set, the DCCIL method is indeed universal and can be applied to the Class-IL scenarios with optical images or medical images. He teaches and conducts research with the NPU in the areas of digital signal processing and tracking, and locating of maneuvering targets. His general research interests include modern signal processing, array signal processing, and bionic intelligent perception and its applications.