Loading [MathJax]/extensions/TeX/boldsymbol.js
Opportunistic Dynamic Architecture for Class-Incremental Learning | IEEE Journals & Magazine | IEEE Xplore

Scheduled Maintenance: On Tuesday, May 20, IEEE Xplore will undergo scheduled maintenance from 1:00-5:00 PM ET (6:00-10:00 PM UTC). During this time, there may be intermittent impact on performance. We apologize for any inconvenience.

Opportunistic Dynamic Architecture for Class-Incremental Learning


The workflow of opportunistic dynamic architecture (ODA), which clusters each task's data into groups, creates or assigns an expert for each group, and merges experts opp...

Abstract:

Continual learning has attracted increasing attention over the last few years, as it enables to continually learn new tasks over time, which has significant implication t...Show More

Abstract:

Continual learning has attracted increasing attention over the last few years, as it enables to continually learn new tasks over time, which has significant implication to many real-world applications. A large number of continual learning techniques are proposed and achieve promising performance; however, many of them commit to a fixed, large architecture at the beginning, which can waste the memory space and incur high training cost. To directly tackle this challenge, we propose an Opportunistic Dynamic Architecture, ODA, based on mixture of experts. ODA can automatically grow with more experts for new incoming tasks and opportunistically shrink by merging experts with similar weights. We evaluated ODA on three commonly used datasets: CIFAR-100, CUB, and iNaturalist, and compared against eight existing continual learning techniques. ODA not only outperforms these techniques but does so with a parameter size that is slightly smaller on average, maintaining memory efficiency without compromising accuracy. Furthermore, ODA achieves this with only around 16% of the training time across all datasets when updating for each new task, making it a highly resource-efficient solution for continual learning applications.
The workflow of opportunistic dynamic architecture (ODA), which clusters each task's data into groups, creates or assigns an expert for each group, and merges experts opp...
Published in: IEEE Access ( Volume: 13)
Page(s): 59146 - 59156
Date of Publication: 31 March 2025
Electronic ISSN: 2169-3536

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

In recent years we have witnessed advanced development in artificial neural networks and deep learning. The advancement has been widely used in a range of domains including medicine, transportation, agriculture, and finance. One key question emerging in the deep learning community is how to make these deep learning models automatically update and evolve over time so that they can be deployed and used over an extended period.

This research is often referred to as continual learning, which is the ability to learn new tasks and accumulate knowledge continuously while retaining previously learned information. It is crucial in many real-world applications, including improving personalised healthcare by continually updating their knowledge on patients’ symptoms and recovery progression [1], [2], and enhancing driving safety of autonomous vehicles by continually updating its understanding of road conditions, traffic patterns, and unexpected events [3], [4]. The challenge in these example applications lies in the unpredictability of new tasks and task sequences, as well as the impracticality of learning all tasks simultaneously. Therefore, machine learning models need to learn tasks in a streaming manner, which leads to catastrophic forgetting. That is, knowledge related to old tasks is progressively overwritten while learning new tasks, resulting in a decline in overall performance over time.

Prior attempts to mitigate catastrophic forgetting have primarily involved preserving the weights associated with old tasks [5] or incorporating strategies such as replaying samples from old tasks [6]. While these approaches have demonstrated promising performance, the main limitation lies in their early commitment to a large network, even when the initial task includes only a limited number of classes (e.g., 2 or 5 classes). This architectural choice can lead to substantial memory footprint and prolonged training times due to the abundance of parameters.

Recent advancements in continual learning have shifted towards dynamic architectures, where the network’s structure evolves to accommodate an increasing number of tasks [7]. This progressive approach aims to optimise the network’s capacity in response to task variations, offering a potential solution to the challenges posed by catastrophic forgetting while minimising resource inefficiencies. This paper introduces an opportunistic dynamic architecture (ODA), where the network proactively identifies opportunities to both expand and compress its structure. This adaptive approach is designed to accommodate new tasks effectively while also addressing memory limitations.

ODA (as shown in Figure 1) is built on top of a Mixture of Experts (MoE), which assigns an expert to a specific task, with a gating mechanism to regulate expert selection and integration. MoE [8], [9] is an emerging architecture for many language models because of its computational efficiency. MoE in ODA is different from these MoE architectures in that we host a collection of heterogeneous experts trained independently for different tasks. In the context of continual learning, naively mapping an expert to each new task can result in a large number of experts, leading to the accumulation of redundant weights. To tackle this problem, ODA continuously clusters similar classes from incoming tasks into groups. An expert is created for each group, allowing for updates with classes that share similarities with the expert’s existing classes. ODA allows to merge similar experts to accommodate on-demand memory constraints. This dynamic strategy optimises the architecture for memory efficiency and task adaptation in continual learning.

FIGURE 1. - The workflow of opportunistic dynamic architecture (ODA), which clusters each task’s data into groups, creates or assigns an expert for each group, and merges experts opportunistically.
FIGURE 1.

The workflow of opportunistic dynamic architecture (ODA), which clusters each task’s data into groups, creates or assigns an expert for each group, and merges experts opportunistically.

The contributions of our work are listed as follows.

  1. We propose a novel MoE-based dynamic architecture approach for continual learning that automatically grows and compresses the structure of the network in response to incoming tasks and memory restrictions of devices.

  2. We evaluate our approach on three widely-used datasets: Canadian Institute for Advanced Research (CIFAR-100) [10], Caltech-UCSD Birds (CUB) [11], and iNaturalist [12]. The comparison with the state-of-the-art continual learning techniques demonstrates that ODA can achieve comparable accuracy (76% on CIFAR-100, 88% on CUB, and 71% on iNaturalist) while maintaining a small training time; the average time is 37 seconds per task and can be as low as 12% of the time to the most expensive techniques.

SECTION II.

Related Work

The main challenge of continual learning is catastrophic forgetting (CF) and representative strategies for addressing CF include regularisation, rehearsal, and dynamic networks. Regularisation-based approaches penalise large updates on important parameters identified to a certain task, including knowledge distillation in learning without forgetting (LWF) [5] and elastic weight consolidation (EWC) [13]. However, regularisation alone does not prevent forgetting and recent approaches combine regularisation with rehearsal; that is, they store or generate samples for old tasks and merge these samples with new tasks’ data when updating the network. A classic approach is Incremental Classifier and Representation Learning (iCaRL) [6] that employs a herding algorithm to select the most informative samples for each task and utilise knowledge distillation when updating the network. More recent approaches adopt class prototypes to characterise old classes’ distribution [14]. However, the main limitation of these approaches is that they assume a fixed network, whose parameters may saturate eventually when there is a high volume of tasks to learn.

Recent research has increasingly focused on dynamic networks, which have the advantage of dynamically extending the network’s representational capacity. These approaches include extending neurons or branches and using masks to select parameter subspaces for each task. In the following, we introduce representative approaches in each category.

Early works focus on neuron expansion; that is, adding neurons to a certain layer, or introducing a new layer or a new subnetwork to the existing model. For example, ProgressNet [15] creates a new network for each task and uses the task ID to select which network for prediction. However, it is impractical to know the task ID a priori so ProgressNet only works in task-incremental learning, not in class-incremental learning. To improve on this, Progress & Compress [16] progresses the architecture with new parameters for each new task and then compresses by distilling the new parameters into the shared knowledge base. Similarly, ExpertGate [17] is composed of a collection of expert models, each corresponding to a task. Each expert is designed as an autoencoder, and expert selection is based on the task relevance score based on the reconstruction errors. Based on the relevance score, the model decides whether to create a new expert or reuse and fine-tune an existing one. Dynamically expandable representation (DER) [18] introduces a new backbone network for each new task and aggregates the features from all backbones into super-feature. However, this approach often accumulates a large number of redundant parameters, requiring substantial memory size and extended training time. To tackle this problem, dynamically expandable networks (DEN) [19] is proposed to add new neurons only when necessary; that is, when the retraining loss for a new task exceeds a threshold. It also eliminates unnecessary neurons using group-sparsity regularisation.

Another strand is to use masks to allocate parameter subspaces for each task; that is, a binary mask is applied to the neurons or parameters to indicate whether they will be updated (if the mask value is 1) or frozen. This helps mitigate the forgetting effect by reducing interference during model updates for different tasks, while still allowing parameter sharing across tasks for feature learning. For example, PackNet [20] identifies important neurons for the current task and releases unimportant neurons for future tasks. This is achieved via iterative pruning, activation values, and uncertainty estimation. Piggyback [21] proposes learning binary masks without updating the backbone network. The task-specific binary mask weight matrix ‘piggybacks’ onto the backbone network.

Additive parameter decomposition (APD) [22] decomposes the network parameters at each layer of the target network. When a new task arrives, it first utilises the shared parameters and learns the incremental difference that requires task-specific parameters. To improve the effectiveness of parameter sharing, APD clusters the task-specific parameters into hierarchically shared parameter groups. However, this approach still assumes a fixed network, whose capacity is limited, therefore, their parameters will saturate when a high number of tasks are learnt. Hence, this approach requires sparsity constraints on parameter usage and selective reuse of the old parameters, which might affect the learning of new tasks [23].

Else-net [24] is another example of a dynamic architecture, consisting of multiple layers of elastic units, each of which comprises several learning blocks. Each block stores different knowledge from different human actions and the selection of a relevant block is managed by a switch block. While dynamic architectures in continual learning show good performance, Zhou et al. argue that the same memory size to extend the model can be used for the exemplar buffer to improve standard models [25]. They compare three exemplar-based algorithms with two model-based ones, giving the exemplar-based approaches more memory. They found the best configuration by sharing the shallow layers in the model backbone and only creating deep layers for new tasks. Built on this work, we apply pre-trained feature extractors and only create experts for new tasks.

To maintain low memory overhead when the tasks grow as well as at the test time, Douillard et al. [26] expand the model with special tokens for each task. These tokens have the same dimension as the data features and are concatenated to the input during training and prediction. As the new tokens are only created with linear size, the total token size for all tasks is still considerably small compared to expanding the backbone of the model. Xie et al. [27] apply the expansion technique to capture new domain representation in domain incremental learning (DIL). The intra-class structure is learnt using von Mises-Fisher (vMF) mixture model. After expanding the mixture model in each training step, a reduction strategy by merging existing clusters is applied to make it more compact.

Compared to the above approaches, our method expands and merges experts based on class similarity and memory constraints. It does not require prior knowledge on task IDs, nor does it store extra parameters such as masks.

SECTION III.

Opportunistic Dynamic Architecture

This section describes our proposed technique, ODA, which is built on the MoE framework, with each expert assigned to a set of similar classes. The key novelty of ODA resides in its automated expansion and expert fusion capabilities. For each task, classes are first clustered into groups, after which an expert is created and trained for each group. Expert weights are monitored to merge experts with similar weights to reduce memory redundancy when needed. The following sections describe the two main steps: expert creation and fusion, with a focus on how to create and train experts, integrate predictions from experts, and merge experts to mitigate catastrophic forgetting in both the experts and gates.

A. Problem Statement

Class-incremental continual learning refers to a sequence of tasks with mutually exclusive class sets. Each task T_{i} is associated with a dataset D_{i} = \{(x_{i}^{(j)}, y_{i}^{(j)}) | y_{i}^{(j)} \in C_{i}\}^{N_{i}}_{j=1} , where x_{i}^{(j)} is the jth input feature and y_{i}^{(j)} is the output label belonging to the corresponding class set C_{i} . The objective is to recognise all the classes from the tasks being observed.

B. Expert Creation

The initial phase involves the aggregation of similar classes to form a cohesive expert group. Drawing on inspiration from the multi-task learning literature [28], we group similar classes to enhance overall classification performance. We illustrate this concept via a toy example by randomly selecting 20 classes from CIFAR-100. Three distinct grouping settings are explored: based on their superclass, clustering, and random assignment. An expert is created for each group, and the classification accuracy of the MoE across these settings is evaluated. As presented in Figure 2, clustering-based grouping demonstrates accuracies close to those of superclass-based grouping, both of which outperform random grouping. This example highlights that building experts for similar classes can improve classification accuracy.

FIGURE 2. - A toy example of investigating the performance of different grouping strategies. The results show that clustering similar classes to construct an expert achieves comparable accuracy to superclass grouping and higher accuracy than random grouping.
FIGURE 2.

A toy example of investigating the performance of different grouping strategies. The results show that clustering similar classes to construct an expert achieves comparable accuracy to superclass grouping and higher accuracy than random grouping.

1) Prototyping and Clustering

For each incoming task, we create a prototype using a Gaussian Mixture Model (GMM) for each class, and then run a clustering algorithm on the prototypes. This reduces the computational complexity of clustering and prevents samples from the same class being assigned to different groups.

To cluster similar classes within a task, we applied K-means clustering to the prototype of each class, varying the number of clusters, K, between two and three. This ensures that the classes are grouped into at least two and at most three clusters. The optimal number of clusters is determined based on the highest Silhouette score, a widely used metric that quantifies the wellness of clustering by measuring how similar each sample within its assigned cluster compared to neighbouring clusters [29].

2) Expert Network Creation

A small expert model (e.g., with one hidden layer of 128 neurons) is created for each cluster within a task, and each is trained independently (e.g., outside the MoE framework) using the training data from their corresponding classes. Once each expert has been trained independently, we freeze their weights, integrate them into a MoE framework, and train the gate and classification mapper (CM) layer, which projects the logits of each expert to a unified layer. The CM layer’s dimension is C_{i} * C_{all} , where C_{i} is the number of classes known by an expert i, and C_{all} is the total number of classes seen so far. The gate learns the importance coefficients for the existing M experts and integrates their output into the final prediction:\begin{equation*}\hat {y}= \sum _{i\in [1, M]} g_{i} \text { CM}(\boldsymbol {z}_{i}), \tag {1}\end{equation*} View SourceRight-click on figure for MathML and additional features.where \boldsymbol {z}_{i} is the logit output from expert i, CM maps the logits to the unified classification layer, and g_{i} is the gate coefficient for that expert. The gate and CM layer are trained with a mixture of replay samples from the buffer and those generated by the GMM. With this two-step training strategy, each expert can still be used independently to classify what it has learnt by taking the output from its classification layer, while also being used in an ensemble with the other experts in the ODA architecture by taking the output from the CM layer.

C. Expert Fusion

To calculate the distance between parameters of experiments, we employ singular value decomposition (SVD) to decompose them first, as SVD has shown promising results in calculating the distance of high-dimensional vectors [30].

To merge experts, we need to address three questions: (1) when to merge, (2) which experts to merge, and (3) how to merge without degrading the performance. Our motivation for merging experts stems from limited memory capacity or the maximum number of experts that a device can host. When the current number of experts reaches the maximum, we merge the most similar experts. Similarity is calculated based on their hidden layer’s weights. While the Jonker-Volgeneant algorithm [31] has been adopted to calculate the similarity between experts’ weights [9], it is computationally expensive. Here, we first employ SVD to decompose the weights and calculate the distance between their singular values [30]. A small distance indicates that the singular values of both layers are close, suggesting they capture similar information or exhibit similar patterns. More specifically, for each layer l in expert i and expert j, SVD is performed to decompose their parameters (i.e., the weights and biases) into three main components: the left singular vectors (\boldsymbol {U}_{i,l}, \boldsymbol {U}_{j,l}) , the singular values (\boldsymbol {S}_{i,l}, \boldsymbol {S}_{j,l}) , and the right singular vectors (\boldsymbol {V}_{i,l}, \boldsymbol {V}_{j,l}) :\begin{align*} \boldsymbol {W}_{i,l} & = \boldsymbol {U}_{i,l} \boldsymbol {S}_{i,l} \boldsymbol {V}_{i,l} \tag {2}\\ \boldsymbol {W}_{j,l} & = \boldsymbol {U}_{j,l} \boldsymbol {S}_{j,l} \boldsymbol {V}_{j,l} \tag {3}\end{align*} View SourceRight-click on figure for MathML and additional features.

Then the distance between two experts is calculated as the sum of Euclidean distance between corresponding singular values of their layers \boldsymbol {S}_{i,l} and \boldsymbol {S}_{j,l} :\begin{equation*} d_{i,j} = \Sigma _{l=1,2} |\boldsymbol {S}_{i,l} - \boldsymbol {S}_{j,l}| \tag {4}\end{equation*} View SourceRight-click on figure for MathML and additional features.

After calculating the distance between each pair of experts and rank them in ascending order, we then select the top pairs for merging until the memory budget is met. If there is any overlap in the identified expert pairs, for example, if two pairs of experts (i, j) and (i, k) are in the top list, we merge the three experts i, j, k . Another key consideration is that expanding an expert with too many classes is not ideal, as the model may become saturated. Therefore, we set a maximum limit on the number of classes an expert can take and will not merge any expert whose number of classes already exceeds the threshold. In the future, we may also investigate checking neuron activation at each layer to determine whether an expert has become saturated.

Once we have identified the experts, we perform the merging by averaging the weights of their hidden layers. For the classification layer, we concatenate their weights as they correspond to different classes in each expert. We then align the norms of the weight vectors of classes from the two experts. Since each expert is trained independently with potentially different numbers of training samples, their weight magnitude can be different, potentially causing bias when we fine-tune the merged expert. To address this, we adopt the weight alignment technique [32] that balances the weight magnitudes between old and new classes in class-incremental learning. More specifically, let \boldsymbol {W}_{i} and \boldsymbol {W}_{j} refer to the weights of the classification layer corresponding to the ith and jth experts; that is:\begin{align*} \boldsymbol {W}_{i}& = (\boldsymbol {w}_{i,1}, \boldsymbol {w}_{i,2}, {\dots }, \boldsymbol {w}_{i, C_{i}}), \tag {5}\\ \boldsymbol {W}_{j} & = (\boldsymbol {w}_{j,1}, \boldsymbol {w}_{j,2}, {\dots }, \boldsymbol {w}_{j, C_{j}}), \tag {6}\end{align*} View SourceRight-click on figure for MathML and additional features.where C_{i} and C_{j} are the number of classes in the two experts. Their norms are defined as:\begin{align*} \text { Norm}_{i} & = (||\boldsymbol {w}_{i,1}||, \ldots, ||\boldsymbol {w}_{i, C_{i}}||) \tag {7}\\ \text {Norm}_{j} & = (||\boldsymbol {w}_{j,1}||, \ldots, ||\boldsymbol {w}_{j, C_{j}}||). \tag {8}\end{align*} View SourceRight-click on figure for MathML and additional features.

We concatenate their weights W = (\boldsymbol {W}_{i}, \boldsymbol {W}_{j}) , and normalise them as\begin{align*} \boldsymbol {W}^{\prime }& = (\boldsymbol {W}_{i}, \gamma \boldsymbol {W}_{j}) \tag {9}\\ \gamma & = \frac {\text {Mean}(\text {Norm}_{i})}{\text {Mean}(\text {Norm}_{j})} \tag {10}\end{align*} View SourceRight-click on figure for MathML and additional features.This ensures that the norms of both experts’ weights are equal. In summary, given a collection of experts 1, 2, \ldots, m , each with k layers, the merging function merge is defined as follows:\begin{align*} & \hspace {-.4pc}\text {merge}(E_{1}, E_{2}, \ldots, E_{m}) \\ & = \begin{cases} \displaystyle \text { average}(\boldsymbol {W}_{1, j}, \boldsymbol {W}_{2, j}, \ldots, \boldsymbol {W}_{m, j}) \quad \forall j \in [{1, k-1}] \\ \displaystyle \text { wa}(\text {concat}(\boldsymbol {W}_{1, j}, \boldsymbol {W}_{2, j}, \ldots, \boldsymbol {W}_{m,j})) \quad j=k \end{cases} \tag {11}\end{align*} View SourceRight-click on figure for MathML and additional features.where \boldsymbol {W}_{1, j}, \boldsymbol {W}_{2, j}, \ldots, \boldsymbol {W}_{m, j} represents the weights at jth layer of each expert, which will be averaged if j is not the last layer; i.e., the classification layer; otherwise, the weights will be concatenated first and then aligned via normalisation. average is a function to average the weights of each layer, concat is a function to concatenate the weights in the last layer from all the experts, and wa is the above weight alignment function.

Once merging the experts, we fine-tune the new expert with the data from all the merging experts, which consists of a mixture of replay samples from the buffer and generated samples from GMM. While it is preferable to use only generated samples, we find that incorporating a small number of real samples significantly reduces forgetting. After training, the merged expert will replace the merging experts and classify their classes. Algorithm 1 outlines the overall training process of ODA.

Algorithm 1 ODA Training Procedure

for each task t do

generate prototype and GMM for each class in t

run clustering on the prototypes

create and train an expert for each cluster

store in the buffer the replay samples for each class in t

if the current number of experts exceeds the maximum number of experts then

select the experts whose maximum number of classes below a pre-defined threshold

compute the similarity between the weights of their hidden layers

select and merge the top similar experts

fine-tune the merged experts using the buffer and GMM samples

end if

train the gate and classification mapper layer

end for

SECTION IV.

Experiment, Results, and Discussion

This section describes our experiment methodology including datasets, evaluation metrics, and baseline selection, and presents and discusses the results.

A. Experiment Methodology

The experiments are conducted on three datasets that have been widely used in continual learning [33], [34]: CIFAR-1001 [10], CUB2 [11], and iNaturalist3 [12].

The CIFAR-100 dataset consists of 60000~32\times 32 colour images, spanning 100 fine classes and 20 coarse classes. Each image is associated with two labels: fine and coarse. The fine label specifies the actual class to which the image belongs whereas the coarse label denotes its superclass. For each fine class, there are 500 training samples and 100 test samples. Thus, CIFAR-100 is a balanced dataset.

The CUB dataset consists of 200 birds species with about 30 samples per class, among which 5994 samples are for training and 5794 samples for testing. Like CIFAR-100, this dataset is balanced.

The iNaturalist dataset contains over 800000 images of more than 5000 different plants and animals species. The dataset includes several classification groups based on taxonomy: category, kingdom, family, supercategory, class, phylum, genus and order. In our experiment, we use the category group and select the top 1011 categories. This results in a long-tailed dataset where the number of samples per class ranges from 1000 to 80.

Metrics: Performance is measured using two mostly commonly used metrics in continual learning [33]: Final Average Accuracy (FAA) and Final Forgetting (FF). FAA measures the average accuracy of the model across all tasks at the end of the learning process, providing a comprehensive view of the model’s performance after exposure to the full sequence of tasks.\begin{equation*} \text { FAA} = \frac {1}{T} \sum _{i=1}^{T} A_{T,i} \tag {12}\end{equation*} View SourceRight-click on figure for MathML and additional features.where T is the total number of tasks, and A_{T,i} is the accuracy on task i after learning all T tasks.

FF measures the degree of forgetting for each task by comparing the model’s accuracy on that task immediately after learning it with its accuracy at the end of the learning process. This quantifies how much information is lost over time.\begin{equation*} \text { FF} = \frac {1}{T-1} \sum _{i=1}^{T-1} (A_{i,i} - A_{T,i}) \tag {13}\end{equation*} View SourceRight-click on figure for MathML and additional features.where A_{i,i} is the accuracy on task i right after training it, and A_{T,i} is the accuracy on task i after learning all T tasks.

B. Baseline Selection

To evaluate ODA, a set of commonly used baseline models is selected. This selection covers all three categories in continual learning: regularisation, replay, and dynamic architecture. They are LwF [5], EWC [13], iCaRL [6], Gdumb [35], bias correction (BiC) [36], dark experience replay (DER++) [37], ProgressNet [15] and Vanilla Ensemble [38]. Two additional baselines are also used to serve the upper bound (joint training) and the lower bound (fine tuning). Most of the implementations are sourced from the existing libraries Mammoth [37], FACIL [39] and Avalanche [40]. A home-made Vanilla Ensemble is constructed as a readily available implementation is not found.

C. Implementation and Hyperparameter Selection

We use Vision Transformer4 [41] to extract features for the datasets.

For ODA, each expert consists of a hidden layer with 128 neurons and the gate has a hidden layer with 256 units, which is intentionally kept small. In practice, the maximum number of experts can be decided by the hardware memory constraints and the maximum number of classes in an expert can be decided by their model size. In our experiments, we set the maximum number of experts to be 6, 10 and 20 and the maximum number of classes in an expert to be 17, 20, and 50 for CIFAR-100, CUB and iNaturalist. This setup aims to encourage merging, and we have also evaluated ODA’s performance with different configurations of these two parameters. For consistency, the same architecture is adopted for all the baseline models such that the total parameters of ODA will be less than or equal; that is, one hidden layer with 1000, 1500 and 2000 units for CIFAR-100, CUB and iNaturalist.

For rehearsal, the buffer size of 10, 10 and 20 per class is chosen for CIFAR-100, CUB and iNaturalist. ODA is updated using both replay samples and GMM-generated samples. We experimented with well-established regularisation techniques including knowledge distillation (KD) and its variants. However, they did not significantly improve the performance and required more memory. Therefore, we do not use any regularisation techniques in our current implementation.

We run grid search for hyperparameters of ODA; that is, the searching range for learning rates is [{0.1, 0.01, 0.001, 0.0001}] and for batch size is [{32, 64, 128}] . FAAs are reported in Table 2 on different combinations and we select the combination yielding the highest FAA, which are learning rate 0.001 and batch size 64. For the baseline models, we use the following hyperparameters: a learning rate of 0.01 and a batch size of 64 for Joint Training, Finetune, and ProgressNet; and a learning rate of 0.03 and a batch size of 64 for the rest. The specific hyperparameters for each technique are kept as provided by the framework.

TABLE 1 Experiments for the Three Datasets. The Numbers are in Percentage (%) With the Standard Deviation in Brackets. ODA Outperforms the State-of-the-Art Continual Learning Techniques on All the Datasets
Table 1- Experiments for the Three Datasets. The Numbers are in Percentage (%) With the Standard Deviation in Brackets. ODA Outperforms the State-of-the-Art Continual Learning Techniques on All the Datasets
TABLE 2 FAA of Hyperparameters Via Grid Search on Learning Rates and Batch Sizes
Table 2- FAA of Hyperparameters Via Grid Search on Learning Rates and Batch Sizes

D. Overall Performance

The most commonly used class-incremental continual learning settings are considered, where an equal number of 10, 20, and 50 randomly selected classes are assigned to each task for CIFAR-100, CUB, and iNaturalist. Table 1 presents the overall performance across the three datasets in terms of FAA and FF, with the mean score and standard deviation reported in brackets. ODA has outperformed the state-of-the-art techniques in all the datasets.

As discussed in Section IV-C, combining selected raw samples from the buffer with GMM-generated samples improves performance. A choice of 100, 20, and 100 was made for CIFAR-100, CUB, and iNaturalist, respectively. The selection of 20 for CUB is due to its limited 30 samples per class during training, ensuring that the GMM samples do not overpower the raw ones.

Across all the datasets, ODA demonstrates better accuracy, followed by BiC, iCaRL, and DER++. In terms of forgetting, iCaRL exhibits the least forgetting. Among the dynamic architecture approaches, ODA shows the least forgetting by a factor of half or a third compared to ProgressNet and VE techniques.

An example of the training progression for the expert and the gate is illustrated in Figure 3 and Figure 4. As each individual expert is trained on samples with close representations, optimal performance is achieved in the early epochs. Similarly, when the gate and the classification mapper layer are trained, only a small number of epochs are required to achieve very good accuracy and loss.

FIGURE 3. - The performance progression of training one expert in 100 epochs on CIFAR-100. The stable accuracy and loss have been achieved around epoch 15.
FIGURE 3.

The performance progression of training one expert in 100 epochs on CIFAR-100. The stable accuracy and loss have been achieved around epoch 15.

FIGURE 4. - The performance progression of training the gate and the classification mapper in 100 epochs on CIFAR-100. The accuracy for both the gate and the classification mapper has been stable since around epoch 15 while their losses can still have small fluctuation.
FIGURE 4.

The performance progression of training the gate and the classification mapper in 100 epochs on CIFAR-100. The accuracy for both the gate and the classification mapper has been stable since around epoch 15 while their losses can still have small fluctuation.

Figure 5 illustrates the performance progression of experts and gates on the CIFAR-100 dataset. The average validation accuracy of all experts and the gate at the end of each task is calculated from five experiments. The dashed line indicates the mean accuracy. The gate performs well, with its average top-2 accuracy reaching 90.42%, suggesting that the gate can accurately assign higher coefficients to the correct experts. This implies that in the future we may only need to activate the top-k experts for prediction to further reduce computational cost. The expert accuracy drops more than the gate’s. Each expert is initially trained with real samples in the training data, but as they merge and are updated with only a small portion of the original samples alongside pseudo-samples from GMM, their performance degrades.

FIGURE 5. - The performance progression of the experts, the gate and the gate’s top-2 on CIFAR-100. The expert accuracy is taken by averaging all available experts at the end of training each task. The gate can maintain high accuracy over time with the newly introduced experts; however, the expert’s accuracy with merging suffers large decrease.
FIGURE 5.

The performance progression of the experts, the gate and the gate’s top-2 on CIFAR-100. The expert accuracy is taken by averaging all available experts at the end of training each task. The gate can maintain high accuracy over time with the newly introduced experts; however, the expert’s accuracy with merging suffers large decrease.

E. GMM Pseudo-Samples

Data augmentation plays a crucial role in enhancing the generalisation and performance of deep learning models [42]. In computer vision tasks, many techniques have been developed to deal with this challenge. These techniques range from directly manipulating the images, such as flipping, rotating, clipping, mixing, to using generative models to reproduce images with similar features.

Similarly, GMM can augment datasets by generating pseudo-samples that capture the characteristics of the data within the same class. In our study, these pseudo-samples improve ODA performance. For example, on CUB, the accuracy of ODA even slightly exceeds that of joint training. This is likely due to the small size of the original training samples and the closely matched pseudo-samples generated by GMM, which enhances the effectiveness of the sample distribution captured by GMM during retraining steps after merging. Figure 6 shows the t-SNE distribution between real samples and GMM-generated samples for five classes in CUB, showing that GMM samples closely resemble real samples.

FIGURE 6. - t-SNE of real images vs. GMM samples for iNaturalist. The distribution of GMM samples are very close to the original ones. The same shape represents the same class and the color separates real and generated samples.
FIGURE 6.

t-SNE of real images vs. GMM samples for iNaturalist. The distribution of GMM samples are very close to the original ones. The same shape represents the same class and the color separates real and generated samples.

The effect of various pseudo-sample sizes on CIFAR-100 and CUB is illustrated in Figure 7. In general, having more pseudo-samples generated by GMM helps improve the accuracy. However, the decision to choose the right size also needs to consider the trade-off between training time and the limitation of the available memory.

FIGURE 7. - The effect of various pseudo-sample sizes generated by GMM on CIFAR-100 and CUB. The general trend is that the more generated samples, the better performance. But there is a need to consider the trade-off of training time and the limitation of memory.
FIGURE 7.

The effect of various pseudo-sample sizes generated by GMM on CIFAR-100 and CUB. The general trend is that the more generated samples, the better performance. But there is a need to consider the trade-off of training time and the limitation of memory.

F. Parameter Size

Following the decision in Section IV-C to keep the architecture to be small, we compared the parameter size between ODA and the baselines as shown in Figure 8. The plot presents a comparison of parameter sizes across different models for CIFAR-100, CUB, and iNaturalist datasets. On average, ODA maintains a comparable parameter size to the ‘Others’ baseline, showing that our model achieves competitive resource usage without excessive growth in parameters. Importantly, despite having a similar parameter footprint to Others, ODA is able to deliver better performance.

FIGURE 8. - Parameter size comparison between ODA and the baselines. All baselines but Vanilla Ensemble are grouped as ‘Others’. The parameter size is in millions. The average size of one expert is also provided for comparison. ODA uses much fewer parameters than the other continual learning techniques.
FIGURE 8.

Parameter size comparison between ODA and the baselines. All baselines but Vanilla Ensemble are grouped as ‘Others’. The parameter size is in millions. The average size of one expert is also provided for comparison. ODA uses much fewer parameters than the other continual learning techniques.

Additionally, the plot includes the average parameter size of a single expert in ODA, which remains minimal at around 100K parameters across datasets. This compact expert design not only aids scalability but also allows ODA to efficiently utilise memory and encourage expert merging where appropriate. Overall, ODA provides a balanced and effective solution by controlling parameter size while achieving strong performance outcomes.

G. Computational Cost

We ran all experiments on NVIDIA RTX A6000 with 48GB of GDDR6 GPU memory. Figure 9 compares the training time per task with the best-performing techniques across three datasets. ODA takes 12.78%, 12.22% and 26.10% of BiC’s training time, and 16.19%, 15.38%, and 13.70% of iCaRL’s on CIFAR-100, CUB, and iNaturalist, respectively. Since iNaturalist has the most number of classes (i.e., 1011), and more classes to learn per task (i.e., 50), the training time is significantly longer than the other two datasets. Figure 10, Figure 11 and Figure 12 show the progression of training time on each task on CIFAR-100, CUB, and iNaturalist. With more classes to learn, iCaRL’s training increases much faster than ODA’s and BiC’s. ODA’s training time stays low and does not vary much, as it separates the training of experts and gate. Table 3 lists the training time for each key component in ODA. The gate training and SVD for calculating the similarity between the weights of experts are consuming more time. In ODA, the number of parameters grows gradually with more experts being created, ultimately reaching 88% of the baseline models.

TABLE 3 Training Time Per Key Component in ODA (in Seconds). The Most Computationally Expensive Component is the Gate Training, and the Expert Training and Updating has Similar Profile
Table 3- Training Time Per Key Component in ODA (in Seconds). The Most Computationally Expensive Component is the Gate Training, and the Expert Training and Updating has Similar Profile
FIGURE 9. - Comparison of training time between ODA and best-performing techniques: iCaRL and BiC. ODA only consumes around 13% of training time of these techniques when updating with a new task.
FIGURE 9.

Comparison of training time between ODA and best-performing techniques: iCaRL and BiC. ODA only consumes around 13% of training time of these techniques when updating with a new task.

FIGURE 10. - Comparison of training time on each task of CIFAR-100 between ODA and best-performing techniques: iCaRL and BiC. The comparison techniques increase training time with more tasks, while ODA stays small because ODA freezes most of the experts and only needs to update one expert and the gate, which reduces the training cost.
FIGURE 10.

Comparison of training time on each task of CIFAR-100 between ODA and best-performing techniques: iCaRL and BiC. The comparison techniques increase training time with more tasks, while ODA stays small because ODA freezes most of the experts and only needs to update one expert and the gate, which reduces the training cost.

FIGURE 11. - Comparison of training time on each task of CUB between ODA and best-performing techniques: iCaRL and BiC.
FIGURE 11.

Comparison of training time on each task of CUB between ODA and best-performing techniques: iCaRL and BiC.

FIGURE 12. - Comparison of training time on each task of iNaturalist between ODA and best-performing techniques: iCaRL and BiC.
FIGURE 12.

Comparison of training time on each task of iNaturalist between ODA and best-performing techniques: iCaRL and BiC.

H. Impact of the Number of Experts

The impact of the number of experts on performance is also investigated, specifically how accuracy declines as the number of experts is reduced. This analysis aims to understand the effect of the merging process on long-term performance. Table 4 illustrates the accuracy and parameter size (in millions) on CUB. Generally, fewer experts result in lower accuracy, fewer parameters, and faster training. This is expected as the increased merging process leads to more frequent retraining and worse forgetting in experts. This highlights the need for careful management of the merging process to avoid compromising the overall performance of the architecture.

TABLE 4 Accuracy, the Number of Parameters (in Millions), and Training Time Per Task (in Seconds) on a Different Number of Experts
Table 4- Accuracy, the Number of Parameters (in Millions), and Training Time Per Task (in Seconds) on a Different Number of Experts

I. Impact of Gate Architecture

Figure 13 compares accuracies with different gate architectures on all three datasets. On CIFAR-100 and CUB, the best performing gate is one hidden layer with 256 neurons. On CIFAR-100, increasing the size of the gate does not significantly improve the accuracy, as the number of the experts is quite small. On iNaturalist, we have much more experts, we start with the large architecture and then increase the depth of the gate; that is, adding an extra hidden layer with 128 neurons. The accuracy improves greatly. A general rule can be drawn from the experiment is that the more classes to learn and the more experts to be created, the larger the gate needs to be.

FIGURE 13. - Comparison of accuracies with different gate architectures on all three datasets. When the dataset has more classes and tasks, the gate needs a larger architecture to manage the growing number of experts.
FIGURE 13.

Comparison of accuracies with different gate architectures on all three datasets. When the dataset has more classes and tasks, the gate needs a larger architecture to manage the growing number of experts.

SECTION V.

Conclusion and Future Work

This paper presents ODA for class-incremental learning. It is built on MoE, enabling to extend the model with more experts and shrink via merging experts. The advantage is that ODA does not need to commit to a large model at the beginning, optimising memory use and reducing training time. The experiment results demonstrate that ODA can achieve comparable accuracy, i.e., 2.39%, 6.34%, and 2.54% over the best-performing techniques on CIFAR-100, CUB, and iNaturalist, with much less training time. In terms of parameter size, ODA also uses a slightly smaller parameter compared to the baselines, which makes it a suitable candidate to maintain good performance where memory is limited. This can be promising to deploy it on resource-constrained devices for real-world continual learning applications.

A. Limitation Discussion

One major limitation of ODA is the significant degradation in expert after merging. Future work will explore methods to mitigate the forgetting effect on experts. Another limitation is the computational cost associated with dynamically managing experts, including the calculation of the similarity of weights between experts. Furthermore, the scalability of ODA to very large datasets and highly dynamic environments has not been extensively tested, potentially limiting its applicability to more complex scenarios. Future work will explore addressing these limitations by enhancing the merging mechanism, introducing advanced federated learning techniques, and optimising the system for large-scale, real-world deployments.

Additionally, ODA currently assumes that task boundaries are clearly defined and that training data for new tasks is fully annotated, which may not align with the realities of real-world applications where task boundaries are ambiguous, and labels are scarce.

References

References is not available for this document.